[PATCH 1/2] md: fix stale corrupted error upon mountpoint loss

6 views
Skip to first unread message

guihe...@cmss.chinamobile.com

unread,
Sep 1, 2016, 5:32:26 AM9/1/16
to sheep...@googlegroups.com, Gui Hecheng
From: Gui Hecheng <guihe...@cmss.chinamobile.com>

As we encountered, when we are on handling ENOENT error under the .stale directory,
and the log appears like:
ERROR [gway 121713] err_to_sderr(61) /shd/obj10/.stale corrupted
ERROR [gway 121743] err_to_sderr(61) /shd/obj10/.stale corrupted
ERROR [gway 121710] err_to_sderr(61) /shd/obj10/.stale corrupted
...
The ENOENT error is because that the disk is umounted by systemd due to certain
errors found on disk.
But sheep should not report the corruption of a .stale directory.
Actually sheep intends to check the existence of the mountpoint directory:
if (stat(dir, &s) < 0) {
sd_err("%s corrupted", dir);
return md_handle_eio(dir);
}
But here the "dir" is a .stale directory, and then md_handle_eio() will
check whether the disk exist by comparing string "/shd/obj10/.stale" and
all the disk names like "/shd/obj10", "/shd/obj9", etc.
So it will not be able to find the bad disk and give up.

To fix it, we trim the ".stale" with an extra dirname() call, then
md_handle_eio() will be given an searchable disk name.

Signed-off-by: Gui Hecheng <guihe...@cmss.chinamobile.com>
---
sheep/store/common.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/sheep/store/common.c b/sheep/store/common.c
index 87b3fc0..ecd3c84 100644
--- a/sheep/store/common.c
+++ b/sheep/store/common.c
@@ -45,6 +45,13 @@ int prepare_iocb(uint64_t oid, const struct siocb *iocb, bool create)
return flags;
}

+static char *get_obj_dir(char *path)
+{
+ if (is_stale_path(path))
+ path = dirname(path);
+ return dirname(path);
+}
+
int err_to_sderr(const char *path, uint64_t oid, int err)
{
struct stat s;
@@ -52,7 +59,7 @@ int err_to_sderr(const char *path, uint64_t oid, int err)

/* Use a temporary buffer since dirname() may modify its argument. */
pstrcpy(p, sizeof(p), path);
- dir = dirname(p);
+ dir = get_obj_dir(p);

sd_debug("%s", path);
switch (err) {
--
1.8.3.1



Liu Yuan

unread,
Sep 8, 2016, 2:51:49 AM9/8/16
to guihe...@cmss.chinamobile.com, sheep...@googlegroups.com
On Thu, Sep 01, 2016 at 05:33:00PM +0800, guihe...@cmss.chinamobile.com wrote:
> From: Gui Hecheng <guihe...@cmss.chinamobile.com>
>
> As we encountered, when we are on handling ENOENT error under the .stale directory,
> and the log appears like:
> ERROR [gway 121713] err_to_sderr(61) /shd/obj10/.stale corrupted
> ERROR [gway 121743] err_to_sderr(61) /shd/obj10/.stale corrupted
> ERROR [gway 121710] err_to_sderr(61) /shd/obj10/.stale corrupted
> ...
> The ENOENT error is because that the disk is umounted by systemd due to certain
> errors found on disk.
> But sheep should not report the corruption of a .stale directory.
> Actually sheep intends to check the existence of the mountpoint directory:
> if (stat(dir, &s) < 0) {
> sd_err("%s corrupted", dir);
> return md_handle_eio(dir);
> }
> But here the "dir" is a .stale directory, and then md_handle_eio() will
> check whether the disk exist by comparing string "/shd/obj10/.stale" and
> all the disk names like "/shd/obj10", "/shd/obj9", etc.
> So it will not be able to find the bad disk and give up.
>
> To fix it, we trim the ".stale" with an extra dirname() call, then
> md_handle_eio() will be given an searchable disk name.

Applied these two, thanks!

Yuan
Reply all
Reply to author
Forward
0 new messages