HELP: DNS-323-B1 new install of RC5 stuck on fsck Step2

Eduardo Cabrera

unread,

Jan 7, 2017, 12:21:48 PM1/7/17

to Alt-F

First let me say that I can not blame myself enough for this. MEA CULPA! So, lets get past that and get going on the facts.

Preface: DNS-323-B1 with dual 1.5TB drives non-raid ext2. It run fine for years fine, with Dlink firmware 1.10, plus FFP 0.5. Volume1 was daily use drive with my whole live of backups of pictures, docs, etc. only VERY IMPORTANT file. To keep it safe, Volume 2 was not shareable to the network, and did an incremental daily diff. backup from Volume 1. Used DNS323 RSync Time Machine methods. It was done via scripts a la Time Machine, so I could go back to older dates. Directories on the Volume 2 reflect the day of the backup, but the files in it were just symlinks to the first copied instance of the file. So, every dated directory has the whole directory of files. I had about 1,200 directories (days) recorded that way on Volume 2. Ok so you know the importance of the data, and how the drive of Volume 2 was filled up.

This week, Volume 1 had issues, would take ages to list its content of the share directories on the network. Took it out (after shutdown), and try running fsck on a lubunto machine to no avail. Ended up tripping the S.M.A.R.T., so its basically useless. So I went ahead and bought a HSGT 2TB. I did format it using the DNS323 with ext2 (for speed) and then everything was supposed to be fine, ready to recover the info from the Volume 2.

After much readying wanted to try Alt-F, and take advantage from ext4 on the new drive, so I decided to install RC5 (even though the warnings on network corruption there, but I thought I could deal with it).

Installed RC5 yesterday morning WITHOUT taking out my precious Volume2 1.5TB drive, nor having made a copy of it first. BIG MISTAKE!

Everything seemed to go fine, installed OK, but on boot is doing an fsck on the sdb2 drive (my precious volume 2) and its taking FOREVER. It seems to be progressing thought, after 25 hrs is still on 6% of Step 2. Clock seems to be running, as when I refresh the screen I see an updated time. This is the first time I run it, so there is no User or password set yet. Light were ok, until the morning, when I see both disk lights in orange, network light blinking and the power button is not blinking anymore (seems orange but might be from the reflection of the other leds pic attached). I attached 2 pfd of the status page 3 hours apart, were you can see that the clock continues to work and also progress was made from 5 to 6%. I supposed I could leave it going for days, and HOPE that at the end I will have the data OK. Unfortunately, I have too much rinding on this, so I do not know which is worse: To just try my luck and let it finish and wake up sometime next week and see it has completed the fsck successfully (or maybe returning an mostly empty disk), or just somehow interrupt the fsck process the best cleanest way possible (maybe saving the data loss that fsck might be creating), take the disk out and put in on a more powerful linux machine, run FSCK there and HOPE that my data is found. My biggest concern is that I do not know how fsck pass2 works, as it checks directories to i-nodes (which is what I think I have lots of), so I do not know if it saves its fixes as it progresses, or if it saves the fixes only at the end of the pass, or end of pass6.

So, if any one has in depth knowledge on how the fsck is working, and has a suggestion on what to do, its welcome. I just want to save all (as much as possible) data I have there. Also, if some knows why I do not have the blue button bling (hear rate as it was yesterday) and the yellow light do not blink (even as it seems there is activity on the drive), it would be really appreciated.

Alt-F 0.1RC5 Status Page.pdf

Alt-F 0.1RC5 Status Page2.pdf

IMG_3084.jpg.pdf

Joao Cardoso

unread,

Jan 7, 2017, 7:22:13 PM1/7/17

to Alt-F Group

Fsck seems to be working. From the screen shoot Swap is being used and cpu load is high.

Fsck needs a lot of memory and starts using disk, becoming very slow - thats probably why dlink disabled it in their firmware.

Fsck is working in the automatic mode, as all linux does. If it finds inconsistencies that might lead to data loss it will stop and that will be shown in the status page.

Eventually swap might exhausts and fsck will stop with a no memory warning.

You can *telnet* the box even without setting it up first.

Be patient...

--
You received this message because you are subscribed to the Google Groups "Alt-F" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/alt-f.
For more options, visit https://groups.google.com/d/optout.

Eduardo Cabrera

unread,

Jan 8, 2017, 12:04:53 AM1/8/17

to Alt-F

Joao, Thanks for the prompt reply and for your comments. I feel much relieved now. I was able to telnet, and run TOP to see what is going on. Attached is a screenshot of TOP showing everything that is running. As you see, fsck is running:

fsck.ext3 -p -C5 /dev/sdb2

Which tells me that my precious Volume2 was using ext3 -> journaled (lucky me!), so I will have a much better chance of having most of my data back!. Additionally, it does show the -p passive safe auto-repair is running.

Also, I see that there is plenty of activity, most of it being io operations, so the disk is still cranking up, even though leds are still solid orange.

I also attached the current status page of the process, after 36 hrs, we are in Step 2 at 10%. So, it will probably take another 9 more days before it completes. Patience shall be rewarded...

In the mean time, any comments / suggestions? Logs to watch? maybe some processes to kill that might speed up the completion?

Regards,

Eduardo

To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+un...@googlegroups.com.

Screen Shot 2017-01-07 at 11.44.23 PM.png

Alt-F 0.1RC5 Status Page3.pdf

João Cardoso

unread,

Jan 10, 2017, 11:48:02 AM1/10/17

to Alt-F

On Sunday, 8 January 2017 05:04:53 UTC, Eduardo Cabrera wrote:

Joao, Thanks for the prompt reply and for your comments. I feel much relieved now. I was able to telnet, and run TOP to see what is going on. Attached is a screenshot of TOP showing everything that is running. As you see, fsck is running:
fsck.ext3 -p -C5 /dev/sdb2

Which tells me that my precious Volume2 was using ext3 -> journaled (lucky me!), so I will have a much better chance of having most of my data back!. Additionally, it does show the -p passive safe auto-repair is running.

Also, I see that there is plenty of activity, most of it being io operations, so the disk is still cranking up, even though leds are still solid orange.

The leds are orange to warn you a new error/warning message is shown in the Status page: "Clock is drifting" -- not to worry about.

I also attached the current status page of the process, after 36 hrs, we are in Step 2 at 10%. So, it will probably take another 9 more days before it completes.

The percentage is not related with the time it takes, it's the step progress indicator, related with the "number" of files and its length.

Patience shall be rewarded...

hope so. Has it finished already?

As you can see from 'top', the VSZ column, fsck is using 883MB of (virtual -- read RAM+swap) memory, and that is the amount of swap being used, 85% of 1035MB.

The DNS-323 has 64MB of RAM memory, and at the time it appeared disk drives were also much smaller. Although it will work fine with TB drives the fsck step has to occur now and then, and that is going to be always slow. Using ext3/4 minimizes that.

In the mean time, any comments / suggestions? Logs to watch? maybe some processes to kill that might speed up the completion?

Don't use the box for anythink else. You can even stop samba, use "rcsmb stop", but it is not going to help much. You can also stop smartd, "rcsmart stop". Don't check the "autorefresh" check button in the status page -- *that* slows down the box.

The fsck log growing (/tmp/check*.log) is not helping also, as it keeps growing (what's its current size?). Its only function is to show in the status page the progress completion.

On a desktop linux computer, with 2/4/8GB of RAM fsck would perform much quicker. That is an option for you if you have a linux computer, kill fsck, power off the box, move the disk to the PC and perform the fsck there.

I don't think that killing fsck is to be dangerous, but that's your data. fsck will stop anyway when/if swap exhausts or if there is a power cut...

sda4/sdb4 are also waiting to be fsck checked, and they consume a fraction of a second each ten seconds. You can see that in /var/log/hot_aux.log

Eduardo Cabrera

unread,

Jan 11, 2017, 6:18:59 PM1/11/17

to al...@googlegroups.com

Hello Joao,

Thanks for your reply. Its been 5 days 17 hrs, and still going.

on your comments:

-----

Don't use the box for anythink else. You can even stop samba, use "rcsmb stop", but it is not going to help much. You can also stop smartd, "rcsmart stop". Don't check the "autorefresh" check button in the status page -- *that* slows down the box.

--------

I did stop those two smb processes, and the smartd, and the box is not being used for anything and has UPS power, so it can go for days. Currently I can not tell what % of step 2 is, as it does take a long time to present the status page now.

-------

The fsck log growing (/tmp/check*.log) is not helping also, as it keeps growing (what's its current size?). Its only function is to show in the status page the progress completion.

-------

Currently check-sdb2.log is 155.7M in size

-rw-rw-rw- 1 root root 163295232 Jan 12 10:25 check-sdb2.log

but it seems the it has stopped growing now.

the last lines show:

2 5660319 15099941 /dev/sdb2

2 5660320 15099941 /dev/sdb2

2 5660321 15099

and seems that it did not complete writing the last line nor put in the LF at the end.

FYI in my flawed wisdom, I copied the check-sdb2.log file to the mounted volume, in the hopes to delete it or create a new empty one, so there is more memory / swap (/tmp) and does not run out of memory. does it seem like a good idea? any thoughts on how to proceed replcing the log file with a new or shorter version while FSCK is running?

Something that caught my eyes was the status of the step 2 which now showed 37488%. This last piece of information on the status page. It took a long time to show up (about 1 hr) after the status page was invoked. I noticed that these processes were running before the whole status page is complete:

tr -s \b\r\001\002 \n

cat /tmp/check-sdb2.log

{status.cgi} /bin/sh status.cgi

so I assume is all part of the status.cgi script and the process of manipulating data to show on the screen. The fact that that check-sdb2.log file is more than 155Mb now is causing the delays.

I know believe that the 37488% comes from reading the last line of the check-sbd2.log file that now shows

2 5660321 15099

and the number correspond to have taken the percentage based on 15099 vs the full number of 15099941 which was not completed on the log file. The real number should have been 37.48% of Step 2 complete.

At any rate, now the log file check-db2.log seems is not logging anymore (growing) so now we are blind to the status of completion of the fsck. Any thoughts ideas what might be going on? how to restore the logging?

----

On a desktop linux computer, with 2/4/8GB of RAM fsck would perform much quicker. That is an option for you if you have a linux computer, kill fsck, power off the box, move the disk to the PC and perform the fsck there.

------

At this stage I rather not shut it down, and let it finished.

sda4/sdb4 are also waiting to be fsck checked, and they consume a fraction of a second each ten seconds. You can see that in /var/log/hot_aux.log

----------

The hot_aux.log file is 2.7M now, and the last four lines are:

hot_aux sda4 waiting to be fscked

hot_aux sdb4 waiting to be fscked

hot_aux sda4 waiting to be fscked

hot_aux sdb4 waiting to be fscked

so I see why you mean. Any way to shut down those from the pending fsck status? Attached is also the latest TOP command result. Please note that there are some two

sleep 10

processes, which I would think its related to the wait for FSCK above. if there a way to stop those processes?

Thanks you for all your help, and hope my data is safe.

regards,

Eduardo

Alt-F 0.1RC5 Status Page5.pdf

Screen Shot 2017-01-12 at 5.36.54 AM.png

João Cardoso

unread,

Jan 12, 2017, 10:57:41 AM1/12/17

to al...@googlegroups.com

On Wednesday, 11 January 2017 23:18:59 UTC, Eduardo Cabrera wrote:

Hello Joao,

Thanks for your reply. Its been 5 days 17 hrs, and still going.

on your comments:
-----
Don't use the box for anythink else. You can even stop samba, use "rcsmb stop", but it is not going to help much. You can also stop smartd, "rcsmart stop". Don't check the "autorefresh" check button in the status page -- *that* slows down the box.
--------
I did stop those two smb processes, and the smartd, and the box is not being used for anything and has UPS power, so it can go for days. Currently I can not tell what % of step 2 is, as it does take a long time to present the status page now.

-------
The fsck log growing (/tmp/check*.log) is not helping also, as it keeps growing (what's its current size?). Its only function is to show in the status page the progress completion.
-------
Currently check-sdb2.log is 155.7M in size
-rw-rw-rw- 1 root root 163295232 Jan 12 10:25 check-sdb2.log

that's too much for just "showing" the progress. I have to change that.

Perhaps in your case years of not doing filesystem checks now requires more work from fsck.

I have done a similar fsck on a DNS-325 with a ext4 900GB capacity filesystem, of with 400GB are used, 2.8 million files on it, and it produced a 17MB log file.

I removed the log file using 'rm' and fsck continued working until the end fine. In your case this would release just that amount of memory, but fsck itself needs a lot, so removing the log file will alleviate the waiting.

but it seems the it has stopped growing now.
the last lines show:
2 5660319 15099941 /dev/sdb2

2 5660320 15099941 /dev/sdb2

2 5660321 15099

and seems that it did not complete writing the last line nor put in the LF at the end.
FYI in my flawed wisdom, I copied the check-sdb2.log file to the mounted volume, in the hopes to delete it or create a new empty one,

Copying is useless, and a new one will not be created; as soon as you remove the original log you will lost progress indication, but you still have the command line to verify that it is still working, and from my experience above fsck will not complain (it still generates the log internally, but it is lost)

so there is more memory / swap (/tmp) and does not run out of memory. does it seem like a good idea? any thoughts on how to proceed replcing the log file with a new or shorter version while FSCK is running?

Something that caught my eyes was the status of the step 2 which now showed 37488%. This last piece of information on the status page. It took a long time to show up (about 1 hr) after the status page was invoked. I noticed that these processes were running before the whole status page is complete:
tr -s \b\r\001\002 \n
cat /tmp/check-sdb2.log
{status.cgi} /bin/sh status.cgi
so I assume is all part of the status.cgi script and the process of manipulating data to show on the screen. The fact that that check-sdb2.log file is more than 155Mb now is causing the delays.

Yes, and probably there are errors produced given the huge log and numbers, don't worry with that, its just "show off"

I know believe that the 37488% comes from reading the last line of the check-sbd2.log file that now shows
2 5660321 15099
and the number correspond to have taken the percentage based on 15099 vs the full number of 15099941 which was not completed on the log file. The real number should have been 37.48% of Step 2 complete.

In my case the log has 7.5 thousand lines for each of step 1 and 4, and 332K lines for each of step 2 and 3.

At any rate, now the log file check-db2.log seems is not logging anymore (growing) so now we are blind to the status of completion of the fsck. Any thoughts ideas what might be going on? how to restore the logging?

I want tell you what to do with your data, but as the log seems to be useless now I would just remove it.

----
On a desktop linux computer, with 2/4/8GB of RAM fsck would perform much quicker. That is an option for you if you have a linux computer, kill fsck, power off the box, move the disk to the PC and perform the fsck there.
------
At this stage I rather not shut it down, and let it finished.

sda4/sdb4 are also waiting to be fsck checked, and they consume a fraction of a second each ten seconds. You can see that in /var/log/hot_aux.log
----------
The hot_aux.log file is 2.7M now, and the last four lines are:
hot_aux sda4 waiting to be fscked

hot_aux sdb4 waiting to be fscked

hot_aux sda4 waiting to be fscked

hot_aux sdb4 waiting to be fscked

so I see why you mean. Any way to shut down those from the pending fsck status? Attached is also the latest TOP command result. Please note that there are some two
sleep 10
processes, which I would think its related to the wait for FSCK above. if there a way to stop those processes?

Don't worry with that, it represents a tiny fraction of memory and cpu. Its like you typing (not executing) anything at the command line.

Thanks you for all your help, and hope my data is safe.

If that gives you some peace, In my 35 years of unix experience I never loosed any data either at work or at home. But I do periodic fscks...

Eduardo Cabrera

unread,

Jan 13, 2017, 9:39:36 PM1/13/17

to Alt-F

Hello Joao,

Just to give you an update. Process still going strong after 1 week. Attached TOP and Status screen data.

As you can see the fsck is still going strong, seems has not run out of memory/resources.

Question, do you know if signal -USR1 is supported in this version of fsck.ext3? I like to see if I can get a sense of progress on the running fsck like described on:

http://serverfault.com/questions/118791/how-do-you-get-e2fsck-to-show-progress-information

via sending signal:

kill -USR1 341

but I do not want to end up cancelling the fsck process, if it does not support the -USR1 signal. Obviously I can not check to be sure, so that is why I asked you in case you know.

Other than that, I will keep waiting... Thanks for everything...

Eduardo

Screen Shot 2017-01-13 at 9.32.13 PM.png

Alt-F 0.1RC5 Status Page6.pdf

João Cardoso

unread,

Jan 14, 2017, 11:52:23 AM1/14/17

to Alt-F

On Saturday, 14 January 2017 02:39:36 UTC, Eduardo Cabrera wrote:

Hello Joao,

Just to give you an update. Process still going strong after 1 week. Attached TOP and Status screen data.
As you can see the fsck is still going strong, seems has not run out of memory/resources.

But it might be stuck in some loop?

Question, do you know if signal -USR1 is supported in this version of fsck.ext3?

Yes, it is (I experimented, it's not only the manual saying)

)

But according to the manual page, the progress bar only applies if you start running it in a console. That is not the case, as it was started within a script and logging is already enabled.

As you can see by the ps or top command, "fsck.ext3 -p -C5" the -C5 means that logging is going to be sent to file "5", which the script setup as being /tmp/checkxxx.log: "fsck -p -C5 /dev/sdb2 2>&1 5<> /var/log/checkxxx.log"

In the manual page:

SIGNALS

The following signals have the following effect when sent to e2fsck.

SIGUSR1

This signal causes e2fsck to start displaying a completion bar or emit‐

ting progress information. (See discussion of the -C option.)

SIGUSR2

This signal causes e2fsck to stop displaying a completion bar or emit‐

ting progress information.

and previously:

-C fd This option causes e2fsck to write completion information to the speci‐

fied file descriptor so that the progress of the filesystem check can be

monitored. This option is typically used by programs which are running

e2fsck. If the file descriptor number is negative, then absolute value

of the file descriptor will be used, and the progress information will

be suppressed initially. It can later be enabled by sending the e2fsck

process a SIGUSR1 signal. If the file descriptor specified is 0, e2fsck

will print a completion bar as it goes about its business. This

requires that e2fsck is running on a video console or terminal.

What you can do is to stop the log sending SIGUSR2 to avoid additional and unnecessary swap being used because of it.

Have you deleted the log file?

You have ps (and top) to verify that fsck is still running, and 'cat /proc/swaps' to see the swap in use.

The status page watches for /tmp/checkxxx and sends a 'kill -0' to the process whose pid is in /tmp/checkxxx.pid to be sure that the process is still running (you don't need to do that, you have ps telling you that); if it alive, if performs some computation the the last line of /tmp/checkxxx.log (getting the last line means reading them all), but you can do the math yourself: 'tail -1 /tmp/checkxxx.log' gives the fsck step, the current number of operations done in step and the total number of ops to perform on the step.

So you don't need to use (and wait for) the status page.

As a personal note, I would already have killed fsck and perform it on a linux PC with more RAM memory. Or if a linux PC is not available start fsck from the command line (giving -C0 as an option for having the completion bar). That would avoid the need for the creation of the log file and I would expect that some/most of the filesystem check has already been performed, so the second fsck would be faster - but that is only educated speculation.

I like to see if I can get a sense of progress on the running fsck like described on:
http://serverfault.com/questions/118791/how-do-you-get-e2fsck-to-show-progress-information
via sending signal:
kill -USR1 341
but I do not want to end up cancelling the fsck process, if it does not support the -USR1 signal. Obviously I can not check to be sure, so that is why I asked you in case you know.

I have done limited experiments and it worked (when invoked from the command line)

Other than that, I will keep waiting... Thanks for everything...
Eduardo

Luck (and let the force be with you :-)

Eduardo Cabrera

unread,

Jan 15, 2017, 2:07:17 AM1/15/17

to al...@googlegroups.com

Hello Joao,

Its 8 Days and 14 hrs running the fsck.ext3 and still cranking up (the process goes from STAT: R to D and back to R as it waits for IO activity). Remember 3 days ago I tried deleting the 155MB check-sdb2.log, as it was not writing to it anymore. Also, I did recreate it using touch /tmp/check-sdb2.log, so now exits and its size 0. I assume the fd is no longer the same as the 5 in your call to fsck.ext3, so it would not work, but I gave it a try. I did not work, nothing is writing to the new /var/log/check-sdb2.log file.

I experimented with fsck.ext3 PID 341

kill -USR2 341

kill -USR1 341

to see if it would start writting to check-sdb2.log, but it did not. Then I tried its parent fsck PID 339:

kill -USR2 339

kill -USR1 339

It seems it did not recognized the signal, as it stopped (went to STAT: Z = zombie). Now the fsck.ext3 process parent PPID is 1!, so its detached from the fsck PID 339, therefore detached to the hot_aux.sh fsck -p defaults sdb2 ext3

shell command process PID 327 that started it.

With no way to find out what fsck is doing (or if it is progressing at all) I almost kill it. But, after much searching how to see the process output, I came up with this.

http://stackoverflow.com/questions/715751/attach-to-a-processes-output-for-viewing

look answer 3 with 106 votes. So I did:

[root@DNS-323-583F3C]# kill -USR1 341

[root@DNS-323-583F3C]# cd /proc/341/fd

[root@DNS-323-583F3C]# ls

0 1 2 3 5

[root@DNS-323-583F3C]# tail -f 5

and this is what I got:

2 5660312 15099941 /dev/sdb2

2 5660313 15099941 /dev/sdb2

2 5660314 15099941 /dev/sdb2

2 5660315 15099941 /dev/sdb2

2 5660316 15099941 /dev/sdb2

2 5660317 15099941 /dev/sdb2

2 5660318 15099941 /dev/sdb2

2 5660319 15099941 /dev/sdb2

2 5660320 15099941 /dev/sdb2

2 5660321 15099

which was the same last lines of the original 155MB check-sdb2.log before I delete it. Funny thing is that file is no longer in the drive, but a

more 5

command seems to show all 155MB of lines (starting at line 1, form the original check-sdb2.log file). Puzzling. It only shows this when the fsck process goes to STAT: R, not while is on STAT:D.

So, if the file does not exist, why is it showing it? also, I am seeing just the "trash" version of the file check-sdb2.log not the current output (or what was supposed to be appended to the check-sdb2.log file).

Anyway, I will continue looking at this.

Regards,

Eduardo

Alt-F 0.1RC5 Status Page7.pdf

Screen Shot 2017-01-15 at 12.27.41 AM.png

Eduardo Cabrera

unread,

Jan 15, 2017, 2:34:00 AM1/15/17

to al...@googlegroups.com

Hello Joao,

It seems I was in the right direction.

After might prior post, I did search around on how to truncate the fd 5, which was this process attached to, as I thought it was still 155MB long, and it could not write to it due to it size / length of time to search to the end of it to add its current state. Truncate is not in Busybox, so I tried the command below from the directory /proc/341/fd which lists the file descriptors directory for the process 341 which was the fsck.ext3 process.

[root@DNS-323-583F3C]# cd /proc/341/fd

[root@DNS-323-583F3C]# echo -n > 5

this actually worked, and made the file with fd 5 a zero size file (which was actually the /tmp/check-sdb2.log that I erased and replaced but the new file is no longer linked to fd 5, so fd 5 must still be pointed to a zero file on the swap drive).

After I did this, while still on the directory /proc/341/fd, I executed:

[root@DNS-323-583F3C]# tail -f 5

and got:

2 9738042 15099941 /dev/sdb2

2 9738043 15099941 /dev/sdb2

2 9738044 15099941 /dev/sdb2

2 9738045 15099941 /dev/sdb2

2 9738046 15099941 /dev/sdb2

2 9738047 15099941 /dev/sdb2

2 9738048 15099941 /dev/sdb2

2 9738049 15099941 /dev/sdb2

2 9738050 15099941 /dev/sdb2

2 9738051 15099941 /dev/sdb2

Bingo! this is now telling me that the process is still going at step 2 with 64.49% progress. and I see the file new output coming in, so I know it keeps progressing and its not stuck!

I am more confident that my data is still being checked and hopefully safe. Stoping it and running it on a much powerful multicore, 8GB RAM machine is always an option, but I read everywhere that sending it a TERM signal might be risky. Still, all this process might help you modify your script to not append to the check-xxxx.log file, but maybe rewrite as every new line comes in, or maybe do just what I did, truncate the file via rewriting it with an echo -n to the file. Just a thought.

Further information:

[root@DNS-323-583F3C]# ls -l

total 0

lr-x------ 1 root root 64 Jan 15 06:08 0 -> /dev/null

l-wx------ 1 root root 64 Jan 15 06:08 1 -> pipe:[331]

l-wx------ 1 root root 64 Jan 15 06:08 2 -> pipe:[331]

lrwx------ 1 root root 64 Jan 15 06:08 3 -> /dev/sdb2

lrwx------ 1 root root 64 Jan 15 06:08 5 -> /tmp/check-sdb2.log (deleted)

I wonder if I could undelete the actual file referenced by the fd 5 or maybe redirect the fd 5 to a new log file executing on directory /proc/341/fd the following:

5>/tmp/check-sdb2.log

(note that this would not append but keep just the last line - current status only)

so I can have the new version of the check-sdb2.log start updating, and the status page would then show the completion percentage again? Saving me having to calculate the progress % manually.

So, I will let it run longer, and maybe later I will get a grip of my changes and send a TERM to the process and fsck the disk in another machine. But I am not there yet.

Regards, and may the force be with me!

Eduardo

On Sunday, January 15, 2017 at 2:07:17 AM UTC-5, Eduardo Cabrera wrote:

Hello Joao,

Its 8 Days and 14 hrs running the fsck.ext3 and still cranking up (the process goes from STAT: R to D and back to R as it waits for IO activity). Remember 3 days ago I tried deleting the 155MB check-sdb2.log, as it was not writing to it anymore. Also, I did recreate it using touch /var/log/check-sdb2.log, so now exits and its size 0. I assume the fd is no longer the same as the 5 in your call to fsck.ext3, so it would not work, but I gave it a try. I did not work, nothing is writing to the new /var/log/check-sdb2.log file.

Eduardo Cabrera

unread,

Jan 15, 2017, 5:00:51 AM1/15/17

to al...@googlegroups.com

Hello Joao,

Also, another couple of comments (form the directory /proc/341/fd):

tail -f 5 > /tmp/check-sdb2.log

copied the latest 10 lines to the check-sdb2.log file, and now the status page shows the progress at 65%

I tried:

exec 5 > /tmp/check-sdb2-new.log

this did not work, did not redirected the output of the fd 5 to that new file, as I would have hoped. It actually kicked me out of the terminal session when it errored out as shown below:

[root@DNS-323-583F3C]# exec 5 > /tmp/check-sdb2-new.log

-sh: exec: line 86: 5: not found

Connection closed by foreign host.

Even though I did do a

echo -n > 5

which made the logging to the erased fd 5 log work again (I thought I made the erased log file zero length) when I copied to the other drive which is OK:

[root@DNS-323-583F3C]# cat 5 > /mnt/sda2/check-sdb2.tmp.log

[root@DNS-323-583F3C]# cd /mnt/sda2/

[root@DNS-323-583F3C]# ls -l

total 323480

-rw-rw-rw- 1 root root 43368 Jan 6 16:02 alt-f.log

-rw-r--r-- 1 root root 163295232 Jan 12 00:04 check-sdb2.log

-rw-r--r-- 1 root root 167567396 Jan 15 09:39 check-sdb2.tmp.log

[root@DNS-323-583F3C]#

You can see the copies file (temp) is actually 8 MB larger than the file was when it stopped working. Maybe its because the file was made zero, but the fsck process had the pointer to write the log file at a 163MB when it started again, hence most of the file is filled with zeroes.

I will continue to let it run. I will go catch some sleep now.

Regards,

Eduardo

João Cardoso

unread,

Jan 15, 2017, 1:02:14 PM1/15/17

to Alt-F

On Sunday, 15 January 2017 10:00:51 UTC, Eduardo Cabrera wrote:

Hello Joao,

Also, another couple of comments (form the directory /proc/341/fd):
tail -f 5 > /tmp/check-sdb2.log

Yes, that works.

Before I removed the /tmp/check-sdb2.log in my early tests, I tried to zero it by 'echo > /tmp/check-sdb2.log', trying to truncate the file, but that didn't work, so I ended up removing it. I was not sure that it would work, as files opened by more than one process have a "reference count" on them, and they are only effectively removed when all processes referencing them terminates or closes them. I forgot to examine /proc/<pid>/fd, or I would notice the still existence of the 155MB '5' file.

You are right latter on when you say that fsck has a "pointer" to the end of the file, so when it writes something to it it writes to its know end of file. The file must have been opened in "append" mode, and its up to the OS to know/keep that pointers. So, when more than one process writes to the same file, even if one writes to its start trying to truncate it (echo > file), when the other process (roughly fsck, that probably opened the file in append mode) writes something, it writes to its know end.

keeping log files sizes under control is a though task, the usual practice is to rotate logs -- creating a new file when the current exceeds a specified size, and saving the current one under a new name. Other strategy is to have "ring" files, that wraps themselves to the beginning when the specified is reached. But any of those is generally done at the application (program) level, designing and writing it that way from the very beginning.

Some programs might not open log files in append mode, and truncation the log externally might work.

Alt-F has a simple cron-base script that tries to keep log files up to a given size, discarding its beginning and keeping its last entries, but that only works with some programs and not with others (Services->System, Cleanup, the /usr/sbin/cleanup script). Basically what it does it to save the log last 32KB to a new file and them overwrite the original log with it. It doesn't work for every log file, as explained before.

In the fsck case, as I never had such huge log files as yours, I didn't worry much, I though that it was a reasonable compromise between having minimal user feedback and a small processing and memory penalty. But with 155MB files I might have to rethink that... not sure how... any hints?

copied the latest 10 lines to the check-sdb2.log file, and now the status page shows the progress at 65%

I tried:
exec 5 > /tmp/check-sdb2-new.log
this did not work,

No, that doesn't work, 'exec' replaces the current shell process with another command or redirects *its* own open files, not another process open files.

did not redirected the output of the fd 5 to that new file, as I would have hoped. It actually kicked me out of the terminal session when it errored out as shown below:
[root@DNS-323-583F3C]# exec 5 > /tmp/check-sdb2-new.log

-sh: exec: line 86: 5: not found

Connection closed by foreign host.

Even though I did do a
echo -n > 5
which made the logging to the erased fd 5 log work again (I thought I made the erased log file zero length) when I copied to the other drive which is OK:
[root@DNS-323-583F3C]# cat 5 > /mnt/sda2/check-sdb2.tmp.log

[root@DNS-323-583F3C]# cd /mnt/sda2/

[root@DNS-323-583F3C]# ls -l

total 323480

-rw-rw-rw- 1 root root 43368 Jan 6 16:02 alt-f.log

-rw-r--r-- 1 root root 163295232 Jan 12 00:04 check-sdb2.log

-rw-r--r-- 1 root root 167567396 Jan 15 09:39 check-sdb2.tmp.log

[root@DNS-323-583F3C]#

You can see the copies file (temp) is actually 8 MB larger than the file was when it stopped working. Maybe its because the file was made zero, but the fsck process had the pointer to write the log file at a 163MB when it started again, hence most of the file is filled with zeroes.

I will continue to let it run. I will go catch some sleep now.

Nice dreams (like having fsck finished when you wake up and no files under /mnt/sdb2/lost+found)

Eduardo Cabrera

unread,

Jan 15, 2017, 9:37:04 PM1/15/17

to Alt-F

Hello Joao,

Thanks for the writeup, and having read all that I wrote and experimented with. This process is really pushing me to learn and think outside the box. I do THANK YOU for all your wisdom. If there is anyway shape or form I can help you, let me know.

In the meantime, let me inform you that after 9 days and 10hrs, I am at pass 2 with 71.63% completion. I will keep emptying the log, monitoring progress manually and waiting. Do you know if in my file condition, pass 3 , 4 or 5 will be longer or shorter than pass 2?

Regards,

Eduardo

Eduardo Cabrera

unread,

Jan 17, 2017, 12:15:50 AM1/17/17

to Alt-F

Hello,

Update: After 10 days, 12 hours, FSCK is on Step 2 at 81% and still going strong. To simplify my checking on progress via the status page, I created a little fsckcorrect.sh script, to clear the original (deleted) check-sdb2.log File Descriptor 5 (still pointing to the open erased file), and copy the last line of fresh data to a new /tmp/check-sdb2.log after appending a time stamp. This is the file the Alt-F status checks. Timestamps can tell me how fast the progress has been in recent hours.

[root@DNS-323-583F3C]# cat /mnt/sda2/fsckcorrect.sh

#! /bin/sh

echo -n > /proc/341/fd/5

sleep 10

date >> /tmp/check-sdb2.log

tail -n 1 /proc/341/fd/5 >> /tmp/check-sdb2.log

[root@DNS-323-583F3C]#

By the way, I copied temporarily

cat /proc/341/fd/5 > temp.log

to check the size:

-rw-r--r-- 1 root root 238656298 Jan 17 04:28 temp.log

So the file is now 238Mb. tail -n 1 of the file takes a long while, so no wonder the Status page takes a while parsing to the last line to show progress % on large volumes. If it was just 1 line long (fsck not appending but replacing the a line on the log file), then it would load the % much faster.

Do you know if I signal

kill -USR2 341

to make it stop sending data to fd 5 would it be significantly faster? Just asking.

Regards,

Eduardo

I entered an hourly cron job to do this automatically:

[root@DNS-323-583F3C]# crontab -l

48 0 * * 6,2 /usr/bin/news.sh #!# Alt-F cron

47 0 * * * /usr/sbin/adjtime -adjust #!# Alt-F cron

@hourly /mnt/sda2/fsckcorrect.sh

So, I am not pulling the plug on the fsck process yet. Thanks for your support and follow up.

Regards,

Eduardo

Alt-F 0.1RC5 Status Page8.pdf

João Cardoso

unread,

Jan 17, 2017, 12:41:01 PM1/17/17

to Alt-F

On Tuesday, 17 January 2017 05:15:50 UTC, Eduardo Cabrera wrote:

Hello,

Update: After 10 days, 12 hours, FSCK is on Step 2 at 81% and still going strong. To simplify my checking on progress via the status page, I created a little fsckcorrect.sh script, to clear the original (deleted) check-sdb2.log File Descriptor 5 (still pointing to the open erased file), and copy the last line of fresh data to a new /tmp/check-sdb2.log after appending a time stamp. This is the file the Alt-F status checks. Timestamps can tell me how fast the progress has been in recent hours.

[root@DNS-323-583F3C]# cat /mnt/sda2/fsckcorrect.sh

#! /bin/sh

echo -n > /proc/341/fd/5

sleep 10
date >> /tmp/check-sdb2.log

tail -n 1 /proc/341/fd/5 >> /tmp/check-sdb2.log

[root@DNS-323-583F3C]#

By the way, I copied temporarily
cat /proc/341/fd/5 > temp.log
to check the size:
-rw-r--r-- 1 root root 238656298 Jan 17 04:28 temp.log
So the file is now 238Mb. tail -n 1 of the file takes a long while, so no wonder the Status page takes a while parsing to the last line to show progress % on large volumes. If it was just 1 line long (fsck not appending but replacing the a line on the log file), then it would load the % much faster.

Do you know if I signal
kill -USR2 341
to make it stop sending data to fd 5 would it be significantly faster? Just asking.

No, I don't know, a new log line is produced every couple of seconds, but the situation described in the manual page is not exactly identical to yours, as initially option -C5 was given and there is no control terminal to fsck.

I don't also know how interaction between producing a log file or a progress bar output happens. There are two processes, the parent fsck that detects the filesystem type and launches a child fsck.ext3 (in your case). The log is produced by the child and the parent only shows the child progress bar (even when not producing a log file, internal pipes might be used). So sending the signal to the parent or the child might have different outcomes.

I remember you telling that one of them starts belonging to init (PID 1), so only experimentation, knowledge or close examination of the source code could tell.

I observed that sending a SIGUSR1/2 to the parent processes pauses/resumes the log being generated, although not immediately, so you can/might save swapping from happening by pausing the log during one hour and resuming it a few (tens of?) seconds before getting its last line.

Also, the time spent in each step depends on your fs characteristics, such as number of files, files length, the folder hierarchy (wide or deep), etc, so I don't think that one step timing might give strong indications on the other steps timing.

But you can test all this on your other filesystems, even if mounted, by giving the '-fn' option to fsck (force checking, but don't change).

I'm running

fsck -fn -C5 /mnt/md1 2>&1 5<> /tmp/po.log

on my home RAID real data mounted fs to perform these experiments.

Eduardo Cabrera

unread,

Jan 18, 2017, 11:31:38 AM1/18/17

to al...@googlegroups.com

Hello Joao,

I was getting enthusiastic about being close to completion (last night it was at Step2 85%), but this morning I got an unresponsive box. It no longer shows the Atl-F status webpage, and when I try to telnet into it, I gets stuck after connecting:

$ telnet 192.168.1.130

Trying 192.168.1.130...

Connected to 192.168.1.130.

Escape character is '^]'.

It does not get to ask for user login. I left it waiting 30 minutes, in case it was busy but to no avail. So I did:

^]

telnet> status

Connected to 192.168.1.130.

Operating with LINEMODE option

No line editing

No catching of signals

Special characters are local values

Local character echo

No flow control

Escape character is '^]'.

^]

telnet> close

Connection closed.

So, it does connect, but nothing else. I now I wonder if its is just extra busy or is it hung (but would not complete a connection in that case).

If you recommend to restart the device, which method and in which order should I do it.

I thought trying:

1-Try keep pressing the front button by 6 seconds, the left led will start flashing, then release the button for the box to do an ordered shutdown. (unmount and kill all processes) right?

2-IF that does not work, press the reset button in the back for 10 secs to get another try at telnet port 26.

3-IF that does not work, then I would try to reset via pressing the reset button for 20 secs and see if leds go off. At that point unplug the device.

4-After attempting that, I will just unplug.

Remember my goal is to shut down as cleanly as possible to get the drive out so I can do an fsck of the drive on another machine.

Let me know if you think of some other way.

Regards,

Eduardo

João Cardoso

unread,

Jan 18, 2017, 3:08:20 PM1/18/17

to Alt-F

On Wednesday, 18 January 2017 16:31:38 UTC, Eduardo Cabrera wrote:

Hello Joao,

I was getting enthusiastic about being close to completion (last night it was at Step2 85%), but this morning I got an unresponsive box. It no longer shows the Atl-F status webpage, and when I try to telnet into it, I gets stuck after connecting:

$ telnet 192.168.1.130
Trying 192.168.1.130...
Connected to 192.168.1.130.
Escape character is '^]'.

It does not get to ask for user login. I left it waiting 30 minutes, in case it was busy but to no avail. So I did:

^]
telnet> status
Connected to 192.168.1.130.
Operating with LINEMODE option
No line editing
No catching of signals
Special characters are local values
Local character echo
No flow control
Escape character is '^]'.
^]
telnet> close

Connection closed.

So, it does connect, but nothing else. I now I wonder if its is just extra busy or is it hung (but would not complete a connection in that case).

If you recommend to restart the device, which method and in which order should I do it.
I thought trying:

I would try 2 first (>10 but <20 secs).

If the orange leds start blinking it is good sign, even if you can't telnet. If telnet on port 26 does work, 'poweroff' will do a clean shutdown, or you might want to issue the terminating commands yourself.

Any of the three methods should start blinking the orange leds, if that does not happens, I'm afraid that the controlling daemon (sysctrl) is dead or irresponsible.

In any case the clean shutdown (among other things) do stop all services, then softly kill remaining processes (it might take a while to kill fsck), then a force kill process, then unmount or remount read-only fs and either 'reboot' or 'poweroff'. In any case the critical fs is not mounted anyway.

Luck

...

Eduardo Cabrera

unread,

Jan 18, 2017, 9:56:29 PM1/18/17

to al...@googlegroups.com

Hello Again,

Well, nothing worked, no leds blinking, so I had to pull the plug. I got the drive on a KNOPPIX 7.2 computer, and I am able to see the whole drive. I copied the most important information on the last backup of HD_a2 (totals 700GB). I started the fsck on the drive:

fsck -C -a -t ext3 /dev/sdb2

and after about 4 hours, it was completed. lost+found has 0 files.

It seems everything is there. Thanks for all your help in the past 11 days.

Regards,

Eduardo

Alt-F 0.1RC5 Status Page9.pdf

João Cardoso

unread,

Jan 19, 2017, 1:29:35 PM1/19/17

to Alt-F

On Thursday, 19 January 2017 02:56:29 UTC, Eduardo Cabrera wrote:

Hello Again,

Well, no thing worked, no leds blinking, so I had to pull the plug. I got he drive on a KNOPPIX 7.2 computer, and I am able to see the whole drive. I have not started the fsck yet, as I am copying the most important information on the last backup of HD_a2 (totals 700GB). I will started the fsck on the drive:

fsck -C -a -t ext3 /dev/sdb2

and after about 4 hours,

Yeah, and some people still wants the 323 to behave like their PC ;-)

it was completed. lost+found has 0 files.

The force was indeed with you :-)

It seems everything is there. Thanks for all your help in the past 11 days.

Great that everything is now OK.

You now you know that every now and then (Disks->Filesystem, "Set mounted filesystems to be checked every") the filesystem will be checked at boot time.

So, now that the fs clean, fixed and backups done you might want to see how long does fsck takes to check the fs on the 323... just to know what to expect the next time (you can use System->Utilities, RebootAndCheck)

If you do that, please let us know how long it takes.

And after all this odyssey, don't forget to apply the "network corruption fix".

Eduardo Cabrera

unread,

Jan 19, 2017, 2:02:18 PM1/19/17

to al...@googlegroups.com

Hello Joao,

Yes. Now I know and will provide the timing of a full fsck. BTW, the fsck reported more than 236 mounts since the last time it was run. No wonder there was a lot of checking to do.

Also, I did applied the mandatory "network corruption fix", and verified it after shutdown.

Let me make a suggestion: maybe we can alert more "Strongly", and make suggestions (like fsck on a separate system) prior to install on NAS with larger than 500GB drives, as it was my case. Here are the suggested addition for the "how to install" page:

WARNING FOR LARGER DRIVES: Consider performing a fsck on a separate machine prior to installation. If you have larger than 500GB drives with 400GB+ of data, specially if running rsynch to back up one volume to the other (as this creates a large number of hardlinks), and have been doing "unclean" reboots / shutdowns i.e. by using the DLINK firmware, the file systems are "not clean". A file system check on those drives is Mandatory for file integrity. You can run something like "fsck -V -a -M -C /mnt/sda2" on a spare system that would be more powerful that he limited NAS hardware. You would temporarily install the drives, via SATA drive ports / cables, or a USB-SATA adapter on that system. Knoppix live-CD (even from a USB with boot code: "knoppix 2") might be helpful to get a single-user console mode linux system on any computer/laptop. This step could "literarily" save you days of anxious waiting, and possible lock the NAS at the end, as otherwise the Alf-F firmware would performs the required fsck operation, on the slower and memory constrained NAS hardware. Please note that this is a necessary step just during the first boot after installing the firmware. Its a regular maintenance process that should have regurlarly been run, but the DLINK firmware did not do it automatically, and every shutdown / reboot done were unclean. You might read more: https://groups.google.com/forum/#!topic/alt-f/AjikxYbdH_k on experiences when this precautionary step is not taken at before the firmware install process. Its the difference of almost 11 days of crossing fingers, versus a 4 hr fsck on a more powerful system.

I know you warn about the size issue on the section "diagnosing installation problems", but I thought a warning to take action prior to the upgrade is in order.

That is all, Now I will get to enjoy this NAS, and the newly found capabilities. I will start to digest much of this forums for know hows, and tips.

I can not thank you enough for just sticking by me thru this process.

Regards,

Eduardo

Eduardo Cabrera

unread,

Jan 19, 2017, 3:00:17 PM1/19/17

to al...@googlegroups.com

Hello Joao,

I might have spoken too soon. Last night, after configuring the NAS, I did a reboot and check. File Systems check seemed progressing quickly (2TB drive in 15-20 mins, but its empty). Now the 1.5TB with 800GB of actual data and TONS of hard links, is already taking 13 hours and is just at step2: 2%. So, it would take a number of days 5% a day, so about 20 days. Since its doing:

fsck.ext3 -fp -C5 /dev/sdb2

it will not matter that the fs shows clean. Maybe my condition is a corner case, due to the thousands of hardlinks created backing up with rsync. But, its certainly a long wait. Now, I am back again waiting, but I know I have my data. I believe I could do a kill term to the process, as it was a clean system to start with. repeat the process of completing an fsck on another system, and see if I can erase a lot or most of the hard links I have in there. Maybe that is my problem, but can not tell for sure.

Do you know of any benchmark of the forced fsck process on DNS323 hw? Maybe getting 2TB+ drives, even though possible, are not practical on this platform.

I will experiment with ext4 format, as I am reading it has significant improvement specifically during fsck (http://www.pointsoftware.ch/en/4-ext4-vs-ext3-filesystem-and-why-delayed-allocation-is-bad/).

"A good thing with ext4 is that ‘fsck’ is always very fast; on the other hand, ‘fsck’ on a big ext3 with a huge amount of files can take hours of downtime to complete. Other new features of ext4 include journal checksum, pre-allocation of blocks and extents to reserve an adjacent list of block, helping to reduce fragmentation. It also includes delayed allocation to improve write performance; but this has dangerous consequences (more on this below)"

Otherwise for me, I will resort to get this volume fsck once a year, and have the data unavailable for 20 days (hoping it does hang close to the end), but since its a "backup" for rsync, then I do not need the data every day. Also, I will not be rebooting the server daily, as I had it on the prior firmware, that created this issue to start with.

Thanks again, and let me know any comments you might have. There is northing we can do about the DNC323 lack limited resources CPU/Memory, but if we want it to make it a practical more reliable storage option, we might have to suggest drive size / usability limits on it. As you mentioned on your first comment to this issues, it might have been the reason why DLINK disable it on their firmware. It would be nice to confirm that from the "horse's mouth".

Attached some files to show the current condition:

sdb2 is not mounted

check-sdb2.log is already 11mb at 2%. It will again be more than 220mb close to the end, will stop reporting % at some point around 120 mb.

Finally, it might completely hung like it did last time, closer to the end.

It would be nice to know what others experience with even larger volumes 2TB+, to see if its a reasonable proposition for this platform.

Regards,

Eduardo

...

Alt-F 0.1RC5 Status Page10.pdf

Screen Shot 2017-01-19 at 2.49.46 PM.png

Screen Shot 2017-01-19 at 2.48.56 PM.png

Eduardo Cabrera

unread,

Jan 19, 2017, 5:11:28 PM1/19/17

to al...@googlegroups.com

Hello Joao,

Once more, I believe my corner case (thousands or maybe millions of hardlinks due to rsync) might be shared by others, as the fsck can hang, indefinite loop. Please read below on the seventh comment. He is talking about a 30 day long fsck:

https://community.netgear.com/t5/Using-your-ReadyNAS/fsck-has-taken-6-days-so-far/td-p/729875

also, the comments below, they talk about:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411838

where there is a bug that was correct that caused the hung, at least from what I gathered.

At any rate, it seems that running fsck in a more powerful machine is what others also had to do. Or maybe ext4 in would be there answer.

Thought to let you know that.

Regards,

Eduardo

On Thursday, January 19, 2017 at 3:00:17 PM UTC-5, Eduardo Cabrera wrote:

Hello Joao,

I might have spoken too soon. Last night, after configuring the NAS, I did a reboot and check. File Systems check seemed progressing quickly (2TB drive in 15-20 mins, but its empty). Now the 1.5TB with 800GB of actual data and TONS of hard links, is already taking 13 hours and is just at step2: 2%. So, it would take a number of days 5% a day, so about 20 days. Since its doing:
fsck.ext3 -fp -C5 /dev/sdb2
it will not matter that the fs shows clean. Maybe my condition is a corner case, due to the thousands of hardlinks created backing up with rsync. But, its certainly a long wait. Now, I am back again waiting, but I know I have my data. I believe I could do a kill term to the process, as it was a clean system to start with. repeat the process of completing an fsck on another system, and see if I can erase a lot or most of the hard links I have in there. Maybe that is my problem, but can not tell for sure.

Do you know of any benchmark of the forced fsck process on DNS323 hw? Maybe getting 2TB+ drives in RAID 0, even though possible, are not practical on this platform.
For me, I will resort to get this volume fsck once a year, and have the data unavailable for 20 days (hoping it does hang close to the end), but since its a "backup" for rsync, then I do not need the data every day. Also, I will not be rebooting the server daily, as I had it on the prior firmware, that created this issue to start with.

João Cardoso

unread,

Jan 20, 2017, 12:25:17 PM1/20/17

to al...@googlegroups.com

On Thursday, 19 January 2017 22:11:28 UTC, Eduardo Cabrera wrote:

Hello Joao,

Once more, I believe my corner case (thousands or maybe millions of hardlinks due to rsync) might be shared by others, as the fsck can hang, indefinite loop. Please read below on the seventh comment. He is talking about a 30 day long fsck:
https://community.netgear.com/t5/Using-your-ReadyNAS/fsck-has-taken-6-days-so-far/td-p/729875
also, the comments below, they talk about:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411838

Alt-F uses e2fsprogs-1.41.14, the bug refers to 1.39.

According to the 1.39 to 1.40 change log, the bug has been solved and closed: Fix infinite loop in e2fsck on really big filesystems (Closes: #411838)

And examining the 1.41.14 source code, modifications have been done at that level (and again again in latter releases)

The reason why Alt-F is not using a more recent e2fsprogs: E2FSPROGS_VERSION:=1.42 this new release is too big! (and uclibc needs ftw, +4KB)

[Added: I compiled e2fsprogs-1.42.13 but it's too big by 73792 bytes; even if removing the inadyn package it's still bigger by 24640 bytes; e2fsprogs-1.43.3 is too big by 131136 bytes. This means that neither fits the dns-323 flash memory space, so there is almost no hope to see a more recent e2fsprogs version in Alt-F]

I still think the problem to be aggravated by the box low memory, which can be alleviated by running fsck from the command line and not generating a log (for a progress bar use only -C).

But it might as well be some characteristic of your fs that deploys another fsck bug.

Several users have reported using 3TB disks/fs on the 323, and I have a 2TB disk on a DNS-325 (using the same fsck and also using rsync to perform backups), and no issues.

You might want to rebuild/recreate the filesystem on the desktop PC?

...

Eduardo Cabrera

unread,

Jan 23, 2017, 10:52:46 AM1/23/17

to Alt-F

Hello Joao,

Thanks for your comments and efforts. I am in the process of recovering the files, via rsync from sdb2 (old 1.5TB w/ ext2) to sda2(the new 2.0TB w/ext4). Once I finished, I will report back how long it takes for an fsck.ext4 on the DNS323 hw on the new drive (with 744GB of data but no hard links) vs. the old drive (with about the same amount of data but with thousands or maybe millions of hard links), which we know so far that it would take about 20 days to do the check on the DNS323.

I will be purging tons of the sda2 hard links. Do you know if there is a limit of how many hardlinks can be pointing to the same inode? It seems most of my data has upwards to 996 hard links pointed to one. I do not know if It can be higher than 999 per Inode.

While I am researching, do you know a command to could count how many hardlinks are listed in a directory recursively? Otherwise, I will figure it our, but thanks if you know it.

Regards,

Eduardo

João Cardoso

unread,

Jan 23, 2017, 2:04:59 PM1/23/17

to al...@googlegroups.com

On Monday, 23 January 2017 15:52:46 UTC, Eduardo Cabrera wrote:

Hello Joao,

Thanks for your comments and efforts.

Meanwhile I have backported from a later e2fsprogs version the function reported to in the 411838 bug report you posted. *if* that was the cause of your long fsck time, it will be fixed, as it was using floats to perform calculations (and as the box SoC has no FPU they were implemented in software, which is much slower)

I have also found a way to avoid the large log files and still have the progress status; now the log is created into a pipe, which has a max size of 64KB, and a background job regularly reads/empties the pipe and creates the actual log tail.

I am in the process of recovering the files, via rsync from sdb2 (old 1.5TB w/ ext2) to sda2(the new 2.0TB w/ext4). Once I finished, I will report back how long it takes for an fsck.ext4 on the DNS323 hw on the new drive (with 744GB of data but no hard links) vs. the old drive (with about the same amount of data but with thousands or maybe millions of hard links), which we know so far that it would take about 20 days to do the check on the DNS323.

Please take also a note on the log file size.

I will be purging tons of the sda2 hard links.

I'm not sure if I understand what you mean with that. A hard link is just a different name for the same file; the file inode contains the number of files using it.

So, the only way to purge hard links is to remove files that refers to the same "final" file, and that will destroy you folder/files structure.

If /folder1/file1 is a hard link to /folder2/file2, and /folder3/file3 is also a hardlink to any of them, than file1, file2 and file3 are indistinguible and have the same inode, and the "final" file inode reference count is three.

Do you know if there is a limit of how many hardlinks can be pointing to the same inode?

I don't know but I would beat that in the worse case the reference count will be a 32 bits variable, so 4 Tera number of files could "refer" to it.

I have created 10000 hard links on the box without problem:

[root@dns-323]# ls -l lnk-9999
-rw-r--r-- 10004 jcard users 733151232 Jan 19 18:14 lnk-9999

[root@dns-323]# stat lnk-9999
File: 'lnk-9999'
Size: 733151232 Blocks: 1431944 IO Block: 4096 regular file
Device: 802h/2050d Inode: 6914 Links: 10004

It seems most of my data has upwards to 996 hard links pointed to one. I do not know if It can be higher than 999 per Inode.

While I am researching, do you know a command to could count how many hardlinks are listed in a directory recursively?

Don't know exactly what you want to do, 'ls -l' lists in the second column the number of hard links the file has, and 'ls -li' also shows its inode. Doing a 'ls -Rl /mnt/whatever' does a recursive fs listing, that can takes ages... you can also 'find /mnt/whatever -links +10' to find all files with more than 10 links (in the example, use 'find --help'.

[Added: the for the 'find' command you might want to use also '-type f', or directories with many sub-directories will appear in the list, as the ref count on directories refers to the number of subdirectories: find /mnt/whatever -type f -links +10 ]

...

Jc Connell

unread,

Jun 27, 2019, 6:26:48 PM6/27/19

to Alt-F

I've received a message that my clock is drifting 2x now. I've replaced the battery on the motherboard thinking that it would resolve this issue.

Now that I've received it a second time, I thought I would investigate more. Can I assume from this message that the "Clock is drifting" error message is safe to ignore or not worry about?

On Tuesday, January 10, 2017 at 11:48:02 AM UTC-5, João Cardoso wrote:

Rolf Pedersen

unread,

Jun 27, 2019, 6:47:29 PM6/27/19

to al...@googlegroups.com

On 06/27/2019 03:26 PM, Jc Connell wrote:

I've received a message that my clock is drifting 2x now. I've replaced the battery on the motherboard thinking that it would resolve this issue.

Now that I've received it a second time, I thought I would investigate more. Can I assume from this message that the "Clock is drifting" error message is safe to ignore or not worry about?

Did you get an advice with the message that you can configure ntp to "run continually as a server" if you repeatedly get the drifting message? ISTR that was the case with my clock drift and I did so: Services > Network > ntp > Configure and there has been no clock drift message since. Read the help on the NTP setup page.
Rolf

To view this discussion on the web visit https://groups.google.com/d/msgid/alt-f/ad34398f-1182-4ba2-8681-a3f5bcc22e17%40googlegroups.com.

Reply all

Reply to author

Forward