--
You received this message because you are subscribed to the Google Groups "Alt-F" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/alt-f.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+un...@googlegroups.com.
Joao, Thanks for the prompt reply and for your comments. I feel much relieved now. I was able to telnet, and run TOP to see what is going on. Attached is a screenshot of TOP showing everything that is running. As you see, fsck is running:fsck.ext3 -p -C5 /dev/sdb2Which tells me that my precious Volume2 was using ext3 -> journaled (lucky me!), so I will have a much better chance of having most of my data back!. Additionally, it does show the -p passive safe auto-repair is running.Also, I see that there is plenty of activity, most of it being io operations, so the disk is still cranking up, even though leds are still solid orange.
I also attached the current status page of the process, after 36 hrs, we are in Step 2 at 10%. So, it will probably take another 9 more days before it completes.
Patience shall be rewarded...
In the mean time, any comments / suggestions? Logs to watch? maybe some processes to kill that might speed up the completion?
2 5660319 15099941 /dev/sdb2
2 5660320 15099941 /dev/sdb2
At this stage I rather not shut it down, and let it finished.
hot_aux sda4 waiting to be fscked
hot_aux sdb4 waiting to be fscked
hot_aux sda4 waiting to be fscked
hot_aux sdb4 waiting to be fscked
so I see why you mean. Any way to shut down those from the pending fsck status? Attached is also the latest TOP command result. Please note that there are some two
sleep 10
processes, which I would think its related to the wait for FSCK above. if there a way to stop those processes?
Hello Joao,Thanks for your reply. Its been 5 days 17 hrs, and still going.
on your comments:-----Don't use the box for anythink else. You can even stop samba, use "rcsmb stop", but it is not going to help much. You can also stop smartd, "rcsmart stop". Don't check the "autorefresh" check button in the status page -- *that* slows down the box.--------I did stop those two smb processes, and the smartd, and the box is not being used for anything and has UPS power, so it can go for days. Currently I can not tell what % of step 2 is, as it does take a long time to present the status page now.-------The fsck log growing (/tmp/check*.log) is not helping also, as it keeps growing (what's its current size?). Its only function is to show in the status page the progress completion.-------Currently check-sdb2.log is 155.7M in size-rw-rw-rw- 1 root root 163295232 Jan 12 10:25 check-sdb2.log
but it seems the it has stopped growing now.the last lines show:2 5660319 15099941 /dev/sdb2
2 5660320 15099941 /dev/sdb2
2 5660321 15099and seems that it did not complete writing the last line nor put in the LF at the end.FYI in my flawed wisdom, I copied the check-sdb2.log file to the mounted volume, in the hopes to delete it or create a new empty one,
so there is more memory / swap (/tmp) and does not run out of memory. does it seem like a good idea? any thoughts on how to proceed replcing the log file with a new or shorter version while FSCK is running?Something that caught my eyes was the status of the step 2 which now showed 37488%. This last piece of information on the status page. It took a long time to show up (about 1 hr) after the status page was invoked. I noticed that these processes were running before the whole status page is complete:tr -s \b\r\001\002 \ncat /tmp/check-sdb2.log{status.cgi} /bin/sh status.cgiso I assume is all part of the status.cgi script and the process of manipulating data to show on the screen. The fact that that check-sdb2.log file is more than 155Mb now is causing the delays.
I know believe that the 37488% comes from reading the last line of the check-sbd2.log file that now shows2 5660321 15099and the number correspond to have taken the percentage based on 15099 vs the full number of 15099941 which was not completed on the log file. The real number should have been 37.48% of Step 2 complete.
At any rate, now the log file check-db2.log seems is not logging anymore (growing) so now we are blind to the status of completion of the fsck. Any thoughts ideas what might be going on? how to restore the logging?
----On a desktop linux computer, with 2/4/8GB of RAM fsck would perform much quicker. That is an option for you if you have a linux computer, kill fsck, power off the box, move the disk to the PC and perform the fsck there.------At this stage I rather not shut it down, and let it finished.
sda4/sdb4 are also waiting to be fsck checked, and they consume a fraction of a second each ten seconds. You can see that in /var/log/hot_aux.log----------The hot_aux.log file is 2.7M now, and the last four lines are:hot_aux sda4 waiting to be fscked
hot_aux sdb4 waiting to be fscked
hot_aux sda4 waiting to be fscked
hot_aux sdb4 waiting to be fscked
so I see why you mean. Any way to shut down those from the pending fsck status? Attached is also the latest TOP command result. Please note that there are some two
sleep 10
processes, which I would think its related to the wait for FSCK above. if there a way to stop those processes?
Thanks you for all your help, and hope my data is safe.
Hello Joao,
Just to give you an update. Process still going strong after 1 week. Attached TOP and Status screen data.As you can see the fsck is still going strong, seems has not run out of memory/resources.
Question, do you know if signal -USR1 is supported in this version of fsck.ext3?
I like to see if I can get a sense of progress on the running fsck like described on:via sending signal:kill -USR1 341but I do not want to end up cancelling the fsck process, if it does not support the -USR1 signal. Obviously I can not check to be sure, so that is why I asked you in case you know.
Other than that, I will keep waiting... Thanks for everything...Eduardo
[root@DNS-323-583F3C]# kill -USR1 341
[root@DNS-323-583F3C]# cd /proc/341/fd
[root@DNS-323-583F3C]# ls
0 1 2 3 5
[root@DNS-323-583F3C]# tail -f 5
and this is what I got:
2 5660312 15099941 /dev/sdb2
2 5660313 15099941 /dev/sdb2
2 5660314 15099941 /dev/sdb2
2 5660315 15099941 /dev/sdb2
2 5660316 15099941 /dev/sdb2
2 5660317 15099941 /dev/sdb2
2 5660318 15099941 /dev/sdb2
2 5660319 15099941 /dev/sdb2
2 5660320 15099941 /dev/sdb2
2 5660321 15099
which was the same last lines of the original 155MB check-sdb2.log before I delete it. Funny thing is that file is no longer in the drive, but a
more 5
command seems to show all 155MB of lines (starting at line 1, form the original check-sdb2.log file). Puzzling. It only shows this when the fsck process goes to STAT: R, not while is on STAT:D.
So, if the file does not exist, why is it showing it? also, I am seeing just the "trash" version of the file check-sdb2.log not the current output (or what was supposed to be appended to the check-sdb2.log file).
Anyway, I will continue looking at this.
Regards,
Eduardo
[root@DNS-323-583F3C]# echo -n > 5
[root@DNS-323-583F3C]# tail -f 5
and got:
2 9738042 15099941 /dev/sdb2
2 9738043 15099941 /dev/sdb2
2 9738044 15099941 /dev/sdb2
2 9738045 15099941 /dev/sdb2
2 9738046 15099941 /dev/sdb2
2 9738047 15099941 /dev/sdb2
2 9738048 15099941 /dev/sdb2
2 9738049 15099941 /dev/sdb2
2 9738050 15099941 /dev/sdb2
2 9738051 15099941 /dev/sdb2
Bingo! this is now telling me that the process is still going at step 2 with 64.49% progress. and I see the file new output coming in, so I know it keeps progressing and its not stuck!
I am more confident that my data is still being checked and hopefully safe. Stoping it and running it on a much powerful multicore, 8GB RAM machine is always an option, but I read everywhere that sending it a TERM signal might be risky. Still, all this process might help you modify your script to not append to the check-xxxx.log file, but maybe rewrite as every new line comes in, or maybe do just what I did, truncate the file via rewriting it with an echo -n to the file. Just a thought.
Further information:
[root@DNS-323-583F3C]# ls -l
total 0
lr-x------ 1 root root 64 Jan 15 06:08 0 -> /dev/null
l-wx------ 1 root root 64 Jan 15 06:08 1 -> pipe:[331]
l-wx------ 1 root root 64 Jan 15 06:08 2 -> pipe:[331]
lrwx------ 1 root root 64 Jan 15 06:08 3 -> /dev/sdb2
lrwx------ 1 root root 64 Jan 15 06:08 5 -> /tmp/check-sdb2.log (deleted)
5>/tmp/check-sdb2.log
(note that this would not append but keep just the last line - current status only)
so I can have the new version of the check-sdb2.log start updating, and the status page would then show the completion percentage again? Saving me having to calculate the progress % manually.
So, I will let it run longer, and maybe later I will get a grip of my changes and send a TERM to the process and fsck the disk in another machine. But I am not there yet.
Regards, and may the force be with me!
Eduardo
Hello Joao,Its 8 Days and 14 hrs running the fsck.ext3 and still cranking up (the process goes from STAT: R to D and back to R as it waits for IO activity). Remember 3 days ago I tried deleting the 155MB check-sdb2.log, as it was not writing to it anymore. Also, I did recreate it using touch /var/log/check-sdb2.log, so now exits and its size 0. I assume the fd is no longer the same as the 5 in your call to fsck.ext3, so it would not work, but I gave it a try. I did not work, nothing is writing to the new /var/log/check-sdb2.log file.
[root@DNS-323-583F3C]# exec 5 > /tmp/check-sdb2-new.log
-sh: exec: line 86: 5: not found
Connection closed by foreign host.
[root@DNS-323-583F3C]# cat 5 > /mnt/sda2/check-sdb2.tmp.log
[root@DNS-323-583F3C]# cd /mnt/sda2/
[root@DNS-323-583F3C]# ls -l
total 323480
-rw-rw-rw- 1 root root 43368 Jan 6 16:02 alt-f.log
-rw-r--r-- 1 root root 163295232 Jan 12 00:04 check-sdb2.log
-rw-r--r-- 1 root root 167567396 Jan 15 09:39 check-sdb2.tmp.log
[root@DNS-323-583F3C]#
Hello Joao,
Also, another couple of comments (form the directory /proc/341/fd):tail -f 5 > /tmp/check-sdb2.log
copied the latest 10 lines to the check-sdb2.log file, and now the status page shows the progress at 65%I tried:exec 5 > /tmp/check-sdb2-new.logthis did not work,
did not redirected the output of the fd 5 to that new file, as I would have hoped. It actually kicked me out of the terminal session when it errored out as shown below:[root@DNS-323-583F3C]# exec 5 > /tmp/check-sdb2-new.log
-sh: exec: line 86: 5: not found
Connection closed by foreign host.
Even though I did do aecho -n > 5which made the logging to the erased fd 5 log work again (I thought I made the erased log file zero length) when I copied to the other drive which is OK:[root@DNS-323-583F3C]# cat 5 > /mnt/sda2/check-sdb2.tmp.log
[root@DNS-323-583F3C]# cd /mnt/sda2/
[root@DNS-323-583F3C]# ls -l
total 323480
-rw-rw-rw- 1 root root 43368 Jan 6 16:02 alt-f.log
-rw-r--r-- 1 root root 163295232 Jan 12 00:04 check-sdb2.log
-rw-r--r-- 1 root root 167567396 Jan 15 09:39 check-sdb2.tmp.log
[root@DNS-323-583F3C]#
You can see the copies file (temp) is actually 8 MB larger than the file was when it stopped working. Maybe its because the file was made zero, but the fsck process had the pointer to write the log file at a 163MB when it started again, hence most of the file is filled with zeroes.I will continue to let it run. I will go catch some sleep now.
[root@DNS-323-583F3C]# cat /mnt/sda2/fsckcorrect.sh
#! /bin/sh
echo -n > /proc/341/fd/5
sleep 10
date >> /tmp/check-sdb2.log
tail -n 1 /proc/341/fd/5 >> /tmp/check-sdb2.log
[root@DNS-323-583F3C]#
By the way, I copied temporarily
cat /proc/341/fd/5 > temp.log
to check the size:
-rw-r--r-- 1 root root 238656298 Jan 17 04:28 temp.log
So the file is now 238Mb. tail -n 1 of the file takes a long while, so no wonder the Status page takes a while parsing to the last line to show progress % on large volumes. If it was just 1 line long (fsck not appending but replacing the a line on the log file), then it would load the % much faster.
Do you know if I signal
kill -USR2 341
to make it stop sending data to fd 5 would it be significantly faster? Just asking.
Regards,
Eduardo
[root@DNS-323-583F3C]# crontab -l
48 0 * * 6,2 /usr/bin/news.sh #!# Alt-F cron
47 0 * * * /usr/sbin/adjtime -adjust #!# Alt-F cron
Hello,Update: After 10 days, 12 hours, FSCK is on Step 2 at 81% and still going strong. To simplify my checking on progress via the status page, I created a little fsckcorrect.sh script, to clear the original (deleted) check-sdb2.log File Descriptor 5 (still pointing to the open erased file), and copy the last line of fresh data to a new /tmp/check-sdb2.log after appending a time stamp. This is the file the Alt-F status checks. Timestamps can tell me how fast the progress has been in recent hours.[root@DNS-323-583F3C]# cat /mnt/sda2/fsckcorrect.sh
#! /bin/sh
echo -n > /proc/341/fd/5
sleep 10
date >> /tmp/check-sdb2.log
tail -n 1 /proc/341/fd/5 >> /tmp/check-sdb2.log
[root@DNS-323-583F3C]#
By the way, I copied temporarily
cat /proc/341/fd/5 > temp.log
to check the size:
-rw-r--r-- 1 root root 238656298 Jan 17 04:28 temp.log
So the file is now 238Mb. tail -n 1 of the file takes a long while, so no wonder the Status page takes a while parsing to the last line to show progress % on large volumes. If it was just 1 line long (fsck not appending but replacing the a line on the log file), then it would load the % much faster.
Do you know if I signal
kill -USR2 341
to make it stop sending data to fd 5 would it be significantly faster? Just asking.
$ telnet 192.168.1.130
Trying 192.168.1.130...
Connected to 192.168.1.130.
Escape character is '^]'.
^]
telnet> status
Connected to 192.168.1.130.
Operating with LINEMODE option
No line editing
No catching of signals
Special characters are local values
Local character echo
No flow control
Escape character is '^]'.
^]
telnet> close
Connection closed.
Hello Joao,
I was getting enthusiastic about being close to completion (last night it was at Step2 85%), but this morning I got an unresponsive box. It no longer shows the Atl-F status webpage, and when I try to telnet into it, I gets stuck after connecting:
$ telnet 192.168.1.130
Trying 192.168.1.130...
Connected to 192.168.1.130.
Escape character is '^]'.
It does not get to ask for user login. I left it waiting 30 minutes, in case it was busy but to no avail. So I did:^]
telnet> status
Connected to 192.168.1.130.
Operating with LINEMODE option
No line editing
No catching of signals
Special characters are local values
Local character echo
No flow control
Escape character is '^]'.
^]
telnet> close
Connection closed.
So, it does connect, but nothing else. I now I wonder if its is just extra busy or is it hung (but would not complete a connection in that case).If you recommend to restart the device, which method and in which order should I do it.I thought trying:
...
Hello Again,Well, no thing worked, no leds blinking, so I had to pull the plug. I got he drive on a KNOPPIX 7.2 computer, and I am able to see the whole drive. I have not started the fsck yet, as I am copying the most important information on the last backup of HD_a2 (totals 700GB). I will started the fsck on the drive:
fsck -C -a -t ext3 /dev/sdb2and after about 4 hours,
it was completed. lost+found has 0 files.
It seems everything is there. Thanks for all your help in the past 11 days.
WARNING FOR LARGER DRIVES: Consider performing a fsck on a separate machine prior to installation. If you have larger than 500GB drives with 400GB+ of data, specially if running rsynch to back up one volume to the other (as this creates a large number of hardlinks), and have been doing "unclean" reboots / shutdowns i.e. by using the DLINK firmware, the file systems are "not clean". A file system check on those drives is Mandatory for file integrity. You can run something like "fsck -V -a -M -C /mnt/sda2" on a spare system that would be more powerful that he limited NAS hardware. You would temporarily install the drives, via SATA drive ports / cables, or a USB-SATA adapter on that system. Knoppix live-CD (even from a USB with boot code: "knoppix 2") might be helpful to get a single-user console mode linux system on any computer/laptop. This step could "literarily" save you days of anxious waiting, and possible lock the NAS at the end, as otherwise the Alf-F firmware would performs the required fsck operation, on the slower and memory constrained NAS hardware. Please note that this is a necessary step just during the first boot after installing the firmware. Its a regular maintenance process that should have regurlarly been run, but the DLINK firmware did not do it automatically, and every shutdown / reboot done were unclean. You might read more: https://groups.google.com/forum/#!topic/alt-f/AjikxYbdH_k on experiences when this precautionary step is not taken at before the firmware install process. Its the difference of almost 11 days of crossing fingers, versus a 4 hr fsck on a more powerful system.
fsck.ext3 -fp -C5 /dev/sdb2
...
Hello Joao,
I might have spoken too soon. Last night, after configuring the NAS, I did a reboot and check. File Systems check seemed progressing quickly (2TB drive in 15-20 mins, but its empty). Now the 1.5TB with 800GB of actual data and TONS of hard links, is already taking 13 hours and is just at step2: 2%. So, it would take a number of days 5% a day, so about 20 days. Since its doing:fsck.ext3 -fp -C5 /dev/sdb2
it will not matter that the fs shows clean. Maybe my condition is a corner case, due to the thousands of hardlinks created backing up with rsync. But, its certainly a long wait. Now, I am back again waiting, but I know I have my data. I believe I could do a kill term to the process, as it was a clean system to start with. repeat the process of completing an fsck on another system, and see if I can erase a lot or most of the hard links I have in there. Maybe that is my problem, but can not tell for sure.
Do you know of any benchmark of the forced fsck process on DNS323 hw? Maybe getting 2TB+ drives in RAID 0, even though possible, are not practical on this platform.For me, I will resort to get this volume fsck once a year, and have the data unavailable for 20 days (hoping it does hang close to the end), but since its a "backup" for rsync, then I do not need the data every day. Also, I will not be rebooting the server daily, as I had it on the prior firmware, that created this issue to start with.
Hello Joao,
Once more, I believe my corner case (thousands or maybe millions of hardlinks due to rsync) might be shared by others, as the fsck can hang, indefinite loop. Please read below on the seventh comment. He is talking about a 30 day long fsck:also, the comments below, they talk about:
...
Hello Joao,Thanks for your comments and efforts.
I am in the process of recovering the files, via rsync from sdb2 (old 1.5TB w/ ext2) to sda2(the new 2.0TB w/ext4). Once I finished, I will report back how long it takes for an fsck.ext4 on the DNS323 hw on the new drive (with 744GB of data but no hard links) vs. the old drive (with about the same amount of data but with thousands or maybe millions of hard links), which we know so far that it would take about 20 days to do the check on the DNS323.
I will be purging tons of the sda2 hard links.
Do you know if there is a limit of how many hardlinks can be pointing to the same inode?
[root@dns-323]# ls -l lnk-9999-rw-r--r-- 10004 jcard users 733151232 Jan 19 18:14 lnk-9999
[root@dns-323]# stat lnk-9999File: 'lnk-9999'Size: 733151232 Blocks: 1431944 IO Block: 4096 regular fileDevice: 802h/2050d Inode: 6914 Links: 10004
It seems most of my data has upwards to 996 hard links pointed to one. I do not know if It can be higher than 999 per Inode.While I am researching, do you know a command to could count how many hardlinks are listed in a directory recursively?
...
I've received a message that my clock is drifting 2x now. I've replaced the battery on the motherboard thinking that it would resolve this issue.
Now that I've received it a second time, I thought I would investigate more. Can I assume from this message that the "Clock is drifting" error message is safe to ignore or not worry about?
To view this discussion on the web visit https://groups.google.com/d/msgid/alt-f/ad34398f-1182-4ba2-8681-a3f5bcc22e17%40googlegroups.com.