File Copies Through NFS Horribly Slow

Craig Holcomb

unread,

Mar 22, 2020, 7:53:39 PM3/22/20

to al...@googlegroups.com

Recently, copying files through an NFS mount from one of my Raspberry Pi’s to my DNS-320 using Linux CP command has slowed to a crawl. I don’t believe I have introduced any changes that would cause this, however, as

I’m sure many of you have experienced in the past I will probably have an “oh yeah” moment where I realize I did somehow cause the problem but currently can’t see the forest through the trees.

While monitoring the copy using Nautilus from my Ubuntu desktop the slowness is painfully obvious where there are periods of no change though there are sometimes spurts of speed every now and then. Very inconsistent!

Anyway, monitoring the DNS-320 using Alt-F status page shows a load that starts out normal and then steadily increasing to between 5 and 6 with CPU pegged at 100%. When I look at the System Log I don’t see anything obvious (though I really don’t know what to look for). I looked just now and io shows 75% which seems high but I would think during data transfer you would see high io. I have included a screen shot of the current Status page and the System Log running processes.

I turned NFS off as an experiment and found that Samba links work just fine for copying data from the same Raspberry Pi to the same directory and same drive on the NAS with which NFS seems to have a problem. That is annoying because my understanding is that NFS is the faster, better and more efficient route to take.

If anyone sees something I’m missing or has suggestions for where to look for a problem I am all ears!

Screenshot from 2020-03-22 16-52-43.png

Jessy Hartman

unread,

Mar 22, 2020, 11:04:32 PM3/22/20

to Alt-F

I'm facing the exactly same Issue, mounting the share from a Mac, CPU goes to 100%.

Craig Holcomb

unread,

Mar 23, 2020, 3:00:46 PM3/23/20

to Alt-F

Now it's happening with another Raspberry Pi to the other DNS-320 drive! My two NFS shares are pretty much unusable as far as writing is concerned. Reading is not a problem. All of this was working just fine until sometime in the last few days. I'm thinking that something changed on my Alt-F box but I'm not sure what. Did you ever have your NFS share working without problems or did your problems start as soon as you tried to access it the first time? Another question, do you actually have an NFS share or are you using Samba? Samba appears to work for me with no problem (reading, anyway... I guess I need to try writing... especially if I'm going to have to use it until I get my NFS problem fixed). By the way, I also posted on your other thread asking for some related info.

Craig Holcomb

unread,

Mar 23, 2020, 3:25:40 PM3/23/20

to Alt-F

I evidently spoke too soon about reads being okay. Currently, while two different Raspberry Pi's are writing (at an imperceptible rate) to two different NAS dirves via NFS shares, my Le Potato (similar to a Raspberry Pi) is trying to access one of those NFS shares to play a video with KODI. Not happening! It took forever to navigate to the directory but loading the video, well... like I said... Not happening!

João Cardoso

unread,

Mar 24, 2020, 2:44:16 PM3/24/20

to Alt-F

I believe that your issues come from file locking either at samba and/or nfs. Don't use both servers on read/write mode on the same folder, samba and nfs don't know of each other existence, and possible file locks set on one can lead to issues/deadlocks on the other, depending on how locks are done (kernel/advisory/mandatory locks). You don't control how that happens, perhaps mounting the folder on one the protocols as read only will solve the issue?

NFS is know to have issues with file locking, one of the mount options is 'lock/nolock', or 'sync/nosync', but that can also cause other issues.

Craig Holcomb

unread,

Mar 24, 2020, 3:44:45 PM3/24/20

to al...@googlegroups.com

Thanks for your response João, what you say sounds logical. It is just frustrating that it all worked well for months and just all of a sudden it stopped working. Especially how the two different NAS drives (each with samba and nfs shares) stopped working at the same time. Anyway, I have successfully modified one of my scripts that points to one drive to now use samba instead of nfs. The other script is not cooperating but it uses rsync so not sure if maybe if rsync has a problem with samba? Anyway, I'm working on that.

Joao Cardoso

unread,

Mar 24, 2020, 10:18:45 PM3/24/20

to Alt-F

On Tuesday, March 24, 2020 at 7:44:45 PM UTC, Craig Holcomb wrote:

Thanks for your response João, what you say sounds logical.

But I'm not very convinced myself :-O

The increase in load (processes waiting for cpu) that you see must be the samba client establishing more connections, thus more smbd processes.

Looking at 'pstree -p' in those circumstances should show several smbd processes, while in normal circumstances only a couple would appear. Also, 'smbstatus' will list several open connections and file locks. 'top' should display one or more smbd processes with the 'D' status flag, meaning a non-interruptible process (not even a kill -9).

I think that an increase in NFS workload should not translate in a higher process load, that is what appears as 'load' in the status page. A bigger nfs workload would translate in a slower, non responding but give-me-a-breath box.

As for the cpu utilization, even displaying the status page is enough to take it to 100%, which is reasonable and expectable -- otherwise the cpu would not be in use (and ours is single, not a octa-core, not even multi-thread cpu)

It is just frustrating that it all worked well for months and just all of a sudden it stopped working. Especially how the two different NAS drives (each with samba and nfs shares) stopped working at the same time.

Well, then the root cause is external... something external to the boxes deployed that behaviour.

Anyway, I have successfully modified one of my scripts that points to one drive to now use samba instead of nfs. The other script is not cooperating but it uses rsync so not sure if maybe if rsync has a problem with samba?

don't think so. Not even not sure if smb/nfs are conflicting by themselves, given fast enough IO and CPU.

Craig Holcomb

unread,

Mar 25, 2020, 2:14:17 PM3/25/20

to Alt-F

It is just frustrating that it all worked well for months and just all of a sudden it stopped working. Especially how the two different NAS drives (each with samba and nfs shares) stopped working at the same time.

Well, then the root cause is external... something external to the boxes deployed that behaviour.

I had also thought of this as a possibility when during my research I found a page on the web where someone resolved their slow NFS performance by rebooting a switch (of course, I cant' find that page now). I have rebooted my router and ny main gigabit switch which made no difference. One of the affected Raspberry Pi's is connected directly to the router and the other one goes through a 100Mbit switch to the gigabit switch to the router. Though I did not reboot the 100Mbit switch it is not a common denominator as only one of the Pi's is attached to it. All of my switches are non-managed.

Anyway, I have successfully modified one of my scripts that points to one drive to now use samba instead of nfs. The other script is not cooperating but it uses rsync so not sure if maybe if rsync has a problem with samba?

don't think so. Not even not sure if smb/nfs are conflicting by themselves, given fast enough IO and CPU.

Anyway, I'm working on that.

The problem Samba script appears to have resolved itself after rebooting that Raspberry Pi so everything is working using Samba now though I still wonder if it was faster when NFS was working.

Some of my research has found many people using nfsstat to troubleshoot. Apparently that is not part of Alt-F or additionally available Alt-F packages. In looking at the package manager web pages for Entware-ng and ffp I don't see a listing of available packages (I thought I saw those listings there before, maybe not) so I don't know if nfsstat is available in those or not. Installing from those would probably create additional variables that would affect the whole scenario anyway. Debian package nfs-common evidently contains nfsstat but again running Debian on the box would put the scenario outside of Alt-F firmware, right? Trying to diagnose this without any tools is next to impossible and I'm afraid I'll be stuck with Samba. Don't get me wrong, Samba is currently working but I'm sure you know that nagging feeling when problem resolution seems to be just outside your reach.

Craig Holcomb

unread,

Mar 26, 2020, 2:29:43 PM3/26/20

to Alt-F

Okay, now things are really getting weird. Samba is starting to exhibit the same (or at least similar) behaviour.

Between early this morning and a little after noon my time several automated processes ran on those same two Raspberry Pis I previously mentioned. One of the Pi's runs a backup of it's SD card using rsync to a file on the NAS then it spawns a process on the NAS that creates a compressed copy of that file. The backup and the compressed copy both completed flawlessly as they have for years now. It failed the other day when the NFS problem started but now has worked for two days without a problem using Samba instead of NFS. Okay so no Samba problem with those processes which completed in a little under 2 hours which is normal.

On the other Raspberry Pi I have qBittorrent running and early this morning started downloading three TV show files, one at a time. I have a script on that Pi that qBittorrent runs at the completion of a file download. It was using NFS but with the problem that occurred I switched it to use Samba. That worked for yesterday's downloads just fine. If worked on the first two downloads this morning just fine, as well. Evidently the copy was still occurring when I checked on it by using Ubuntu Nautilus to the NAS drive using sftp (aha, even a thrid network access method). The first TV file that copied successfully this morning was 1.3 GB and the second was 637.3 MB. The third one apparently was hung at 83.5 MB out of 778.MB that was supposed to copy.

I had no problem using Nautilus/sftp to view the NAS drive containing the resync backup and compressed copy but when I first tried to look at the other NAS drive which contains my TV video files Nautilus kind of went out to lunch so I went to look at the Alt-F status page and sure enough there was a load of 4.9 and 100% CPU. After several times going back and forth between Nautilus and Alt-F status page the load and CPU finally dropped back to normal but the file size remained the same (83.5 MB) on the NAS TV video drive. I then noticed that the status page no longer showed the Raspberry Pi connected to the share.

Again, I had been using all of these network access methods (NFS, Samba, sftp) together at the same time working flawlessly before the problem started recently so I doubt if the problem is related directly to that, however, today's problem seemingly coincides with the use of sftp and samba at the same time even though I was only using sftp in a read mode scenario (unless Nautilus wants to lock files you are listing which does not seem likely). I will try some more experimenting (maybe even with NFS so I can try nfsstat on the client side) so see if I can find anything but currently I'm flabbergasted.

A quick and probably related note. I just tried to recopy that failed file using Nautilus. The copy started and was moving along just fine so I tried to view the listing of files in that NAS disk directory using another Nautilus tab and then those two tabs hung for a while, the copy froze and the file listing tab froze. I came over here to start typing this information and went back to look and it was no longer frozen but the copy had completely stopped, the file information on the NAS still showed the failed file size and date/time as before but now there was a new binary file called .giosavecFNPJA which might be the copy to that point (it is 17.5 MB and has a time stamp that matches about when the copy died). I'm going to try the copy the same way again but will not attempt to display the list of files while it is copying. This is crazy. I will report back here later today with more results.

Craig Holcomb

unread,

Mar 26, 2020, 4:52:53 PM3/26/20

to Alt-F

Wow, my NAS shares are now totally worthless. Neither NFS or Samba copies work and now I can't get sftp to work for copies either. I am able to copy between the Pi's and my Ubuntu desktop just fine but not to the NAS. At least not at a reasonable rate.

Here is the latest result from a straight copy within a script that worked for 2 other files earlier today from my qBittorrent Pi to the NAS using Samba:

pi@qbitbox:~ $ ./copy_specific_file_to_NAS_temp.sh
cp: error writing '/mnt/cifs/Temporary/some_tv_file.mkv': Bad file descriptor
cp: failed to close '/mnt/cifs/some_tv_file.mkv': Input/output error
pi@qbitbox:~ $ ^C
pi@qbitbox:~ $

I was able to quickly pull that file to my Ubuntu desktop with Nautilus/sftp but pushing it to the NAS from my desktop results in the same excruciatingly slow copy I was experiencing using NFS (the reason I started this thread). I have to step away from this for a while as it is too frustrating to continue and I have to get other things done in my life. I'll try to see if I can read from the NAS later... it may just be writes that are a problem I don't know. Currently I have that file copy in progress from my desktop to the NAS at a snails pace using Nautilus/sftp. It is showing 83.8 MB out of 778.0 MB copied with anticipated completion in 2 hours and 30 minutes (normally a copy like this completes within minutes). It has actually been incrementing upward so that is something. I'll check on it later and report back.

Craig Holcomb

unread,

Mar 26, 2020, 8:39:19 PM3/26/20

to Alt-F

The file copy to the NAS using Nautilus/sftp worked, even though it was agonizingly slow. Also, I just ran a test with a script that used the cp command to copy from the NAS to my Pi using the Samba share. The speed going that direction seemed normal and both the CPU load and percent were showing green on the NAS status page the whole time. While that script was running I was able to look at the progress by continually updating (F5) my desktop Ubuntu file explorer (Nautilus/sftp) file listing of the directory being written to on the Pi. Using the file explorer for this purpose did not seem to slow the progress or cause any other perceptible problems. At this point it appears that performance problems only occur when writing to the NAS drives, both through the Samba shares or using sftp and that reading from those drives/shares works just fine. I believe NFS showed the same behaviour when I still had it turned on. Hopefully that information can help pinpoint the problem. I'm going to look for tools to diagnose the Samba performance issue and may also turn NFS back on to use nfsstat from the client side for diagnosis. It seems likely to me that the Samba and NFS problems are related (as well as sftp) and the more diagnosis and information I can glean the more likely the problem can be found (maybe?).

Craig Holcomb

unread,

Mar 30, 2020, 12:59:19 AM3/30/20

to al...@googlegroups.com

After several days of testing and research I now believe this is not a network problem but is actually caused by the new 8TB hard drive I installed at the end of January. It is a Seagate ST8000DM004-2CX188 (which I shucked from an external enclosure) and from what I’ve read it is based on SMR technology and therefore not recommended for NAS use. I took a chance and bought it anyway and it performed very well up until the middle of March. I don’t understand what caused write speed to slow down seemingly overnight but I found many online postings from people who experienced almost the exact same problem with this drive. One posting even alluded to their personal experience with the drive having problems only in Linux and no problems when used with Windows. The external drive is definitely marketed for Windows platforms with nothing said about Linux in the marketing materials. Buyer beware, I guess.

The network share to the other NAS drive does not have any problems and I can’t really find where I had actually had a problem with that one so I was mistaken on having thought that there was a problem there. I was able to copy a large file across the network to that shared drive with no problem. I then determined that the new drive was the problem after logging into the NAS with SSH to use the cp command to copy that file directly from the working drive to the new drive. That copy process exhibited the same performance problem I experienced across the network. I am marking this issue as complete although if anyone has knowledge of how to resolve this problem and make this drive work I would rather do that then buy a new drive. It seems like it should be possible considering it worked for two months before becoming obstinate. Anyway, if not then I will probably be acquiring a new drive which should eliminate this problem.

Joao Cardoso

unread,

Mar 30, 2020, 7:28:43 PM3/30/20

to al...@googlegroups.com

Have you examined its SMART status and performed a SMART long test? Disk->Utilities

Have you tried to put the disk back in its enclosure and access it that way? On a normal PC with MS-windows? On a PC because some Seagate USB drives have issues with Alt-F (due to issues in the UAS usb storage kernel module)

Or run the Seagate drive test program on the drive out of the enclosure?

At least you have pinpointed the issue root cause

Craig Holcomb

unread,

Apr 1, 2020, 1:17:16 PM4/1/20

to al...@googlegroups.com

I did look at the SMART status from the Alt-F disk utiliities link and it showed PASSED. I have not tried any of your other suggestions yet. Does a long test leave my existing data alone? Anyway, I'm not in a hurry to buy a new drive because current prices are way higher than what I paid for this drive so I will be trying every solution possible to make this drive work. I may have to reduce my video library some (I do have a lot of stuff I can delete) and move stuff back to the old 4TB drive (though the amount of time involved doing the repartitioning and copying is a pain). Anyway, should I keep posting to this thread for help, open a new thread or not bother the group with this particular issue?

Craig Holcomb

unread,

Apr 3, 2020, 1:14:26 PM4/3/20

to Alt-F

I ran (or so I thought) the long test yesterday starting at about 2:47pm my time (Central/USA) and it displayed a message saying it would completed today at 3:08am. Should I have stayed on that screen? I didn't as I assumed I would be able to access the results later. Looking at the SMART information for that drive, however, I see what appears to be the same information that was displayed before the test. Is there some indicator that I should see in that log to show that the long test was run?

On Monday, March 30, 2020 at 6:28:43 PM UTC-5, Joao Cardoso wrote:

João Cardoso

unread,

Apr 3, 2020, 2:53:33 PM4/3/20

to Alt-F

On Friday, 3 April 2020 18:14:26 UTC+1, Craig Holcomb wrote:

I ran (or so I thought) the long test yesterday starting at about 2:47pm my time (Central/USA) and it displayed a message saying it would completed today at 3:08am. Should I have stayed on that screen?

No need, the test is done internally by the drive firmware, and it doesn't affect normal disk access.

But be warned that Seagate says/said that most/some of the SMART results can't be interpreted by us, as they are proprietary! No more Seagate disks for me after reading that.

I didn't as I assumed I would be able to access the results later. Looking at the SMART information for that drive, however, I see what appears to be the same information that was displayed before the test. Is there some indicator that I should see in that log to show that the long test was run?

One more line in the log, the LifeTime(hours) as compared with the Power_On_Hours. The most recent appear at the top.

Craig Holcomb

unread,

Apr 4, 2020, 3:10:20 AM4/4/20

to Alt-F

Okay, well I don't see anything here that slaps me in the face and says: "Hey Dummy!  Here's your problem!"  So I'm not going to ask for your interpretation... just your opinion ;)

smartctl 6.5 2016-05-07 r4318 [armv5tel-linux-4.4.86] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST8000DM004-2CX188
Serial Number:    ZCT1J3QA
LU WWN Device Id: 5 000c50 0c3cdf441
Firmware Version: 0001
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Apr  4 01:59:58 2020 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   006    Pre-fail  Always       -       111632038
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       297
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   080   060   045    Pre-fail  Always       -       93391752
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1575 (166 13 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       17
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   055   040    Old_age   Always       -       34 (Min/Max 30/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       661
194 Temperature_Celsius     0x0022   034   045   000    Old_age   Always       -       34 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   080   064   000    Old_age   Always       -       111632038
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       476 (254 68 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       7630882294
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       657516658

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1555         -
# 2  Short offline       Completed without error       00%      1535         -
# 3  Short offline       Completed without error       00%      1521         -
# 4  Short offline       Completed without error       00%      1497         -
# 5  Short offline       Completed without error       00%      1466         -
# 6  Extended offline    Interrupted (host reset)      00%      1446         -
# 7  Short offline       Completed without error       00%      1414         -
# 8  Short offline       Completed without error       00%      1387         -
# 9  Short offline       Completed without error       00%      1363         -
#10  Short offline       Completed without error       00%      1346         -
#11  Short offline       Completed without error       00%      1322         -
#12  Short offline       Completed without error       00%      1291         -
#13  Extended offline    Interrupted (host reset)      00%      1279         -
#14  Short offline       Completed without error       00%      1244         -
#15  Short offline       Completed without error       00%      1219         -
#16  Short offline       Completed without error       00%      1197         -
#17  Short offline       Completed without error       00%      1171         -
#18  Short offline       Completed without error       00%      1162         -
#19  Short offline       Completed without error       00%      1125         -
#20  Extended offline    Completed without error       00%      1117         -
#21  Short offline       Completed without error       00%      1076         -

João Cardoso

unread,

Apr 4, 2020, 1:24:31 PM4/4/20

to Alt-F

On Saturday, 4 April 2020 08:10:20 UTC+1, Craig Holcomb wrote:

Okay, well I don't see anything here that slaps me in the face and says: "Hey Dummy!  Here's your problem!"  So I'm not going to ask for your interpretation... just your opinion ;)

...

The disk looks OK, except for a couple of parameters, see comments.

Can only make more accurate guesses having the smart test of a new disk of the same model.

Values start high (255 or 100, depends) and goes lowering, lower is worst.

Worst column is the minimum value ever attained, it might have been a transient situation and it might recover.

Threshold is the minimum acceptable value.

Pre-fail type attributes means that if threshold is attained disk will probably stop working. Notice, probably, SMART is not a science. There are reports of disks working fine during years after this happens.

The RAW value can't be compared/interpreted, except for temperatures

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   006    Pre-fail  Always       -       111632038

There was a time in the past where the Raw_Read_Error_Rate increased significantly (decrease to 64) in relation with its current value (80). Notice it's pre-fail type


  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       297
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   080   060   045    Pre-fail  Always       -       93391752

similar to the read read error rate. Probably the seek error rate lead to the raw read error rate increase (decrease).

Also pre-fail, and worst (60) went to half way from current (80) to threshold (45)


  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1575 (166 13 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

I think that this usual in this parameter, value near threshold, probably means that no early warning will be given


 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       17
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0

That is usual in this parameter, value near threshold, probably means that no early warning will be given


187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   055   040    Old_age   Always       -       34 (Min/Max 30/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       661
194 Temperature_Celsius     0x0022   034   045   000    Old_age   Always       -       34 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   080   064   000    Old_age   Always       -       111632038
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       476 (254 68 0)

253 as as the worst does not make sense, Seagate own proprietary usage?


241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       7630882294

242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       657516658

You wrote to the disk 10x more than read, interpreting the raw value.



SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1555         -
# 2  Short offline       Completed without error       00%      1535         -
# 3  Short offline       Completed without error       00%      1521         -
# 4  Short offline       Completed without error       00%      1497         -
# 5  Short offline       Completed without error       00%      1466         -
# 6  Extended offline    Interrupted (host reset)      00%      1446         -

The interrupted by hosts is probably due to spindown timeout, The test takes more time to accomplish, and the disk goes to standby, stopping the test.

When started from the webUI, spindown timeout is disable, so this doesn't happen. This behaviour pattern applies to schedule self tests.

Craig Holcomb

unread,

May 20, 2020, 12:05:17 PM5/20/20

to al...@googlegroups.com

Here is a weird update for anyone who cares. Two nights ago the problem drive stopped misbehaving and suddenly I can copy to it either internally from the other DNS-320 drive or externally from a network drive at what I would call normal speeds again. In other words without any changes on my part (at least that I know of) the excruciatingly slow write speed I was experiencing suddenly changed to what I can only refer to as blindingly fast (compared to what it had been). I am not holding my breath that this behaviour will continue but it is definitely a welcome change. The average write speed had been running roughly around 170MB per hour (that is not a mistype on my part, that is how horribly slow it was writing). After I noticed the change for the better I started moving a ton of video files from the other internal drive to that one and in one of my moves I moved two 1.1G files in 1 minute and 13 seconds! That is an incredible difference from 170MB per hour. When I started moving all these video files Alt-F status page showed the drive had 3.7T out of 7.2T remaining and after I moved all the files it showed 3.6T remaining. The drive is still writing at the fast speed today and I don't know what changed. Maybe some threshold was reached or it filled up locations that I had deleted from and is writing sequentially again. This is crazy, Luckily this drive does not contain critical data and only has music, TV shows and movies because I'm not sure I trust it since it is acting so flaky. I'll take the win, however (for now). I will post again if this behaviour changes.

Craig Holcomb

unread,

May 30, 2020, 3:43:25 AM5/30/20

to Alt-F

Here is another update: The drive is now back to writing excruciatingly slow but I think I understand why it has slowed down (just not why the extreme slowness): I deleted a small directory containing a few files which is not a wise choice to do with an SMR drive as I assume it must be trying to reclaim and reuse that space where the deleted files were. From my research I have found that rewriting is the main cause of slow writes and many people have corroborated this behaviour. I assume that my recent write speed increase was probably due to the drive having filled all the areas on the drive previously deleted by me and then starting to write sequentially again. By conjecture I postulate that the drive will speed up again once this newly deleted area has been refilled. I believe I can prevent this problem from occurring again by moving files I want to delete into a folder I will name FILES_TO_DELETE_WHEN_DRIVE_HAS_NO_MORE_SPACE-WILL_SLOW_DOWN_WRITES_WHEN_DELETED (or something like that). I believe this will work because during the recent increase in drive speed I found that moving directories and files did not slow down write speed. I assume that moves only change virtual and not physical locations on the drive (meaning no deletes). I will update again when (if?) the drive writes speed up again.

João Cardoso

unread,

May 30, 2020, 11:24:35 AM5/30/20

to al...@googlegroups.com

Hi Craig,

Thank you for sharing your saga and experiences. Please keep us updated, either confirming or not your educated and supported conjectures.

If they are confirmed, then SRM drives have a badly designed internal controller (such as older SSD drives did, which suffered from the same symptoms you describe, although alleviated by their higher speed), and they should be treated as a "append only" backup medium, like old reel tapes. Your idea of not deleting, but instead moving (within the same filesystem?), might be interesting.

Keep going!

PS: see https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-DM and https://www.phoronix.com/scan.php?page=news_item&px=ZAC-SMR-Linux-4.7
and https://lwn.net/Articles/637035/

From wikipedia:

A number of file systems in Linux are or can be tuned for SMR drives:^[27]

F2FS, originally designed for flash media, has a Zoned Block Device (ZBD) mode. It can be used on host-managed drives with conventional zones for metadata.
-Btrfs ZBD support is in progress, but it already writes mostly sequentially due to the CoW nature.
-ext4 can be experimentally tuned to write more sequentially. Ted Ts'o and Abutalib Aghayev gave a talk in 2017 on their ext4-lazy. Seagate also has a more radical "SMRFFS" extension from 2015 that makes use of the ZBC/ZAC commands.^[28]
-For other filesystems, the Linux device mapper has a dm-zoned target that maps a host-managed drive into a random-writable drive. Linux kernel since 4.10 can perform this task without dm.^[29] A zonefs from 2019 exposes the zones as files for easier access.^[30]

Craig Holcomb

unread,

Oct 25, 2020, 2:16:25 PM10/25/20

to Alt-F

Yesterday the SMR drive finally started writing at an acceptable speed again and I'm pretty sure my actions caused that improvement. My idea of waiting for drive cache and rewrites of previously deleted areas was a waste of time. What seems to have solved the problem was FSCK. I hadn't really paid much attention to the Status Page's information on number of day's until it would run FSCK automatically. It had obviously been trying to tell me something for a very long time as all devices were sitting in the negative 100+ range (and the numbers were in a nice red font, duh). I navigated to the Filesystem Maintenance Page and selected Check from the FS Operations drop down for my three main devices. Between the three of them it took quite a long time to complete but we're talking about approximately 12 terabytes so the time it took does not seem unreasonable. The last time I wrote data to the SMR drive prior to running FSCK it was getting so bad it took roughly 5 days to write 6 gigabytes. After running FSCK I wrote approximately 13 gigabytes in 3 hours. A vast improvement, no? I'm hoping this is the solution I've been looking for and will only report back here if I experience future slowdowns that cannot be resolved this way.

Craig Holcomb

unread,

Oct 25, 2020, 2:20:00 PM10/25/20

to Alt-F

Sorry, my math was off on that last write time. It did not take 3 hours... it only took 1.25 hours!!! Even better, am I right?

Reply all

Reply to author

Forward