DSR failure

370 views
Skip to first unread message

Hemantt Chugh

unread,
Mar 17, 2021, 9:41:58 AM3/17/21
to Isilon Technical User Group
Hello Experts

We have a scenario where we have 2 nodes stuck on mounting /ifs  with below DSR failure and since cluster is not in EMC support. Can anyone help how can we skip DSR below so flex protect can complete successfully.

History
# Node 1 has some issues and after replacing boot drive it didnt come up successfully
# in mean time we saw boot drive health going  down on node 7  we planned to get it replaced  unfortunately it also got stuck on the same mounting /ifs.

I need urgent help on the same.

DSR failure on { 5,0,10593525760:8192 }
DSR failure on { 5,0,10593525760:8192 }
DSR failure on { 5,0,10593525760:8192 }
DSR failure on { 5,0,10593525760:8192 }
DSR failure on { 5,0,10549551104:8192 }
DSR failure on { 5,0,10549551104:8192 }
DSR failure on { 5,0,10549551104:8192 }
 

Jean-Didier stefaniak

unread,
Mar 17, 2021, 5:30:39 PM3/17/21
to Isilon Technical User Group
This is a very risky situation you are finding yourself into here. Your data may be at risk at the very least.DSR stands for Dynamic Sector Recovery; i.e : one or more drive/node has/have suffered an ECC error. your LNN 5 has 2 blocks impacted (10593525760 && 10549551104).
If concurrently nodes 1 & 7 are also down you may have already exceeded your N+M protection model or just at the threshold of doing so.You are likely experiencing partial Data unavailability at this point in time and contemplating Data Loss if not careful every single step of the way from now on.

You say you replaced the boot drives on node#1 but you do not say how you did it. This procedure is a fairly advanced one that requires precise steps in a given order to protect the node's FS-awareness.  

It may not be what you want to hear or read but a forum is definitively not the place you want to ask help for this type of issue.
Those situation need to be handled extremely carefully as one wrong step could mean the difference between recovering or loosing your data.

Contact your Dell EMC Account Manager and ask them to quote you for "Time and Maintenance".

hemant chugh

unread,
Mar 18, 2021, 3:13:51 AM3/18/21
to isilon-u...@googlegroups.com
Thanks for your response.

Boot drive was replaced by 3rd party  vendor . It's DR cluster hence we can remove dar failures to get nodes back online then we can replicate data from primary cluster. 

I am requesting procedure as customer not giving approval to remove data but we can overwrite with synciq for that nodes should be added back to cluster . Currently i have put cluster in degraded mode .

Thanks
Hemant
9686630313

--
You received this message because you are subscribed to a topic in the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/isilon-user-group/KNquwhdInbM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isilon-user-gr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/isilon-user-group/b1a45ae0-44cb-4afa-91ab-f6a9754416f8n%40googlegroups.com.

mandar kolhe

unread,
Mar 18, 2021, 3:39:46 AM3/18/21
to isilon-u...@googlegroups.com
Hemant, dsr failures shouldn't be causing due to boot drive replacement . Their is some issue with node id 5 who has lnum 0 drive. In logs you can find lin number and then get the path using dd command and see if its readable or gives io error. Support should be engaged if corruption spreads it would be blunder and stop replication and restriping jobs 

You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/isilon-user-group/CAKYrsr8VEy_yvmx-Ndw_8hcR0Euv%3D1QsMYyZLmOM_zL6ZdSX7g%40mail.gmail.com.

hemant chugh

unread,
Mar 18, 2021, 4:52:24 AM3/18/21
to isilon-u...@googlegroups.com
Hello Mandar,

How can I read DSR location ?  LIN can be read via isi get i tried DSr also see details below. we have deleted all snapshots only synciq failover snapshots are available. Shall i delete them as well ? and then try smartfail ?

sudo isi get -L 4000:0001:0085:0029
isi: Could not find a path to LIN:0x4000000100850029/SNAP:18446744073709551615: Invalid argument

4.9881805 03/15 23:24 C    4    479990         DSR failure on { 5,0,10643939328:8192 } of UNKNOWN:{} owned by 4000:0001:0013:008c::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881824 03/15 23:28 C    4    479999         DSR failure on { 3,2,13969637376:8192 } of UNKNOWN:{} owned by 4000:0001:0018:0017::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881829 03/15 23:29 C    4    480001         DSR failure on { 5,0,10644971520:8192 } of UNKNOWN:{} owned by 4000:0001:001b:0023::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881850 03/15 23:34 C    4    480011         DSR failure on { 5,0,10636500992:8192 } of UNKNOWN:{} owned by 4000:0001:0024:0012::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881916 03/15 23:56 C    4    480044         DSR failure on { 5,0,10593525760:8192 } of UNKNOWN:{} owned by 4000:0001:0085:0029::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881921 03/15 23:57 C    4    480046         DSR failure on { 5,0,10549551104:8192 } of UNKNOWN:{} owned by 4000:0001:008d:000e::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881936 03/16 00:04 C    4    480053         DSR failure on { 5,0,10529497088:8192 } of UNKNOWN:{} owned by 4000:0001:00f2:0007::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881942 03/16 00:05 C    4    480055         DSR failure on { 5,0,9801621504:8192 } of UNKNOWN:{} owned by 4000:0001:011f:000f::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881944 03/16 00:05 C    4    480055         DSR failure on { 5,0,9801621504:8192 } of UNKNOWN:{} owned by 4000:0001:011f:000f::HEAD: syscall failed: _sys_pctl2_advance: EIO
4.9881954 03/16 00:10 C    4    480061         DSR failure on { 5,0,10545807360:8192 } of UNKNOWN:{} owned by 4000:0001:0173:0016::HEAD: syscall failed: _sys_pctl2_advance: EIO

Thanks,
Hemant Chugh
+919686630313
 Please don't print this e-mail unless you really need to. Keep our City & Country Clean & Green


hemant chugh

unread,
Mar 18, 2021, 5:12:39 AM3/18/21
to isilon-u...@googlegroups.com
Hello Mandar,

Do we have a way to see which snapshot it's referring to ? All are pointing to the  snapshots.

sudo isi get -L 1:0a59:7ea1
isi: Could not find a path to LIN:0x10a597ea1/SNAP:18446744073709551615: No such file or directory

sudo isi get -L 4000:0001:0085:0029
isi: Could not find a path to LIN:0x4000000100850029/SNAP:18446744073709551615: Invalid argument

sudo isi get -L 4000:0001:008d:000e
isi: Could not find a path to LIN:0x40000001008d000e/SNAP:18446744073709551615: Invalid argument



Thanks,
Hemant Chugh
+919686630313
 Please don't print this e-mail unless you really need to. Keep our City & Country Clean & Green

mandar kolhe

unread,
Mar 18, 2021, 11:11:59 AM3/18/21
to isilon-u...@googlegroups.com
Hello Hemant,

Yes HEAD indicates its snapshot lin try like isi get -L 4000:0001:0173:0016::HEAD

EIO its input output error.

do you have more devices down ? is your protection lost ? isi_group_info , isi_classic stat -q -d -v

Their are some bugs as well for snapshot lins for that need to review logs. some are false alert if its snapshot lin in onefs 8. family i dont remember in which version its fixed.



Hemantt Chugh

unread,
Mar 18, 2021, 1:33:55 PM3/18/21
to isilon-u...@googlegroups.com
Hello Mandar

Yes node 1 and node 7 are not able to mount /ifs after boot drive was replaced we are on 8.1.2.0 

Sent from my iPhone

On 18-Mar-2021, at 8:42 PM, mandar kolhe <kolhem...@gmail.com> wrote:



mandar kolhe

unread,
Mar 28, 2021, 6:55:23 PM3/28/21
to isilon-u...@googlegroups.com
What error it gives ? Did you try to manually mount it

Reply all
Reply to author
Forward
0 new messages