Let me preface this that I think you might want to look at opening a time & materials support case with EMC so that they can fully attempt to get your cluster back into protection.
Two drives down in a cluster that's +1 is the same as having the node down. The system would have to try to rebuild from parity for any stripes of data that were on those 2 drives, but other stripes might be completely intact and might be able to be reprotected to the remaining nodes. Do you think there's any hope of getting the node back online? That might help quite a lot for FlexProtect. Otherwise, I'll see if I can explain a bit more what might be needed.
And, 5.0.6.4 on 9000i. Pretty old release, but some of the tools that support uses now are still there; the logging is a bit different, but likely similar enough that you might be able to work through the problem.
Are your files snapshotted? If so, then simply running rm on the file won't remove the file... all references to its blocks just move into the snapshot when you run rm. Removing the file in that case will require support involvement, hence the T&M support case.
With the DSR failure, there *should* have been something logged in /var/log/messages that indicates the LIN that was accessed when the filesystem was attempting to perform DSR. If you have that then you can convert the LIN to a path with isi get -L <LIN>. You'll want to grep for DSR failure in /var/log/messages on all nodes to find all the log messages, and then you can pull the list of LINs from that. You should also look in /var/log/restripe.log as the restriper would've also been logging EIOs when FlexProtect was running. isi restripe -wD might also have more about the FlexProtect failure.
Once you have the list of LINs, you convert them to paths with isi get -L. Assuming they're not snapshotted, then you attempt to rm the LINs, which will remove them from the filesystem and prevent FlexProtect from attempting to reprotect them the next time it runs.
There's another possibility, that the blocks that are trying to be accessed are in damaged HDD sectors (what Isilon calls ECCs). Support might be able to recover those blocks via another procedure, but even that is a tricky procedure and would require a support case to even get it looked at. That's why I suggested having that node back online, even for the smartfail process, as FlexProtect will attempt to read from a device in smartfail even if it's in a read-only state, assuming it can't get the data/parity from some other device that's already online (and not smartfailed).
So, in summary:
- With the node in smartfail, but online, FlexProtect might be able to reprotect more data than if the node's powered down
- ECCs on other drives in the cluster might prevent some stripes from being reprotected
- rm will not delete data if the data's snapshotted; it'll just move that data into the snapshot
- If you can't remove the data cleanly with rm, or there are ECCs on drives, you'll need to try to get support involved. I'm not sure what the quote would be for T&M in this case.
I think that covers the majority of issues here...