Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

FW: Status of SGI DCE/DFS Issues

0 views

Skip to first unread message

Brown, Doug

unread,

Mar 25, 1999, 3:00:00 AM3/25/99

to dc...@es.net

I forgot to include the ESnet DCEWG mailing list on the message below, but I
thought you might be interested in the response from SGI regarding DCE and
DFS problems.

- Doug

-----Original Message-----
From: Brown, Doug [mailto:cdb...@sandia.gov]
Sent: Thursday, March 25, 1999 12:44 PM
To: 'Susan Bechly'
Cc: sim...@sgi.com; Beck, David; 'asci...@lanl.gov';
'pwd...@ca.sandia.gov'; 'bho...@llnl.gov'; 'b...@lanl.gov';
'qui...@llnl.gov'; 'mi...@ca.sandia.gov'
Subject: RE: Status of SGI DCE/DFS Issues

Susan,

Thank you for the information regarding SGI's work to resolve these DCE/DFS
problems. I was unaware that SGI was giving this matter so much attention,
and I am gratified to learn that this is the case. I hope you have indeed
been successful in correcting these problems, especially the one that
resulted in data corruption. We will be testing your fixes at the National
Labs in an attempt to verify that the problems are resolved.

I would like to emphasize (primarily for the benefit of your upper
management at SGI) that DCE and DFS are critical components of the DOE Labs'
strategy for information sharing in a secure environment. It is paramount
that they continue to be supported and enhanced in future versions of the
SGI operating systems.

Again, thanks for clarifying the activities that have been occurring at SGI
to support DCE and DFS.

- Doug Brown
Manager, Computer Security Technology Dept.
Sandia National Laboratories, Albuq. NM

> -----Original Message-----
> From: Susan Bechly [mailto:s...@sgi.com]
> Sent: Friday, March 19, 1999 3:32 PM
> To: cdb...@sandia.gov
> Cc: sim...@sgi.com; s...@sgi.com; dfb...@sandia.gov
> Subject: Status of SGI DCE/DFS Issues
>
>
> Dear Mr. Brown,
>
> I work on SGI's DCE/DFS product as a support engineer. At the recent
> Transarc user's conference, DECORUM, I ran into a number of
> people from
> the various labs who had concerns with SGI's DCE/DFS product based on
> information obtained in a meeting a few weeks ago. This included David
> Beck and John Noe from Sandia. David asked that I email you
> and let you
> know the status of these issues.
>
> The 1.2.2 release for IRIX 6.5.
> -------------------------------
>
> This release was pulled from the web and distribution the end of
> December for a packaging problem. This had nothing to do with any user
> reported problem or coding problem found in house. At the time the
> package was pulled we had a few problems which where close to
> resolution, therefore we decided to completed these fixes before
> re-releasing. These turned out to take much longer than expected. Also
> for the IRIX 6.5.4 the DFS kernel code will be released with IRIX,
> therefore the DCE/DFS package had to be reworked before it could
> support 6.5.4. The new package (1.2.2a) supports 6.5.2, 6.5.3, and
> 6.5.4. The package was released yesterday, March 18th, and is
> available
> at the following web site:
>
> http://www.sgi.com/Products/Evaluation/
>
> Apparently there was speculation by some people that the release was
> pulled due to known data corruption problems and we were keeping it
> secret from our customers. This is totally untrue.
>
> Data Corruption Problem(s)
> ---------------------------
>
> I'm not sure exactly when the meeting where the data corruption issue
> came up was held, so I don't know what information was passed on at
> that time and since the meeting. Hopefully most of this is old news.
>
> LANL reported a data corruption problem a couple of months ago. Before
> we even started the research process they told us that they
> believed it
> was due to a credentials problem and said that we could close the
> problem report, which we did.
>
> On February 10th LANL asked that we reopen the problem report. At this
> time the only information we had was that they had a test script which
> was doing nothing more than a copy and compare and it failed with a
> compare error. The script purged the associated files. At
> this point we
> started running the test script in Eagan. Engineering was fully
> informed, but without the corrupted file or any information
> which would
> give a clue where to look there wasn't much they could do at this
> point.
>
> On Feb. 18th, the test script running in Eagan failed. Analysis of the
> source and destination files showed that the first 64K of the
> destination file had been null filled. After byte 200000 octal the
> files were the same. PV 673644 was written. Note: this
> site/problem was
> on the SGI Critical Site List since it was reported so it
> already had a
> lot of attention and a high priority.
>
> We did not see another failure for 10 days and no failure had been
> reported by the site. Engineering started looking at possible locking
> problems and changed the script so it would capture dfstrace
> information
> when a failure was hit.
>
> On March 4th it was reported that a LANL employee had a script which
> forced the failure fairly regularly.
>
> March 5th - progress!! The second script supplied by LANL failed twice
> in Eagan in a short period of time. We were able to capture a clean
> dfstrace. Also we finally got an example of a corrupted file from
> LANL. It confirmed they were seeing the 64K of nulls in the corrupted
> file, therefore verifying that we were researching the same problem.
>
> March 10th - Engineering believed they found the problem. The fix
> involved a rewrite of some locking code so they wanted to do some
> pretty extensive testing before checking it into the source tree and
> giving it to LANL. During the testing another possible corruption
> problem was observed. (PV 678713) More on this later.
>
> March 15th - the fix was checked into the source tree and made
> available for LANL to test. If LANL verifies that this fixes the
> problem, current plans are to release it with IRIX 6.5.4 and create a
> patch for other OSs ASAP.
>
> In between all this was a whole bunch of meetings, the creation of a
> SWAT team, and even the attention of Rick Belluzzo. Obviously it was
> taken seriously.
>
> Hopefully this fully explains the progress of this problem. It is
> probably a lot more than you wanted.
>
>
> ----------------------------
>
> PNNL -
>
> I also heard that Troy Thompson stated at this meeting that DFS file
> corruption on the SGIs were part of the reason they stopped their DFS
> project and switched to AFS. I had not heard that this was the reason
> for this decision although I can't say I blame them at all. PNNL was
> one of the first sites to really work our IRIX DCE/DFS
> software and hit
> a lot of problems. It took a huge amount of time on the part of their
> analyst, who was wonderful to work with (and still is). I'm extremely
> sorry that we could not meet their needs.
>
> Anyway here is what I have in my notes as far as this problem is
> concerned.
>
> The problem was reported on February 20th, 1998. It was against our
> 1.1C product on the SGI O2s. As far as I know they only seen it on the
> O2s. It is unknown whether it happens on 1.2.2 being they dropped the
> project before they had moved much of the load to this release.
>
> The symptoms of this problem was that the file in the client's cache
> picked up a fragment of data which did not belong to the associated
> file. The disk file appeared to be OK. We worked with the site analyst
> to try to get enough information to point Engineering in the right
> direction. As I mentioned before, PNNL's site analyst, Paul Gjefle is
> great as far as providing well documented information in a very timely
> manner. We were not able to reproduce the problem in house,
> nor had any
> other customer reported a similar problem. When PNNL dropped their DFS
> project work on this issue stopped until recently. The PV, 574192, is
> still open. This did not, and still does not, appear to be the same
> problem as LANL hit.
>
> On February 17th another site, not one of the national labs,
> reported a
> data corruption problem (PV 673380). This problem seems to be
> closer to
> the PNNL problem, although the symptoms are a bit different. The
> corruption observed during the testing of the LANL code (PV 678713)
> also appears to have similar symptoms to the symptoms reported by this
> site. They have been given the code for the LANL fix, just in case it
> is some how a different manifestation of the same problem. We are
> continuing to work this issue. As an FYI - an analyst from MERCK
> reminded me that MERCK had seen a problem with cache corruption quite
> some time ago. It went away when they moved the cache from an xfs
> filesystem to an efs filesystem. I passed this on to Engineering.
>
> The only other thing I have to add is that just before I left
> for class
> and Decorum, Lawrence Livermore Laboratory reported a corruption
> problem. They were running a script doing a copy and compare.
> Unfortunately the script did not save the files. They said they would
> reproduce it and send us further information. So far we have not
> received this information, nor have we been able to contact
> the problem
> originator. Hopefully it is the same problem as LANL reported and
> therefore we have a likely fix.
>
> That is the whole story as I know it. Data corruption is probably the
> most serious computer related problem there is, and one of
> the toughest
> to isolate and resolve. A really bad combination. I hope that the fix
> Engineering provided addresses both problems.
>
> This memo may not have made the situation any better, but hopefully
> it shows that we are working these problems at a critical priority.
>
>
>
>
> Susan Bechly SSSSS M M BBBBB
> S MM
> MM B B
> SGI Global Product Support SSSS M MM M BBBBB
> s...@sgi.com S M
> M B B
> (651) 683-5308 SSSSS M M BBBBB
>

0 new messages