Availability isn't critical (ie, if a T1000 crashes for a few
days, it's OK), but we don't want to lose data or
have a bad drive corrupt files, so we would like to
use ZFS w/ RAIDZ as the underlying filesystem. We also
want to minimize power consumption (thus the Tx000
boxes)
Questions:
* Are there any PCI-Express SAS cards that work
w/ Solaris 10/sparc? Ideally, we could throw a
PCI-Express SAS card in a T1000 and attach N
terabytes of storage to it. (where N >= 30,
hopefully :-)
* Any general recommendations as to how much
storage to attach to a single T1000 or T2000?
* What sort of ZFS gotchas would we encounter on
volumes this large?
Thanks!
James
Plus zfs has snapshots and has great management features
yadda yadda.
> Questions:
>
> * Are there any PCI-Express SAS cards that work
> w/ Solaris 10/sparc? Ideally, we could throw a
> PCI-Express SAS card in a T1000 and attach N
> terabytes of storage to it. (where N >= 30,
> hopefully :-)
I've been battling this one for awhile. Good luck. The LSI Logic
3442E-R (they don't make it in non -R [RAID]) seems to work, but I can
only see the drive enclosure, not the drives. See my post "I can see
the scsi enclosure but no disks". A 3442X (same controller, different
host interface) works just fine in an x4100, so the difference on the
t1000 seems like it must be [lack of] FCode. But if that's the case,
why does the card work at all?
There's a firmware patch for the onboard LSI SAS controller, I was
going to try to apply it to the 3442E and see what happens. The card
isn't useful to me if it doesn't work, so if I kill it, then
"whatever".
If that doesn't work I was going to get an iscsi array.
What array/JBOD are you looking at? I have a promise J300s. Let me
know if you want to buy it. ;-)
> * Any general recommendations as to how much
> storage to attach to a single T1000 or T2000?
Well isn't that based on your availability and performance requirements?
> * What sort of ZFS gotchas would we encounter on
> volumes this large?
I just learned (on zfs-discuss) that you should pay attention to the
man page when it says the maximum number of disks in a raidz should be
9. So for 16 drive arrays probably you might want 3 5-drive raidz's,
and save a drive for a hot spare (support coming in U3). Note that
zfs can combine all these raidz's into a single pool, so you don't
have to worry about free space management.
-frank
Here's a thought:
Get drive arrays that has SAS or SATA disks internally, and connects
from the array to host via FC.
For instance, the AC&NC Jetstor 416FC4 array supports 16 disks (SCSI,
SATA, unfortunately no SAS) and supports dual 4 Gb/sec FC paths.
It supports 750 GB Seagate SATA disks, too. So one array would be 12 TB
raw. Not sure cost... probably around USD $22,000??
SATA disks in these Jetstors do not live as long as their SCSI
counterparts, however. Our SATA disks seems to last about three years,
whereas our SCSI disks lasts five to seven years. That's part of the
price tradeoff. You get what you pay for. :-)
SCSI and SAS would be nice, but they're not as high capacity as SATA,
which means you need far more disks (and arrays)...
Solaris FC drivers are stable and very well supported, and FC cards for
SPARC are plentiful.
Sun even sells StorageTek (bought out by Sun a while ago) dual port 4
Gb/sec FC PCI-e HBAs for SPARC.
FC HBA options possible:
1 Gb/sec, 2 Gb/sec, 4 Gb/sec.
Single port, dual port.
PCI-X, PCI-e.
Most importantly, drivers are solid and there are at least two major
card manufacturers with good Solaris support: Emulex and Qlogic. (JNI
was bought out by AMCC then AMCC dropped FC HBA support. Too bad.)
The T2000 has 3 PCI-e slots and 2 PCI-X slots, I believe.
If you use FC to connect to your storage... you could throw in a SAN
switch. Fewer cables to host and therefore, fewer cards. Also zoned
setups makes it possible for other hosts to see the disks.
If you did that... then you could easily run Sun Cluster 3.1 (free). The
T2000 has enough network ports to do public interfaces, private
interfaces, and cluster interconnects.
Between two or more cluster nodes, availability is pretty good. You're
probably going to have to get at least two servers, anyhow... might as
well as not make the disk-to-host attachment a single point of failure.
But if you're thinking of 100 TB of disk, that's hundreds to thousands
of disks... you probably really want to consider an enterprise disk
storage system (e.g. IBM DS6000, DS8000, EMC Clariion, Sun StorageTek,
Hitachi HDS, etc) for the management and support angle.
Think about it this way: if you went with 146 GB SAS disks, you'd need
about 685 of them to reach 100 TB. With 750 GB SATA disks, only need
about 133.
Our averaged disk failure rate per year (starting around year 4 or year
5 of life) is about 1.5% per year, meaning with 986 disks, you'd want at
least 10 spare drives on hand for swapouts as disks fails. With 133
disks, at least 2 spares.
You had better have really good monitoring systems to keep an eye on
degrading and failed disks, especially if it's 133, 986, or even more. :)
I would still recommend FC interconnects, no matter what the disk array
is -- whether high or low end.
-Dan
P.S. I don't work for Sun or any system or storage vendors. I also hate
sales people. ;) I'm just a customer with a pile of Suns and disks.
jwa wrote:
> Hello -- we have ~100TB spread over about 45 PCs with
> SATA disks + FreeBSD. We're looking at reducing the
> number of machines we have to maintain, and are
> considering using a much smaller number of T1000 or
> T2000 boxes + SAS + ZFS/RAIDZ + as many 16-bay
> JBOD SAS arrays as we can.
>
> Availability isn't critical (ie, if a T1000 crashes for a few
> days, it's OK), but we don't want to lose data or
> have a bad drive corrupt files, so we would like to
> use ZFS w/ RAIDZ as the underlying filesystem. We also
> want to minimize power consumption (thus the Tx000
> boxes)
This doesn't answer your questions, but have you considered using the
X4500 (aka Thumper)?
http://www.sun.com/servers/x64/x4500/
With 24TB per box, 5 boxes could give you the ~100TB you wanted or were
you hoping to reuse the drives from your existing machines?
Kind Regards,
Nathan Dietsch
Yeah, but compare that to a promise J300s (only 12 drives), about
$7200 for 9TB. Plus only about $300 for the controller (and that's
only because you are forced to pay for the raid) vs what $1000 or even
$2000?
If iscsi is acceptable, the promise M500i is ~$5400 so then you get
11.2TB for $11500, still half the price of FC and no HBA to buy. I
guess the major disadadvantage here is that it's slow compared to
either SAS or FC. Maybe we'll see 10Gb iscsi soon.
hmm I was going to say that given the use of zfs, maybe the Apple RAID
is acceptable. (You wouldn't want to use the Aple RAID without zfs.)
The problem with that guy is that it's Ultra-ATA not serial ATA, and
it's not as dense as other solutions. And you might end up having to
deal with Apple support. But doing the numbers it seems it's expensive;
$13k for 7TB (and eats 3U of space for half the data).
hmmm I had just assumed the FC->SATA arrays were much more expensive
than SAS (given numbers like $22k) but the promise M500f is just $5k.
So 11.2TB for $11k seems a good deal. On the downside it only
supports a single controller (dual ports) so there is less
availability. (But this is the same for the Jetstor.)
Getting back to Apple, don't be confused about the dual controllers;
the Apple RAID is two distinct arrays in one box.
-frank
:-)
> The LSI Logic
> 3442E-R (they don't make it in non -R [RAID]) seems to work, but I can
> only see the drive enclosure, not the drives.
Unfortunately, that won't fit in a T1000 (although since it doesn't
seem to work, it may be moot)
> What array/JBOD are you looking at? I have a promise J300s. Let me
> know if you want to buy it. ;-)
Possibly something like the Adaptec SANblock S50:
http://www.adaptec.com/en-US/products/nas/expansion_arrays/SANbloc_S50_JBOD/
SAS port on the back, takes SAS or SATA disks.
> > * Any general recommendations as to how much
> > storage to attach to a single T1000 or T2000?
>
> Well isn't that based on your availability and performance requirements?
Put another way: assuming we're spreading out I/O across nearly
all drives, a what point (physical drive count) does performance
peak? At what rate does performance degrade?
> > * What sort of ZFS gotchas would we encounter on
> > volumes this large?
>
> I just learned (on zfs-discuss) that you should pay attention to the
> man page when it says the maximum number of disks in a raidz should be
> 9. So for 16 drive arrays probably you might want 3 5-drive raidz's,
> and save a drive for a hot spare (support coming in U3). Note that
> zfs can combine all these raidz's into a single pool, so you don't
> have to worry about free space management.
Good point.
wrt hot spare replacement, it should be possible to script something
to do the replacement; either tail the syslog or run zpool status
periodically to look for failures, and then enable spares as
appropriate. I've done this manually in only a few steps, so
it can't be that hard to script..
James
Actually, I've thought about using SCSI -> SATA enclosures, since
it's easier to find compatible SCSI cards for Solaris than SAS.
We'd use the 750GB SATA Seagates.
> Solaris FC drivers are stable and very well supported, and FC cards for
> SPARC are plentiful.
As is SCSI .. but like SAS, I don't know where the performance peaks &
begins
to degrade.
> I would still recommend FC interconnects, no matter what the disk array
> is -- whether high or low end.
Why FC vs. SCSI? Is it just about better performance, or thinner
cables? :-)
James
Yup, we have .. but it seems to be more cost-effective to use
a T1000 and a bunch of SAS->SCSI shelves. How well this
combination actually works is to be determined :)
James
Well, it's a lot easier on the host wiring for a clustered setup if you
only have to run a couple FC cables from a switch to the host... and
this also makes it easier to have a cluster of more than two hosts if
ever desired.
This also relieves slot pressure on the host systems. If you only need
to wire up, let's say, 4 FC cables to access all 12-16 arrays, then...
you'd only need two dual-port FC HBAs in each host, leaving a few slots
free for future expansion in a T2000. Or maybe you could just get by
with a T1000 and a single dual port FC HBA.
With SCSI, you'd be limited to only two hosts in a clustered setup (for
the typical SCSI-attached disk array with dual ports) and you'd have to
wire up all the arrays to both hosts. It works, but is relatively ugly,
inelegant, etc.
FC cables also seems to be more resistant to damage than SCSI, in my
experience. Both can be damaged, but SCSI cables I've had are more
sensitive to reflection due to kinks or even bends.
Distances are also *MUCH* greater with FC than with SCSI. 1.5-6m for SE
SCSI, 12m for LVD SCSI, 25m for HVD SCSI, 300-500m for FC (100-300km
with FC channel extenders).
Distance is good for two things:
1. Less pressure on selecting nearby locations if you have a data center
with little space and perhaps they are a little far apart, or if some
equipment is managed by another department in another section of the
computer room.
2. Less likely to run into attentuation issues with FC than for SCSI for
computer room distances.
Bandwidth: 4 Gb/sec FC would be 500 MB/sec; SCSI peaks out at 320 MB/sec.
10 Gb/sec FC (1250 MB/sec) is already starting to appear.
With larger SCSI setups, need a small farm of goats for sacrifice to
appease Murphy. ;) All kidding aside, we make use of both SCSI and FC,
but for a large scale setup, it is generally a lot easier to scale
further with FC than with SCSI. When you're looking at a 100+ TB setup,
you definitely want something that will scale.
12 arrays to 2 clustered hosts via SAN switch would require at a minimum
of perhaps... (12 * 1) + (3 * 2) = 18 FC cables. Thereafter, you can add
it to more hosts just by adding (let's say) 3 FC cables to each host and
adjusting zoning on the SAN switch.
12 arrays to 2 hosts via SCSI would require a minimum of (12 * 2) = 24
bigger cables that would have to be recabled if you ever wanted to move
them to other hosts in the future.
BTW, FC-connected storage uses SCSI commands to manage things... so...
think of FC as being SCSI without SCSI cabling limitations.
-Dan
> 12 arrays to 2 clustered hosts via SAN switch would require at a minimum
> of perhaps... (12 * 1) + (3 * 2) = 18 FC cables. Thereafter, you can add
> it to more hosts just by adding (let's say) 3 FC cables to each host and
> adjusting zoning on the SAN switch.
Forgive the FC-newbie question, but I'm not following your math. 12 * 1 I
get (1 FC cable from each of the 12 arrays to the one SAN switch), but don't
understand why the two hosts would need 3 cables each (unless they're being
trunked for more bandwidth)?
--
Rich Teer, SCNA, SCSA, OpenSolaris CAB member
President,
Rite Online Inc.
Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich
?
It's in my T1000 right now.
-frank
I'd have done that also, but SCSI = problems.
-frank
I tried to buy that before I got the promise. At the time (2 months
ago), the product was vaporware, although they were taking orders
which just sat forever in some backorder queue.
Even today, you can't download documentation and you can't buy it from
the Adaptec online store. I doubt you'd get this product anytime soon.
-frank
3 was an arbitrary number. Needs at least 2 if you want to do load
balancing and multipathing (for redundancy). More than 2 if you need
greater aggregated bandwidth.
-Dan
Well, still this depends on your application. random writes? sequential
reads? etc.
But Tom's Hardware says the seagate 750 does 63.5 MB/s read (and write),
so with SAS 4x interface at 1200MB/s, 18 drives will saturate the SAS bus.
You're likely to reach the peak at fewer than 18 drives, thanks to
cache effects. And I guess the 1200MB/s number doesn't include
protocol overhead. So 12-16 drives per interface might be a starting
ballpark. But you'd really want to put it up against your
application. Maybe you'd peak at only 6 drives, then it'd be smarter
to get lower capacity arrays (if performance as opposed to space or
power are your primary concerns). Maybe you'd peak at 32 drives, then
you'd want to chain arrays together and buy fewer CPUs/HBAs.
-frank
yup, pretty easy, but U3 will make it automatic.
-frank
Hey, wow! I just got this working. Sorry this is so long, but I'm
pretty excited ... it's fun to tell the story.
In my testing, I was actually using the enclosure "live" on the x4100,
which has 2 disks mounted (and there were only 2 disks in the enclosure).
I then attached the T1000 to the enclosure (port 2 on the single controller),
and it wouldn't see the disks. I thought maybe just maybe it is something
to do with the x4100 "owning" those disks, somehow. So I disconnected and
shut off the x4100, and attached only the t1000. Still no love ... which
is what I expected.
Since you got me going again on this, I remember having this kind of
problem with a 3511 JBOD when attaching it to a SAN. (3511 JBOD is
unsupported as a direct attach array, it's only supported as an
expansion unit for a 3510 or 3511 RAID.) When attaching the 3511
directly, I could use it fine, but with a switch in between, I could
only see the enclosure. So based on the SAN hint I found luxadm and
thought I'd play with that.
I hooked the t1000 back up to the jbod, but this time I added a new
drive -- a seagate 750gb. Then, without doing anything else, lo and
behold, cfgadm -al saw a disk:
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 disk connected configured unknown
c1 scsi-bus connected configured unknown
c1::es/ses0 ESI connected configured unknown
c1::sd28 disk connected configured unknown
#
However, format didn't see it. After 'devfsadm -c disk', format saw it
as c1t13, and cfgadm output changed as well. After doing the same on
the x4100, it doesn't see the new disk. So there must be some ownership
of drives that happens. Why the t1000 couldn't see the original 2 drives
when I turned off the x4100 I still don't get.
Yay! This is a pretty cheap solution. Certainly performance won't be
as good as FC or SCSI disks (for one thing because the seagates are
only 7200 RPM cf 10k or 15k) but for me, zfs (or even svm; slow raid5
write is fine for my app) will take care of that. My application is
large sequential reads of static data so I should be able to take
enough advantage of striping that the inherent per-disk performance
difference won't matter. The drives themselves are probably not
nearly as reliable either. I'm ok with that.
So we have
promise j300s $2200
lsi 3442e-r $340
seagate 750gb $5400 (13x415 -- always buy a spare!)
-----
$7940
= $0.88 / GB, excluding the computer. $1.43/GB if you add a $5k
T1000. I think that's going to be hard to beat. (unless you use
a $1000 x2100 == $0.99/GB, but there are other reasons to use T1000)
The controller on the j300s has 2 ports (plus an expansion port), so
you can dual attach to protect against host failure, basically this
comes free if you already are going to buy multiple hosts. (Assuming
there's a way to get 2 hosts to see the same drives.) An additional
controller for the JBOD is available for only $838 if you want.
There are SAS switches (ala SAN switches) that should be available in
the next 1 year timeframe for a real storage network solution.
Oh, last note, the rackmount rails for the promise are horrendous.
Hopefully you can get some kind of mounting rails 3rd party -- I was
able to use APC rails.
-frank
Looks like the problem was that I am using different drivers on the
x4100 vs the t1000. On the x4100 it's S10U1 with LSI itmpt driver and
on the t1000 it's S10U2 with the Sun mpt driver. The 2 drivers are
setting the drive id's differently. This has to cause some conflicts.
If I power off either system, the other sees all the drives.
-frank
> * What sort of ZFS gotchas would we encounter on
> volumes this large?
I've got a raidz pool of six 146Gb spindles. The CPU overhead
is pretty massive and the performance bottleneck lands on the fact
that an E250 with 2x400Mhz CPUs can't deliver enough CPU time to run
the spindles at anything approaching full speed. I tried a 280R with
12 x 9Gb FCAL spindles and the performance was similarly sluggish.
--
Andre.
Maybe Sun's x4500 (aka Thumper) would be better?
--
Robert Milkowski
rmilko...@wp-sa.pl
http://milek.blogspot.com
That's good news -- I think we may buy a couple of these cards.
Can the LSI SAS3442E-R card be configured to just do JBOD?
The web page (
http://www.lsilogic.com/products/sas_hbas/sas3442e_r.html )
doesn't mention JBOD, just "Integrated RAID 0, 1, 1E and 10E"
I guess I could make each physical disk a single RAID0, but that seems
silly. How did you configure it?
btw, this place has the Adaptec SANbloc S50 in stock:
http://www.costcentral.com/proddetail/Adaptec_SANbloc_S50_JBOD/2221000R/J66438/froogle/
How much RAM is in the 280R? Is it CPU-bound?
What happens when you reduce the number of disks in the RAIDZ pool
to 9? The ZFS guide @
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
says "the recommended number of disks [in a RAID-Z] is between 3 and
9".
This thread has some other good RAID-Z sizing info:
http://www.opensolaris.org/jive/thread.jspa?threadID=11151&tstart=15
James
That's how I'm doing it.
> The web page (
> http://www.lsilogic.com/products/sas_hbas/sas3442e_r.html )
> doesn't mention JBOD, just "Integrated RAID 0, 1, 1E and 10E"
> I guess I could make each physical disk a single RAID0, but that seems
> silly. How did you configure it?
I didn't configure it at all. By default it seems it doesn't do any
RAID.
> btw, this place has the Adaptec SANbloc S50 in stock:
>
> http://www.costcentral.com/proddetail/Adaptec_SANbloc_S50_JBOD/2221000R/J66438/froogle/
cool.
You might want to wait just a bit ...
I thought it was working 100% but I didn't have the JBOD fully populated.
Once I put all the drives in, I found that the mpt driver isn't seeing
scsi id's over 15, and the first drive is numbered at 8 ... and 2 drive
id's are skipped (this appears to be a quirk of the phy numbering in
the j300s jbod) so I only see 6 (out of 12) drives. I updated sd.conf
but it didn't help. I have a ticket open with Sun to resolve this.
If I don't get it resolved, I'll use the itmpt driver (from lsilogic).
It's not available for download on the 3442X or 3442E page, and when
I asked support they explicitly told me it wasn't supported, but if
you go to one of the FC HBA pages you can download it. All of their
cards use the same driver, and the readme lists the 3442X and 3442E
as supported. So think it's just a matter of poor frontline support
and inconsistent web pages.
-frank
Interesting .. are you sure the J300S is behaving properly?
Have you tried moving the card + shelf to another, non-solaris system?
James
It works correctly on the x4100 (S10U1 with lsilogic itmpt driver).
-frank
A while ago, I saw a Sun blog where someone writing code for iSCSI said
they were pleasantly surprised to see something like (from my
recollection) ~85-90% of peak theoretical performance on a 10-Gigabit
Ethernet adapter for what was not fully polished code.
They had a graph and some other comments. Impressive post.
From some quick number-crunching, iSCSI is a no-go with 100 Mbit
Ethernet; just barely acceptable for 1-2 busy disks with Gigabit
Ethernet, and decent at 10-Gigabit Ethernet. Though, you'd want a good
processor and a network card that can offload both TCP and iSCSI
protocol-related processing.
> hmm I was going to say that given the use of zfs, maybe the Apple RAID
> is acceptable. (You wouldn't want to use the Aple RAID without zfs.)
Indeed. It's an interesting thought. Only downside to them is that -- as
nice as the hardware is -- they really do require a Mac to manage them
from what I seem to recall. (I say this not as being biased against
Apple, but rather, as a nearly 20 year Mac user, administrator, and
developer.)
For some non-Apple shops, this is a hard thing to swallow, according to
various folks I've spoken with. For others, they just shrug and say
'ok... whatever it takes...'
> The problem with that guy is that it's Ultra-ATA not serial ATA, and
> it's not as dense as other solutions. And you might end up having to
Aye.
> deal with Apple support. But doing the numbers it seems it's expensive;
> $13k for 7TB (and eats 3U of space for half the data).
Hrmm.
> hmmm I had just assumed the FC->SATA arrays were much more expensive
> than SAS (given numbers like $22k) but the promise M500f is just $5k.
Well, it's mostly due to the highest capacity disks which are still
pretty new and thus, pricey. For lower capacity disks, the prices are
closer to around $11K or so.
-Dan
well, not that nice
> they really do require a Mac to manage them from what I seem to
> recall. (I say this not as being biased against Apple, but rather,
> as a nearly 20 year Mac user, administrator, and developer.)
The management app is a java app. It runs on Windows for sure, and
probably on Solaris.
-frank
Indeed! It would be nice if there was more exposure of just how good
Sun's engineering actually is...
>
> ...
> > hmm I was going to say that given the use of zfs, maybe the Apple RAID
> > is acceptable. (You wouldn't want to use the Aple RAID without zfs.)
>
> Indeed. It's an interesting thought. Only downside to them is that -- as
> nice as the hardware is -- they really do require a Mac to manage them
> from what I seem to recall. (I say this not as being biased against
> Apple, but rather, as a nearly 20 year Mac user, administrator, and
> developer.)
Are you sure about that? Admittedly when I was running an Xserve +
RAID, I was indeed using a Mac as administrative host, but aren't the
Java tools meant to be crossplatform? (I never tried them on Windows,
we only had 1 PC in the building thank goodness).
Drats... where'd I put the URL to that great post? *sigh* If I find it
again, I'll post it here. It was *very* good.
I have vague recollections of Project Nemo and a 10-GbE adapter... and it
being a higher performance device driver framework (within GLDv3) that
had results superior to even interrupt coalescing amongst other things.
>> Indeed. It's an interesting thought. Only downside to them is that -- as
>> nice as the hardware is -- they really do require a Mac to manage them
>> from what I seem to recall. (I say this not as being biased against
>> Apple, but rather, as a nearly 20 year Mac user, administrator, and
>> developer.)
>
> Are you sure about that? Admittedly when I was running an Xserve +
> RAID, I was indeed using a Mac as administrative host, but aren't the
> Java tools meant to be crossplatform? (I never tried them on Windows,
> we only had 1 PC in the building thank goodness).
I goofed; thanks to you and Frank for sorting me out. :)
-Dan
Is it available? Doesn't seem like it.
-frank
You can definitelly order them - I did :)
--
Robert Milkowski
rmilkow...@wp-sa.pl
http://milek.blogspot.com
It seems to work well on x86 though (using non-Sun driver).
-frank
Interesting -- we grabbed the LSI card and a J300 to play with.
Short answer: it doesn't work (yet, anyway.)
I installed the lsi driver (itmpt_install_v50701) on a
T2000, Solaris 10 update 2. Filled the J300 with 12
400gb SATA II disks. First issue I saw mirrored your
problem about Solaris only seeing some of the
disks -- in this case, only 7 of the 12:
root@gog:/kernel/drv# cfgadm -al | grep c2
Ap_Id Type Receptacle Occupant
Condition
c2::dsk/c2t8d0 disk connected configured
unknown
c2::dsk/c2t9d0 disk connected configured
unknown
c2::dsk/c2t10d0 disk connected configured
unknown
c2::dsk/c2t11d0 disk connected configured
unknown
c2::dsk/c2t13d0 disk connected configured
unknown
c2::dsk/c2t14d0 disk connected configured
unknown
c2::dsk/c2t15d0 disk connected configured
unknown
However, probe-scsi-all could see all drives, and so could Solaris:
Sep 1 10:39:55 gog itmpt: [ID 328041 kern.info] itmpt1: target 9 is id
5000155d21bd6201, phy 1 handle b parent 9
Sep 1 10:39:55 gog itmpt: [ID 328041 kern.info] itmpt1: target 10 is
id 5000155d21bd6202, phy 2 handle c parent 9
Sep 1 10:39:55 gog itmpt: [ID 328041 kern.info] itmpt1: target 11 is
id 5000155d21bd6203, phy 3 handle d parent 9
Sep 1 10:39:55 gog itmpt: [ID 328041 kern.info] itmpt1: target 20 is
id 5000155d21bd6204, phy 4 handle e parent 9
Sep 1 10:39:55 gog itmpt: [ID 328041 kern.info] itmpt1: target 18 is
id 5000155d21bd6205, phy 5 handle f parent 9
Sep 1 10:39:55 gog itmpt: [ID 328041 kern.info] itmpt1: target 15 is
id 5000155d21bd6206, phy 6 handle 10 parent 9
Sep 1 10:39:56 gog itmpt: [ID 328041 kern.info] itmpt1: target 16 is
id 5000155d21bd6207, phy 7 handle 11 parent 9
Sep 1 10:39:56 gog itmpt: [ID 328041 kern.info] itmpt1: target 19 is
id 5000155d21bd6208, phy 8 handle 12 parent 9
Sep 1 10:39:56 gog itmpt: [ID 328041 kern.info] itmpt1: target 17 is
id 5000155d21bd6209, phy 9 handle 13 parent 9
Sep 1 10:39:56 gog itmpt: [ID 328041 kern.info] itmpt1: target 13 is
id 5000155d21bd620a, phy 10 handle 14 parent 9
Sep 1 10:39:56 gog itmpt: [ID 328041 kern.info] itmpt1: target 14 is
id 5000155d21bd620b, phy 11 handle 15 parent 9
Sep 1 10:39:56 gog itmpt: [ID 328041 kern.info] itmpt1: target 12 is
id 5000155d21bd623e, phy 24 handle 16 parent 9
I don't know why it started numbering the drives at target 8, or
why it stopped at 15. Poking around in lsiutil, I found a
"delete persistent mappings for ALL targets" option & tried it.
a reboot later cfgadm read:
c2::dsk/c2t0d0 disk connected configured
unknown
c2::dsk/c2t1d0 disk connected configured
unknown
c2::dsk/c2t2d0 disk connected configured
unknown
c2::dsk/c2t3d0 disk connected configured
unknown
c2::dsk/c2t4d0 disk connected configured
unknown
c2::dsk/c2t5d0 disk connected configured
unknown
c2::dsk/c2t6d0 disk connected configured
unknown
c2::dsk/c2t8d0 disk connected configured
unknown
c2::dsk/c2t9d0 disk connected configured
unknown
c2::dsk/c2t10d0 disk connected configured
unknown
c2::dsk/c2t11d0 disk connected configured
unknown
-- all disks visible except target 7. /kernel/drv/sd.conf didn't
have a reference to target 7. I added one & rebooted.
Presto: I saw all 12 devices in cfgadm. Hooray! (After looking in
sd.conf, I noticed that it didn't have any entries after target 15,
so I bet I could have added 5 more targets and cfgadm would display
the disks -- but I still don't understand why it started at target 8)
Next, I created a 12-disk ZFS raidz volume and copied a few gb
of data over. Everything seemed fine-- then I did a zpool
scrub and *boom* .. started getting device errors on multiple
disks. I/O to the volume stalled. I went to lunch. 30 minutes
later it had come back to life, but reported the disk @ target
7 failed. I replaced it, started over from scratch (recreated
the raidz) and repeated the copy + scrub ... *boom* .. kernel
panic!
I made one more attempt, creating a RAID10 volume. Again,
disk failure during a copy + scrub. No kernel panic, though.
Questions:
(1) WTF? Why would I be able to create a raidz filesystem
& copy over gigs of data, but have things go way far
south when I do a scrub? Maybe it's a write-only device :-O
(2) What uses ssd.conf? (apparently nothing, but *should*
something?)
(3) Let's pretend it *does* work, using all the default settings.
How does SCSI ID assignment work? Is this the domain of the
J300 chassis, the controller, or the OS? I'd really like to have the
SCSI ID be assigned by the physical location in the JBOD;
this makes identifying & swapping failed disks simpler. I don't
like the idea of the SCSI ID following the disk around (the whole
WWN thing), since I'd have no way to locate that disk on the
array w/o watching the blinky (or not-so-blinky) activity lights.
James
...
Right there (because id's start at 8), I can almost positively say you
are not using itmpt but rather the mpt driver that is part of Solaris.
Beginning with S10U2, Solaris supports the LSI cards out of the box.
Beginning with itmpt-05.07.01, the installer does not overwrite the
Sun entries.
You need to manually change /etc/driver_aliases to use itmpt (just
for the LSI card!) instead of mpt.
...
> Presto: I saw all 12 devices in cfgadm. Hooray! (After looking in
> sd.conf, I noticed that it didn't have any entries after target 15,
> so I bet I could have added 5 more targets and cfgadm would display
> the disks -- but I still don't understand why it started at target 8)
Nope. There is a bug in the mpt driver in that it treats SAS like
regular SCSI and is limited to 16 targets. A partial fix at least
is in the works.
> Next, I created a 12-disk ZFS raidz volume and copied a few gb
> of data over. Everything seemed fine-- then I did a zpool
> scrub and *boom* .. started getting device errors on multiple
> disks. I/O to the volume stalled. I went to lunch. 30 minutes
> later it had come back to life, but reported the disk @ target
> 7 failed. I replaced it, started over from scratch (recreated
> the raidz) and repeated the copy + scrub ... *boom* .. kernel
> panic!
>
> I made one more attempt, creating a RAID10 volume. Again,
> disk failure during a copy + scrub. No kernel panic, though.
Yeah, this is another problem with the mpt driver.
I offered to send Sun my j300s and LSI cards but was turned down.
> (1) WTF? Why would I be able to create a raidz filesystem
> & copy over gigs of data, but have things go way far
> south when I do a scrub? Maybe it's a write-only device :-O
Apparently, a bug in the mpt driver.
> (2) What uses ssd.conf? (apparently nothing, but *should*
> something?)
fibre channel devices
> (3) Let's pretend it *does* work, using all the default settings.
> How does SCSI ID assignment work? Is this the domain of the
> J300 chassis, the controller, or the OS? I'd really like to have the
> SCSI ID be assigned by the physical location in the JBOD;
> this makes identifying & swapping failed disks simpler. I don't
AFAICT, the HBA numbers the drives in the array consectively based
on the phynum. But this might be modified by the controller's
automatic persistent mappings. I haven't been able to find any
documentation about it.
> like the idea of the SCSI ID following the disk around (the whole
> WWN thing), since I'd have no way to locate that disk on the
> array w/o watching the blinky (or not-so-blinky) activity lights.
You could simply label the drive/sled.
But yeah, persistence in [Solaris implementation of] SAS is probably
not good since you can't map scsi target to SASAddress like you can
with FC.
I think the SAS HBAs act like a regular scsi card to the OS to make
porting regular scsi drivers easier. But SAS is really more like FC
so drivers should really treat it as such.
-frank
eh, forget what I said earlier about mpt vs itmpt. Apparently you are
indeed using itmpt. I guess on SPARC it still doesn't work well.
> Next, I created a 12-disk ZFS raidz volume and copied a few gb
Note that raidz performs poorly with 12 disks. The manual says to use
3-9 disks. So you should create multiple raidz vdevs.
-frank
Yeah, I wasn't going for performance, just stress-testing.
Since these cards and the J300 works well under Linux, we're ditching
the Solaris SAS stuff for now. Hopefully it will be more stable in a
future release.
Now I'm starting to play with iSCSI, using the Promise Vtrak M500i
(configured w/ 15 400GB SATA drives, RAID50, presented as a single
logical device -- it doesn't do JBOD) .. I get reasonable performance
out of a single one (50MB/sec writes, up to 100MB/sec reads when
measured with 'zpool iostat'), but when I add another identically
configured M500i as a mirror, the host (T2000) becomes unusable while
the mirror syncs. (load average of 400!)
James
Linux on what platform? Surely not a T2000?
-frank
nope .. we had an idle 2-way Xeon Supermicro -- Ubuntu 6.06 / kernel
2.6.14 worked "out of the box". I loathe the device naming (I miss
devfs), but that's a different problem.
James
Why not use Solaris on it? Solaris/x86 and SAS (with lsi drivers)
seems to work well.
-frank
I'm downloading Solaris x86 now & am going to give it a shot.
How many of the shelves have you attached to your system?
James
Just 1.
-frank