Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

zfs l2arc warmup

341 views
Skip to first unread message

Joar Jegleim

unread,
Mar 27, 2014, 3:50:06 AM3/27/14
to
Hi list !

I struggling to get a clear understanding of how the l2arc get warm ( zfs).
It's a FreeBSD 9.2-RELEASE server.

From various forum I've come up with this which I have in my /boot/loader.conf
# L2ARC tuning
# Maximum number of bytes written to l2arc per feed
# 8MB (actuall=vfs.zfs.l2arc_write_max*(1000 / vfs.zfs.l2arc_feed_min_ms))
# so 8MB every 200ms = 40MB/s
vfs.zfs.l2arc_write_max=8388608
# Mostly only relevant at the first few hours after boot
# write_boost, speed to fill l2arc until it is filled (after boot)
# 70MB, same rule applys, multiply by 5 = 350MB/s
vfs.zfs.l2arc_write_boost=73400320
# Not sure
vfs.zfs.l2arc_headroom=2
# l2arc feeding period
vfs.zfs.l2arc_feed_secs=1
# minimum l2arc feeding period
vfs.zfs.l2arc_feed_min_ms=200
# control whether streaming data is cached or not
vfs.zfs.l2arc_noprefetch=1
# control whether feed_min_ms is used or not
vfs.zfs.l2arc_feed_again=1
# no read and write at the same time
vfs.zfs.l2arc_norw=1

But what I really wonder is how does the l2arc get warmed up ?
I'm thinking of 2 scenarios:

a.: when arc is full, stuff that evict from arc is put over in l2arc,
that means that files in the fs that are never accessed will never end
up in l2arc, right ?

b.: zfs run through fs in the background and fill up the l2arc for any
file, regardless if it has been accessed or not ( this is the
'feature' I'd like )

I suspect scenario a is what really happens, and if so, how does
people warmup the l2arc manually (?)
I figured that if I rsync everything from the pool I want to be
cache'ed, it will fill up the l2arc for me, which I'm doing right now.
But it takes 3-4 days to rsync the whole pool .

Is this how 'you' do it to warmup the l2arc, or am I missing something ?

The thing is with this particular pool is that it serves somewhere
between 20 -> 30 million jpegs for a website. The front page of the
site will for every reload present a mosaic of about 36 jpegs, and the
jpegs are completely randomly fetched from the pool.
I don't know what jpegs will be fetched at any given time, so I'm
installing about 2TB of l2arc ( the pool is about 1.6TB today) and I
want the whole pool to be available from the l2arc .


Any input on my 'rsync solution' to warmup the l2arc is much appreciated :)


--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------
_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

krad

unread,
Mar 27, 2014, 5:16:09 AM3/27/14
to
not sure if its made it into freebsd yet but

https://www.illumos.org/issues/3525

Ronald Klop

unread,
Mar 27, 2014, 5:02:11 AM3/27/14
to
On Thu, 27 Mar 2014 08:50:06 +0100, Joar Jegleim <joar.j...@gmail.com>
wrote:
2TB of l2arc?
Why don't you put your data on SSD's, get rid of the l2arc and buy some
extra RAM instead.
Than you don't need any warm-up.

For future questions, please provide more details about your setup. What
are disks, what ssds, how much RAM. How is your pool configured? Mirror,
raidz, ... Things like that.

Ronald.

Joar Jegleim

unread,
Mar 27, 2014, 6:06:02 AM3/27/14
to
Hi,

thnx for your input.

The current setup :
2 X HP Proliant DL380 G7, 2xXeon (six core)@2667Mhz, 144GB DDR3
@1333Mhz (ecc, registered)
Each server has an external shelf with 20 1TB SATA ('sas midline') 7200RPM disks
The shelf is connected via a Smart Array P410i with 1GB cache .

The second server is failover, I use zfs send/receive for replication
( with mbuffer). I had HAST in there for a couple months but got cold
feet after some problems + I hate to have an expensive server just
'sitting there'. We will in near future start serving jpeg's from both
servers which is a setup I like a lot more.

I've setup 20 single disk 'raid 0' logical disks in that P410i, and
built zfs mirror over those 20 disks ( raid 10) which give me about
~9TB of storage.

For the record, I initially used an LSI SAS 9207-4i4e SGL HBA to
connect the external shelf, but after some testing I realized I got
more performance out of the P410i with cache enabled.
I have dual power supplies as well as ups and want performance with
the risk it involves.

At the moment I have 2x Intel 520 480GB ssd's for l2arc, the plan is
to add 2 more ssd's to get ~ 2TB for l2arc, and add a small ssd for
log/zil

The pool got default recordsize (128k), atime=off and I've set
compression to lz4 and got a compressratio of 1.18x .

I've set the following sysctl's related to zfs:
# 100GB
vfs.zfs.arc_max=107374182400
# used to be 5(default) trying 1
vfs.zfs.txg.timeout="1"
# this to work with the raid ctrl cache
vfs.zfs.cache_flush_disable=1
vfs.zfs.write_limit_shift=9
vfs.zfs.txg.synctime_ms=200
# L2ARC tuning
# Maximum number of bytes written to l2arc per feed
# 8MB/s (actuall=vfs.zfs.l2arc_write_max*vfs.zfs.l2arc_feed_min_ms)
# so 8MB every 200ms = 40MB/s
vfs.zfs.l2arc_write_max=8388608
# Mostly only relevant at the first few hours after boot
# write_boost, speed to fill l2arc until it is filled (after boot)
# 70MB/s, same rule applys, multiply by 5
vfs.zfs.l2arc_write_boost=73400320
# Not sure
vfs.zfs.l2arc_headroom=2
# l2arc feeding period
vfs.zfs.l2arc_feed_secs=1
# minimum l2arc feeding period
vfs.zfs.l2arc_feed_min_ms=200
# control whether streaming data is cached or not
vfs.zfs.l2arc_noprefetch=1
# control whether feed_min_ms is used or not
vfs.zfs.l2arc_feed_again=1
# no read and write at the same time
vfs.zfs.l2arc_norw=1

> 2TB of l2arc?
> Why don't you put your data on SSD's, get rid of the l2arc and buy some
> extra RAM instead.
> Than you don't need any warm-up.
I'm considering this option, but today I have ~10TB of storage, and
need space for future growth + I like the idea that the l2arc may die
and I'll loose performance, not my data.
+ I reckon I'd have to use a lot more expensive ssd's if I was to use
them for main datastore, as l2arc I can use cheaper ssd's. Those intel
520's can deliver ~50 000 iops, and I need iops, not necessarily
bandwidth.
At least that's my understandig of this.
Open for input ! :)
--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------

Joar Jegleim

unread,
Mar 27, 2014, 6:10:48 AM3/27/14
to
thnx, I found some similar post over at illumos related to persistent
l2arc yesterday. It's interesting :)

But it's really not a problem for me how long it takes to warm up the
l2arc, if it takes a week that's ok. After all I don't plan on
reboot'ing this setup very often + I have 2 servers so I have the
option to let the server warmup until i hook it into production again
after maintenance / patch upgrade and so on .

I'm just curious about wether or not the l2arc warmup itself, or if I
would have to do that manual rsync to force l2arc warmup.

Johan Hendriks

unread,
Mar 27, 2014, 6:21:03 AM3/27/14
to
Joar Jegleim schreef:
A nice blog about the L2ARC

https://blogs.oracle.com/brendan/entry/test

https://blogs.oracle.com/brendan/entry/l2arc_screenshots

regards
Johan

Rainer Duffner

unread,
Mar 27, 2014, 6:40:18 AM3/27/14
to
Am Thu, 27 Mar 2014 08:50:06 +0100
schrieb Joar Jegleim <joar.j...@gmail.com>:

> Hi list !
>
> I struggling to get a clear understanding of how the l2arc get warm
> ( zfs). It's a FreeBSD 9.2-RELEASE server.
>

> The thing is with this particular pool is that it serves somewhere
> between 20 -> 30 million jpegs for a website. The front page of the
> site will for every reload present a mosaic of about 36 jpegs, and the
> jpegs are completely randomly fetched from the pool.
> I don't know what jpegs will be fetched at any given time, so I'm
> installing about 2TB of l2arc ( the pool is about 1.6TB today) and I
> want the whole pool to be available from the l2arc .
>
>
> Any input on my 'rsync solution' to warmup the l2arc is much
> appreciated :)
>
>



Don't you need RAM for the L2ARC, too?

http://www.richardelling.com/Home/scripts-and-programs-1/l2arc


I'd just max-out the RAM on the DL370 - you'd need to do that anyway,
according to the above spread-sheet....

Bob Friesenhahn

unread,
Mar 27, 2014, 10:26:20 AM3/27/14
to
On Thu, 27 Mar 2014, Joar Jegleim wrote:
> Is this how 'you' do it to warmup the l2arc, or am I missing something ?
>
> The thing is with this particular pool is that it serves somewhere
> between 20 -> 30 million jpegs for a website. The front page of the
> site will for every reload present a mosaic of about 36 jpegs, and the
> jpegs are completely randomly fetched from the pool.
> I don't know what jpegs will be fetched at any given time, so I'm
> installing about 2TB of l2arc ( the pool is about 1.6TB today) and I
> want the whole pool to be available from the l2arc .

Your usage pattern is the opposite of what the ARC is supposed to do.
The ARC is supposed to keep most-often accessed data in memory (or
retired to L2ARC) based on access patterns.

It does not seem necessary for your mosaic to be truely random across
20 -> 30 million jpegs. Random across 1000 jpegs which are circulated
in time would produce a similar effect.

The application building your web page mosiac can manage which files
will be included in the mosaic and achieve the same effect as a huge
cache by always building the mosiac from a known subset of files.
The 1000 jpegs used for the mosaics can be cycled over time from a
random selection, with old ones being removed. This approach assures
that in-memory caching is effective since the same files will be
requested many times by many clients.

Changing the problem from an OS-oriented one to an
application-oriented one (better algorithm) gives you more control and
better efficiency.

Bob
--
Bob Friesenhahn
bfri...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

Linda Kateley

unread,
Mar 27, 2014, 11:45:15 AM3/27/14
to
it seems like this should be easier. The arc and l2 will hold what has
been read.. I don't know, maybe cat the jpegs at boot?





On 3/27/14, 9:53 AM, Karl Denninger wrote:
> That's true, but the other option if he really does want it to be
> random across the entire thing, given the size (which is not
> outrageous) and that the resource is going to be read-nearly-only, is
> to put them on SSDs and ignore the L2ARC entirely. These days that's
> not a terribly expensive answer as with a read-mostly-always
> environment you're not going to run into a rewrite life-cycle problem
> on rationally-priced SSDs (e.g. Intel 3500s).
>
> Now an ARC cache miss is not all *that* material since there is no
> seek or rotational latency penalty.
>
> HOWEVER, with that said it's still expensive compared against rotating
> rust for bulk storage, and as Bob noted a pre-select middleware
> process would result in no need for a L2ARC and allow the use of a
> pool with much-smaller SSDs for the actual online retrieval function.
>
> Whether the coding time and expense is a good trade against the lower
> hardware cost to do it the "raw" way is a fair question.

Joar Jegleim

unread,
Mar 27, 2014, 4:20:11 PM3/27/14
to
thnx !
Read the first links many times before, but the second one was new to
me and great reading !
--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------

Joar Jegleim

unread,
Mar 27, 2014, 4:34:11 PM3/27/14
to
> Don't you need RAM for the L2ARC, too?
>
> http://www.richardelling.com/Home/scripts-and-programs-1/l2arc
>
>
> I'd just max-out the RAM on the DL370 - you'd need to do that anyway,
> according to the above spread-sheet....
>
yeah, it does. At the moment I've got 2x480GB ssd for l2arc and 144GB
ram, though I haven't found a way to calculate if I have enough ram or
not I've seen posts that make me suspect I had enough ram for this
setup.

The link from Johan Hendriks
https://blogs.oracle.com/brendan/entry/l2arc_screenshots mention that
in the bottom actually
"
It costs some DRAM to reference the L2ARC, at a rate proportional to
record size. For example, it currently takes about 15 Gbytes of DRAM
to reference 600 Gbytes of L2ARC - at an 8 Kbyte ZFS record size. If
you use a 16 Kbyte record size, that cost would be halve - 7.5 Gbytes.
This means you shouldn't, for example, configure a system with only 8
Gbytes of DRAM, 600 Gbytes of L2ARC, and an 8 Kbyte record size - if
you did, the L2ARC would never fully populate.
"

My two 480GB ssd's will probably be full by tomorrow, they're
currently at 686GB and got about 207GB left to fill.
I wonder how I can read out how much ram is used for l2arc reference
(?) would that be the 'HEADER' value from top in 9.2-RELEASE (the ARC
line) it was around 3GB yesterday and now I see it's climbed to about
6.3GB .
(got 128KB record size) .




On 27 March 2014 11:40, Rainer Duffner <rai...@ultra-secure.de> wrote:
> Am Thu, 27 Mar 2014 08:50:06 +0100
> schrieb Joar Jegleim <joar.j...@gmail.com>:
>
>> Hi list !
>>
>> I struggling to get a clear understanding of how the l2arc get warm
>> ( zfs). It's a FreeBSD 9.2-RELEASE server.
>>
>
>> The thing is with this particular pool is that it serves somewhere
>> between 20 -> 30 million jpegs for a website. The front page of the
>> site will for every reload present a mosaic of about 36 jpegs, and the
>> jpegs are completely randomly fetched from the pool.
>> I don't know what jpegs will be fetched at any given time, so I'm
>> installing about 2TB of l2arc ( the pool is about 1.6TB today) and I
>> want the whole pool to be available from the l2arc .
>>
>>
>> Any input on my 'rsync solution' to warmup the l2arc is much
>> appreciated :)
>>
>>
>
>
>
> Don't you need RAM for the L2ARC, too?
>
> http://www.richardelling.com/Home/scripts-and-programs-1/l2arc
>
>
> I'd just max-out the RAM on the DL370 - you'd need to do that anyway,
> according to the above spread-sheet....
>
>
>
>
>



--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------

Joar Jegleim

unread,
Mar 27, 2014, 4:40:11 PM3/27/14
to
Appreciate your input, and I've talked to our devs about that and
requested them to make some finit subset of the jpegs that rotate
every night.
Actually the whole web application is being rewritten, I won't have
anything like that until august at best, which certainly isn't bad .

When I get that kind of feature maybe I no longer need an l2arc to
cover the whole dataset.
--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------

Joar Jegleim

unread,
Mar 27, 2014, 4:55:01 PM3/27/14
to
I agree, and since the devs will take this into account for our next,
total rewrite, release I may hold any further hardware purchase until
then, not sure yet.
I've almost got my current 960GB of l2arc filled up, I'm going to see
how that affect any performance.
It won't cover the whole dataset, but about 75% or so, so I reckon I
should see some improvement.

On 27 March 2014 15:53, Karl Denninger <ka...@denninger.net> wrote:
>
> On 3/27/2014 9:26 AM, Bob Friesenhahn wrote:
>>
> That's true, but the other option if he really does want it to be random
> across the entire thing, given the size (which is not outrageous) and that
> the resource is going to be read-nearly-only, is to put them on SSDs and
> ignore the L2ARC entirely. These days that's not a terribly expensive
> answer as with a read-mostly-always environment you're not going to run into
> a rewrite life-cycle problem on rationally-priced SSDs (e.g. Intel 3500s).
>
> Now an ARC cache miss is not all *that* material since there is no seek or
> rotational latency penalty.
>
> HOWEVER, with that said it's still expensive compared against rotating rust
> for bulk storage, and as Bob noted a pre-select middleware process would
> result in no need for a L2ARC and allow the use of a pool with much-smaller
> SSDs for the actual online retrieval function.
>
> Whether the coding time and expense is a good trade against the lower
> hardware cost to do it the "raw" way is a fair question.
>
> --
> -- Karl
> ka...@denninger.net
>
>



--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------

Joar Jegleim

unread,
Mar 27, 2014, 5:00:08 PM3/27/14
to
On 27 March 2014 11:40, Rainer Duffner <rai...@ultra-secure.de> wrote:
> Am Thu, 27 Mar 2014 08:50:06 +0100
> schrieb Joar Jegleim <joar.j...@gmail.com>:
>
>> Hi list !
>>
>> I struggling to get a clear understanding of how the l2arc get warm
>> ( zfs). It's a FreeBSD 9.2-RELEASE server.
>>
>
>> The thing is with this particular pool is that it serves somewhere
>> between 20 -> 30 million jpegs for a website. The front page of the
>> site will for every reload present a mosaic of about 36 jpegs, and the
>> jpegs are completely randomly fetched from the pool.
>> I don't know what jpegs will be fetched at any given time, so I'm
>> installing about 2TB of l2arc ( the pool is about 1.6TB today) and I
>> want the whole pool to be available from the l2arc .
>>
>>
>> Any input on my 'rsync solution' to warmup the l2arc is much
>> appreciated :)
>>
>>
>
>
>
> Don't you need RAM for the L2ARC, too?
>
> http://www.richardelling.com/Home/scripts-and-programs-1/l2arc
>
>
> I'd just max-out the RAM on the DL370 - you'd need to do that anyway,
> according to the above spread-sheet....
>
>
>
>
>



--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------

Joar Jegleim

unread,
Mar 27, 2014, 5:02:09 PM3/27/14
to
sorry for the previous empty mail.
But I think mabye kstat.zfs.misc.arcstats.l2_hdr_size would show the
l2arc header size.
It's currently at kstat.zfs.misc.arcstats.l2_hdr_size: 4901413968 when
l2arc is currently at 696GB, and I have the default 128KB recordsize .


On 27 March 2014 11:40, Rainer Duffner <rai...@ultra-secure.de> wrote:
> Am Thu, 27 Mar 2014 08:50:06 +0100
> schrieb Joar Jegleim <joar.j...@gmail.com>:
>
>> Hi list !
>>
>> I struggling to get a clear understanding of how the l2arc get warm
>> ( zfs). It's a FreeBSD 9.2-RELEASE server.
>>
>
>> The thing is with this particular pool is that it serves somewhere
>> between 20 -> 30 million jpegs for a website. The front page of the
>> site will for every reload present a mosaic of about 36 jpegs, and the
>> jpegs are completely randomly fetched from the pool.
>> I don't know what jpegs will be fetched at any given time, so I'm
>> installing about 2TB of l2arc ( the pool is about 1.6TB today) and I
>> want the whole pool to be available from the l2arc .
>>
>>
>> Any input on my 'rsync solution' to warmup the l2arc is much
>> appreciated :)
>>
>>
>
>
>
> Don't you need RAM for the L2ARC, too?
>
> http://www.richardelling.com/Home/scripts-and-programs-1/l2arc
>
>
> I'd just max-out the RAM on the DL370 - you'd need to do that anyway,
> according to the above spread-sheet....
>
>
>
>
>



--
----------------------
Joar Jegleim
Homepage: http://cosmicb.no
Linkedin: http://no.linkedin.com/in/joarjegleim
fb: http://www.facebook.com/joar.jegleim
AKA: CosmicB @Freenode

----------------------
Message has been deleted

Joar Jegleim

unread,
Mar 28, 2014, 5:23:35 AM3/28/14
to
On 28 March 2014 01:59, <kpn...@pobox.com> wrote:
> On Thu, Mar 27, 2014 at 11:10:48AM +0100, Joar Jegleim wrote:
>> But it's really not a problem for me how long it takes to warm up the
>> l2arc, if it takes a week that's ok. After all I don't plan on
>> reboot'ing this setup very often + I have 2 servers so I have the
>> option to let the server warmup until i hook it into production again
>> after maintenance / patch upgrade and so on .
>>
>> I'm just curious about wether or not the l2arc warmup itself, or if I
>> would have to do that manual rsync to force l2arc warmup.
>
> Have you measured the difference in performance between a cold L2ARC and
> a warm one? Even better, have you measured the performance with a cold
> L2ARC to see if it meets your performance needs?
No I haven't.
I actually started using those 2 ssd's for l2arc the day before I sent
out this mail to the list .
I haven't done this the 'right' way by producing some numbers for
measurement, but I do know that the way this application work today is
that it will pull random jpegs from this dataset of about 1.6TB,
consisting of lots of millions of files ( more than 20 million). And
that today this pool is served from 20 SATA 7.2K disks which would be
the slowest solution for random read access.
Based on the huge performance gain by using ssd's simply by looking at
the spec., but also by looking at other peoples graphs from the net (
people who have done this more thorough than me) I'm pretty confident
to say that if at any time when the application request a jpeg if it
was served from either ram or ssd it would be a substantial
performance gain compared from serving it from the 7.2k array of
disks.

>
> If you really do need a huge L2ARC holding most of your data have you
> considered that maybe you are taking the wrong approach to getting
> performance? Consider load balancing across multiple servers, or having
> your application itself spread the loads of pictures across multiple
> servers.
yes I have :p but again that would mean I'd have to rewrite the
application, or I would have to have several servers mirrored. There
are problems with having several servers mirrored related to the
application, I'll skip those details here, but I have thought about
what if I served those jpegs from say 4 backend servers, I really
don't think it would help compared to serving stuff from ssd's, or I
would at least have to have 20 disks pr. server for it to be any
performance gain... Bu t I'd still have latency and all the
disadvantages from having 7.2k disks.

The next release of the application actually has taken this into
account and I will in the future be able to spread this over 4
servers.
For the future I might spread this stuff over more backends.
At the moment the cheapest and easiest would be to simply by 2 more
480GB ssd's, put them in the server and make sure as much as possible
of the dataset resides in l2arc.

>
> If a super-huge L2ARC is really needed for the traffic _today_, what about
> when you have more traffic in 3-6-12 months? What about if you increase
> the number of pictures you are randomly choosing from? If your server is
> at the limit of its performance today then pretty soon you will outgrow
> it. Then what?
The server is actually far from any limit, in fact it has so 'little'
to do I've been a bit put off to figure out why our frontpage won't be
more snappy.
And these things will probably be taken care of, again, in the next
release of the application which will give me control of 'todays'
frontpage mosaic pictures where I can either make sure frontpage jpegs
stay in arc, or I'll simply serve frontpage jpegs from varnish .


>
> What happens if your production server fails and your backup server has
> a cold L2ARC? Then what?
performance would drop, but nothing really serious + I got 2 of them,
and my plan is to make sure the l2arc for the second server is warm.


>
> Having more and more parts in a server also means you have more opportunities
> for a failure, and that means a higher chance of something bringing down
> the entire server. What if one of the SSD in your huge L2ARC fails in a
> way that locks the bus? This is especially important since you indicated
> you are using cheaper SSD for the L2ARC. Fewer parts -> more robust server.
Good point. Again, I have a failover server and a proxy with health
check in front, and actually I have a third 'fall-back' server too for
worst case scenarios.

>
> On the ZIL: the ZIL holds data on synchronous writes. That's it. This is
> usually a fraction of the writes being done except in some circumstances.
> Have you measured to see if, or do you otherwise know for sure, that you
> really do need a ZIL? I suggest not adding a ZIL unless you are certain
> you need it.
Yes, I only recently realized that too, and I'm really not sure if a
zil is required.
Some small portion of files (som hundre MB's) are served over nfs from
the same server, if I understand it right a zil will help for nfs
stuff (?) , but I'm not sure if it's any gain of having a zil today.
On the other hand, a zil doesn't have to be big, I can simply buy a
128GB ssd which are cheap today .

>
> Oh, and when I need to pull files into memory I usually use something
> like 'find . -type f -exec cat {} \; >/dev/null'. Well, actually, I
> know I have no spaces or special characters in filenames so I really
> do 'find . -type f -print | xargs cat > /dev/null'. This method is
> probably best if you use '-print0' instead plus the correct argument to
> xargs.
Thanks, this really makes sense and I reckon it would be faster than
rsync from an other server.

>
> --
> Kevin P. Neal http://www.pobox.com/~kpn/
>
> "Nonbelievers found it difficult to defend their position in \
> the presense of a working computer." -- a DEC Jensen paper

Bob Friesenhahn

unread,
Mar 28, 2014, 10:56:08 AM3/28/14
to
On Fri, 28 Mar 2014, Joar Jegleim wrote:
> The server is actually far from any limit, in fact it has so 'little'
> to do I've been a bit put off to figure out why our frontpage won't be
> more snappy.

The lack of "snappy" is likely to be an application problem rather
than a server problem. Take care not to blame the server for an
application design problem. You may be over-building your server when
all that is actually needed is some simplification of the web content.

The design of the application is important. The design of the content
provided to the web client is important.

Something I learned about recently which could be really helpful to
you is there is a Firefox tool called "Web Developer Toolbar" which
has a "Network" option. This option will show all files loaded for a
given web page, including the time when the request was initiated, and
when it completed. You may find that the apparent latency problem is
not your server at all. You may find that there are many requests to
servers not under your control. The performance problem is likely be
due to the design of the content passed to the browser.

For example, I just requested to initially load an
application-generated page and I see that the base page loaded in
722ms and then there were two more subsequent loads in parallel
requiring 335ms and 445ms, and then one more load subsequent to that
requiring 262ms. The entire page load time was 1.7 seconds. The load
time was dominated by the chain of dependencies. If I reload the page
(request is now 'hot' on the server) then I see several of the
response times substantially diminish, but some others remain
virtually the same, resuling in a page load time of 1.13 seconds.

From what I have been seeing, web page load times often don't have
much at all to do with the performance of the server.

Dmitry Morozovsky

unread,
Mar 28, 2014, 4:40:04 PM3/28/14
to
On Fri, 28 Mar 2014, Joar Jegleim wrote:

[snip most of]

> > Have you measured to see if, or do you otherwise know for sure, that you
> > really do need a ZIL? I suggest not adding a ZIL unless you are certain
> > you need it.
> Yes, I only recently realized that too, and I'm really not sure if a
> zil is required.
> Some small portion of files (som hundre MB's) are served over nfs from
> the same server, if I understand it right a zil will help for nfs
> stuff (?) , but I'm not sure if it's any gain of having a zil today.
> On the other hand, a zil doesn't have to be big, I can simply buy a
> 128GB ssd which are cheap today .

Please don't forget that, unlike L2ARC, if you lost ZIL during sync write,
you're effectively lost the pool.

Hence, you have two points:

- have ZIL on a enterprise-grade SLC SSD (aircraft-grade prices ;P)
- allocate mirrored ZIL from fraction (rule of thumb if I'm not mistaken was
"get all of write performance of your low-level disks per 1 second, double it,
and it will be size of your ZIL) of your existing otherwise used for L2ARC SSDs

We (by all means not at your read pressure) used the second approach, like the
following:

pool: br
state: ONLINE
scan: resilvered 13.0G in 0h3m with 0 errors on Sun Aug 18 19:52:20 2013
config:

NAME STATE READ WRITE CKSUM
br ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/br0 ONLINE 0 0 0
gpt/br4 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gpt/br1 ONLINE 0 0 0
gpt/br5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gpt/br2 ONLINE 0 0 0
gpt/br6 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gpt/br3 ONLINE 0 0 0
gpt/br7 ONLINE 0 0 0
logs
mirror-4 ONLINE 0 0 0
gpt/br-zil0 ONLINE 0 0 0
gpt/br-zil1 ONLINE 0 0 0
cache
gpt/br-l2arc0 ONLINE 0 0 0
gpt/br-l2arc1 ONLINE 0 0 0

where logs/cache are like

root@briareus:~# gpart show -l da9
=> 34 234441581 da9 GPT (111G)
34 2014 - free - (1M)
2048 16777216 1 br-zil0 (8.0G)
16779264 217661440 2 br-l2arc0 (103G)
234440704 911 - free - (455k)

(this is our main PostgreSQL server, with 8 SASes and 2*Intel3500 SSDs)



--
Sincerely,
D.Marck [DM5020, MCK-RIPE, DM3-RIPN]
[ FreeBSD committer: ma...@FreeBSD.org ]
------------------------------------------------------------------------
*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru ***
------------------------------------------------------------------------

Berend de Boer

unread,
Mar 28, 2014, 4:52:16 PM3/28/14
to
>>>>> "Dmitry" == Dmitry Morozovsky <ma...@rinet.ru> writes:

Dmitry> Please don't forget that, unlike L2ARC, if you lost ZIL
Dmitry> during sync write, you're effectively lost the pool.

Wow, is that true? I'm using a ZIL on Amazon AWS, and these machines
are virtual, i.e. I have no guarantee they will exist. Obviously
that's not usually a problem :-)

I thought I would just lose my sync write, and my pool would still be
there?

Can people enlighten me if this is indeed correct?

--
All the best,

Berend de Boer

Message has been deleted
Message has been deleted

Dmitry Morozovsky

unread,
Mar 28, 2014, 5:16:25 PM3/28/14
to
On Fri, 28 Mar 2014, Freddie Cash wrote:

> > > > Have you measured to see if, or do you otherwise know for sure, that
> > you
> > > > really do need a ZIL? I suggest not adding a ZIL unless you are certain
> > > > you need it.
> > > Yes, I only recently realized that too, and I'm really not sure if a
> > > zil is required.
> > > Some small portion of files (som hundre MB's) are served over nfs from
> > > the same server, if I understand it right a zil will help for nfs
> > > stuff (?) , but I'm not sure if it's any gain of having a zil today.
> > > On the other hand, a zil doesn't have to be big, I can simply buy a
> > > 128GB ssd which are cheap today .
> >
> > Please don't forget that, unlike L2ARC, if you lost ZIL during sync write,
> > you're effectively lost the pool.
> >
>
> ?Nope. Not even close.
>
> The ZIL is only ever read at boot time. If you lose the ZIL between the
> time the data is written to the ZIL and the time the async write of the
> data is actually done to the pool ... and the server is rebooted at that
> time, then you get an error message at pool import.
>
> You can then force the import of the pool, losing any *data* in the ZIL,
> but nothing else.
>
> It used to be (back in the pre-ZFSv?13-ish days) that if you lost the ZIL
> while there was data in it that wasn't yet written to the pool, the pool
> would fault and be gone. Hence the rule-of-thumb to always mirror the ZIL.
>
> Around ZFSv14-ish, the ability to import a pool with a missing ZIL was
> added.
>
> Remember the flow of data in ZFS:
> async write request --> TXG --> disk
> sync write request --> ZIL
> \--> TXG --> disk
>
> All sync writes are written to the pool as part of a normal async TXG after
> its written sync to the ZIL. And the ZIL is only ever read during pool
> import.
>
> ?[Note, I'm not a ZFS developer so some of the above may not be 100%
> accurate, but the gist of it is.]?

Ah, thanks, I stand corrected.

Great we're tighten the window we could possibly lose precious data.

Dmitry Morozovsky

unread,
Mar 28, 2014, 5:19:26 PM3/28/14
to
On Fri, 28 Mar 2014, Freddie Cash wrote:

[snip most again]

> Around ZFSv14-ish, the ability to import a pool with a missing ZIL was
> added.
>
> Remember the flow of data in ZFS:
> async write request --> TXG --> disk
> sync write request --> ZIL
> \--> TXG --> disk
>
> All sync writes are written to the pool as part of a normal async TXG after
> its written sync to the ZIL. And the ZIL is only ever read during pool
> import.

On the other side, doesn't it put the risk on sync-dependent, like database,
systems?

I'm thinking not about losing the transaction, but possibly putting your
filesystem in the middle of (database PoV) transaction, hence render your DB
inconsistent?

Quick googling seems to be uncertain about it...

mikej

unread,
Mar 28, 2014, 5:35:39 PM3/28/14
to
On 2014-03-28 17:19, Dmitry Morozovsky wrote:
> On Fri, 28 Mar 2014, Freddie Cash wrote:
>
> [snip most again]
>
>> Around ZFSv14-ish, the ability to import a pool with a missing ZIL
>> was
>> added.
>>
>> Remember the flow of data in ZFS:
>> async write request --> TXG --> disk
>> sync write request --> ZIL
>> \--> TXG --> disk
>>
>> All sync writes are written to the pool as part of a normal async
>> TXG after
>> its written sync to the ZIL. And the ZIL is only ever read during
>> pool
>> import.
>
> On the other side, doesn't it put the risk on sync-dependent, like
> database,
> systems?
>
> I'm thinking not about losing the transaction, but possibly putting
> your
> filesystem in the middle of (database PoV) transaction, hence render
> your DB
> inconsistent?
>
> Quick googling seems to be uncertain about it...

As I understand it..... (And I am always looking for an education)

Any files system that honors fsync and provided the DB uses fsync
should be fine.

Any data loss then will only be determined by what transaction (log)
capabilities
the DB has.

--mikej
Message has been deleted

Dmitry Morozovsky

unread,
Mar 28, 2014, 5:45:08 PM3/28/14
to
On Fri, 28 Mar 2014, mikej wrote:

> > [snip most again]
> >
> > > Around ZFSv14-ish, the ability to import a pool with a missing ZIL was
> > > added.
> > >
> > > Remember the flow of data in ZFS:
> > > async write request --> TXG --> disk
> > > sync write request --> ZIL
> > > \--> TXG --> disk
> > >
> > > All sync writes are written to the pool as part of a normal async TXG
> > > after
> > > its written sync to the ZIL. And the ZIL is only ever read during pool
> > > import.
> >
> > On the other side, doesn't it put the risk on sync-dependent, like database,
> > systems?
> >
> > I'm thinking not about losing the transaction, but possibly putting your
> > filesystem in the middle of (database PoV) transaction, hence render your DB
> > inconsistent?
> >
> > Quick googling seems to be uncertain about it...
>
> As I understand it..... (And I am always looking for an education)
>
> Any files system that honors fsync and provided the DB uses fsync should be
> fine.
>
> Any data loss then will only be determined by what transaction (log)
> capabilities the DB has.

And?

1. The DB issues "sync WAL" request, which is translated to fsync-like FS
requests, there are (IIUC) should ne directed to ZIL.

2. ZIL is failing in the middle of the request, or, even more bad, after
reporting that ZIL transaction is done, but before translating ZIL to the
underlying media

3. inconsistend DB?

I'm in hope I'm wrong somewhere...

--
Sincerely,
D.Marck [DM5020, MCK-RIPE, DM3-RIPN]
[ FreeBSD committer: ma...@FreeBSD.org ]
------------------------------------------------------------------------
*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru ***
------------------------------------------------------------------------

Dmitry Morozovsky

unread,
Mar 28, 2014, 5:52:17 PM3/28/14
to
On Fri, 28 Mar 2014, Freddie Cash wrote:

> > I'm thinking not about losing the transaction, but possibly putting your
> > filesystem in the middle of (database PoV) transaction, hence render your
> > DB
> > inconsistent?
> >
> > Quick googling seems to be uncertain about it...
> >
>
> ?That I don't know. Again, I'm not a ZFS code guru; just a very
> happy/active ZFS user and reader of stuff online. :)
>
> You're thinking of the small window where:
> - database writes transaction to disk
> - zfs writes ?the data to the ZIL on the log vdev
> - zfs returns "data is written to disk" to the DB
> - zfs queues up the write to the pool
> - the log device dies
> - the pool is forcibly exported/server loses power

Pretty much the same I'm just written in the parallel reply ;)

> Such that the DB considers the transaction complete and the data safely
> written to disk, but it's actually only written to the ZIL on the separate
> log device (which no longer exists) and is not stored in the pool yet.

So, if ZIL dies but server is alive the sync write will be done, IIUC from your
and other's comments? Then, of course, it will shorten the width of dangerous
to nearly zero.

> Yeah, that could be a problem. A very unlikely event, although not
> entirely impossible.
>
> ?I would think it would be up to the database to be able to roll-back a
> database to prior to the corrupted transaction. If the DB has a log or
> journal or whatever, then it could be used to roll-back, no?

If it could detect this situation, yes. I'm not sure detecting this while
getting (previously) state like "write has been done" couldbe simple though ;P

> It's still considered best practice to use mirror log device. It's just no
> longer required, nor does a dead log lead to a completely dead pool.?

Well, I suppose the middle point is like "weight your abilities to lose data,
and prepare measures" ;P

For the database case, I'm not so sure, alas.

But, anyway, thank you very much for thougthful comments!
Message has been deleted
0 new messages