I am basically trying to create a write-back buffer ( i know it's not
battery backed) , but the problem is, I can't seem to control the page
flush, it always seems to bypass io elevator when page flush needs to
be done. which simply dump all huge ios on the disk and delay all
others.
This makes the disk behave really bursty, If I set page flush ratio
lower, then I can't utilize all the memory at all. is there any way to
make all the writes "best effort" ?
Has anyone else had any experience on problem of file io background
dirty page flush problem?
Thanks.
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Iscsitarget-devel mailing list
Iscsitar...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
Hi,
I am basically trying to create a write-back buffer ( i know it's not
battery backed) , but the problem is, I can't seem to control the page
flush, it always seems to bypass io elevator when page flush needs to
be done. which simply dump all huge ios on the disk and delay all
others.
This makes the disk behave really bursty, If I set page flush ratio
lower, then I can't utilize all the memory at all. is there any way to
make all the writes "best effort" ?
Has anyone else had any experience on problem of file io background
dirty page flush problem?
the nature of the pdflush right now forbids me doing that because it
has no way of restricting the flush io.
------------------------------------------------------------------------------
Okay, I completely agrees what you say, that's probabaly how BBU cache
on raid card does anyway,
So to get into practical terms, what I was tweaking before is:
dirty_background_bytes -- 1024M , this is when it should start write
out dirty pages
dirty_bytes -- 2048M
dirty_expire_centisecs -- default 30s
dirty_writeback_centisecs -- default 1/5 seconds?
And the huge page flush IO destoryed all other activity, make it very brusty.
What you were saying is probably this
dirty_background_bytes -- 5Mb
dirty_bytes -- 2048M
dirty_expire_centisecs -- very long ? hours ?
dirty_writeback_centisecs -- default 1/5 seconds?
So what would happens is that if a very fast client is writng, it
would eventually fill up the 2G buffer and being blocked, but the page
flush happens every 5Mb, but I am somewhat skeptical because of lack
of controls flush io size, it will simply flush everything that is
available to flush, not just the expired ones.
4 disks
>
> What type of disks was that raid 10?
500G sata, and it's operating with 700iops under same workload without any tweak
>
> What size is that raid 10?
1Tb, so complete memory is not feasible.
>
>> What Linux kernel?
>> 2.6.29 , ubuntu lts
>>
>> What would you like to achieve?
>>
>> at first I was just using WT mode and relying on raid card
>> write buffer, but I want to use 2G ram as a secondary write
>> cache, from what I read (kernel code and documents i can
>> found), I think page cache is just what I need, except for
>> one thing, I can't control the page flush , ideally I guess I
>> want to make all write in best effort mode, only use
>> available bandwidth unless there's a buffer under-run, but I
>> realize that it is probably hard, but doable I'm sure, but no
>> one seems to care enough to implement this.
>
> This document is good:
>
> http://www.westnet.com/~gsmith/content/linux-pdflush.htm
>
> Lets take a look at the tunables:
>
> dirty_bytes/dirty_ratio: total amount of dirty memory allowed
> for process before process is blocked for flushing.
>
> - I would keep this high, say 50% of total memory cause if
> this is hit the results could be unpredictable for IET, maybe
> all targets get blocked, maybe none, more investigation is
> needed here.
Exactly, this controls the total up limit to prevent serious underun,
I plan to set it to 2G.
>
> dirty_background_bytes/dirty_background_ratio: total amount of
> dirty memory before a background flush operation is started.
>
> - Keep this small, say 256MB or 512MB to make sure the
> controller can swallow it up in a single operation so the
> flushes are tiny blips on the radar
controller has 256 write buffer, so I guess I should set it to 256M
here, what's weird is that I observe in relatity, is that the
controller don't just swallow it into write cache, that's why I am
seeing huge ios blocking any other activities for at least 1 seconds
when page flush happens.
>
> dirty_expire_centisecs: total time a page can be dirty before
> it is flushed
>
> - Keep this small, say 1-3 seconds for data reliability in the
> face of an accidental power-off or kernel panic. This one
> tells you the recovery point of the volume, very important.
I am not actually caring too much about preservation of the data,
since the data is not very important, just highly violate, it's mostly
page swap data anyway, use raid10 just to reduce downtime. keeping it
small basically force Linux to flush pages, I think I can set it to at
least minutes , I will experiment on that.
>
> Now the real benefit to page cached IET volumes is read data
> caching, you want to get your whole workload in memory as
> fast as you can, then you want to completely operate out of
> that, so only writes go to disk. Of course this won't be
> completely possible, but with enough RAM it can be
> significantly reduced so the read operations fit nicely within
> the flush operations.
I see what you mean, that's exactly how it operate now. but in WT
mode, the write operation will not success until disk layer confirms,
right? that could unnecessarily delay things up.
And by the way, what about SYNC operations that IET receives? I know
upper layer metadata operations will sync it before goes, how would
that work in the iscsi world?
I'm sorry, the ~700 iops is actually measured in upper layer
application with IET running in WT mode, with 2G read cache.
The real iops to the disk seems to agree, some what less, here's
output from iostat -x -d 1 , the disk is completely stressed by the
way, deadline IO is enabled, noop seems to do worse.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 1095.00 200.00 464.00 1600.00 12472.00
21.19 5.59 7.70 1.51 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 1088.00 187.00 389.00 1496.00 11816.00
23.11 5.74 10.17 1.74 100.00
>> >
>> > What size is that raid 10?
>>
>> 1Tb, so complete memory is not feasible.
>
> Does each client read 1TB of data all the time?
no, to be precise, this is swap paging area for 50s of independent
virtual machine, the real data is only about 200G on disk for now, but
it still beyond complete load into memory, For the situation like this
, I think just the more the better. the active working set can be
somehow determined by the write bandwidth, which is about 5m, from
dstat
--dsk/sdb-- --net/eth0-
read writ| recv send
680k 5552k|4421k 2228k
628k 4544k|3533k 1491k
976k 3984k|2751k 2810k
>
> No, it's only the current active working set that matters.
>
> So, say it's mysql, and you've figured out that the max
> table size is X and the min is Y and the average join is
> 4 tables then (((X + Y) / 2) * 4) is the client's working
> set. Say you have 8 clients, multiply that by 8.
>
> Set read-ahead on block devices to be able to pull that
> working set into memory as quick as possible without
> impacting each other or the background writes.
I'm not sure that would help, since the disk is completely stressed
now, the read-ahead is disabled all the way down to controller, should
I change that ?
Do you have insight why it doesn't behave like you said?
>
>> >
>> > dirty_expire_centisecs: total time a page can be dirty before
>> > it is flushed
>> >
>> > - Keep this small, say 1-3 seconds for data reliability in the
>> > face of an accidental power-off or kernel panic. This one
>> > tells you the recovery point of the volume, very important.
>>
>> I am not actually caring too much about preservation of the data,
>> since the data is not very important, just highly violate, it's mostly
>> page swap data anyway, use raid10 just to reduce downtime. keeping it
>> small basically force Linux to flush pages, I think I can set it to at
>> least minutes , I will experiment on that.
>
> Even if the data isn't essential, don't do minutes, the default
> 30 seconds should be good enough.
>
> If the data is only swap then I would do sparse or flat files
> on top of a file system and let the file system worry about how
> best to handle the page cache.
I see, I was just worried that it might introduce even more delay, but
I can definitely give it a try.
>
>> >
>> > Now the real benefit to page cached IET volumes is read data
>> > caching, you want to get your whole workload in memory as
>> > fast as you can, then you want to completely operate out of
>> > that, so only writes go to disk. Of course this won't be
>> > completely possible, but with enough RAM it can be
>> > significantly reduced so the read operations fit nicely within
>> > the flush operations.
>>
>> I see what you mean, that's exactly how it operate now. but in WT
>> mode, the write operation will not success until disk layer confirms,
>> right? that could unnecessarily delay things up.
>
> If the disks are slow it will delay.
>
>> And by the way, what about SYNC operations that IET receives? I know
>> upper layer metadata operations will sync it before goes, how would
>> that work in the iscsi world?
>
> When IET gets a sync from disk it flushes the whole target disk
> page cache, which could mean the whole RAID 10 for some devices.
I think we might be talking about different things, I'm guess the
upper layer is sending operations that with SYNC flag, not just
fsync(), so does IET just submit io with SYNC flag as well?