[Iscsitarget-devel] Fileio Background dirty page flush

Yucong Sun (叶雨飞)

unread,

Feb 13, 2012, 12:27:33 AM2/13/12

to iscsitarget-devel

Hi,

I am basically trying to create a write-back buffer ( i know it's not
battery backed) , but the problem is, I can't seem to control the page
flush, it always seems to bypass io elevator when page flush needs to
be done. which simply dump all huge ios on the disk and delay all
others.

This makes the disk behave really bursty, If I set page flush ratio
lower, then I can't utilize all the memory at all. is there any way to
make all the writes "best effort" ?

Has anyone else had any experience on problem of file io background
dirty page flush problem?

Thanks.

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Iscsitarget-devel mailing list
Iscsitar...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel

Ross S. W. Walker

unread,

Feb 13, 2012, 9:48:28 AM2/13/12

to Yucong Sun (叶雨飞), iscsitarget-devel

On Feb 13, 2012, at 12:28 AM, "Yucong Sun (叶雨飞)" <suny...@gmail.com> wrote:

Hi,

I am basically trying to create a write-back buffer ( i know it's not
battery backed) , but the problem is, I can't seem to control the page
flush, it always seems to bypass io elevator when page flush needs to
be done. which simply dump all huge ios on the disk and delay all
others.

This makes the disk behave really bursty, If I set page flush ratio
lower, then I can't utilize all the memory at all. is there any way to
make all the writes "best effort" ?

Has anyone else had any experience on problem of file io background
dirty page flush problem?

The key is not to hold a lot of dirty pages in memory by flushing frequently, but limiting the size of each flush, so instead of one big flush you get a whole bunch of small flushes which allows IO to resume quickly.

Check out this document for further information.

http://www.kernel.org/doc/Documentation/sysctl/vm.txt

Try to set the max amount of mem reserved for dirty pages to say 1GB, the max time a page can be dirty to say 10-50 centi seconds, the max pdflush size to say 1MB, and the max pdlush time to say 5 centi seconds.

This is just off the top of my head and it's highly backend storage dependent, so much experimenting is needed.

-Ross

This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.

Yucong Sun (叶雨飞)

unread,

Feb 13, 2012, 10:52:11 AM2/13/12

to Ross S. W. Walker, iscsitarget-devel

Yeah I know that, but that basically forces me to use less ram than
desired, I *want* to use 2G write back buffer, let's say I'm
comfotable about having 1G constantly full (start background flush
when 1G is full, and when it hit 1.5G, force subsequent writes wait
to for previous room flushed).

the nature of the pdflush right now forbids me doing that because it
has no way of restricting the flush io.

------------------------------------------------------------------------------

Ross S. W. Walker

unread,

Feb 13, 2012, 11:20:27 AM2/13/12

to suny...@gmail.com, iscsitarget-devel

Yucong Sun (叶雨飞) [mailto:suny...@gmail.com] wrote:
>
> Yeah I know that, but that basically forces me to use less ram than
> desired, I *want* to use 2G write back buffer, let's say I'm
> comfotable about having 1G constantly full (start background flush
> when 1G is full, and when it hit 1.5G, force subsequent writes wait
> to for previous room flushed).

I wouldn't think of it as persistent cause it doesn't work that way.

It will use as much RAM as possible, but as all write-back buffers
it's only meant as a buffer for writes. If it has any data in it then
it should be writing it out. It just needs to be timed so that it
never fills up completely and stalls write operations with a
complete flush.

If you approach it like an optical disk's write buffer, where if it
fills you get an overrun then you will be on the right track. Just
make sure it writes out in a constant stuccato pattern to allow for
other operations to happen between each write.

It's an art not a science and you just manipulate the settings
until you get what you want.

Try with two clients, one doing 100% writes of a fixed size (test
with varying sizes) and the other doing 100% reads of the same
fixed size. Keep those going with one second stat outputs while
tuning the settings until you get the desired outcome, then change
the block sizes up, then down, then up and so on. You can even run
a criss-cross test with small reads vs big writes, and change them
until you have big reads vs small writes.

Once your storage is tuned properly your system will outperform
systems normally costing tens of thousands of dollars.

-Ross
______________________________________________________________________

Yucong Sun (叶雨飞)

unread,

Feb 13, 2012, 12:27:23 PM2/13/12

to Ross S. W. Walker, iscsitarget-devel

On Mon, Feb 13, 2012 at 8:20 AM, Ross S. W. Walker
<RWa...@medallion.com> wrote:
> Yucong Sun (叶雨飞) [mailto:suny...@gmail.com] wrote:
>>
>> Yeah I know that, but that basically forces me to use less ram than
>> desired, I *want* to use 2G write back buffer, let's say I'm
>> comfotable about having 1G constantly full (start background flush
>> when 1G is full, and when it hit 1.5G, force subsequent writes wait
>> to for previous room flushed).
>
> I wouldn't think of it as persistent cause it doesn't work that way.
>
> It will use as much RAM as possible, but as all write-back buffers
> it's only meant as a buffer for writes. If it has any data in it then
> it should be writing it out. It just needs to be timed so that it
> never fills up completely and stalls write operations with a
> complete flush.

Okay, I completely agrees what you say, that's probabaly how BBU cache
on raid card does anyway,

So to get into practical terms, what I was tweaking before is:

dirty_background_bytes -- 1024M , this is when it should start write
out dirty pages
dirty_bytes -- 2048M
dirty_expire_centisecs -- default 30s
dirty_writeback_centisecs -- default 1/5 seconds?

And the huge page flush IO destoryed all other activity, make it very brusty.

What you were saying is probably this

dirty_background_bytes -- 5Mb
dirty_bytes -- 2048M
dirty_expire_centisecs -- very long ? hours ?
dirty_writeback_centisecs -- default 1/5 seconds?

So what would happens is that if a very fast client is writng, it
would eventually fill up the 2G buffer and being blocked, but the page
flush happens every 5Mb, but I am somewhat skeptical because of lack
of controls flush io size, it will simply flush everything that is
available to flush, not just the expired ones.

Ross S. W. Walker

unread,

Feb 13, 2012, 12:48:10 PM2/13/12

to Yucong Sun (叶雨飞), iscsitarget-devel

Let's start simple,

What is your setup? (Processors, NICs, HDDs, controllers, RAID, etc.)

What Linux kernel?

What would you like to achieve?

-Ross

Yucong Sun (叶雨飞)

unread,

Feb 13, 2012, 12:58:21 PM2/13/12

to Ross S. W. Walker, iscsitarget-devel

What is your setup? (Processors, NICs, HDDs, controllers, RAID, etc.)

a normal linux server, with 256M BBU hardware raid 10 perc/6i as disk backend (system on another disk) , running ietd trunk. have 8 works constantly writing highly violate data in random location (means they normally get changed right away after last write)

What Linux kernel?

2.6.29 , ubuntu lts

What would you like to achieve?

at first I was just using WT mode and relying on raid card write buffer, but I want to use 2G ram as a secondary write cache, from what I read (kernel code and documents i can found), I think page cache is just what I need, except for one thing, I can't control the page flush , ideally I guess I want to make all write in best effort mode, only use available bandwidth unless there's a buffer under-run, but I realize that it is probably hard, but doable I'm sure, but no one seems to care enough to implement this.

Now I am resorting to a idea of using fixed bandwidth to write, let's say 100 iops in my 800iops storage backend, so I can avoid a lot of soon-to-change write.

Cheers.

Ross S. W. Walker

unread,

Feb 13, 2012, 1:50:55 PM2/13/12

to suny...@gmail.com, iscsitarget-devel

Yucong Sun (叶雨飞) [mailto:suny...@gmail.com] wrote:
>
> What is your setup? (Processors, NICs, HDDs, controllers, RAID, etc.)
>
> a normal linux server, with 256M BBU hardware raid 10
> perc/6i as disk backend (system on another disk) , running
> ietd trunk. have 8 works constantly writing highly violate
> data in random location (means they normally get changed
> right away after last write)

The 6/i isn't as good as the 6/e but if space is tight.

How many disks was that raid 10?

What type of disks was that raid 10?

What size is that raid 10?

> What Linux kernel?
> 2.6.29 , ubuntu lts
>
> What would you like to achieve?
>
> at first I was just using WT mode and relying on raid card
> write buffer, but I want to use 2G ram as a secondary write
> cache, from what I read (kernel code and documents i can
> found), I think page cache is just what I need, except for
> one thing, I can't control the page flush , ideally I guess I
> want to make all write in best effort mode, only use
> available bandwidth unless there's a buffer under-run, but I
> realize that it is probably hard, but doable I'm sure, but no
> one seems to care enough to implement this.

This document is good:

http://www.westnet.com/~gsmith/content/linux-pdflush.htm

Lets take a look at the tunables:

dirty_bytes/dirty_ratio: total amount of dirty memory allowed
for process before process is blocked for flushing

- I would keep this high, say 50% of total memory cause if
this is hit the results could be unpredictable for IET, maybe
all targets get blocked, maybe none, more investigation is
needed here.

dirty_background_bytes/dirty_background_ratio: total amount of
dirty memory before a background flush operation is started.

- Keep this small, say 256MB or 512MB to make sure the
controller can swallow it up in a single operation so the
flushes are tiny blips on the radar

dirty_expire_centisecs: total time a page can be dirty before
it is flushed

- Keep this small, say 1-3 seconds for data reliability in the
face of an accidental power-off or kernel panic. This one
tells you the recovery point of the volume, very important.

Now the real benefit to page cached IET volumes is read data
caching, you want to get your whole workload in memory as
fast as you can, then you want to completely operate out of
that, so only writes go to disk. Of course this won't be
completely possible, but with enough RAM it can be
significantly reduced so the read operations fit nicely within
the flush operations.

-Ross

______________________________________________________________________

This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

Yucong Sun (叶雨飞)

unread,

Feb 13, 2012, 2:07:25 PM2/13/12

to Ross S. W. Walker, iscsitarget-devel

On Mon, Feb 13, 2012 at 10:50 AM, Ross S. W. Walker
<RWa...@medallion.com> wrote:
> Yucong Sun (叶雨飞) [mailto:suny...@gmail.com] wrote:
>>
>> What is your setup? (Processors, NICs, HDDs, controllers, RAID, etc.)
>>
>> a normal linux server, with 256M BBU hardware raid 10
>> perc/6i as disk backend (system on another disk) , running
>> ietd trunk. have 8 works constantly writing highly violate
>> data in random location (means they normally get changed
>> right away after last write)
>
> The 6/i isn't as good as the 6/e but if space is tight.
>
> How many disks was that raid 10?

4 disks

>
> What type of disks was that raid 10?

500G sata, and it's operating with 700iops under same workload without any tweak

>
> What size is that raid 10?

1Tb, so complete memory is not feasible.

>
>> What Linux kernel?
>> 2.6.29 , ubuntu lts
>>
>> What would you like to achieve?
>>
>> at first I was just using WT mode and relying on raid card
>> write buffer, but I want to use 2G ram as a secondary write
>> cache, from what I read (kernel code and documents i can
>> found), I think page cache is just what I need, except for
>> one thing, I can't control the page flush , ideally I guess I
>> want to make all write in best effort mode, only use
>> available bandwidth unless there's a buffer under-run, but I
>> realize that it is probably hard, but doable I'm sure, but no
>> one seems to care enough to implement this.
>
> This document is good:
>
> http://www.westnet.com/~gsmith/content/linux-pdflush.htm
>
> Lets take a look at the tunables:
>
> dirty_bytes/dirty_ratio: total amount of dirty memory allowed

> for process before process is blocked for flushing.

>
> - I would keep this high, say 50% of total memory cause if
> this is hit the results could be unpredictable for IET, maybe
> all targets get blocked, maybe none, more investigation is
> needed here.

Exactly, this controls the total up limit to prevent serious underun,
I plan to set it to 2G.

>
> dirty_background_bytes/dirty_background_ratio: total amount of
> dirty memory before a background flush operation is started.
>
> - Keep this small, say 256MB or 512MB to make sure the
> controller can swallow it up in a single operation so the
> flushes are tiny blips on the radar

controller has 256 write buffer, so I guess I should set it to 256M
here, what's weird is that I observe in relatity, is that the
controller don't just swallow it into write cache, that's why I am
seeing huge ios blocking any other activities for at least 1 seconds
when page flush happens.

>
> dirty_expire_centisecs: total time a page can be dirty before
> it is flushed
>
> - Keep this small, say 1-3 seconds for data reliability in the
> face of an accidental power-off or kernel panic. This one
> tells you the recovery point of the volume, very important.

I am not actually caring too much about preservation of the data,
since the data is not very important, just highly violate, it's mostly
page swap data anyway, use raid10 just to reduce downtime. keeping it
small basically force Linux to flush pages, I think I can set it to at
least minutes , I will experiment on that.

>
> Now the real benefit to page cached IET volumes is read data
> caching, you want to get your whole workload in memory as
> fast as you can, then you want to completely operate out of
> that, so only writes go to disk. Of course this won't be
> completely possible, but with enough RAM it can be
> significantly reduced so the read operations fit nicely within
> the flush operations.

I see what you mean, that's exactly how it operate now. but in WT
mode, the write operation will not success until disk layer confirms,
right? that could unnecessarily delay things up.

And by the way, what about SYNC operations that IET receives? I know
upper layer metadata operations will sync it before goes, how would
that work in the iscsi world?

Ross S. W. Walker

unread,

Feb 13, 2012, 2:52:44 PM2/13/12

to suny...@gmail.com, iscsitarget-devel

Yucong Sun (叶雨飞) [mailto:suny...@gmail.com] wrote:
> On Mon, Feb 13, 2012 at 10:50 AM, Ross S. W. Walker
> <RWa...@medallion.com> wrote:
> > Yucong Sun (叶雨飞) [mailto:suny...@gmail.com] wrote:
> >>
> >> What is your setup? (Processors, NICs, HDDs, controllers, RAID, etc.)
> >>
> >> a normal linux server, with 256M BBU hardware raid 10
> >> perc/6i as disk backend (system on another disk) , running
> >> ietd trunk. have 8 works constantly writing highly violate
> >> data in random location (means they normally get changed
> >> right away after last write)
> >
> > The 6/i isn't as good as the 6/e but if space is tight.
> >
> > How many disks was that raid 10?
>
> 4 disks
>
> >
> > What type of disks was that raid 10?
>
> 500G sata, and it's operating with 700iops under same
> workload without any tweak

It is impossible for 4 7200 RPM disks to perform 700 IOPS in
a RAID10. This is of course for random IO, it makes no sense
to calculate sequential IO in IOPS as that is throughput and
is measured in bytes/sec.

SATA disks have an average seek of 8-12ms and 7200 RPM drives
have a average rotational latency of 4ms, taking each IO
12-16ms to seek and rotate. This means each SATA disk can do
62 - 84 IOPS.

In a perfectly designed RAID10 each disk can independantly
read giving 248 - 336 IOPS reading, but can only get write
IOPS of the number of mirrors which means 124 - 168 IOPS of
writes, but I suspect the PERCs don't do independant reads as
it takes more logic which means more $$$, so bet that your
array can only handle 124 -168 IOPS both reading and writing.

> >
> > What size is that raid 10?
>
> 1Tb, so complete memory is not feasible.

Does each client read 1TB of data all the time?

No, it's only the current active working set that matters.

So, say it's mysql, and you've figured out that the max
table size is X and the min is Y and the average join is
4 tables then (((X + Y) / 2) * 4) is the client's working
set. Say you have 8 clients, multiply that by 8.

Set read-ahead on block devices to be able to pull that
working set into memory as quick as possible without
impacting each other or the background writes.

> >
> >> What Linux kernel?
> >> 2.6.29 , ubuntu lts
> >>
> >> What would you like to achieve?
> >>
> >> at first I was just using WT mode and relying on raid card
> >> write buffer, but I want to use 2G ram as a secondary write
> >> cache, from what I read (kernel code and documents i can
> >> found), I think page cache is just what I need, except for
> >> one thing, I can't control the page flush , ideally I guess I
> >> want to make all write in best effort mode, only use
> >> available bandwidth unless there's a buffer under-run, but I
> >> realize that it is probably hard, but doable I'm sure, but no
> >> one seems to care enough to implement this.
> >
> > This document is good:
> >
> > http://www.westnet.com/~gsmith/content/linux-pdflush.htm
> >
> > Lets take a look at the tunables:
> >
> > dirty_bytes/dirty_ratio: total amount of dirty memory allowed
> > for process before process is blocked for flushing.
> >
> > - I would keep this high, say 50% of total memory cause if
> > this is hit the results could be unpredictable for IET, maybe
> > all targets get blocked, maybe none, more investigation is
> > needed here.
>
> Exactly, this controls the total up limit to prevent serious underun,
> I plan to set it to 2G.

What this tells kernel is, if this limit is reached, block the
process until it's flushed. All target threads will probably
be included in this calculation. Make sure it isn't hit.

> >
> > dirty_background_bytes/dirty_background_ratio: total amount of
> > dirty memory before a background flush operation is started.
> >
> > - Keep this small, say 256MB or 512MB to make sure the
> > controller can swallow it up in a single operation so the
> > flushes are tiny blips on the radar
>
> controller has 256 write buffer, so I guess I should set it to 256M
> here, what's weird is that I observe in relatity, is that the
> controller don't just swallow it into write cache, that's why I am
> seeing huge ios blocking any other activities for at least 1 seconds
> when page flush happens.

Then tune it down until it doesn't.

> >
> > dirty_expire_centisecs: total time a page can be dirty before
> > it is flushed
> >
> > - Keep this small, say 1-3 seconds for data reliability in the
> > face of an accidental power-off or kernel panic. This one
> > tells you the recovery point of the volume, very important.
>
> I am not actually caring too much about preservation of the data,
> since the data is not very important, just highly violate, it's mostly
> page swap data anyway, use raid10 just to reduce downtime. keeping it
> small basically force Linux to flush pages, I think I can set it to at
> least minutes , I will experiment on that.

Even if the data isn't essential, don't do minutes, the default
30 seconds should be good enough.

If the data is only swap then I would do sparse or flat files
on top of a file system and let the file system worry about how
best to handle the page cache.

> >
> > Now the real benefit to page cached IET volumes is read data
> > caching, you want to get your whole workload in memory as
> > fast as you can, then you want to completely operate out of
> > that, so only writes go to disk. Of course this won't be
> > completely possible, but with enough RAM it can be
> > significantly reduced so the read operations fit nicely within
> > the flush operations.
>
> I see what you mean, that's exactly how it operate now. but in WT
> mode, the write operation will not success until disk layer confirms,
> right? that could unnecessarily delay things up.

If the disks are slow it will delay.

> And by the way, what about SYNC operations that IET receives? I know
> upper layer metadata operations will sync it before goes, how would
> that work in the iscsi world?

When IET gets a sync from disk it flushes the whole target disk
page cache, which could mean the whole RAID 10 for some devices.

Another way is to make a big XFS file system with sparse files
for each client and serve those sparse files up over IET. Then
XFS can take care of the page cache corner issues. Then is there
is a flush it only flushes that file.

Yucong Sun (叶雨飞)

unread,

Feb 13, 2012, 3:40:48 PM2/13/12

to Ross S. W. Walker, iscsitarget-devel

On Mon, Feb 13, 2012 at 11:52 AM, Ross S. W. Walker

I'm sorry, the ~700 iops is actually measured in upper layer
application with IET running in WT mode, with 2G read cache.

The real iops to the disk seems to agree, some what less, here's
output from iostat -x -d 1 , the disk is completely stressed by the
way, deadline IO is enabled, noop seems to do worse.

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 1095.00 200.00 464.00 1600.00 12472.00
21.19 5.59 7.70 1.51 100.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 1088.00 187.00 389.00 1496.00 11816.00
23.11 5.74 10.17 1.74 100.00

>> >
>> > What size is that raid 10?
>>
>> 1Tb, so complete memory is not feasible.
>
> Does each client read 1TB of data all the time?

no, to be precise, this is swap paging area for 50s of independent
virtual machine, the real data is only about 200G on disk for now, but
it still beyond complete load into memory, For the situation like this
, I think just the more the better. the active working set can be
somehow determined by the write bandwidth, which is about 5m, from
dstat

--dsk/sdb-- --net/eth0-
read writ| recv send
680k 5552k|4421k 2228k
628k 4544k|3533k 1491k
976k 3984k|2751k 2810k

>
> No, it's only the current active working set that matters.
>
> So, say it's mysql, and you've figured out that the max
> table size is X and the min is Y and the average join is
> 4 tables then (((X + Y) / 2) * 4) is the client's working
> set. Say you have 8 clients, multiply that by 8.
>
> Set read-ahead on block devices to be able to pull that
> working set into memory as quick as possible without
> impacting each other or the background writes.

I'm not sure that would help, since the disk is completely stressed
now, the read-ahead is disabled all the way down to controller, should
I change that ?

Do you have insight why it doesn't behave like you said?

>
>> >
>> > dirty_expire_centisecs: total time a page can be dirty before
>> > it is flushed
>> >
>> > - Keep this small, say 1-3 seconds for data reliability in the
>> > face of an accidental power-off or kernel panic. This one
>> > tells you the recovery point of the volume, very important.
>>
>> I am not actually caring too much about preservation of the data,
>> since the data is not very important, just highly violate, it's mostly
>> page swap data anyway, use raid10 just to reduce downtime. keeping it
>> small basically force Linux to flush pages, I think I can set it to at
>> least minutes , I will experiment on that.
>
> Even if the data isn't essential, don't do minutes, the default
> 30 seconds should be good enough.
>
> If the data is only swap then I would do sparse or flat files
> on top of a file system and let the file system worry about how
> best to handle the page cache.

I see, I was just worried that it might introduce even more delay, but
I can definitely give it a try.

>
>> >
>> > Now the real benefit to page cached IET volumes is read data
>> > caching, you want to get your whole workload in memory as
>> > fast as you can, then you want to completely operate out of
>> > that, so only writes go to disk. Of course this won't be
>> > completely possible, but with enough RAM it can be
>> > significantly reduced so the read operations fit nicely within
>> > the flush operations.
>>
>> I see what you mean, that's exactly how it operate now. but in WT
>> mode, the write operation will not success until disk layer confirms,
>> right? that could unnecessarily delay things up.
>
> If the disks are slow it will delay.
>
>> And by the way, what about SYNC operations that IET receives? I know
>> upper layer metadata operations will sync it before goes, how would
>> that work in the iscsi world?
>
> When IET gets a sync from disk it flushes the whole target disk
> page cache, which could mean the whole RAID 10 for some devices.

I think we might be talking about different things, I'm guess the
upper layer is sending operations that with SYNC flag, not just
fsync(), so does IET just submit io with SYNC flag as well?

Ross S. W. Walker

unread,

Feb 13, 2012, 4:18:41 PM2/13/12

to suny...@gmail.com, iscsitarget-devel

Yucong Sun (叶雨飞) [mailto:suny...@gmail.com] wrote:
>
> I'm sorry, the ~700 iops is actually measured in upper layer
> application with IET running in WT mode, with 2G read cache.
>
> The real iops to the disk seems to agree, some what less, here's
> output from iostat -x -d 1 , the disk is completely stressed by the
> way, deadline IO is enabled, noop seems to do worse.
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sdb 0.00 1095.00 200.00 464.00 1600.00 12472.00
> 21.19 5.59 7.70 1.51 100.00
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sdb 0.00 1088.00 187.00 389.00 1496.00 11816.00
> 23.11 5.74 10.17 1.74 100.00

You know swap is almost 100% random in both reads and writes, so
it will always defeat write-back caching.

Best bet is to do blockio to SSD drives.

> >> >
> >> > What size is that raid 10?
> >>
> >> 1Tb, so complete memory is not feasible.
> >
> > Does each client read 1TB of data all the time?
>
> no, to be precise, this is swap paging area for 50s of independent
> virtual machine, the real data is only about 200G on disk for now, but
> it still beyond complete load into memory, For the situation like this
> , I think just the more the better. the active working set can be
> somehow determined by the write bandwidth, which is about 5m, from
> dstat
>
> --dsk/sdb-- --net/eth0-
> read writ| recv send
> 680k 5552k|4421k 2228k
> 628k 4544k|3533k 1491k
> 976k 3984k|2751k 2810k

Like I said above best bet may be SSD drives here.

> >
> > No, it's only the current active working set that matters.
> >
> > So, say it's mysql, and you've figured out that the max
> > table size is X and the min is Y and the average join is
> > 4 tables then (((X + Y) / 2) * 4) is the client's working
> > set. Say you have 8 clients, multiply that by 8.
> >
> > Set read-ahead on block devices to be able to pull that
> > working set into memory as quick as possible without
> > impacting each other or the background writes.
>
> I'm not sure that would help, since the disk is completely stressed
> now, the read-ahead is disabled all the way down to controller, should
> I change that ?

Yes, once I realized this is for swap data I realize that
there is nothing you can do for it.

> >> controller has 256 write buffer, so I guess I should set it to 256M
> >> here, what's weird is that I observe in relatity, is that the
> >> controller don't just swallow it into write cache, that's why I am
> >> seeing huge ios blocking any other activities for at least 1 seconds
> >> when page flush happens.
> >
> > Then tune it down until it doesn't.
>
> Do you have insight why it doesn't behave like you said?

Swap is all random and it defeats caching, need more back end IOPS.

> > If the data is only swap then I would do sparse or flat files
> > on top of a file system and let the file system worry about how
> > best to handle the page cache.
>
> I see, I was just worried that it might introduce even more delay, but
> I can definitely give it a try.

I would give it a try, at worse it should be about the same.

Once again, swap, and swap IO is almost always 4k in size and random.

Very tough workload.

For VMware swap I put the VMs swap files on local SSD drives.

> > When IET gets a sync from disk it flushes the whole target disk
> > page cache, which could mean the whole RAID 10 for some devices.
>
> I think we might be talking about different things, I'm guess the
> upper layer is sending operations that with SYNC flag, not just
> fsync(), so does IET just submit io with SYNC flag as well?

It's different between fileio and blockio. Fileio uses
filemap_write_and_wait_range() while blockio just unplugs and
replugs the queue on every IO.

Reply all

Reply to author

Forward