DBWR performance

Andrew Protasov

unread,

May 4, 2012, 2:02:50 PM5/4/12

to

Did anybody ever push DBWR to reach max disk write speed?

I have /dev/md0, /dev/md1 and /dev/md2 with identical configuration:
raid1 on 2 3TB hdd, 6 hdd total. I can reach 180MB/sec write speed on
each of them running 3 bonnie++ concurrently. Filesystem is xfs on all
of them.

Data goes on /dev/md0, redo on /dev/md1 and undo on /dev/md2.

Execute insert /*+ append */ and write speed on /dev/md0 goes to
150-160MB/sec as seen by iostat. Good, this is close. Obviously
nothing goes to /dev/md1 and /dev/md2.

Execute insert (no append) and dbwr write speed on /dev/md0 is 0MB/sec
and lgwr on /dev/md1 reaches 130MB/sec until db block buffers are
full. Then dbwr speed is only 50MB/sec and it also holding back lgwr
at 50MB/sec.

To me it looks like dbwr is incapable of writing more then 50MB/sec
and holding everything back.

I would expect write speed on /dev/md0 and /dev/md1 to be 180/180 MB/
sec, not 50/50 MB/sec.

Adding >1 db writers or >1 slaves does not change this picture. Async
io is on, direct io is on. DB block buffers 4GB. Test table size
~10GB.

Config: fedora 16 64bit, oracle 11.2, xfs, 32GB ram, AMD FX-8120.

Andrew

Noons

unread,

May 5, 2012, 12:11:27 AM5/5/12

to

Start here:
http://kevinclosson.wordpress.com/2012/02/06/introducing-slob-the-silly-little-oracle-benchmark/

Take the benchmark and use it to get baselines and then start changing things.
When it comes to I/O, you MUST have a defined repeatable process to test,
otherwise you'll be forever chasing your tail.

hint: if some moron emails you that Kevin is not an I/O performance expert, just
plonk the email in the rubbish bin. Don't even bother with anything else.

Andrew Protasov

unread,

May 5, 2012, 3:04:47 AM5/5/12

to

This does not answer my question.

Did you personally ever push DBWR to reach max disk write speed?

By the way here is extra little nugget that could be relevant - in my
configuration DB_WRITER_PROCESSES is ignored by oracle. There is only
one process even if parameter specifies >1. And this single process is
not enough when it tries to flush about 24GB of dirty buffers and uses
about 99% cpu for quite some time.

Andrew

> Start here:http://kevinclosson.wordpress.com/2012/02/06/introducing-slob-the-sil...

Jonathan Lewis

unread,

May 5, 2012, 5:22:48 AM5/5/12

to

"Andrew Protasov" <andrew....@gmail.com> wrote in message
news:a57eb2b1-e9a7-4a96...@p6g2000yqi.googlegroups.com...

I can't account for all the missing time (at present), but you have to
remember that DBWR is not allowed to write a block until LGWR has written
the redo protecting that block. This being the case my first thought would
be that you've simply pushed the system to the point where DBWR want to
write 100 blocks (say) and has to wait for LGWR to write the redo for those
blocks to disc, so LGWR and DBWR alternate - that would halve the
throughput rate.

You didn't say how you populated the table - but you might demonstrate this
effect by populating the table with rows that are 80 bytes long, and
setting pctfree to 99 and pctused to 1. That way the redo log generated for
each block will be about 400 bytes, while the write from dbwr will be the
full block size - and this may allow dbwr to write faster because it
doesn't have to spend so much time waiting on each lgwr/dbwr cycle.

--
Regards

Jonathan Lewis
http://jonathanlewis.wordpress.com
Oracle Core (Apress 2011)
http://www.apress.com/9781430239543

Jesper Wolf Jespersen

unread,

May 5, 2012, 5:25:43 AM5/5/12

to

Hi Andrew.

Are you sure you have all your six disks on individual SATA controllers ?

If for some reason two disks are sharing a controller you may see the
disks sharing one I/O shannel and thus limit throughput.

I dont know how to check for that in linux but there must be a way.

Greetings from Denmark
Jesper Wolf Jespersen

Andrew Protasov

unread,

May 5, 2012, 3:57:29 PM5/5/12

to

Jasper,

Controller is part of AMD chipset and handles 6 sata ports
concurrently.

Here is 3 concurrent bonnie++ test and each of 6 hdd writes close to
close to 200MB/sec. So, hardware is not an issue.

Andrew

bonnie++ -f -d /u01 -u root -s 120g:4096 &
bonnie++ -f -d /u02 -u root -s 120g:4096 &
bonnie++ -f -d /u03 -u root -s 120g:4096 &

iostat -k 5

avg-cpu: %user %nice %system %iowait %steal %idle
5.81 0.00 11.24 48.18 0.00 34.77

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 386.60 0.00 196821.60 0 984108
sdb 393.20 0.00 200200.80 0 1001004
sdc 356.40 0.00 181664.00 0 908320
sdd 379.20 0.00 193236.00 0 966180
sde 382.00 0.00 194466.40 0 972332
sdf 383.80 0.00 195388.00 0 976940
sdg 0.00 0.00 0.00 0 0
md2 409.60 0.00 208597.60 0 1042988
md1 409.60 0.00 208800.80 0 1044004
md0 409.60 0.00 208597.60 0 1042988

On May 5, 4:25 am, Jesper Wolf Jespersen <jes...@oz8ace.dk.nonsens>
wrote:

Andrew Protasov

unread,

May 5, 2012, 4:59:38 PM5/5/12

to

Jonathan,

It does not look like LGWR is an issue, because it writes much faster
then DBWR.

Here is an example. I increased db block buffers to 24GB - much bigger
size then test table size. In this case whole table fits into db cache
and dbwr writes nothing to hdd during normal insert, only lgwr writes
with speed close to 130MB/sec.

iostat -k 5

avg-cpu: %user %nice %system %iowait %steal %idle

10.67 0.00 2.73 0.60 0.00 86.00

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn

sda 0.80 0.00 6.80 0 34
sdb 1.80 16.00 6.80 80 34
sdc 460.80 0.00 132140.20 0 660701
sdd 460.80 0.00 132140.20 0 660701
sde 0.00 0.00 0.00 0 0
sdf 0.20 1.60 0.00 8 0

sdg 0.00 0.00 0.00 0 0

md2 0.20 1.60 0.00 8 0
md1 460.80 0.00 132140.20 0 660701
md0 1.40 16.00 6.40 80 32

After commit and some idle time DBWR starts flushing dirty db blocks
but never reaches sustained speed more than 50MB/sec.

iostat -k 5

avg-cpu: %user %nice %system %iowait %steal %idle

0.83 0.00 0.80 5.76 0.00 92.61

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn

sda 374.20 0.00 42636.80 0 213184
sdb 375.40 19.20 42636.80 96 213184
sdc 1.20 0.00 134.30 0 671
sdd 1.20 0.00 134.30 0 671
sde 9.80 0.00 670.20 0 3351
sdf 9.80 0.00 670.20 0 3351

sdg 0.00 0.00 0.00 0 0

md2 5.20 0.00 665.60 0 3328
md1 0.40 0.00 133.50 0 667
md0 375.40 19.20 42636.80 96 213184

top
Tasks: 234 total, 1 running, 233 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.4%us, 1.1%sy, 0.0%ni, 90.9%id, 6.3%wa, 0.1%hi,
0.2%si, 0.0%st
Mem: 32895864k total, 14264500k used, 18631364k free, 7768k
buffers
Swap: 75641852k total, 305920k used, 75335932k free, 13378736k
cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
2839 oracle 20 0 25.3g 7.1g 7.1g S 8.6 22.8 0:12.19
oracle

[root@dataworks etc]# ps -ef|grep 2839
oracle 2839 1 2 15:40 ? 00:00:15 ora_dbw0_market
root 2920 2111 0 15:49 pts/3 00:00:00 grep --color=auto 2839

DBWR write speed never reaches LGWR write speed.

I can guess several explanations here:

1. dbwr can't perform fast enough to saturate io bandwidth
2. it does not write blocks sequentially and random io is much slower
than sequential
3. 8k db block size is too small (bonnie++ uses 4MB buffer and lgwr
buffer is 32MB)
4. io calls and parameters that DBWR is using are not as fast as ones
used by LGWR, direct path or bonnie++.

Noons

unread,

May 6, 2012, 4:09:15 AM5/6/12

to

Andrew Protasov wrote,on my timestamp of 5/05/2012 5:04 PM:
> This does not answer my question.

Your question does not have an answer unless everyone is running the same
conditions. Hence the need for a repeatable set of tests, well defined and
consistent across runs so that different options and configurations can be
evaluated.

> Did you personally ever push DBWR to reach max disk write speed?

No, I find it very hard to "push" a piece of software. A car, or a bike, yes.
A process? I'm not that good. ;-)
(just joking, don't get upset!)

> By the way here is extra little nugget that could be relevant - in my
> configuration DB_WRITER_PROCESSES is ignored by oracle. There is only
> one process even if parameter specifies>1. And this single process is
> not enough when it tries to flush about 24GB of dirty buffers and uses
> about 99% cpu for quite some time.
>

That tells us straight away that you are using lots of CPU to do writes. That
should never be the case. At most you should only see a "blip" on kernel times.
For a given process? Never. Are you absolutely sure, positive,
cross-your-fingers-hope-to-die that aio is being used and has enough aio
servers? Just turning it on in the spfile is not enough to prove it is being
ideally used. Do a strace of the dbwr process and check how it is opening the
files and what I/O options it's using - only way toconfirm it is indeed being
activated. Then you need to watch the stats for aio. Been a while since I last
used fedora, so others may advise on best way of doing so.

Like I said: a repeatable benchmark is the only way to tune the I/O in a system.
You may think your test is consistently repeatable, it may well not be.

Andrew Protasov

unread,

May 6, 2012, 1:19:43 PM5/6/12

to

Noons,

Ok, let's check it (this is one dbwr, no slaves):

[root@dataworks u03]# ps -ef|grep ora_dbw
oracle 2871 1 2 11:44 ? 00:00:11 ora_dbw0_market
root 2945 2102 0 11:53 pts/4 00:00:00 grep --color=auto
ora_dbw

strace -p 2871
...
io_submit(140438516658176, 1, {{0x7fba6446ce08, 0, 1, 0, 24}}) = 1
io_getevents(140438516658176, 1, 128, {{0x7fba6446ce08,
0x7fba6446ce08, 131072, 0}}, {600, 0}) = 1
times({tms_utime=584, tms_stime=479, tms_cutime=0, tms_cstime=0}) =
429660093
io_submit(140438516658176, 1, {{0x7fba6446ce08, 0, 1, 0, 24}}) = 1
io_getevents(140438516658176, 1, 128, {{0x7fba6446ce08,
0x7fba6446ce08, 131072, 0}}, {600, 0}) = 1
times({tms_utime=584, tms_stime=479, tms_cutime=0, tms_cstime=0}) =
429660094
io_submit(140438516658176, 1, {{0x7fba6446ce08, 0, 1, 0, 24}}) = 1
io_getevents(140438516658176, 1, 128, {{0x7fba6446ce08,
0x7fba6446ce08, 131072, 0}}, {600, 0}) = 1
times({tms_utime=584, tms_stime=479, tms_cutime=0, tms_cstime=0}) =
429660094
io_submit(140438516658176, 1, {{0x7fba6446ce08, 0, 1, 0, 24}}) = 1
io_getevents(140438516658176, 1, 128, {{0x7fba6446ce08,
0x7fba6446ce08, 131072, 0}}, {600, 0}) = 1
times({tms_utime=584, tms_stime=479, tms_cutime=0, tms_cstime=0}) =
429660094
...
This looks like async io to me.

Andrew

Mladen Gogala

unread,

May 6, 2012, 11:41:57 PM5/6/12

to

On Fri, 04 May 2012 11:02:50 -0700, Andrew Protasov wrote:

> Config: fedora 16 64bit, oracle 11.2, xfs, 32GB ram, AMD FX-8120.
>
> Andrew

Wow! You're worse than me:

head -1 /proc/meminfo
MemTotal: 16435008 kB

[root@medo mgogala]# head -13 /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 5
model name : AMD Phenom(tm) II X4 840 Processor
stepping : 3
microcode : 0x10000b6
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4

[root@medo mgogala]# uname -a
Linux medo.home.com 3.3.4-3.fc16.x86_64 #1 SMP Thu May 3 14:46:44 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux

I have 16GB F16, running on a quad core AMD CPU. I thought that this
little desktop of mine was a monster, but your gear is an order of
magnitude larger. BTW, we have the same taste for the file system:

[oracle@medo ~]$ mount|grep xfs
/dev/sdb1 on /misc type xfs (rw,relatime,attr2,noquota)
/dev/sdb2 on /data type xfs (rw,relatime,attr2,noquota)
/dev/mapper/vg_medo-lv_home on /home type xfs (rw,relatime,attr2,noquota)

However, if you test the same disk using bonnie++ on Ext4 and XFS, you'll
see that Ext4 is usually faster. Your machine isn't a production machine,
is it? It would be irresponsible to use Fedora for production, Oracle
wouldn't support it. If your files are on XFS, you can use xfs_io to set
"real time" attribute, which will further speed things up.

--
http://mgogala.byethost5.com

Mladen Gogala

unread,

May 6, 2012, 11:58:22 PM5/6/12

to

On Sat, 05 May 2012 00:04:47 -0700, Andrew Protasov wrote:

> This does not answer my question.
>
> Did you personally ever push DBWR to reach max disk write speed?

You can speed things up, but you must specify realtime area when making
xfs and then put the file into this area by using xfs_io chattr. I used
to put redo log files, UNDO tablespace and SYSTEM tablespace into that
area, but it is a lot of work on my desktop, so I usually don't do it.
Also, from what you say about flushing dirty buffers, did you set direct
IO?

--
http://mgogala.byethost5.com

Noons

unread,

May 7, 2012, 8:26:45 AM5/7/12

to

Andrew Protasov wrote,on my timestamp of 7/05/2012 3:19 AM:

> io_submit(140438516658176, 1, {{0x7fba6446ce08, 0, 1, 0, 24}}) = 1
> io_getevents(140438516658176, 1, 128, {{0x7fba6446ce08,
> 0x7fba6446ce08, 131072, 0}}, {600, 0}) = 1
> times({tms_utime=584, tms_stime=479, tms_cutime=0, tms_cstime=0}) =
> 429660094
> ...
> This looks like async io to me.

Agreed: definitely does. Direct I/O also being used? Likely you can only trace
that from the file open of the db using the -ff option and tracing sqlplus on
start.

Are there any tunables for aio for xfs in Fedora?
For example, in Aix we can specify how many "aio servers" are available and how
many requests can be queued without forcing a process to give up CPU or queue on
the equivalent of io_getevents.
Sorry: don't know enough of Fedora to advise there. But maybe Mladen will know
as he uses xfs often enough?

I reckon it still might be worth contacting Kevin: he used to do a lot of xfs
and Linux a few years ago, I'm quite sure he might have a few other avenues to
explore.

Mladen Gogala

unread,

May 7, 2012, 8:42:17 AM5/7/12

to

On Mon, 07 May 2012 22:26:45 +1000, Noons wrote:

> Are there any tunables for aio for xfs in Fedora?

Well, to use the title of Alex Baldwin/Meryl Streep movie, it's
complicated. XFS can be created with a "real time" area which is used for
"real time" I/O. That means that it gets higher priority than any other I/
O against file system and is done first. XFS also has full block logging,
which also slows things down. There are elementary tricks like the noatime
mount option, which tells FS not to maintain "last access time" which is
updated on the inode each time an I/O against the file happens, thereby
creating a bottleneck.

--
http://mgogala.byethost5.com

Andrew Protasov

unread,

May 7, 2012, 1:19:25 PM5/7/12

to

No, it is not production.

I would like to match DBWR write speed to other processes write speed
for the same filesystem setup before I start re-creating filesystem.
Just apple-to-apple comparison they should be able to do the same.

Andrew

Andrew Protasov

unread,

May 7, 2012, 1:20:38 PM5/7/12

to

Yes,

filesystemio_options=setall

Andrew Protasov

unread,

May 7, 2012, 1:41:09 PM5/7/12

to

I did some asynch io+direct io testing outside oracle using some C
samples from web and found something quite disturbing: write speed for
8k block on xfs+raid1(mdadm) depends on number of concurrent aio
requests. So, this function

f: concurrent requests->write speed MB/sec

looks like this

f(2)=192
...
f(46)=43
...
f(3000)=198

Minimum speed is reached for 46 concurrent requests. It could be
coincidence but min speed matches what I see from DBWR. It could mean
DBWR uses number of concurrent aio requests that is bad for my setup.

The same test does not show any drop like this for ext4+no raid on
different slower HDD (at least for the same number of requests).
Actually speed is almost flat at 123MB/sec for all request counts.

Now I need to do more testing to find what exactly is broken for aio
+direct io: xfs or mdadm or particular HDD model. I suspect software
raid is the problem.

Andrew

Mladen Gogala

unread,

May 7, 2012, 1:47:56 PM5/7/12

to

On Mon, 07 May 2012 10:41:09 -0700, Andrew Protasov wrote:

> Now I need to do more testing to find what exactly is broken for aio
> +direct io: xfs or mdadm or particular HDD model. I suspect software
> raid is the problem.

It may also be XFS. You will have to re-create it, add the RT area, make
your files "realtime", mount without maintaining the access time and make
sure you mount with sufficient amount and sizes of log buffers.

--
http://mgogala.freehostia.com

Mladen Gogala

unread,

May 7, 2012, 4:43:17 PM5/7/12

to

On Mon, 07 May 2012 10:20:38 -0700, Andrew Protasov wrote:

> Yes,
>
> filesystemio_options=setall

So, how come that you're worried about flushing the dirty FS buffers?
direct I/O doesn't need that. You only need to put your files into the
realtime portion of the FS and the IO will have priority over anything
else.

--
http://mgogala.freehostia.com

andrew....@gmail.com

unread,

May 7, 2012, 5:17:28 PM5/7/12

to

I did not say dirty FS buffers, I said dirty buffers and meant dirty DB buffers that dbwr is not flushing fast enough.

It is like investing in nascar team, buying car that can do 200MPH and suddenly your hired champion driver refuses to drive faster than 40 MPH on racetrack :).

Andrew

Mladen Gogala

unread,

May 7, 2012, 11:19:32 PM5/7/12

to

On Mon, 07 May 2012 10:19:25 -0700, Andrew Protasov wrote:

> No, it is not production.
>
> I would like to match DBWR write speed to other processes write speed
> for the same filesystem setup before I start re-creating filesystem.
> Just apple-to-apple comparison they should be able to do the same.
>
> Andrew

That will never happen. DBWR has a lot to take care of (pins, mutexes and
counting hits) and it mostly writes single blocks, rarely batches. So,
you will get gazillion of small write requests, not sorted by the block.
You can load deadline I/O scheduler for the drives used by the database.
That can be done like this:

echo "deadline"> /sys/block/sdb/queue/scheduler

That will load the deadline scheduler which will probably combine the
gazillion of small write requests submitted by DBWR into a few of larger
I/O request. Nevertheless, DBWR will still receive signal for every
completed write request and there will be few interrupts for every IO.
Asynchronous I/O threads inform the issuing process by sending signals.
Signals are never delivered before the next process activation.
If you don't mind my asking, why would you want DBWR which reaches the
disk speeds? As far as I know, DBWR writes asynchronously so it's not
like you're waiting for it to write. With 32GB, you can have a sizable SGA
which will be capable of holding things until DBWR catches up. What do
you intend to do with that kind of machine?

--
http://mgogala.byethost5.com

andrew....@gmail.com

unread,

May 7, 2012, 11:55:06 PM5/7/12

to

DBWR will hold you back as soon as you fill db block buffers because you insert can't continue until DBWR writes some of them to hdd and makes them available for your process again. That's why he have direct path insert --+ append to skip this bottleneck.

It is box for DW stuff. Sometime you need personal box to test code that no developer has access to. This way you can get accurate timing that is repeatable and not screwed up by some other SQL running concurrently.

Mladen Gogala

unread,

May 8, 2012, 9:18:54 AM5/8/12

to

On Mon, 07 May 2012 20:55:06 -0700, andrew.protasov wrote:

> DBWR will hold you back as soon as you fill db block buffers because you
> insert can't continue until DBWR writes some of them to hdd and makes
> them available for your process again. That's why he have direct path
> insert --+ append to skip this bottleneck.
>
> It is box for DW stuff. Sometime you need personal box to test code that
> no developer has access to. This way you can get accurate timing that is
> repeatable and not screwed up by some other SQL running concurrently.
>
> Andrew

Andrew, I have been a DBA for a very, very long time and haven't noticed
any such bottleneck. By far the most of the problems I have are with
queries and RAC. People actually think that having several 4 boxes to run
a database is just like having one, 4 times bigger box, with the added
benefit of fault tolerance. I really haven't had a case where insert
wasn't fast enough. Big inserts are usually done during the off peak
hours, especially in the data warehouse. The purpose of a DW environment
is to enable executing large queries and producing summary reports
without jeopardizing the primary DB. Most of the database problems that
I've ever encountered was about queries, not inserts. I find the idea
that DBWR should be able to come close to the speed of disk rather
strange.

--
http://mgogala.byethost5.com

andrew....@gmail.com

unread,

May 8, 2012, 10:06:04 AM5/8/12

to

Mladen,

I have quite a different point of view on this. As developer I do DW ETL. Big part of ETL is insert/update/merge; it is not just queries. I have to do it over and over again when testing, so write performance matters big time. Yes, we can skip dbwr using sqlldr direct=y or insert --+ append (hello anti-hint party :-) ), but in some cases it can't be used, i.e. type 1 DW (merge).

Anyway, I moved data file to FS without aio issues and dbwr speed actually dropped to 12MB/sec (back to 1999 :-) ). I switched off asynch and direct io using filesystemio_option=none and it is still 12MB/sec. Run strace on dbwr and see pwrite doing 128k writes (16*8k blocks) in this offset pattern:

o11,o21,o12,o22,o13,o23,...

o12-o11=128k
o22-o21=128k
o21-o11=8M

Essentially dbwr interleaving 2 sequential outputs which converts it to non-sequntial / random writes.

So, I have 2 questions now:

1. Why are we using 128k write buffer when we have 10G of dirty db buffers to write? Where did 16 block write buffer come from?

2. Why are we not keeping write sequential as it came from insert select statement? What is possible benefit of random writes?

3. How can I fix 1 and 2?

I would prefer solution inside Oracle where problem originates before we start throwing money on something like Adaptec RAID 6805 (flashback to died raid controllers that took data with them) with 4gb write back flash cache, which may not solve it.

Mladen Gogala

unread,

May 8, 2012, 12:59:47 PM5/8/12

to

On Tue, 08 May 2012 07:06:04 -0700, andrew.protasov wrote:

> I would prefer solution inside Oracle where problem originates before we
> start throwing money on something like Adaptec RAID 6805 (flashback to
> died raid controllers that took data with them) with 4gb write back
> flash cache, which may not solve it.

As I have already told you, DBWR is actually writing very small chunks.
Try using deadline IO scheduler which will combine them into something
larger. Also, try creating XFS with realtime area.

--
http://mgogala.byethost5.com

andrew....@gmail.com

unread,

May 8, 2012, 2:28:56 PM5/8/12

to

Yes, I remember and will try it, but not holding my breath for it to help in synch io case. As Oracle claims, they do not use FS write caching, so each 16 block write from DBWR is actual physical disk write. There are no multiple requests for FS to re-order. Asynch io may improve though.

Andrew

Noons

unread,

May 9, 2012, 12:18:21 AM5/9/12

to

On May 8, 3:41 am, Andrew Protasov <andrew.prota...@gmail.com> wrote:

> samples from web and found something quite disturbing: write speed for
> 8k block on xfs+raid1(mdadm) depends on number of concurrent aio
> requests. So, this function
>
> f: concurrent requests->write speed MB/sec
>
> looks like this
>
> f(2)=192
> ...
> f(46)=43
> ...
> f(3000)=198

Wow, good catch! There's gotta be a way of controlling/tuning that!

> Minimum speed is reached for 46 concurrent requests. It could be
> coincidence but min speed matches what I see from DBWR. It could mean
> DBWR uses number of concurrent aio requests that is bad for my setup.

I wonder if there isn't somewhere in Oracle another "hidden parameter"
to tune that?

> Now I need to do more testing to find what exactly is broken for aio
> +direct io: xfs or mdadm or particular HDD model. I suspect software
> raid is the problem.

Interesting. If you get a chance, please post results:
bound to be useful to someone else in future.

Mladen Gogala

unread,

May 9, 2012, 8:47:04 AM5/9/12

to

On Tue, 08 May 2012 11:28:56 -0700, andrew.protasov wrote:

> Yes, I remember and will try it, but not holding my breath for it to
> help in synch io case. As Oracle claims, they do not use FS write
> caching, so each 16 block write from DBWR is actual physical disk write.
> There are no multiple requests for FS to re-order. Asynch io may improve
> though.
>
> Andrew

The problem is that those requests are rather small. And there is another
level, between FS and disk, when IORB (I/O request blocks) are actually
queued. The software that does sorting of the I/O requests, to make sure
that adjacent blocks are written together, is IO scheduler or "elevator".
Oracle recommends deadline scheduler, which is not the default. The
default is "completely fair scheduler", which has a goal of making sure
that every process gets its fair share of the I/O pipe, exactly the
opposite of what you want to achieve here. The recommendation is stated
in the MOS document 1352304.1.
IO scheduler will help you only if there is a queue of I/O requests,
waiting on the disk to service them which is not the case here. Your
problem is that DBWR doesn't launch enough of I/O requests to come even
close to saturating the pipe. The problem may also be coming from the
Linux side. I have never liked Linux implementation of the asynch I/O.
You may try with DBWR_IO_SLAVES to see how Oracle's own implementation
works.

--
http://mgogala.byethost5.com

andrew....@gmail.com

unread,

May 9, 2012, 11:31:55 AM5/9/12

to

So far, I replaced xfs with ext4 and got very similar results for the same asynch io + direct io test. It does not look good for mdadm right now.

Next planned test is xfs or ext4 on the same hdd without mdadm software raid. I also may try different chunk size with software raid (it is just pain to wait half day for md to synch 3 TB volume).

I have ordered Adaptec 6805 + AFM 600 predicting result already :-).

Andrew

andrew....@gmail.com

unread,

May 9, 2012, 11:34:38 AM5/9/12

to

Apparently all results were with deadline scheduler in place. I did this before the first test:

tuned-adm profile enterprise-storage

and it installed the scheduler.

Andrew

> The problem is that those requests are rather small. And there is another
> level, between FS and disk, when IORB (I/O request blocks) are actually
> queued. The software that does sorting of the I/O requests, to make sure
> that adjacent blocks are written together, is IO scheduler or "elevator".
> Oracle recommends deadline scheduler, which is not the default. The
> default is "completely fair scheduler", which has a goal of making sure
> that every process gets its fair share of the I/O pipe, exactly the
> opposite of what you want to achieve here. The recommendation is stated
> in the MOS document 1352304.1.
> IO scheduler will help you only if there is a queue of I/O requests,
> waiting on the disk to service them which is not the case here. Your
> problem is that DBWR doesn't launch enough of I/O requests to come even
> close to saturating the pipe. The problem may also be coming from the
> Linux side. I have never liked Linux implementation of the asynch I/O.
> You may try with DBWR_IO_SLAVES to see how Oracle's own implementation
> works.

Jonathan Lewis

unread,

May 9, 2012, 3:36:09 PM5/9/12

to

"Andrew Protasov" <andrew....@gmail.com> wrote in message
news:ed5ea637-760e-440f...@cl4g2000vbb.googlegroups.com...

| Jonathan,
|
| It does not look like LGWR is an issue, because it writes much faster
| then DBWR.
|
| Here is an example. I increased db block buffers to 24GB - much bigger
| size then test table size. In this case whole table fits into db cache
| and dbwr writes nothing to hdd during normal insert, only lgwr writes
| with speed close to 130MB/sec.

dbwr and lgwr don't write "as fast as they can", they write "as fast as
they need to".
The fact that lgwr often needs to write as fast as it can is probably why
you can see it writing at peak speeds.

Unless you are seeing more than a few "free buffer wait" waits then dbwr
isn't under pressure to write.

If you want to see if you can put dbwr under pressure, you might like to
repeat you experiment to fill the buffer cache, issue a commit, and then
either start a parallel tablescan of the table or truncate the table.
Either command should cause an object checkpoint which I think should cause
all the dirty blocks from the object go onto the high-priority queue -
which should encourage dbwr to try harder. (It's possible that the PQ will
be higher priority than the truncate).

Regarding your other posts:

a) dbwr normally walks the checkpoint queue to write.- and if you are
allocating and formatting blocks as you insert data then you could be
jumping all over the table as you allocate blocks and update space
management bitmaps - the side effects could be quite pronounced if you are
using ASSM and system allocated extent sizes.

b) if it's a normal insert you will also be generating undo, which means
you will have undo blocks interspersed with the table blocks in the
checkpoint queue - which MIGHT affect the sizes of the batches that dbwr is
able to coalesce before writing.

We can't really do anything about the undo, but you could try running with
a dictionary managed tablespace with a very large extent size. Set a very
large pctfree so table blocks are full after inserting a very small amount
of data (and generating a small amount undo and redo) - and see how this
affects the size of a single batch write.

Note - if dbwr is already running at 100% of a CPU, then this is likely to
be a side effect of polling for completion of the async writer processes -
one of the options for dbwr is to dispatch several writes to multiple o/s
processes (threads) and then spin round them checking for completion. There
may be some configuration details that allows larger writes to be
dispatched to operating system - but this may be something that isn't
relevant to your O/S.

--
Regards

Jonathan Lewis
http://jonathanlewis.wordpress.com
Oracle Core (Apress 2011)
http://www.apress.com/9781430239543

andrew....@gmail.com

unread,

May 9, 2012, 4:50:29 PM5/9/12

to

Jonathan,

> If you want to see if you can put dbwr under pressure, you might like to
> repeat you experiment to fill the buffer cache, issue a commit, and then
> either start a parallel tablescan of the table or truncate the table.
> Either command should cause an object checkpoint which I think should cause
> all the dirty blocks from the object go onto the high-priority queue -
> which should encourage dbwr to try harder. (It's possible that the PQ will
> be higher priority than the truncate).

Yes, I have seen it before with truncate or drop; it forces dbwr to flush, but write speed was the same as in other cases.

> a) dbwr normally walks the checkpoint queue to write.- and if you are
> allocating and formatting blocks as you insert data then you could be
> jumping all over the table as you allocate blocks and update space
> management bitmaps - the side effects could be quite pronounced if you are
> using ASSM and system allocated extent sizes.

This does not look optimal to me - I would try to keep writes sequential using sort by file offset or something and use bigger block batches. This should be exposed as parameters in spfile. There is huge difference between random and sequential writes even for newest SSD.

> b) if it's a normal insert you will also be generating undo, which means
> you will have undo blocks interspersed with the table blocks in the
> checkpoint queue - which MIGHT affect the sizes of the batches that dbwr is
> able to coalesce before writing.

I do not see any activity on undo volume during insert or after - only on data and redo logs. Update and delete statements are different story.

> Note - if dbwr is already running at 100% of a CPU, then this is likely to
> be a side effect of polling for completion of the async writer processes -
> one of the options for dbwr is to dispatch several writes to multiple o/s
> processes (threads) and then spin round them checking for completion. There
> may be some configuration details that allows larger writes to be
> dispatched to operating system - but this may be something that isn't
> relevant to your O/S.

CPU spikes to 100% in dbwr when in has like 10GB of dirty buffers and there is truncate or drop issued. There were dbwr slaves running too. I'll need to check if this is still an issue without slaves.

Andrew

andrew....@gmail.com

unread,

May 11, 2012, 9:51:38 PM5/11/12

to

Current results so far:

1. mdadm blows with asynch io+direct io: xfs+no raid does not have drop in write speed as xfs+raid1.

2. async io used in oracle is no good: it looks like it randomizes writes too much for dbwr.

3. dbwr slaves are bad idea - interface between dbwr and slaves consumes too much cpu in dbwr.

4. multiple writers do not start if you have dbwr slaves specified.

5. multiple writers improve nothing compared to one writer for insert select and single hdd.

6. this is a winning combo so far (1 writer, no slaves, direct io):

db_writer_processes=1
filesystemio_options=directio
log_buffer=81920000

Constant write speed

lgwr - 130MB/sec
dbwr - 100MB/sec

Simple is better :-), but still far from target 200.

This is insert select then truncate with one hdd and huge db dblock buffers to make lgwr write first and then dbwr.

I still have to see if write cache in adaptec will improve this.

Andrew

Noons

unread,

May 13, 2012, 6:05:20 PM5/13/12

to

On May 12, 11:51 am, andrew.prota...@gmail.com wrote:

> 1. mdadm blows with asynch io+direct io: xfs+no raid does not have drop in write speed as xfs+raid1.

Slowdown in raid 1 is expected. How much slow down can be sustained
then becomes the limit.

> 2. async io used in oracle is no good: it looks like it randomizes writes too much for dbwr.

Weird. It should not affect write order? I can understand the aio
slaves returning control in different order but that should not affect
how dbwr sequences the issuing of writes.

> 3. dbwr slaves are bad idea - interface between dbwr and slaves consumes too much cpu in dbwr.

General consensus seems to be those are useless. I guess few have
ever made good use of them and why they are still available is a
mistery. I suppose if one does not have any muti-path ability on the
hardware/OS, then there might be a case to use them. Then again who
doesn't, in this day and age?

> 4. multiple writers do not start if you have dbwr slaves specified.

Expected.

> 5. multiple writers improve nothing compared to one writer for insert select and single hdd.

Expected. They are just multiplexing access to a bottleneck.

> 6. this is a winning combo so far (1 writer, no slaves, direct io):
>
> db_writer_processes=1
> filesystemio_options=directio
> log_buffer=81920000
>
> Constant write speed
>
> lgwr - 130MB/sec
> dbwr - 100MB/sec

Interesting the large log buffer. Any thoughts/results on/with other
sizes?

> Simple is better :-), but still far from target 200.

Aye! (I know a few who disagree with the "simple" thing - but then
again, who cares?)
;-)

> This is insert select then truncate with one hdd and huge db dblock buffers to make lgwr write first and then dbwr.
>
> I still have to see if write cache in adaptec will improve this.

You'd definitely like to use SLOB. It is ideal for this sort of
test. I've used it to test/verify and fine tune memory, log writing
speed and db writing speed. Have a read through what it can do and
how simple it is to modify its characteristics.

andrew....@gmail.com

unread,

May 14, 2012, 12:04:38 AM5/14/12

to

Adaptec is in and here are some results with raid1:

1. Write speeds:

lgwr - 130MB/sec
dbwr - 150MB/sec
direct path - 160MB/sec

Still no cigar :-) (200).

2. Lgwr looks like limited by background process running insert select and using 100% cpu, but if I do 2 sqlplus processes doing the same concurrently then lgwr speed jumps to 150MB/sec briefly and drops back to 130MB/sec. None of 3 processes is doing 100% CPU, so there is some other bottleneck. Maybe just shared memory access and waiting for semaphores.

3. Bonnie++ shows write/read 180/330 MB/sec vs 180/180 MB/sec with mdadm. This means adaptec knows little trick called reading from 2 hdd in raid 1 concurrently and mdadm does not know it.

4. If you (adaptec) are designing pc adapter and know that it is overheating then you DO NOT write everywhere in the doc that you operating temperature tops at 55C. You take big radiator or small fan or both and mount them on your PCI card, as nvidia or ati does on all their graphic cards for last gazillion years. Customer must not be forced to put extra fan to cool your card.

5. The same goes for big capacitor (replaces battery) hanging of the card on twisted wire - it is not my problem to figure out where to mount it in AT box - this has to be integrated in PCI card.

Andrew

> > 6. this is a winning combo so far (1 writer, no slaves, direct io):
> >
> > db_writer_processes=1
> > filesystemio_options=directio
> > log_buffer=81920000
> >
> > Constant write speed
> >
> > lgwr - 130MB/sec
> > dbwr - 100MB/sec
>
> Interesting the large log buffer. Any thoughts/results on/with other
> sizes?

It does not matter for lgwr 8m or 80M - it is about the same.

Mladen Gogala

unread,

May 15, 2012, 8:35:42 AM5/15/12

to

On Sun, 13 May 2012 15:05:20 -0700, Noons wrote:

> General consensus seems to be those are useless. I guess few have ever
> made good use of them and why they are still available is a mistery. I
> suppose if one does not have any muti-path ability on the hardware/OS,
> then there might be a case to use them. Then again who doesn't, in this
> day and age?

They're still available because AIO implementation on Linux leaves much
to be desired. One of the problems with Linux AIO is in the artificial
intelligence approach that Linux has taken, which doesn't leave any
possibility of the configuration. On other Unix variants, you can see and
configure the kernel threads that do asynchronous I/O, not so on Linux.
The DBWR slaves are simply emulating the kernel threads. Kernel threads
are on Linux created dynamically. Basically, Linux record of handling
high I/O rates is not very good. If you want a Unix box that will handle
high I/O rates well, buy AIX.

The main problem with Linux AIO is excessive polling, which wastes CPU.
Linux creators are aware of the fact, so they have created two Linux
specific API's to alleviate the problem: epoll and io_submit. Oracle is
using io_submit but it doesn't seem to address the problem completely.

Asynchronous IO is not a magic, in most implementations it is done by
lightweight kernel threads on user behalf. DBWR slaves are simply an
Oracle implementation of asynchronous I/O: instead of lightweight kernel
threads, there are Oracle processes. If Linux version doesn't work well,
try with Oracle's. Problem with Linux is, among other things, in the PC
architecture which doesn't have a standard implementation of smart I/O
channels. Mentioning the old ESCON channels would probably be rightfully
considered rude in the Linux world. There is an I2O story, which is now
defunct, but there is still no channel architecture on PC hardware, which
makes PC I/O a lot slower than the I/O of the true minicomputer, like the
ones offered by IBM or HP.

--
http://mgogala.byethost5.com

Noons

unread,

May 16, 2012, 6:53:48 AM5/16/12

to

Mladen Gogala wrote,on my timestamp of 15/05/2012 10:35 PM:

> high I/O rates is not very good. If you want a Unix box that will handle
> high I/O rates well, buy AIX.

I fully concur: our P6 Aix box is astounding in its I/O capacity.
Particularly with the combo Aix 7.1 and Oracle 11.2.0.3, I'm seeing some
stunning improvements in I/O throughput.

> The main problem with Linux AIO is excessive polling, which wastes CPU.
> Linux creators are aware of the fact, so they have created two Linux
> specific API's to alleviate the problem: epoll and io_submit. Oracle is
> using io_submit but it doesn't seem to address the problem completely.

Didn't Oracle at some stage do something fancy with their own flavour of Linux?

> Oracle implementation of asynchronous I/O: instead of lightweight kernel
> threads, there are Oracle processes. If Linux version doesn't work well,
> try with Oracle's.

Makes sense.

> Problem with Linux is, among other things, in the PC
> architecture which doesn't have a standard implementation of smart I/O
> channels. Mentioning the old ESCON channels would probably be rightfully
> considered rude in the Linux world. There is an I2O story, which is now
> defunct, but there is still no channel architecture on PC hardware, which
> makes PC I/O a lot slower than the I/O of the true minicomputer, like the
> ones offered by IBM or HP.

Yeah. And the backplane on the iPower series also has some accountability in
all that. I am continually surprised at how much processing throughput our DW is
pulling.

Mladen Gogala

unread,

May 18, 2012, 8:17:49 AM5/18/12

to

On Wed, 16 May 2012 20:53:48 +1000, Noons wrote:

> Didn't Oracle at some stage do something fancy with their own flavour of
> Linux?

Yes. They started touting RAC. RAC cannot solve the IO bottleneck but can
make much more money for Oracle.

> Yeah. And the backplane on the iPower series also has some
> accountability in all that. I am continually surprised at how much
> processing throughput our DW is pulling.

Of course it does. True minicomputers and mainframes, the categories that
have been merged into one by the market pressure of cheap Linux systems,
do IO in a different way than PC based machines. The differences are IO
channels which are much cheaper from the processing perspective, but
require a proprietary hardware ("channel controller") which takes care of
the IO. On the true minicomputer, one puts the IORB (IO request block)
into the special location, monitored by the channel processor. The
channel processor picks the request up, executes it and sends a single
interrupt to the CPU saying "I'm done, take a look". The IORB is marked
as complete. That's it. PC hardware is very chatty: CPU sends a message
to the IO controller: are you ready to receive? Controller responds with:
no, I am busy or yes, send the request. Then CPU sends the request, after
which the controller completes the request, puts it in memory and sends a
message that it is done. Each message is called "an interrupt" and has
certain execution characteristics, like executing in kernel context and
preventing other interrupts from getting delivered. Much more time is
wasted than is the case with channel architecture. Unfortunately, PCI-X
and SATA are interrupt based and all the vendors who produce equipment
want it to remain so, for compatibility reasons. That is why the I2O
attempt of addressing this performance shortcoming has failed. However,
if you stack up enough of Dell boxes together, you will get the same
capacity as with a single iPower box, especially if you use Exadata.
I am not sure that the price will remain low, however. Hardware is now
dirt cheap and for a price of a single Oracle license, I can buy 10 very
good PC's. Having one powerful iPower machine with 8GB or even 16GB FC/AL
interfaces and a decent general purpose SAN for DW is probably much more
cost effective than having 4 way RAC, with Exadata. Performance should be
comparable, if not better.

--
http://mgogala.byethost5.com

Noons

unread,

May 19, 2012, 7:50:49 AM5/19/12

to

Mladen Gogala wrote,on my timestamp of 18/05/2012 10:17 PM:

> attempt of addressing this performance shortcoming has failed. However,
> if you stack up enough of Dell boxes together, you will get the same
> capacity as with a single iPower box, especially if you use Exadata.

That was the whole motivation to shove RAC down everyone's throats...

> I am not sure that the price will remain low, however. Hardware is now
> dirt cheap and for a price of a single Oracle license, I can buy 10 very
> good PC's. Having one powerful iPower machine with 8GB or even 16GB FC/AL
> interfaces and a decent general purpose SAN for DW is probably much more
> cost effective than having 4 way RAC, with Exadata. Performance should be
> comparable, if not better.

I can certainly confirm our 32 core iPower6 is an absolute screamer. Fastest
box I've ever seen, at that price point. I do 8TB/day in a 3TB DW db - that's
an average sustained 100MB/s IO rate 24X7, if it was averaged (it isn't). That
P6 box is running as well a JDE system on AS400/DB2 emulation, and just about
every other Peoplesoft db and app servers in our place, in multiple vio
partitions. It hardly ever breaks a sweat!
In peak periods, I've seen the SAN service the DW lpar at 730MB/s aggregate
sustained, with hardly any I/O waits. On 2X4Gbps FC cards attached to the DW
partition, that's just about as fast as one can hope to ever do I/O in such
hardware. I'm looking forward to the 8Gbps FCs we have on order!

andrew....@gmail.com

unread,

May 19, 2012, 12:16:23 PM5/19/12

to wizo...@yahoo.com.au

Guys, you are missing elephant in the room here - cost. How much does that IBM box + storage for it?

This is pc box that costs less than 3K and doing 1GBytes/sec sustained, which is faster than those 2FC cards.

iostat -k 5

...

avg-cpu: %user %nice %system %iowait %steal %idle
0.13 0.00 9.13 15.07 0.00 75.68

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 4.20 12.80 10.40 64 52
sdb 1305.00 334080.00 0.00 1670400 0
sdc 1331.20 340787.20 0.00 1703936 0
sdd 1273.80 326092.80 0.00 1630464 0

avg-cpu: %user %nice %system %iowait %steal %idle
0.63 0.00 10.42 20.05 0.00 68.90

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 504.20 6884.80 0.00 34424 0
sdb 1378.60 352921.60 0.00 1764608 0
sdc 1325.20 339251.20 0.00 1696256 0
sdd 1338.20 342579.20 0.00 1712896 0

avg-cpu: %user %nice %system %iowait %steal %idle
0.28 0.00 9.30 15.31 0.00 75.11

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 1.60 6.40 0.00 32 0
sdb 1343.00 343859.20 0.00 1719296 0
sdc 1285.20 329011.20 0.00 1645056 0
sdd 1304.00 333824.00 0.00 1669120 0

And this could be improved by getting:

1. second controller
2. more hdd
3. switching from hdd to sdd

You may not like bus architecture and IRQs but that stuff works and cpu cycles are cheap - there is not much use for those extra cores anyway, they are pretty idle.

11 years ago I was looking at red hat 7 box and HP UX box side by side and was asking myself - why do we need that old HP 9000 crap? It was worse from hardware, software and cost perspective.

3 years ago I did some CPU benchmark of our SUN box using the java code and int calculations - speed was about the same as our average core 2 desktop.

My buddy is constantly complaining about SUN CPUs failing on a regular basis. The same was my experience with DEC Alphas. When was a last time you have seen Intel CPU failed?

Sorry, but PCs are way to go and this is where hardware have improved for the last 30 years dramatically.

Andrew

Noons

unread,

May 20, 2012, 5:58:08 AM5/20/12

to

andrew....@gmail.com wrote,on my timestamp of 20/05/2012 2:16 AM:

> Guys, you are missing elephant in the room here - cost. How much does that IBM box + storage for it?

No dedicated storage cost. We use a single SAN for everything, its cost is
shared. Cost of P6? Around 400K. For a box that runs at last count 12
partitions inside it, each is a db server or app server. With plenty of room to
spare. It sits in a standard rack using up 6 pizza slots so infra-strucutre
costs are quite low.

> This is pc box that costs less than 3K and doing 1GBytes/sec sustained, which is faster than those 2FC cards.

Yeah, and it's running one system/db/app. I'd like to see it running 12, with
spare capacity...

> My buddy is constantly complaining about SUN CPUs failing on a regular basis. The same was my experience with DEC Alphas. When was a last time you have seen Intel CPU failed?

Last week in our Wintel cloud box. We also run a very large HP Intel
VMWare-based "private cloud". All the Wintel stuff runs on it. That's all the
MSSQL servers plus all app and web servers. Quite a few. That box cost us a LOT
more than the P6...

> Sorry, but PCs are way to go and this is where hardware have improved for the last 30 years dramatically.

No way. Any similarity between an original RS6K Power box and the current P7s
for example is pure coincidence. They have also evolved, as dramatically as PCs
or even more.
The big decider has always been how much one can stash inside one of these
boxes. 10 years ago they were dedicated to a single server. Nowadays,
everything is either lpars or vio partitions. With room to spare.
Sorry, don't know what SUN is doing in that front: it's been more than a decade
since I last touched one of them. They never overly impressed me...

andrew....@gmail.com

unread,

May 20, 2012, 12:29:34 PM5/20/12

to wizo...@yahoo.com.au

OK, what is hardware config for Wintel box that costs > 400K?

I hope you are not talking software licenses cost here :-).

And you should not ignore cost of you storage network - it usually costs way more than your box. I prefer local storage for this reason alone.

Andrew

Mladen Gogala

unread,

May 20, 2012, 1:52:06 PM5/20/12

to

On Sat, 19 May 2012 09:16:23 -0700, andrew.protasov wrote:

> When was a last time you have seen Intel CPU failed?

Two weeks ago, on one of my production databases. There is no RAC and
yes, the entire database went down.

--
http://mgogala.byethost5.com

Mladen Gogala

unread,

May 20, 2012, 1:55:11 PM5/20/12

to

On Sun, 20 May 2012 09:29:34 -0700, andrew.protasov wrote:

> I hope you are not talking software licenses cost here :-).
>
> And you should not ignore cost of you storage network - it usually costs
> way more than your box. I prefer local storage for this reason alone.
>
> Andrew

Problem with local storage is that it is hard to share and re-allocate to
another box, in case of need.

--
http://mgogala.byethost5.com

John Hurley

unread,

May 20, 2012, 2:03:28 PM5/20/12

to

Mladen:

# Two weeks ago, on one of my production databases. There is no RAC

and yes, the entire database went down.

Ouch ... we are not a big shop but for our most important database we
have a spare server powered off right next to prod server.

If unknown hardware failure ... pull out internal drives ( raid 1
mirror for operating system ) ... attach fiber channel connections to
spare server ... power it up ... re-configure network interfaces
( linux will not be happy exactly with different mac addresses
etc ) ... and back up and running.

Local storage not a bad idea ( depending ) for operating system and
installed software ( as long as you can move it quickly to another
machine ) but everything else better be "somewhere else" ...

andrew....@gmail.com

unread,

May 20, 2012, 2:00:45 PM5/20/12

to

On Sunday, May 20, 2012 12:52:06 PM UTC-5, Mladen Gogala wrote:
> On Sat, 19 May 2012 09:16:23 -0700, andrew.protasov wrote:
>
> > When was a last time you have seen Intel CPU failed?
>
> Two weeks ago, on one of my production databases. There is no RAC and
> yes, the entire database went down.

Interesting. Are you sure that you get enough cooling there? What is CPU core temperature?

Andrew

Mladen Gogala

unread,

May 20, 2012, 2:21:42 PM5/20/12

to

On Wed, 09 May 2012 08:34:38 -0700, andrew.protasov wrote:

> Apparently all results were with deadline scheduler in place. I did this
> before the first test:
>
> tuned-adm profile enterprise-storage
>
> and it installed the scheduler.
>
> Andrew

Tuned is showing a lot of promise, but I don't think it's there yet. I
use preload on my Ubuntu laptop and the effects are less than staggering.
With a single-purpose machine like a database server, I prefer doing the
tuning myself. On such machines, I am turning off the OOM killer by
setting vm.overcommit_memory to 1, I usually increase the amount of
available free memory by setting min_free_kbytes and swappiness to 60. I
don't really trust all those AI tuning daemons when it comes to a single
purpose machine. If you have a racing car, then you don't want a tuning
daemon that will tune it so that the oil lasts for 1000 miles and that
tires last for 2 years. You want the car to go as fast as possible and
win the race, to heck with the tires and oil. I don't have much faith in
the NI (Natural Intelligence) either, much less in the form of
intelligence created by the members of the same species as Meg Whitman or
Lindsay Lohan.

--
http://mgogala.byethost5.com

andrew....@gmail.com

unread,

May 20, 2012, 2:14:16 PM5/20/12

to

That's exactly second reason why I prefer it - I do not want it to be shared. I plan my capacity, future growth, design, buy and allocate accordingly. If someone has a NEED then it means they did not do their planning and design properly. It is go away and buy you own stuff time.

Collective farm approach is not the best one :-).

Andrew

andrew....@gmail.com

unread,

May 20, 2012, 2:39:39 PM5/20/12

to

Is there oracle use case with timing before and after the change where it helped?

Andrew

andrew....@gmail.com

unread,

May 20, 2012, 2:46:28 PM5/20/12

to

What about using local storage, linux and drbd? Does anybody use it with oracle?

We used it before for FS and mysql replication, but not with oracle.

Andrew

Mladen Gogala

unread,

May 20, 2012, 4:10:27 PM5/20/12

to

On Sun, 20 May 2012 11:39:39 -0700, andrew.protasov wrote:

> Is there oracle use case with timing before and after the change where
> it helped?

No, there is my ample experience and common sense. Is there a documented
Oracle use case where using "tuned" has helped?

--
http://mgogala.byethost5.com

Mladen Gogala

unread,

May 20, 2012, 4:11:22 PM5/20/12

to

On Sun, 20 May 2012 11:46:28 -0700, andrew.protasov wrote:

> What about using local storage, linux and drbd? Does anybody use it with
> oracle?

That would amount to a home grown SAN?

--
http://mgogala.byethost5.com

andrew....@gmail.com

unread,

May 20, 2012, 4:40:47 PM5/20/12

to

I found it in redhat docs for oracle (sorry, ugly link). There is no timing there either. But it matched your recommendation for deadline scheduler, so how bad could it really be :-).

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CG4QFjAC&url=http%3A%2F%2Fwww.redhat.com%2Frhecm%2Frest-rhecm%2Fjcr%2Frepository%2Fcollaboration%2Fjcr%3Asystem%2Fjcr%3AversionStorage%2Fee6fe0000a0526020f35498ae39e9939%2F11%2Fjcr%3AfrozenNode%2Frh%3AresourceFile&ei=2VS5T8jCHciugQfBx_jNCg&usg=AFQjCNHnR8YHYz3LMEmQW7IG3YmhYT0VoQ

4.1.3 Automatic System Tuning for Database Storage
The tuned package in Red Hat Enterprise Linux 6.2 is recommended for automatically tuning
the system for common workloads: enterprise storage, high network throughput, and power
savings. For Oracle Database, enable the enterprise-storage profile to set the I/O scheduler,
adjust the read-ahead buffers for non-system disks, etc.
# yum install tuned
# chkconfig tuned on
# tuned-adm profile enterprise-storage

andrew....@gmail.com

unread,

May 20, 2012, 4:35:50 PM5/20/12

to

Not really, just db block level replication as alternative to using standby db/data guard.

Andrew

Robert Klemme

unread,

May 21, 2012, 4:43:47 PM5/21/12

to

What about the cooling of the Sun boxes of your buddy whose CPU's are
constantly failing? I can't believe there is a general weakness in
Sparc processors in this area.

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Mladen Gogala

unread,

May 21, 2012, 6:06:08 PM5/21/12

to

On Mon, 21 May 2012 22:43:47 +0200, Robert Klemme wrote:
.

> What about the cooling of the Sun boxes of your buddy whose CPU's are
> constantly failing? I can't believe there is a general weakness in
> Sparc processors in this area.

Yeah, SPARC CPU hardware has a really bad reputation when it comes to the
hardware reliability. Also, they have problems with speed. P7 can run
circles around any SPARC. There was a reason for SUN hardware sales to
sharply fall, approximately two years before they were bought.

--
http://mgogala.byethost5.com