iBook and playing DVDs

Rogério Brito

unread,

May 10, 2002, 10:30:09 PM5/10/02

to

Well, I'm not having luck getting my iBook 600MHz playing DVDs
in a decent way.

In fact, I've spent a great amount of time trying to
understand what the problems are, playing with many different
programs and getting unsatisfactory results.

I have woody installed in this iBook and, with xine (0.9.8), I
get a reasonable playback, with around 15% of dropped frames
and with ugly interlaced frames. Enabling the onefield_xv
deinterlace method works and makes the DVD more or less
watchable, though during some stages, I can see artifacts on
the image due to excessive frames being dropped.

I can't get vlc 0.3.0 working well here, because there are
issues with the sound (probably due to endianness problems,
although I'm not really sure). I already tried telling vlc all
different possibilities for endianness or signedness of the
output, but I don't know if the menu is functional, as I saw
little changes when I played with that option. Enabling the
diagnostics seems to suggest that not all frames are played.

I also tried getting mplayer to work with the iBook. It works
moderately well, but had endianess problems with the audio
(and in other parts of the program too -- I even sent them one
patch to correct the generation of WAV files).

Mplayer can now play vob files with decent sound, and the
video is smooth (if I compile with gcc 3.0), but it seems that
the iBook can't keep up with the decoding of video frames and
the audio and the video become desynchronized. :-(

If I playback with an option to drop frames, then it is
watchable, but the interlaced frames become noticeable. I can
use the option "-npp lb" to deinterlace and it works, but it
is not usable, as too many frames get dropped (even though I'd
guess that more frames would be dropped).

The output of mplayer shows the amount of time it spends
decoding the frames, showing the frames and dealing with
audio. I notice that the percentage of time it spends with
video output is quite high in comparison to my Celeron 466MHz
(I'll make further tests).

I tried installing Michel's XFree 4.2.0 binaries and they seem
to make xine work better, dropping around 6 frames less than
with woody's X 4.1.0.1.

On the other hand, DRI (which, I heard, is supposed to enable
DMA transfers to video) wasn't enabled with XFree 4.2.0
because it complained that my r128 module is version 2.1.6,
but that version 2.2 would be needed.

The catch is that current benh's kernel (just rsync'ed) still
has r128.c with version 2.1.6. Is there any way for me to get
DMA working with this machine?

I have already tried many things that I could think of, but I
am starting to think that the possibilities are almost
exhausted. :-(

So, is there anybody here that is able to see DVDs with their
iBook?

Thanks for ANY help or comment, Roger...

P.S.: BTW an mplayer compiled with X 4.1 headers uses 40% of its time
in video output. The same binary, with X 4.2, uses around 140% and is
quite (visually) slower than with before. Does this increase in time
come from the headers with which the program was compiled?
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rogério Brito - rbr...@iname.com - http://www.ime.usp.br/~rbrito/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

--
To UNSUBSCRIBE, email to debian-powe...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Michel Lanners

unread,

May 11, 2002, 2:40:05 AM5/11/02

to

On 10 May, this message from Rogério Brito echoed through cyberspace:

>
> Well, I'm not having luck getting my iBook 600MHz playing DVDs
> in a decent way.

I think you're quite lucky with what you get ;-)

> I have woody installed in this iBook and, with xine (0.9.8), I
> get a reasonable playback, with around 15% of dropped frames
> and with ugly interlaced frames. Enabling the onefield_xv
> deinterlace method works and makes the DVD more or less
> watchable, though during some stages, I can see artifacts on
> the image due to excessive frames being dropped.

This is probably as good as it gets right now. I don't think you can do
better without using the r128's hardware accelerated iDCT and MC.

> I can't get vlc 0.3.0 working well here, because there are
> issues with the sound (probably due to endianness problems,
> although I'm not really sure). I already tried telling vlc all
> different possibilities for endianness or signedness of the
> output, but I don't know if the menu is functional, as I saw
> little changes when I played with that option. Enabling the
> diagnostics seems to suggest that not all frames are played.

vlc has had sound problems for quite a while. Here's what Jack Howarth
said re. vlc's audio settings:

[quote]
> On my debian sid machine running benh's current kernel and
> alsa 0.9beta12 I was able to get audio working under alsa, esd and
> oss. You have to select 3, the 16 bit signed big endian output from
> the audio preference section for that to work. I notified Samuel
> Hocevar and he is going to try to do something about the default, 0,
> native endian code so that it picks up 16 bit big endian on ppc.
> The audio problems with bursts of static aren't ppc specific.
> There has been and still is a audio starvation problem in vlc
> but according to Sam no one has the guts and/or time to rip into
> that code yet.
[ end quote]

I haven't tried this myself.

> The output of mplayer shows the amount of time it spends
> decoding the frames, showing the frames and dealing with
> audio. I notice that the percentage of time it spends with
> video output is quite high in comparison to my Celeron 466MHz
> (I'll make further tests).

That's essentially because of MTRR on i386. I wouldn't know hard numbers
to compare, but at least subjectivly, MTRR helps a lot for the copy to
VRAM of the video out data.

> I tried installing Michel's XFree 4.2.0 binaries and they seem
> to make xine work better, dropping around 6 frames less than
> with woody's X 4.1.0.1.
>
> On the other hand, DRI (which, I heard, is supposed to enable
> DMA transfers to video) wasn't enabled with XFree 4.2.0

[snip]

Note that for DMA you need DRI, which needs large amounts of VRAM. So
you may only be able to use that in 16-bit mode. I'm usually running in
32-bit mode :(

> I have already tried many things that I could think of, but I
> am starting to think that the possibilities are almost
> exhausted. :-(

You've got it :-) No, there are a few things that still can be done:

- get ATI to open the specs for iDCT and MC in hardware. Won't happen
anytime soon.

- optimize a few of the more processor-intensive parts of the algorithms
with handcoded ASM. Good luck...

- use DMA: that should be doable with the right pieces of code.

Cheers

Michel

Jack Howarth

unread,

May 11, 2002, 12:20:08 PM5/11/02

to

Michel,
Actually they just started mapping out a major rewrite of
the audio codec..

http://www.via.ecp.fr/ml/videolan/vlc-devel/200205/msg00022.html

...and all the followup messages in that thread.
Jack

Rogério Brito

unread,

May 12, 2002, 8:40:05 AM5/12/02

to

On May 11 2002, Michel Lanners wrote:
> On 10 May, this message from Rogério Brito echoed through cyberspace:
> >
> > Well, I'm not having luck getting my iBook 600MHz playing DVDs
> > in a decent way.
>
> I think you're quite lucky with what you get ;-)

Ouch!

I guess that Apple is making false claims about the iBook's
ability to play DVDs, since my complete collection of DVDs
(all 4 of them :-) ) show interlacing artifacts in the image,
even when playing under MacOS X. :-(

All that I can say is that I feel cheated. :-(

On the other hand, according to MacOS X's cpu monitor, playing
DVDs only use about 50% of CPU time, probably due to use of
hardware acceleration.

> This is probably as good as it gets right now. I don't think you can
> do better without using the r128's hardware accelerated iDCT and MC.

I feared that. :-(

> vlc has had sound problems for quite a while. Here's what Jack Howarth
> said re. vlc's audio settings:

Oh, thank you very much for that quote. I guess that it means
that I am not a complete fool as I think I am for not making
vlc work correctly. :-)

[High CPU usage during video output]

> That's essentially because of MTRR on i386. I wouldn't know hard
> numbers to compare, but at least subjectivly, MTRR helps a lot for
> the copy to VRAM of the video out data.

Ouch, I miss MTRR. :-(

[XFree86 4.2.0 and DMA]

> Note that for DMA you need DRI, which needs large amounts of
> VRAM. So you may only be able to use that in 16-bit mode. I'm
> usually running in 32-bit mode :(

So, from you comment, I infer that you have an iBook also?

Anyway, I don't mind if I have to use 16-bit mode. If the
movie is "good enough", then I'm cool with that.

I'd love to get DMA running on this iBook and would appreciate
if anybody could tell me where I can get a r128 with version
2.2, as current benh's kernel only has version 2.1.6
available. :-(

Perhaps the r128 module isn't being maintained?

> You've got it :-) No, there are a few things that still can be done:
>
> - get ATI to open the specs for iDCT and MC in hardware. Won't happen
> anytime soon.

What is the problem with publishing the interfaces for that? I
thought that ATI were an open-source friendly company. :-(

> - optimize a few of the more processor-intensive parts of the algorithms
> with handcoded ASM. Good luck...

Well, I'm so pissed that I am currently even considering
learning PPC's assembly for this task. I even downloaded
Motorola's user guide for the G3. :-( The only problem now is
lack of time.

BTW, I'm using the following options to compile the programs I
am trying:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CC="gcc-3.0"
CFLAGS="-O3 -fomit-frame-pointer -ffast-math -frename-registers \
-mtune=750 -mcpu=750 -mfused-madd"
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Any suggestions on what else I should use? Perhaps gcc-3.1 is
a bit better regarding optimizations?

> - use DMA: that should be doable with the right pieces of code.

Well, that is my highest hope right now. I'd love to get a
kernel with a r128.o that allows DMA to be used.

Thank you for your comments, Roger...

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rogério Brito - rbr...@iname.com - http://www.ime.usp.br/~rbrito/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Michel Dänzer

unread,

May 12, 2002, 10:10:09 AM5/12/02

to

On Sun, 2002-05-12 at 14:34, Rogério Brito wrote:
> On May 11 2002, Michel Lanners wrote:
> > On 10 May, this message from Rogério Brito echoed through cyberspace:
>

> I guess that Apple is making false claims about the iBook's
> ability to play DVDs, since my complete collection of DVDs
> (all 4 of them :-) ) show interlacing artifacts in the image,
> even when playing under MacOS X. :-(

Apple's DVD player is known for its lack of any deinterlacing. Shouldn't
be hard to add...

> All that I can say is that I feel cheated. :-(

Well, the actual movies shouldn't be interlaced.

> I'd love to get DMA running on this iBook and would appreciate
> if anybody could tell me where I can get a r128 with version
> 2.2, as current benh's kernel only has version 2.1.6
> available. :-(

Rumour has it that the current one in benh's tree is 2.2 except the
version is wrong (somebody posted a patch to 'fix' the version number to
linuxppc-dev), but if I were you I'd wait for confirmation from benh
before relying on that. :)

> Perhaps the r128 module isn't being maintained?

It is, kinda, by XFree86 and DRI. Both their CVS repositories have the
2.2 version, just nobody has updated the Linux kernel yet.

> > You've got it :-) No, there are a few things that still can be done:
> >
> > - get ATI to open the specs for iDCT and MC in hardware. Won't happen
> > anytime soon.
>
> What is the problem with publishing the interfaces for that? I
> thought that ATI were an open-source friendly company. :-(

They are IMHO. Every company seems to be afraid of releasing that kind
of information, I have an idea why but people tell me there's no
reason...

> > - use DMA: that should be doable with the right pieces of code.
>
> Well, that is my highest hope right now. I'd love to get a
> kernel with a r128.o that allows DMA to be used.

I'm afraid you'll be disappointed. Seems it was only good for a real
'gain' as a hack on top of 4.1. Now that it's done (more :) correctly, a
lot of cycles are still wasted waiting for the DMA transfer to complete.
Using an interrupt might help there.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Benjamin Herrenschmidt

unread,

May 13, 2002, 6:20:15 AM5/13/02

to

> I guess that Apple is making false claims about the iBook's
> ability to play DVDs, since my complete collection of DVDs
> (all 4 of them :-) ) show interlacing artifacts in the image,
> even when playing under MacOS X. :-(
>
> All that I can say is that I feel cheated. :-(

Well. A lot of DVDs aren't interlaced. Some are, and then you'll
have interlacing artifacts when playing on LCD, but not when
playing on TV with the proper resultion.

On the other hand, if you're playing on TV, the ATI chip has a pretty
good deinterlacer built in.

The problem with OS X is that it's current versio doesn't have
all of the features OS 9 had regarding setting up the TV output.
MacOS 9 ca, via the control strip, be setup to full resolution
(no black boerders, meaning that a 720x576 PAL will be a _real_
interlaced PAL output), or to a "graphics" mode (deinterlaced).

> On the other hand, according to MacOS X's cpu monitor, playing
> DVDs only use about 50% of CPU time, probably due to use of
> hardware acceleration.

Yes, it's smoother and use less CPU. Actually, the OS X implementation
is of much higher quality than the OS 9 one in this regard, though the
system itself lacks a few options described above.

>> This is probably as good as it gets right now. I don't think you can
>> do better without using the r128's hardware accelerated iDCT and MC.
>
> I feared that. :-(

Well, are we _that_ far ? It might be possible to properly optimise
the IDCT and MC for "normal" PPCs and win enough perfs... and it might
not. But from experience of pre-altivec days, properly hand optimized
graphic code can get a huge perf boost on G3s.

>> vlc has had sound problems for quite a while. Here's what Jack Howarth
>> said re. vlc's audio settings:
>
> Oh, thank you very much for that quote. I guess that it means
> that I am not a complete fool as I think I am for not making
> vlc work correctly. :-)
>
>[High CPU usage during video output]
>> That's essentially because of MTRR on i386. I wouldn't know hard
>> numbers to compare, but at least subjectivly, MTRR helps a lot for
>> the copy to VRAM of the video out data.
>
> Ouch, I miss MTRR. :-(

The current DMA implementation suffers from lack of DMA usage
(meaning time is lost waiting for it to complete) and not properly
working AGP x2 and x4 modes (would speed things up).
I've obtained similar perfs by just doing 64 bits blits using
floating point registers. So there is probably room for improvement
here, especially if we can figure out why almost all ATI chips
tend to lockup when using AGP...

>[XFree86 4.2.0 and DMA]
>> Note that for DMA you need DRI, which needs large amounts of
>> VRAM. So you may only be able to use that in 16-bit mode. I'm
>> usually running in 32-bit mode :(
>
> So, from you comment, I infer that you have an iBook also?
>
> Anyway, I don't mind if I have to use 16-bit mode. If the
> movie is "good enough", then I'm cool with that.
>
> I'd love to get DMA running on this iBook and would appreciate
> if anybody could tell me where I can get a r128 with version
> 2.2, as current benh's kernel only has version 2.1.6
> available. :-(
>
> Perhaps the r128 module isn't being maintained?
>
>> You've got it :-) No, there are a few things that still can be done:
>>
>> - get ATI to open the specs for iDCT and MC in hardware. Won't happen
>> anytime soon.
>
> What is the problem with publishing the interfaces for that? I
> thought that ATI were an open-source friendly company. :-(

They used to be, but they also stopped responding properly to my
emails for some time, that's one reason why the tipb with rage M6
sleep bug (apparently related to a chip HW bug) isn't fixed yet.

>> - optimize a few of the more processor-intensive parts of the algorithms
>> with handcoded ASM. Good luck...
>
> Well, I'm so pissed that I am currently even considering
> learning PPC's assembly for this task. I even downloaded
> Motorola's user guide for the G3. :-( The only problem now is
> lack of time.

You may want to look into one big bootleneck with PPC: memory
transfers. If you can find places where quantity other than full
aligned 32 bits words are moved around, you way get a significant
performance boost by implementing a simple "deblocking" mecanism
via a couple of intermediate registers. This can most of the time
be done entirely in C.

If you find raw memcpy's, you can make sure they are 64 bits aligned
on both source and dest and then use floating point registers to move
64 bits at a time.

Rogério Brito

unread,

May 13, 2002, 4:10:12 PM5/13/02

to

On May 12 2002, Michel Dänzer wrote:
> On Sun, 2002-05-12 at 14:34, Rogério Brito wrote:
> > I guess that Apple is making false claims about the iBook's
> > ability to play DVDs, since my complete collection of DVDs
> > (all 4 of them :-) ) show interlacing artifacts in the image,
> > even when playing under MacOS X. :-(
>
> Apple's DVD player is known for its lack of any deinterlacing.
> Shouldn't be hard to add...

Shouldn't be hard indeed. I just sent a patch for xine 0.9.9
that added one deinterlacing mode (linear blend) in plain C
(which makes it currently the only portable mode of
deinterlacing, apart from onefield_xv, which is terrible in
terms of image quality).

It is not useful for me with the iBook, but should be useful
for TiBook owners, if the G4s are faster than G3s with code
that is not Altivec enabled.

The actual code didn't take more than, say, 5 or 10 lines.

I can't see why Apple can't include something like this in
their next versions of DVD player. And, obviously, they have
the knowledge for making fast hand-coded assembly versions of
them.

> > All that I can say is that I feel cheated. :-(
>
> Well, the actual movies shouldn't be interlaced.

Well, I read that also in Ben's reply and got surprised, since
without exception all DVDs that I have seen to this day are
interlaced.

> Rumour has it that the current one in benh's tree is 2.2 except the
> version is wrong (somebody posted a patch to 'fix' the version number to
> linuxppc-dev), but if I were you I'd wait for confirmation from benh
> before relying on that. :)

Yes, I went to hunt that message and actually recompiled a new
kernel with the new version, but for some reason, I couldn't
make dri work with the X binaries from your site.

But now my mood is back to the conservative mode and I am
using a standard benh kernel. :-)

> > Perhaps the r128 module isn't being maintained?
>
> It is, kinda, by XFree86 and DRI. Both their CVS repositories have
> the 2.2 version, just nobody has updated the Linux kernel yet.

I see. I'm crossing my fingers for this update to happen
soon. :-)

> > What is the problem with publishing the interfaces for that? I
> > thought that ATI were an open-source friendly company. :-(
>
> They are IMHO. Every company seems to be afraid of releasing that
> kind of information, I have an idea why but people tell me there's
> no reason...

Could you say what could be the possible reasons for ATI to
not release the specifications? I'd be really interested to
know. I promise that I won't start a flamewar if I don't like
what I read. :-)

But how "closed" is ATI to specifying the interface to hw
accel of their chips? Perhaps an organized petition would help
here?

> I'm afraid you'll be disappointed. Seems it was only good for a real
> 'gain' as a hack on top of 4.1. Now that it's done (more :)
> correctly, a lot of cycles are still wasted waiting for the DMA
> transfer to complete.

Well, just using a new X helped performance a bit with xine. I
am hoping that DMA would help more. Any tiny bit would help.

And is the CPU stalled when the DMA transfer is being
performed? I'd say no, given my limited understanding of the
matter (otherwise, there would be not improvement in doing DMA
in the first place, right?)

If it is not, then even lowering the CPU consumption would
help, as there could be more processes decoding the video,
right? Or would the actual video output be slower, in general?

OTOH, perhaps the bottleneck being how the video operations
are slower may be the reason why mplayer's video out benchmark
with X 4.2 was so much higher than with X 4.1?

> Using an interrupt might help there.

I don't understand this, sorry. :-)

Thank you for your comments, Roger...

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rogério Brito - rbr...@iname.com - http://www.ime.usp.br/~rbrito/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Michel Dänzer

unread,

May 14, 2002, 7:30:14 PM5/14/02

to

On Mon, 2002-05-13 at 22:04, Rogério Brito wrote:
> On May 12 2002, Michel Dänzer wrote:
> > On Sun, 2002-05-12 at 14:34, Rogério Brito wrote:
>
> > > All that I can say is that I feel cheated. :-(
> >
> > Well, the actual movies shouldn't be interlaced.
>
> Well, I read that also in Ben's reply and got surprised, since
> without exception all DVDs that I have seen to this day are
> interlaced.

Sounds like bad encodings. The original movies as shown in theaters obviously
aren't interlaced so if a DVD is that's purely artificial.

> > Rumour has it that the current one in benh's tree is 2.2 except the
> > version is wrong (somebody posted a patch to 'fix' the version number to
> > linuxppc-dev), but if I were you I'd wait for confirmation from benh
> > before relying on that. :)
>
> Yes, I went to hunt that message and actually recompiled a new
> kernel with the new version, but for some reason, I couldn't
> make dri work with the X binaries from your site.

What happened? My suspicion is that the other guy only tried with the 4.2 3D
driver, where 2.1 or 2.2 doesn't really make a difference.

> > > Perhaps the r128 module isn't being maintained?
> >
> > It is, kinda, by XFree86 and DRI. Both their CVS repositories have
> > the 2.2 version, just nobody has updated the Linux kernel yet.
>
> I see. I'm crossing my fingers for this update to happen
> soon. :-)

It's being worked on, see the dri-devel archives if you're interested.
Meanwhile, for 4.2 you can build the DRM from the XFree86 4.2 CVS branch or
from http://xfree86.org/~alanh/ .

> > > What is the problem with publishing the interfaces for that? I
> > > thought that ATI were an open-source friendly company. :-(
> >
> > They are IMHO. Every company seems to be afraid of releasing that
> > kind of information, I have an idea why but people tell me there's
> > no reason...
>
> Could you say what could be the possible reasons for ATI to
> not release the specifications? I'd be really interested to
> know. I promise that I won't start a flamewar if I don't like
> what I read. :-)

My idea is it has to do with certain US laws, but what do I know.

> But how "closed" is ATI to specifying the interface to hw
> accel of their chips?

They provide docs to interested developers, just not for these features.

> Perhaps an organized petition would help here?

Who knows, but I'd expect the GATOS project to have achieved something in
that case.

> > I'm afraid you'll be disappointed. Seems it was only good for a real
> > 'gain' as a hack on top of 4.1. Now that it's done (more :)
> > correctly, a lot of cycles are still wasted waiting for the DMA
> > transfer to complete.
>
> Well, just using a new X helped performance a bit with xine. I
> am hoping that DMA would help more. Any tiny bit would help.
>
> And is the CPU stalled when the DMA transfer is being
> performed? I'd say no, given my limited understanding of the
> matter (otherwise, there would be not improvement in doing DMA
> in the first place, right?)

Right, but with a catch: the DMA transfer must complete before the image can
displayed. Right now, this is ensured by waiting for the DMA engine to go idle,
which is basically implemented as a busy loop in the DRM. So while the transfer
will be faster than with plain memcpy, a lot of CPU cycles are wasted.

> If it is not, then even lowering the CPU consumption would
> help, as there could be more processes decoding the video,
> right? Or would the actual video output be slower, in general?

Should be at least as fast, just hardly significantly faster yet.

> OTOH, perhaps the bottleneck being how the video operations
> are slower may be the reason why mplayer's video out benchmark
> with X 4.2 was so much higher than with X 4.1?

I don't understand. What does that benchmark measure and how do the results
compare?

> > Using an interrupt might help there.
>
> I don't understand this, sorry. :-)

The chip could issue an interrupt when the DMA transfer is complete so
the CPU could do something useful while it's in progress.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Adrian Cox

unread,

May 15, 2002, 5:10:14 AM5/15/02

to

On 15 May 2002 00:05:59 +0200
Michel D <dae...@debian.org> wrote:

> On Mon, 2002-05-13 at 22:04, Rogério Brito wrote:
> > On May 12 2002, Michel Dänzer wrote:
> > >
> > > Well, the actual movies shouldn't be interlaced.
> >
> > Well, I read that also in Ben's reply and got surprised, since
> > without exception all DVDs that I have seen to this day are
> > interlaced.
>
> Sounds like bad encodings. The original movies as shown in theaters obviously
> aren't interlaced so if a DVD is that's purely artificial.

Are you sure about this? I thought most DVDs carried either interlaced
NTSC or interlaced PAL. In the NTSC case the movie is converted to
30fps interlaced by a 3:2 pulldown (3 fields from one frame, 2 fields
from the next). In the PAL case each frame is split into two fields,
which speeds the film up slightly. Adjusting the pitch of the audio is
optional in this case. I believe that in both cases this is normally
done when mastering the DVD, to allow for manual tweaking of the
process.

To play back NTSC material sourced from 24fps film on a progressive scan
display the player needs to do an inverse telecine transformation. Do
any of the Linux players support this?

For examples of why inverse telecine is hard, see:
http://www.lukesvideo.com/telecining4.html

- Adrian Cox

Josh Huber

unread,

May 15, 2002, 12:20:12 PM5/15/02

to

Michel Dänzer <dae...@debian.org> writes:

> Sounds like bad encodings. The original movies as shown in theaters
> obviously aren't interlaced so if a DVD is that's purely artificial.

Yes, most DVDs are interlaced. This is why you need a progressive
scan DVD player (which performs de-interlacing) to get non-interlaced
output.

see for example:

http://www.hometheaterhifi.com/volume_7_4/dvd-benchmark-part-5-progressive-10-2000.html

ttyl,

--
Josh Huber | hu...@debian.org |

Benjamin Herrenschmidt

unread,

May 15, 2002, 2:40:11 PM5/15/02

to

>Right, but with a catch: the DMA transfer must complete before the image can
>displayed. Right now, this is ensured by waiting for the DMA engine to go
>idle,
>which is basically implemented as a busy loop in the DRM. So while the
>transfer
>will be faster than with plain memcpy, a lot of CPU cycles are wasted.

Whiwh seems silly. What should happen is

- Decode frame 1
- Start DMA'ing frame 1
- Decode frame 2
- Queue DMA for frame 2 (or start DMAing if fame 1 is done)
- Wait for frame 1 to complete DMA
- Display frame 1
- Decode frame 3

etc...

That is you should have at least 2 buffers so you can asynchronously
DMA a frame while decoding the next frame. It makes only sense to busy
loop on the DMA if you are decoding faster than what you display. If this
is not the case, then you are just throwing CPU cycles by the window.

The whole point of DMA is for the CPU to do something else while the
ATI chip is doing the transfer.

Ben.

Michel Dänzer

unread,

May 15, 2002, 6:30:17 PM5/15/02

to

On Wed, 2002-05-15 at 11:02, Adrian Cox wrote:
> On 15 May 2002 00:05:59 +0200
> Michel D <dae...@debian.org> wrote:
>
> > On Mon, 2002-05-13 at 22:04, Rogério Brito wrote:
> > > On May 12 2002, Michel Dänzer wrote:
> > > >
> > > > Well, the actual movies shouldn't be interlaced.
> > >
> > > Well, I read that also in Ben's reply and got surprised, since
> > > without exception all DVDs that I have seen to this day are
> > > interlaced.
> >
> > Sounds like bad encodings. The original movies as shown in theaters obviously
> > aren't interlaced so if a DVD is that's purely artificial.
>
> Are you sure about this? I thought most DVDs carried either interlaced
> NTSC or interlaced PAL. In the NTSC case the movie is converted to
> 30fps interlaced by a 3:2 pulldown (3 fields from one frame, 2 fields
> from the next). In the PAL case each frame is split into two fields,
> which speeds the film up slightly. Adjusting the pitch of the audio is
> optional in this case. I believe that in both cases this is normally
> done when mastering the DVD, to allow for manual tweaking of the
> process.

You're right, forgot about this, but in the PAL case the two fields are
basically the same as one frame, right? Maybe that's why I haven't
noticed any interlacing yet.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Michel Dänzer

unread,

May 15, 2002, 6:40:04 PM5/15/02

to

On Wed, 2002-05-15 at 13:33, Benjamin Herrenschmidt wrote:
> >Right, but with a catch: the DMA transfer must complete before the image can
> >displayed. Right now, this is ensured by waiting for the DMA engine to go
> >idle,
> >which is basically implemented as a busy loop in the DRM. So while the
> >transfer
> >will be faster than with plain memcpy, a lot of CPU cycles are wasted.
>
> Whiwh seems silly. What should happen is
>
> - Decode frame 1
> - Start DMA'ing frame 1
> - Decode frame 2
> - Queue DMA for frame 2 (or start DMAing if fame 1 is done)
> - Wait for frame 1 to complete DMA
> - Display frame 1
> - Decode frame 3
>
> etc...
>
> That is you should have at least 2 buffers so you can asynchronously
> DMA a frame while decoding the next frame. It makes only sense to busy
> loop on the DMA if you are decoding faster than what you display. If this
> is not the case, then you are just throwing CPU cycles by the window.
>
> The whole point of DMA is for the CPU to do something else while the
> ATI chip is doing the transfer.

The problem is that the XVideo extension is very simple (it was
basically designed to display single images with colorspace conversion
and scaling, nothing more), so if this scheme is possible at all with
it, the app will have to do the hard work.

Things are probably better with XvMC.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Benjamin Herrenschmidt

unread,

May 15, 2002, 6:40:08 PM5/15/02

to

>
>> That is you should have at least 2 buffers so you can asynchronously
>> DMA a frame while decoding the next frame. It makes only sense to busy
>> loop on the DMA if you are decoding faster than what you display. If this
>> is not the case, then you are just throwing CPU cycles by the window.
>>
>> The whole point of DMA is for the CPU to do something else while the
>> ATI chip is doing the transfer.
>
>The problem is that the XVideo extension is very simple (it was
>basically designed to display single images with colorspace conversion
>and scaling, nothing more), so if this scheme is possible at all with
>it, the app will have to do the hard work.

Doesn't xv support double buffer ?

>Things are probably better with XvMC.

--

Michel Dänzer

unread,

May 15, 2002, 7:20:05 PM5/15/02

to

On Thu, 2002-05-16 at 00:51, Benjamin Herrenschmidt wrote:
> >
> >> That is you should have at least 2 buffers so you can asynchronously
> >> DMA a frame while decoding the next frame. It makes only sense to busy
> >> loop on the DMA if you are decoding faster than what you display. If this
> >> is not the case, then you are just throwing CPU cycles by the window.
> >>
> >> The whole point of DMA is for the CPU to do something else while the
> >> ATI chip is doing the transfer.
> >
> >The problem is that the XVideo extension is very simple (it was
> >basically designed to display single images with colorspace conversion
> >and scaling, nothing more), so if this scheme is possible at all with
> >it, the app will have to do the hard work.
>
> Doesn't xv support double buffer ?

Some drivers (including r128 and radeon) do, but that just means that
the current image will be displayed on the next retrace after it has
been transferred to video RAM.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

be...@kernel.crashing.org

unread,

May 16, 2002, 5:20:10 AM5/16/02

to

>> Doesn't xv support double buffer ?
>
>Some drivers (including r128 and radeon) do, but that just means that
>the current image will be displayed on the next retrace after it has
>been transferred to video RAM.

Ok, so I beleive we can't blit asynchronously because the source
buffer won't be availble to the X server upon exist from putimage ?

If this is not that case, that is if X can still tap the source
buffer upon exit from put image, it can do async blit transparently
pretty easily.

In all cases, we would surely benefit some way to block on the interrupt
instead of spin looping.

Ben.

Rogério Brito

unread,

May 16, 2002, 7:40:09 AM5/16/02

to

On May 15 2002, Benjamin Herrenschmidt wrote:
> Whiwh seems silly. What should happen is

(...)

>
> That is you should have at least 2 buffers so you can asynchronously
> DMA a frame while decoding the next frame.

Yes, this queueing scheme was basically what I meant when I
wrote in an earlier message the following:

"And is the CPU stalled when the DMA transfer is being

performed? (...) If it is not, then even lowering the CPU

consumption would help, as there could be more processes
decoding the video, right?"

> It makes only sense to busy loop on the DMA if you are decoding

> faster than what you display. If this is not the case, then you are
> just throwing CPU cycles by the window.

And in the case of the G3 decoding video from a DVD, I learned
that every cycle is important, without any question, as it
doesn't feature vectorial instructions.

> The whole point of DMA is for the CPU to do something else while the
> ATI chip is doing the transfer.

Yes. According to mplayer's output, the r128 wastes quite a
lot of time in this iBook to output the video, even compared
to my Celeron 466 with a mach64 chip.

BTW, gentlemen, thank you for this quite instructive
discussion.

[]s, Roger...

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rogério Brito - rbr...@iname.com - http://www.ime.usp.br/~rbrito/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Rogério Brito

unread,

May 16, 2002, 7:40:10 AM5/16/02

to

On May 16 2002, Michel Dänzer wrote:
> The problem is that the XVideo extension is very simple (it was
> basically designed to display single images with colorspace
> conversion and scaling, nothing more),

Which means it was not meant really for videos in the sense of
a sequence of frames?

> Things are probably better with XvMC.

But will it depend on the idct and mc specifications of the
chips?

If yes, then it would probably be a problem with the r128,
right?

[]s, Roger...

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rogério Brito - rbr...@iname.com - http://www.ime.usp.br/~rbrito/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Michel Dänzer

unread,

May 16, 2002, 10:30:08 AM5/16/02

to

On Thu, 2002-05-16 at 13:33, Rogério Brito wrote:
> On May 16 2002, Michel Dänzer wrote:
> > The problem is that the XVideo extension is very simple (it was
> > basically designed to display single images with colorspace
> > conversion and scaling, nothing more),
>
> Which means it was not meant really for videos in the sense of
> a sequence of frames?

I wouldn't say it wasn't meant for that, but I think the things we worry
about today were far from possible when it was conceived.

> > Things are probably better with XvMC.
>
> But will it depend on the idct and mc specifications of the
> chips?

No, XvMC was originally designed for that but it also supports
non-accelerated surfaces.

In contrast, Xv definitely can't take advantage of those chip features;
XvMC is a must for that.

> If yes, then it would probably be a problem with the r128,
> right?

The problem right now is that there's no XvMC driver for r128 yet.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Michel Dänzer

unread,

May 18, 2002, 9:40:08 AM5/18/02

to

On Thu, 2002-05-16 at 11:57, be...@kernel.crashing.org wrote:
> >> Doesn't xv support double buffer ?
> >
> >Some drivers (including r128 and radeon) do, but that just means that
> >the current image will be displayed on the next retrace after it has
> >been transferred to video RAM.
>
> Ok, so I beleive we can't blit asynchronously because the source
> buffer won't be availble to the X server upon exist from putimage ?
>
> If this is not that case, that is if X can still tap the source
> buffer upon exit from put image, it can do async blit transparently
> pretty easily.

The source buffer doesn't matter after we have filled the DMA buffers,
does it?

Anyway, I don't see any explicit synchronisation in the driver, so
probably the problem is the players calling XSync().

> In all cases, we would surely benefit some way to block on the interrupt
> instead of spin looping.

Yep.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Benjamin Herrenschmidt

unread,

May 18, 2002, 9:50:08 AM5/18/02

to

>The source buffer doesn't matter after we have filled the DMA buffers,
>does it?

Ah, you are right, I forgot about the fact we did an additional copy
here. Too bad we can't just DMA from the source buffer, that sucks,
remember the RAM throughput of most Macs isn't that good... This is
probably one reason why DMA doesn't show that a significant perf
improvement.

>Anyway, I don't see any explicit synchronisation in the driver, so
>probably the problem is the players calling XSync().

Well, do we wait for DMA to finish or not ? If we do, then we are
doing explicit sync. If we don't, then we should, at least, wait
on frame N+1 wait for frame N to finish. The point is to have at
least one frame in advance to not busyloop when there is still
work to do. The ideal case would be to block on command completion
using an IRQ of course.

XSync() can be faked. We can perfectly decide to buffer one frame
in advance, can't we ? Then XSync on the next frame, thus we won't
block on double buffer setup if the second buffer isn't filled. That
make sure we only block (or spinloop) if the decoder is feeding
us faster than the framerate, and only when we have the 2 buffers filled.
That should smooth the whole data flow and avoid a lot of useless
sleeps/busyloops.

Ben.

Michel Dänzer

unread,

May 19, 2002, 5:20:08 AM5/19/02

to

On Fri, 2002-05-17 at 04:42, Benjamin Herrenschmidt wrote:
> >The source buffer doesn't matter after we have filled the DMA buffers,
> >does it?
>
> Ah, you are right, I forgot about the fact we did an additional copy
> here. Too bad we can't just DMA from the source buffer, that sucks,
> remember the RAM throughput of most Macs isn't that good... This is
> probably one reason why DMA doesn't show that a significant perf
> improvement.

One might be able to DMA directly from the source buffer, but one would
have to walk the pages and set up descriptor tables with the bus
addresses. Do you think that could still be better? (Assuming one could
work out alignment etc.)

> >Anyway, I don't see any explicit synchronisation in the driver, so
> >probably the problem is the players calling XSync().
>
> Well, do we wait for DMA to finish or not ? If we do, then we are
> doing explicit sync.

Again, I don't see that in the driver, but maybe I'm just blind.

> If we don't, then we should, at least, wait
> on frame N+1 wait for frame N to finish. The point is to have at
> least one frame in advance to not busyloop when there is still
> work to do. The ideal case would be to block on command completion
> using an IRQ of course.
>
> XSync() can be faked. We can perfectly decide to buffer one frame
> in advance, can't we ? Then XSync on the next frame, thus we won't
> block on double buffer setup if the second buffer isn't filled. That
> make sure we only block (or spinloop) if the decoder is feeding
> us faster than the framerate, and only when we have the 2 buffers filled.
> That should smooth the whole data flow and avoid a lot of useless
> sleeps/busyloops.

Smells like a hack, this is stretching the XSync() semantics to say the
least, in particular I think XSync() is the only feedback the players
get for the timing, I wonder if such a change wouldn't have bad effects
of its own there.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Benjamin Herrenschmidt

unread,

May 19, 2002, 5:40:10 AM5/19/02

to

>On Fri, 2002-05-17 at 04:42, Benjamin Herrenschmidt wrote:
>> >The source buffer doesn't matter after we have filled the DMA buffers,
>> >does it?
>>
>> Ah, you are right, I forgot about the fact we did an additional copy
>> here. Too bad we can't just DMA from the source buffer, that sucks,
>> remember the RAM throughput of most Macs isn't that good... This is
>> probably one reason why DMA doesn't show that a significant perf
>> improvement.
>
>One might be able to DMA directly from the source buffer, but one would
>have to walk the pages and set up descriptor tables with the bus
>addresses. Do you think that could still be better? (Assuming one could
>work out alignment etc.)

Provided that those source pages aren't in swap... Though I think some
of the v4l drivers used to do such tricks.

The ideal way, but probably not possible with current APIs, would be
to have control over Xv (MC ?) allocation routines so that when the client
frames are allocated, it really gets a pair of AGP memory blocks allocated
from the AGP aperture and mapped into the client process space.

Ideally, we could then make it cacheable, and then have Xv flush the cache
when feeding the frame to the ring.

>
>> >Anyway, I don't see any explicit synchronisation in the driver, so
>> >probably the problem is the players calling XSync().
>>
>> Well, do we wait for DMA to finish or not ? If we do, then we are
>> doing explicit sync.
>
>Again, I don't see that in the driver, but maybe I'm just blind.

Could be implicit as part as a wait for engine ready or a 2D sync,
though I yet have to look at the impl. We should try to figure out
where X is actually spending those cycles. But I suspect that since
the AGP memory is mapped uncacheable (and guarded), the time you
spend blitting from the source buffer to the AGP buffer is almost
as long as blitting directly to the FB ... I remember trying your
first r128 implementation on the pismo, I had approx. similar CPU
usage doing DMA blits and doing manual blits to the FB using FP
registers (64 bits bursts on the bus).

>> If we don't, then we should, at least, wait
>> on frame N+1 wait for frame N to finish. The point is to have at
>> least one frame in advance to not busyloop when there is still
>> work to do. The ideal case would be to block on command completion
>> using an IRQ of course.
>>
>> XSync() can be faked. We can perfectly decide to buffer one frame
>> in advance, can't we ? Then XSync on the next frame, thus we won't
>> block on double buffer setup if the second buffer isn't filled. That
>> make sure we only block (or spinloop) if the decoder is feeding
>> us faster than the framerate, and only when we have the 2 buffers filled.
>> That should smooth the whole data flow and avoid a lot of useless
>> sleeps/busyloops.
>
>Smells like a hack, this is stretching the XSync() semantics to say the
>least, in particular I think XSync() is the only feedback the players
>get for the timing, I wonder if such a change wouldn't have bad effects
>of its own there.

I may, but by one-frame off, this isn't too bad, especially since Xv
isn't good enough to do real frame sync on things like broadcast interlaced
display.

Ben.

Michel Dänzer

unread,

May 19, 2002, 11:10:07 AM5/19/02

to

On Sat, 2002-05-18 at 23:33, Benjamin Herrenschmidt wrote:
>
> The ideal way, but probably not possible with current APIs, would be
> to have control over Xv (MC ?) allocation routines so that when the client
> frames are allocated, it really gets a pair of AGP memory blocks allocated
> from the AGP aperture and mapped into the client process space.

AFAIK this is how XvMC works, the client allocates surfaces from the
XvMC driver.

> Ideally, we could then make it cacheable, and then have Xv flush the cache
> when feeding the frame to the ring.

Maybe we could even have the chip display the overlay out of AGP memory
directly?

> >> >Anyway, I don't see any explicit synchronisation in the driver, so
> >> >probably the problem is the players calling XSync().
> >>
> >> Well, do we wait for DMA to finish or not ? If we do, then we are
> >> doing explicit sync.
> >
> >Again, I don't see that in the driver, but maybe I'm just blind.
>
> Could be implicit as part as a wait for engine ready or a 2D sync,

That's exactly what I looked for but didn't see. :)

> though I yet have to look at the impl. We should try to figure out
> where X is actually spending those cycles.

Yep, someone who cares about this will have to do that.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Benjamin Herrenschmidt

unread,

May 19, 2002, 3:40:09 PM5/19/02

to

>
>Maybe we could even have the chip display the overlay out of AGP memory
>directly?

ATI doesn't recommend that. The refresh rate of the screen is high enough
that if you display from AGP memory, you'll cause a hell lot more
throughput on the bus than with a single blit.

>
>> >> >Anyway, I don't see any explicit synchronisation in the driver, so
>> >> >probably the problem is the players calling XSync().
>> >>
>> >> Well, do we wait for DMA to finish or not ? If we do, then we are
>> >> doing explicit sync.
>> >
>> >Again, I don't see that in the driver, but maybe I'm just blind.
>>
>> Could be implicit as part as a wait for engine ready or a 2D sync,
>
>That's exactly what I looked for but didn't see. :)
>
>> though I yet have to look at the impl. We should try to figure out
>> where X is actually spending those cycles.
>
>Yep, someone who cares about this will have to do that.

--

Albert D. Cahalan

unread,

May 19, 2002, 4:10:06 PM5/19/02

to

Benjamin Herrensch writes:
> [somebody]

>> Maybe we could even have the chip display the overlay out
>> of AGP memory directly?
>
> ATI doesn't recommend that. The refresh rate of the screen is
> high enough that if you display from AGP memory, you'll cause
> a hell lot more throughput on the bus than with a single blit.

ATI wouldn't likely recommend uncached/guarded or not using
the hardware IDCT either though.

If this has any chance of getting me 1600x1024 at 24-bit on
my Mac Cube, then I like it. It's good even if the video
isn't full screen or full framerate.

Albert D. Cahalan

unread,

May 19, 2002, 5:10:08 PM5/19/02

to

Michel =?ISO-8859- writes:
> On Wed, 2002-05-15 at 11:02, Adrian Cox wrote:

>> Are you sure about this? I thought most DVDs carried either interlaced
>> NTSC or interlaced PAL. In the NTSC case the movie is converted to
>> 30fps interlaced by a 3:2 pulldown (3 fields from one frame, 2 fields
>> from the next). In the PAL case each frame is split into two fields,
>> which speeds the film up slightly. Adjusting the pitch of the audio is
>> optional in this case. I believe that in both cases this is normally
>> done when mastering the DVD, to allow for manual tweaking of the
>> process.
>
> You're right, forgot about this, but in the PAL case the two fields are
> basically the same as one frame, right?

Not really. PAL is like NTSC, except:

1. slower framerate
2. higher resolution
3. different color encoding (only in analog form?)

In both cases, you can pick any two adjacent fields
and call them a frame. There aren't any pixels in
the analog version, but if you chopped the lines
into pixels you'd have each pixel at a unique
moment in time. The last pixel shown is closer in
time to the first pixel of the _next_ field than
it is the the first pixel in the same field.

The problem is easier to visualize if you imagine
time to be the Z axis. Movie film has slices that
go perpendicular to time. Video has lines that
aren't quite perpendicular to any axis, and they
don't line up in adjacent fields. Video lines
aren't even horizontal, and some fields get a
half line at the top or bottom of the display.

Look for fringes on objects that are moving
horizontally, like this:

::::::..
:::::::..
:::::::..
::::..
::..

People cheat. To convert a movie, you're supposed
to do motion tracking of objects so that you can
generate data for intermediate points in time.
To convert back, you'd do the same thing, except
that it's to your advantage to guess what kind of
cheating happened with the previous conversion.
Consumer hardware has nowhere near the CPU power
needed to do motion tracking of objects, so you
also are forced to cheat.

So you start by pretending that a field is a
snapshot in time, or that it is an average over
some period of time. You ignore the half lines.
If fields N and N+2 look the same, you're looking
at 3:2 pulldown, so use this:

average(N,N+2), N+1, N+3, N+4

Note that NTSC is a wee bit less than 60 frames
per second.

Albert D. Cahalan

unread,

May 19, 2002, 5:50:08 PM5/19/02

to

Michel =?ISO-8859- writes:
> On Mon, 2002-05-13 at 22:04, Rogério Brito wrote:
>> On May 12 2002, Michel Dänzer wrote:
>>> On Sun, 2002-05-12 at 14:34, Rogério Brito wrote:

>>>> What is the problem with publishing the interfaces for that? I
>>>> thought that ATI were an open-source friendly company. :-(
>>>
>>> They are IMHO. Every company seems to be afraid of releasing that
>>> kind of information, I have an idea why but people tell me there's
>>> no reason...
>>
>> Could you say what could be the possible reasons for ATI to
>> not release the specifications? I'd be really interested to
>> know. I promise that I won't start a flamewar if I don't like
>> what I read. :-)
>
> My idea is it has to do with certain US laws, but what do I know.

Companies license stuff from others. ATI might not be able
to release the info. Perhaps "Who should I ask?" would be
a better question for ATI.

Plus imagine you worked there. You don't have authority
to release the info. Who does? Does anyone? In theory the
stockholders could sue if you give away anything that
might offer competitors some help. Maybe you think you
do have authority, but do you want to risk your job?

> Right, but with a catch: the DMA transfer must complete
> before the image can displayed.

Exactly why?

It isn't real nice to just DMA one frame on top of
the next, but in this case it would be a dramatic
improvement.

At worst:

You have an 85 Hz monitor, needing 11.76 ms per frame.
If you can DMA in less time, then the DMA engine will
keep ahead of the display.

>>> Using an interrupt might help there.
>>
>> I don't understand this, sorry. :-)
>
> The chip could issue an interrupt when the DMA transfer is complete so
> the CPU could do something useful while it's in progress.

Then you suffer interrupt overhead. You should be able
to do something useful while you wait, polling when you
expect the DMA to be about done.

Daniel Jacobowitz

unread,

May 19, 2002, 7:50:09 PM5/19/02

to

On Sun, May 19, 2002 at 07:44:35PM -0400, Albert D. Cahalan wrote:
> Look at "gcc -S" output. It's kind of hard to read though, because
> the compiler uses raw numbers ("6") instead of register names ("r6"
> or "f6").

-mregnames; I believe it has worked for at least several years.

> The cache manipulation instructions are nice.
> Even w/o the extra AltiVec ones, you can do a
> lot with 32-byte chunks of memory.

If you're willing to assume the cacheline size, which varies between
existing PowerPC implementations.

--
Daniel Jacobowitz Carnegie Mellon University
MontaVista Software Debian GNU/Linux Developer

Albert D. Cahalan

unread,

May 19, 2002, 7:50:08 PM5/19/02

to

=?iso-8859-1?Q?Rog writes:
> On May 11 2002, Michel Lanners wrote:

>> On 10 May, this message from Rog\351rio Brito echoed through cyberspace:

> [High CPU usage during video output]
>> That's essentially because of MTRR on i386. I wouldn't know hard
>> numbers to compare, but at least subjectivly, MTRR helps a lot for
>> the copy to VRAM of the video out data.
>
> Ouch, I miss MTRR. :-(

As far as XFree86 is concerned, MTRR is a Linux kernel
feature. The driver handles similar non-Intel features;
it could do something useful on PowerPC too.

Things are a little different on PowerPC, with the
attributes specified on a per-page basis. It's not
good to have two mappings with different attributes
for the same physical address.

Still, it should be doable. You'll want memory to
be write-back cached, not guarded, and not coherent.
Then you push stuff out by hand, using the cache
control instructions to do so. (being not coherent
is great when you don't need to worry about getting
swapped out or scheduled on another CPU)

>> - optimize a few of the more processor-intensive parts
>> of the algorithms with handcoded ASM. Good luck...
>
> Well, I'm so pissed that I am currently even considering
> learning PPC's assembly for this task. I even downloaded
> Motorola's user guide for the G3. :-( The only problem
> now is lack of time.

PowerPC assembly is easy, at least w/o using AltiVec.
Just remember that Linux fails to set the LE and ILE
bits in the MSR, so a multi-byte value is stored
backwards in memory. Other than that, it's pretty sane.
The documentation is crap though; shortcut opcodes
like "li" are used all over the place in "real" code
but you can't find them in the index or opcode table.

Tricks to know:

Motorola will ship you free books if you can figure
out where to ask. Dig around on their web site.

There are 3 instructions (rlwimi & friends) that
let you rotate, shift, and mask. Learn to use them.

The FPU runs in parallel to the integer unit(s).
You'll want it in the non-recoverable mode (some
MSR bits control this) with all exceptions off.

An unsigned int up to 0x007fffff is a float, with
a factor of 2**150 to annoy you. (a 150-bit shift)
This won't work for AltiVec, or a 6xx with the NI
bit set in the FPSCR.

Look at "gcc -S" output. It's kind of hard to
read though, because the compiler uses raw numbers
("6") instead of register names ("r6" or "f6").

Interleave your code to avoid stalls.

The cache manipulation instructions are nice.
Even w/o the extra AltiVec ones, you can do a
lot with 32-byte chunks of memory.

I find that drawing on a physical piece of paper
helps with register allocation. Scatter instructions all
over the paper, using arrows to indicate dependencies.
Use "r[]" to name every register, where "[]" is just
a placeholder box you draw. Shift the instructions
around so that you avoid having an arrow between
directly adjacent instructions. As you work, label the
arrows like this: r4, memory, cr2, f11, r8. You are
done when all of the boxes have been filled in.

Linux doesn't trash any registers. Thread libraries
might trash some. There is a register reserved for
the OS, and another for a small data area... you
can abuse both of them.

Use the CTR register for call-by-pointer. You can
also use it for inner loops. Don't store useless
state in a leaf function.

Set the branch prediction bits on conditional branches.

Don't waste your time searching for a conditional
move instruction. ARM, x86, Alpha, and IA-64 all
have this, but not PowerPC. You'll have to jump if
you can't get by with some sort of computation.
If you take pride in straight-line code, PowerPC
is going to frustrate you.

If you need to divide, use mulhw with a mysterious
constant and then shift right. Search the web for
more info, or grab the constant from gcc output.

> BTW, I'm using the following options to compile the programs I
> am trying:
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> CC="gcc-3.0"
> CFLAGS="-O3 -fomit-frame-pointer -ffast-math -frename-registers \
> -mtune=750 -mcpu=750 -mfused-madd"
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> Any suggestions on what else I should use? Perhaps gcc-3.1 is
> a bit better regarding optimizations?

Try -O2 or -Os instead of -O3.

Albert D. Cahalan

unread,

May 19, 2002, 8:10:06 PM5/19/02

to

Daniel Jacobowitz writes:
> On Sun, May 19, 2002 at 07:44:35PM -0400, Albert D. Cahalan wrote:

>> The cache manipulation instructions are nice.
>> Even w/o the extra AltiVec ones, you can do a
>> lot with 32-byte chunks of memory.
>
> If you're willing to assume the cacheline size, which
> varies between existing PowerPC implementations.

Given this:

1. not AltiVec or that 8xxx vector thing
2. fast enough to decode a DVD (7xx, not 6xx)
3. not 64-bit
4. has video hardware

Where is it not 32 bytes per cacheline?

Daniel Jacobowitz

unread,

May 19, 2002, 8:10:07 PM5/19/02

to

On Sun, May 19, 2002 at 08:01:57PM -0400, Albert D. Cahalan wrote:
> Daniel Jacobowitz writes:
> > On Sun, May 19, 2002 at 07:44:35PM -0400, Albert D. Cahalan wrote:
>
> >> The cache manipulation instructions are nice.
> >> Even w/o the extra AltiVec ones, you can do a
> >> lot with 32-byte chunks of memory.
> >
> > If you're willing to assume the cacheline size, which
> > varies between existing PowerPC implementations.
>
> Given this:
>
> 1. not AltiVec or that 8xxx vector thing
> 2. fast enough to decode a DVD (7xx, not 6xx)
> 3. not 64-bit
> 4. has video hardware
>
> Where is it not 32 bytes per cacheline?

I was under the impression that the 4xx line were perfect matches: take
PCI video, no vector hardware, definitely 32-bit, quite fast. I might
be mistaken about the line size on those, though.

--
Daniel Jacobowitz Carnegie Mellon University
MontaVista Software Debian GNU/Linux Developer

Michel Dänzer

unread,

May 20, 2002, 8:00:09 AM5/20/02

to

On Sun, 2002-05-19 at 22:01, Albert D. Cahalan wrote:
> Benjamin Herrensch writes:
> > [somebody]
>
> >> Maybe we could even have the chip display the overlay out
> >> of AGP memory directly?
> >
> > ATI doesn't recommend that. The refresh rate of the screen is
> > high enough that if you display from AGP memory, you'll cause
> > a hell lot more throughput on the bus than with a single blit.

I'm sure Rogério wouldn't mind loading the bus more in favour of the
CPU. :) But anyway, limiting bus traffic to one transfer per frame
shouldn't be hard either.

> ATI wouldn't likely recommend uncached/guarded or not using
> the hardware IDCT either though.
>
> If this has any chance of getting me 1600x1024 at 24-bit on
> my Mac Cube, then I like it.

The bandwidth between the chip and its video RAM matters for this, the
bus bandwidth only limits the size or fps of the video images.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Michel Dänzer

unread,

May 20, 2002, 8:10:05 AM5/20/02

to

On Sun, 2002-05-19 at 23:45, Albert D. Cahalan wrote:
>
> >>> Using an interrupt might help there.
> >>
> >> I don't understand this, sorry. :-)
> >
> > The chip could issue an interrupt when the DMA transfer is complete so
> > the CPU could do something useful while it's in progress.
>
> Then you suffer interrupt overhead. You should be able
> to do something useful while you wait, polling when you
> expect the DMA to be about done.

If it was that easy, it would already be done... Doesn't sound like you
are familiar with the situation we're discussing.

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

Michel Dänzer

unread,

May 20, 2002, 8:20:09 AM5/20/02

to

On Mon, 2002-05-20 at 01:44, Albert D. Cahalan wrote:
>
> Just remember that Linux fails to set the LE and ILE
> bits in the MSR, so a multi-byte value is stored
> backwards in memory.

No, this sentence is backwards. :)

As for the rest of your post, thanks for sharing your apparently immense
low-level knowledge, but I wish you'd also use it...

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

be...@kernel.crashing.org

unread,

May 21, 2002, 6:00:09 AM5/21/02

to

>> ATI doesn't recommend that. The refresh rate of the screen is
>> high enough that if you display from AGP memory, you'll cause
>> a hell lot more throughput on the bus than with a single blit.
>
>ATI wouldn't likely recommend uncached/guarded or not using
>the hardware IDCT either though.

uncached/guarded is more or less mandatory on AGP as the HW isn't
cache coherent, though we would probably get better throughput
using cached mapping and explicit cache flushes.

As far as HW IDCT is concerned, well... :)

Albert D. Cahalan

unread,

May 21, 2002, 8:50:08 PM5/21/02

to

be...@kernel.crashi writes:
> [Albert Cahalan]

>> ATI wouldn't likely recommend uncached/guarded or not using
>> the hardware IDCT either though.
>
> uncached/guarded is more or less mandatory on AGP as the HW isn't
> cache coherent, though we would probably get better throughput
> using cached mapping and explicit cache flushes.

Hey, wait a minute... why guarded?
Tell me where I'm wrong:

AGP memory is regular RAM on the motherboard.
(at least it isn't device registers)

Typically an app puts images (bumpmaps, textures, etc.)
in AGP memory. Triangles for 3d rendering also
get written to AGP memory.

This app is X, or an authorized local client.

It is not common to have the video card writing
to AGP memory.

If the video card does write to memory, X can
ensure that this doesn't happen to memory that
the user is busy writing to.

It is not common for the for the user to read AGP memory.

If the user does read from AGP memory, the X server
could flush some cache lines before telling the user
that the memory has been updated. (PowerPC uses a
physical cache, not a virtual cache)

The motherboard chipset will walk some sort of page table
when the video card tries to access AGP memory. This is
kept coherent by a Linux kernel DRI/DRM/AGP driver.

Aside from X itself, ordering isn't going to matter.
User apps won't be trying to atomicly update data
structures as viewed from the video card. X might
do this.

It wouldn't be insane to update X to include all
the necessary cache-related instructions.

User apps need caching off by default, since trying to
update all the apps would be insane.

Unless user code will write to AGP memory on one
processor and read or write on another processor,
the M bit (Memory Coherency Attribute) can be
cleared. It's pointless for the CPU to waste bus
cycles trying to be coherent, since the video card
will not cooperate. All non-SMP systems should
map the AGP memory with coherency disabled.

No existing PowerPC will do unrequested prefetching
across page boundries, or this is easily avoided
by not using memory adjacent to the boundry
between AGP memory and non-AGP memory.

If apps would at least avoid reading stuff written
by the video card, write-through cached would be OK.
Apps that read AGP memory are uncommon enough that
fixing all of them would be feasible.

be...@kernel.crashing.org

unread,

May 22, 2002, 4:50:14 AM5/22/02

to

>Hey, wait a minute... why guarded?

Well, you are right about this, guarded isn't needed though
the kernel tends to put guarded along with cache inhibit
automatically (it does so in ioremap for example).

The main reason I set it currently that I'm still trying to
figure out what is causing both r128 and radeon drivers to
lockup when using DRI + AGP with Apple chipset, and among
the things I suspected was a speculative access issue, but
I tend to no longer think it's related.

>Tell me where I'm wrong:
>
>AGP memory is regular RAM on the motherboard.
>(at least it isn't device registers)

Yes.

>Typically an app puts images (bumpmaps, textures, etc.)
>in AGP memory. Triangles for 3d rendering also
>get written to AGP memory.

Yes.

>This app is X, or an authorized local client.

Yes.

>It is not common to have the video card writing
>to AGP memory.

By default, the r128 and radeon DRI drivers to write to
AGP memory the ring readptr, but doing so seem to be
broken on some HW (UniNorth 1.0.x and some ia64 bridges
don't deal with that properly)

>If the video card does write to memory, X can
>ensure that this doesn't happen to memory that
>the user is busy writing to.

Currently, I tweaked r128 to write using normal PCI cycles
to a separate page of memory only holding that ring pointer,
and I hacked radeon to not write, but instead have the driver
read that pointer from the card MMIO registers. This didn't
help fixing the lockup though.

>It is not common for the for the user to read AGP memory.

You don't know. If it's cacheable, writing a byte will cause
a CPU load of the entire cacheline for example.

>If the user does read from AGP memory, the X server
>could flush some cache lines before telling the user
>that the memory has been updated. (PowerPC uses a
>physical cache, not a virtual cache)

Well, +/- On radeon+DRI, we could do a flush pass on the
indirect buffers when they get passed to the kernel driver.

>The motherboard chipset will walk some sort of page table
>when the video card tries to access AGP memory. This is
>kept coherent by a Linux kernel DRI/DRM/AGP driver.

The uninorth driver does explicit flush of this page table
after modifying, I don't map it uncacheable.

>Aside from X itself, ordering isn't going to matter.
>User apps won't be trying to atomicly update data
>structures as viewed from the video card. X might
>do this.
>
>It wouldn't be insane to update X to include all
>the necessary cache-related instructions.

Actually, not X, but the DRM kernel driver.

>User apps need caching off by default, since trying to
>update all the apps would be insane.
>
>Unless user code will write to AGP memory on one
>processor and read or write on another processor,
>the M bit (Memory Coherency Attribute) can be
>cleared. It's pointless for the CPU to waste bus
>cycles trying to be coherent, since the video card
>will not cooperate. All non-SMP systems should
>map the AGP memory with coherency disabled.

I'm not too sure about that. What about one CPU writing
half a cache line of the ring buffer in AGP memory, and
another CPU writing the other half ?

>No existing PowerPC will do unrequested prefetching
>across page boundries, or this is easily avoided
>by not using memory adjacent to the boundry
>between AGP memory and non-AGP memory.

That isn't a problem, though I'm not sure about your statement
that they won't do unrequested prefetching. Do you have some
pointers to the docs ?

>If apps would at least avoid reading stuff written
>by the video card, write-through cached would be OK.
>Apps that read AGP memory are uncommon enough that
>fixing all of them would be feasible.

I think we can use full caching (copyback) without too much
problems. In the r128 case, we'll have to flush from the X server
as it's directly writing to the ring (and maybe from the mesa driver
as well). On radeon, it's all done via indirect buffers and those
get passed to the kernel driver before beeing inserted in the ring.

So we can definitely improve the throughput by letting it be
cacheable. The main reason I didn't work on this yet is that I want
the driver to be stable first to avoid possibly mixing problems.
Currently, I haven't managed to figure out what is causing the
card lockups when AGP is used.

Ben.

Albert D. Cahalan

unread,

May 22, 2002, 6:40:09 AM5/22/02

to

be...@kernel.crashi writes:
> [Albert Cahalan]

>> It is not common to have the video card writing
>> to AGP memory.
>
> By default, the r128 and radeon DRI drivers to write to
> AGP memory the ring readptr, but doing so seem to be
> broken on some HW (UniNorth 1.0.x and some ia64 bridges
> don't deal with that properly)

This looks like just one thing that you could put
on a cache line (or page) by itself.

To view it from the CPU:

1. junk the cache line containing this value
2. load the value in the normal way

Anything else?

>> It is not common for the for the user to read AGP memory.
>
> You don't know. If it's cacheable, writing a byte will cause
> a CPU load of the entire cacheline for example.

Arrrgh... I forget that users don't always write
full cache lines. Still, no problem unless the
cache line is shared with something incompatible.

>> User apps need caching off by default, since trying to
>> update all the apps would be insane.
>>
>> Unless user code will write to AGP memory on one
>> processor and read or write on another processor,
>> the M bit (Memory Coherency Attribute) can be
>> cleared. It's pointless for the CPU to waste bus
>> cycles trying to be coherent, since the video card
>> will not cooperate. All non-SMP systems should
>> map the AGP memory with coherency disabled.
>
> I'm not too sure about that. What about one CPU writing
> half a cache line of the ring buffer in AGP memory, and
> another CPU writing the other half ?

First of all, consider non-SMP. There is no other CPU.
You don't need coherency between the CPU and PCI,
because this isn't memory you'd swap out. Would a
user try to read() into this memory? (I hope not.)
So unless you need eieio to work in this memory,
making it non-coherent should be OK.

Now consider SMP. I'm guessing that this ring
buffer holds commands that are being issued to
the video card. If the kernel writes this data,
then perhaps you should write back and free the
cache line. It's a burst write. Maybe you can
avoid reading that cache line in again if you
can pad out to the end of it with NOPs.

>> No existing PowerPC will do unrequested prefetching
>> across page boundries, or this is easily avoided
>> by not using memory adjacent to the boundry
>> between AGP memory and non-AGP memory.
>
> That isn't a problem, though I'm not sure about your
> statement that they won't do unrequested prefetching.
> Do you have some pointers to the docs ?

This is from David S. Miller and others speculating
on linux-kernel:

"I don't think your PPC case needs the kernel mappings
messed with. I really doubt the PPC will speculatively
fetch/store to a TLB missing address..."

This is about the problem that hit Athlon users:

AGP memory was mapped uncachable.
AGP memory was covered by the normal kernel mapping.
The kernel would write to unrelated nearby memory.
The CPU would speculatively fetch into AGP memory.
(this is a load, to be used for a partial write)
The CPU would mark this cache line dirty.
The cache line is never written to.
The CPU would write out the cache line.

So the CPU ends up reading from AGP memory, and
then writing it back unchanged. Meanwhile, stuff
was written to AGP memory. The CPU is doing
something stupid, but AMD said that they had
every right to do so. Linux mapped the AGP memory
as both cacheable and not, the Athlon followed
the coherency protocol, and as documented the
motherboard didn't bother with coherency.

As far as I can tell, Motorola could claim exactly
the same thing. We BAT-map the AGP memory with
caching enabled, don't we? That's a conflict.

>> If apps would at least avoid reading stuff written
>> by the video card, write-through cached would be OK.
>> Apps that read AGP memory are uncommon enough that
>> fixing all of them would be feasible.
>
> I think we can use full caching (copyback) without too much
> problems. In the r128 case, we'll have to flush from the X server
> as it's directly writing to the ring (and maybe from the mesa driver
> as well). On radeon, it's all done via indirect buffers and those
> get passed to the kernel driver before beeing inserted in the ring.
>
> So we can definitely improve the throughput by letting it be
> cacheable. The main reason I didn't work on this yet is that I want
> the driver to be stable first to avoid possibly mixing problems.
> Currently, I haven't managed to figure out what is causing the
> card lockups when AGP is used.

Maybe the problems will go away if you cache the AGP memory.
It's worth a try, and makes stuff faster anyway.

Michel Dänzer

unread,

May 22, 2002, 7:00:11 PM5/22/02

to

On Tue, 2002-05-21 at 22:21, be...@kernel.crashing.org wrote:
>
> >If the video card does write to memory, X can
> >ensure that this doesn't happen to memory that
> >the user is busy writing to.
>
> Currently, I tweaked r128 to write using normal PCI cycles
> to a separate page of memory only holding that ring pointer,
> and I hacked radeon to not write, but instead have the driver
> read that pointer from the card MMIO registers.

What about a mixed approach to avoid unnecessary bus traffic: try to read
the ring head pointer from memory, and if after a timeout the free ring space
still doesn't seem to be large enough, read it directly from the register.
This could hurt performance badly if the memory copy is often outdated though.

> >If apps would at least avoid reading stuff written
> >by the video card, write-through cached would be OK.
> >Apps that read AGP memory are uncommon enough that
> >fixing all of them would be feasible.
>
> I think we can use full caching (copyback) without too much
> problems. In the r128 case, we'll have to flush from the X server
> as it's directly writing to the ring (and maybe from the mesa driver
> as well). On radeon, it's all done via indirect buffers and those
> get passed to the kernel driver before beeing inserted in the ring.

Where is r128 different than radeon in this respect?

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

be...@kernel.crashing.org

unread,

May 23, 2002, 5:40:05 AM5/23/02

to

>What about a mixed approach to avoid unnecessary bus traffic: try to read
>the ring head pointer from memory, and if after a timeout the free ring space
>still doesn't seem to be large enough, read it directly from the register.
>This could hurt performance badly if the memory copy is often outdated
though.

Well, let's first make it stable...

>
>> >If apps would at least avoid reading stuff written
>> >by the video card, write-through cached would be OK.
>> >Apps that read AGP memory are uncommon enough that
>> >fixing all of them would be feasible.
>>
>> I think we can use full caching (copyback) without too much
>> problems. In the r128 case, we'll have to flush from the X server
>> as it's directly writing to the ring (and maybe from the mesa driver
>> as well). On radeon, it's all done via indirect buffers and those
>> get passed to the kernel driver before beeing inserted in the ring.
>
>Where is r128 different than radeon in this respect?

The last time I looked at r128, it wrote to the ring directly iirc,
while radeon on wrote to indirect buffers, the kernel putting them
in the ring. This may have changed though.

Ben.

Michel Dänzer

unread,

May 23, 2002, 7:00:19 PM5/23/02

to

On Wed, 2002-05-22 at 21:48, be...@kernel.crashing.org wrote:
> >What about a mixed approach to avoid unnecessary bus traffic: try to read
> >the ring head pointer from memory, and if after a timeout the free ring space
> >still doesn't seem to be large enough, read it directly from the register.
> >This could hurt performance badly if the memory copy is often outdated
> though.
>
> Well, let's first make it stable...

Good idea. :) Do you agree with Albert's analysis? Have you tried with
caching enabled yet?

> >> >If apps would at least avoid reading stuff written
> >> >by the video card, write-through cached would be OK.
> >> >Apps that read AGP memory are uncommon enough that
> >> >fixing all of them would be feasible.
> >>
> >> I think we can use full caching (copyback) without too much
> >> problems. In the r128 case, we'll have to flush from the X server
> >> as it's directly writing to the ring (and maybe from the mesa driver
> >> as well). On radeon, it's all done via indirect buffers and those
> >> get passed to the kernel driver before beeing inserted in the ring.
> >
> >Where is r128 different than radeon in this respect?
>
> The last time I looked at r128, it wrote to the ring directly iirc,
> while radeon on wrote to indirect buffers, the kernel putting them
> in the ring. This may have changed though.

I'm pretty sure only the kernel ever had access to the ring in both
drivers (the only exception I know of were the ati-5-0-[01] branches in
DRI CVS, but they were never merged into the trunk, let alone an XFree86
release).

--
Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer
XFree86 and DRI project member / CS student, Free Software enthusiast

be...@kernel.crashing.org

unread,

May 24, 2002, 6:30:10 AM5/24/02

to

>On Wed, 2002-05-22 at 21:48, be...@kernel.crashing.org wrote:
>> >What about a mixed approach to avoid unnecessary bus traffic: try to read
>> >the ring head pointer from memory, and if after a timeout the free
>ring space
>> >still doesn't seem to be large enough, read it directly from the register.
>> >This could hurt performance badly if the memory copy is often outdated
>> though.
>>
>> Well, let's first make it stable...
>
>Good idea. :) Do you agree with Albert's analysis? Have you tried with
>caching enabled yet?

Albert analysis makes sense, though I tried a different approach which
was to set the Guarded bit on the BAT mapping. Well, anyway, I'll do some
experiments with that again when I have less pressure from work, I can
at least try to find some more infos about what's going on in the card
when the CCE stops.

>
>> >> >If apps would at least avoid reading stuff written
>> >> >by the video card, write-through cached would be OK.
>> >> >Apps that read AGP memory are uncommon enough that
>> >> >fixing all of them would be feasible.
>> >>
>> >> I think we can use full caching (copyback) without too much
>> >> problems. In the r128 case, we'll have to flush from the X server
>> >> as it's directly writing to the ring (and maybe from the mesa driver
>> >> as well). On radeon, it's all done via indirect buffers and those
>> >> get passed to the kernel driver before beeing inserted in the ring.
>> >
>> >Where is r128 different than radeon in this respect?
>>
>> The last time I looked at r128, it wrote to the ring directly iirc,
>> while radeon on wrote to indirect buffers, the kernel putting them
>> in the ring. This may have changed though.
>
>I'm pretty sure only the kernel ever had access to the ring in both
>drivers (the only exception I know of were the ati-5-0-[01] branches in
>DRI CVS, but they were never merged into the trunk, let alone an XFree86
>release).

Ok, my memory may be failing here ;)

Ben.