How to efficiently handle DMA and cache on ARMv7 ? (was "Is get_user_pages() enough to prevent pages from being swapped out ?")

Laurent Pinchart

unread,

Aug 6, 2009, 6:20:05 AM8/6/09

to

[Resent with an updated subject, this time CC'ing linux-arm-kernel]

I've spent the last few days "playing" with get_user_pages() and mlock() and
got some interesting results. It turned out that cache coherency comes into
play at some point, making the overall problem more complex.

Here's my current setup:

- OMAP processor, based on an ARMv7 core
- MMU and IOMMU
- VIPT non-aliasing data cache
- video capture driver that transfers data to memory using DMA
- video capture application that pass userspace pointers to video buffers to
the driver

My goal is to make sure that, upon DMA completion, the correct data will be
available to the userspace application.

The first problem was to pin pages to memory, to make sure they will not be
freed when the DMA is in progress. videobug-dma-sg uses get_user_pages() for
that, and Hugh Dickins nicely explained to me why this is enough.

The second problem is to ensure cache coherency. As the userspace application
will read data from the video buffers, those buffers will end up being cached
in the processor's data cache. The driver does need to invalidate the cache
before starting the DMA operation (userspace could in theory write to the
buffers, but the data will be overwritten by DMA anyway, so there's no need to
clean the cache).

As the cache is of the VIPT (Virtual Index Physical Tag) type, cache
invalidation can either be done globally (in which case the cache is flushed
instead of being invalidated) or based on virtual addresses. In the last case
the processor will need to look physical addresses up, either in the TLB or
through hardware table walk.

I can see three solutions to the DMA/cache problem.

1. Flushing the whole data cache right before starting the DMA transfer.
There's no API for that in the ARM architecture, so a whole I+D cache is
required. This is quite costly, we're talking about around 30 flushes per
second, but it doesn't involve the MMU. That's the solution that I currently
use.

2. Invalidating only the cache lines that store video buffer data. This
requires a TLB lookup or a hardware table walk, so the userspace application
MM context needs to be available (no problem there as where's flushing in
userspace context) and all pages need to be mapped properly. This can be a
problem as, as Hugh pointed out, pages can still be unmapped from the
userspace context after get_user_pages() returns. I have experienced one oops
due to a kernel paging request failure:

Unable to handle kernel paging request at virtual address 44e12000
pgd = c8698000
[44e12000] *pgd=8a4fd031, *pte=8cfda1cd, *ppte=00000000
Internal error: Oops: 817 [#1] PREEMPT
PC is at v7_dma_inv_range+0x2c/0x44

Fixing this requires more investigation, and I'm not sure how to proceed to
find out if the page fault is really caused by pages being unmapped from the
userspace context. Help would be appreciated.

3. Mark the pages as non-cacheable. Depending on how the buffers are then used
by userspace, the additional cache misses might destroy any benefit I would
get from not flushing the cache before DMA. I'm not sure how to mark a bunch
of pages as non-cacheable though. What usually happens is that video drivers
allocate DMA-coherent memory themselves, but in this case I need to deal with
an arbitrary buffer allocated by userspace. If someone has any experience with
this, it would be appreciated.

Regards,

Laurent Pinchart

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ben Dooks

unread,

Aug 6, 2009, 7:50:11 AM8/6/09

to

On Thu, Aug 06, 2009 at 12:08:21PM +0200, Laurent Pinchart wrote:
> [Resent with an updated subject, this time CC'ing linux-arm-kernel]
>
> I've spent the last few days "playing" with get_user_pages() and mlock() and
> got some interesting results. It turned out that cache coherency comes into
> play at some point, making the overall problem more complex.
>
> Here's my current setup:
>
> - OMAP processor, based on an ARMv7 core
> - MMU and IOMMU
> - VIPT non-aliasing data cache
> - video capture driver that transfers data to memory using DMA
> - video capture application that pass userspace pointers to video buffers to
> the driver
>
> My goal is to make sure that, upon DMA completion, the correct data will be
> available to the userspace application.
>
> The first problem was to pin pages to memory, to make sure they will not be
> freed when the DMA is in progress. videobug-dma-sg uses get_user_pages() for
> that, and Hugh Dickins nicely explained to me why this is enough.
>
> The second problem is to ensure cache coherency. As the userspace application
> will read data from the video buffers, those buffers will end up being cached
> in the processor's data cache. The driver does need to invalidate the cache
> before starting the DMA operation (userspace could in theory write to the
> buffers, but the data will be overwritten by DMA anyway, so there's no need to
> clean the cache).

You'll need to clean the write buffers, otherwise the CPU may have data
queued that it has yet to write back to memory.

> As the cache is of the VIPT (Virtual Index Physical Tag) type, cache
> invalidation can either be done globally (in which case the cache is flushed
> instead of being invalidated) or based on virtual addresses. In the last case
> the processor will need to look physical addresses up, either in the TLB or
> through hardware table walk.
>
> I can see three solutions to the DMA/cache problem.
>
> 1. Flushing the whole data cache right before starting the DMA transfer.
> There's no API for that in the ARM architecture, so a whole I+D cache is
> required. This is quite costly, we're talking about around 30 flushes per
> second, but it doesn't involve the MMU. That's the solution that I currently
> use.
>
> 2. Invalidating only the cache lines that store video buffer data. This
> requires a TLB lookup or a hardware table walk, so the userspace application
> MM context needs to be available (no problem there as where's flushing in
> userspace context) and all pages need to be mapped properly. This can be a
> problem as, as Hugh pointed out, pages can still be unmapped from the
> userspace context after get_user_pages() returns. I have experienced one oops
> due to a kernel paging request failure:

If you already know the virtual addresses of the buffers, why do you need
a TLB lookup (or am I being dense here?)

> Unable to handle kernel paging request at virtual address 44e12000
> pgd = c8698000
> [44e12000] *pgd=8a4fd031, *pte=8cfda1cd, *ppte=00000000
> Internal error: Oops: 817 [#1] PREEMPT
> PC is at v7_dma_inv_range+0x2c/0x44
>
> Fixing this requires more investigation, and I'm not sure how to proceed to
> find out if the page fault is really caused by pages being unmapped from the
> userspace context. Help would be appreciated.
>
> 3. Mark the pages as non-cacheable. Depending on how the buffers are then used
> by userspace, the additional cache misses might destroy any benefit I would
> get from not flushing the cache before DMA. I'm not sure how to mark a bunch
> of pages as non-cacheable though. What usually happens is that video drivers
> allocate DMA-coherent memory themselves, but in this case I need to deal with
> an arbitrary buffer allocated by userspace. If someone has any experience with
> this, it would be appreciated.
>
> Regards,
>
> Laurent Pinchart
>
>

> -------------------------------------------------------------------
> List admin: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
> FAQ: http://www.arm.linux.org.uk/mailinglists/faq.php
> Etiquette: http://www.arm.linux.org.uk/mailinglists/etiquette.php

--
--
Ben

Q: What's a light-year?
A: One-third less calories than a regular year.

Laurent Pinchart

unread,

Aug 6, 2009, 9:10:11 AM8/6/09

to

Hi Ben,

On Thursday 06 August 2009 13:46:19 Ben Dooks wrote:
> On Thu, Aug 06, 2009 at 12:08:21PM +0200, Laurent Pinchart wrote:

[snip]

> >
> > The second problem is to ensure cache coherency. As the userspace
> > application will read data from the video buffers, those buffers will end
> > up being cached in the processor's data cache. The driver does need to
> > invalidate the cache before starting the DMA operation (userspace could
> > in theory write to the buffers, but the data will be overwritten by DMA
> > anyway, so there's no need to clean the cache).
>
> You'll need to clean the write buffers, otherwise the CPU may have data
> queued that it has yet to write back to memory.

Good points, thanks.

> > As the cache is of the VIPT (Virtual Index Physical Tag) type, cache
> > invalidation can either be done globally (in which case the cache is
> > flushed instead of being invalidated) or based on virtual addresses. In
> > the last case the processor will need to look physical addresses up,
> > either in the TLB or through hardware table walk.
> >
> > I can see three solutions to the DMA/cache problem.
> >
> > 1. Flushing the whole data cache right before starting the DMA transfer.
> > There's no API for that in the ARM architecture, so a whole I+D cache is
> > required. This is quite costly, we're talking about around 30 flushes per
> > second, but it doesn't involve the MMU. That's the solution that I
> > currently use.
> >
> > 2. Invalidating only the cache lines that store video buffer data. This
> > requires a TLB lookup or a hardware table walk, so the userspace
> > application MM context needs to be available (no problem there as where's
> > flushing in userspace context) and all pages need to be mapped properly.
> > This can be a problem as, as Hugh pointed out, pages can still be
> > unmapped from the userspace context after get_user_pages() returns. I
> > have experienced one oops due to a kernel paging request failure:
>
> If you already know the virtual addresses of the buffers, why do you need
> a TLB lookup (or am I being dense here?)

The virtual address is used to compute the cache lines index, and the physical
address is then used when comparing the cache line tag. So the processor (or
actually the CP15 coprocessor if I'm not wrong) does a TLB lookup to get the
physical address during cache invalidation/flushing.

> > Unable to handle kernel paging request at virtual address
> > 44e12000 pgd = c8698000
> > [44e12000] *pgd=8a4fd031, *pte=8cfda1cd, *ppte=00000000
> > Internal error: Oops: 817 [#1] PREEMPT
> > PC is at v7_dma_inv_range+0x2c/0x44
> >
> > Fixing this requires more investigation, and I'm not sure how to proceed
> > to find out if the page fault is really caused by pages being unmapped
> > from the userspace context. Help would be appreciated.
> >
> > 3. Mark the pages as non-cacheable. Depending on how the buffers are then
> > used by userspace, the additional cache misses might destroy any benefit
> > I would get from not flushing the cache before DMA. I'm not sure how to
> > mark a bunch of pages as non-cacheable though. What usually happens is
> > that video drivers allocate DMA-coherent memory themselves, but in this
> > case I need to deal with an arbitrary buffer allocated by userspace. If
> > someone has any experience with this, it would be appreciated.

Regards,

Laurent Pinchart

--

David Xiao

unread,

Aug 6, 2009, 2:50:11 PM8/6/09

to

On Thu, 2009-08-06 at 06:06 -0700, Laurent Pinchart wrote:
> Hi Ben,
>
> On Thursday 06 August 2009 13:46:19 Ben Dooks wrote:
> > On Thu, Aug 06, 2009 at 12:08:21PM +0200, Laurent Pinchart wrote:
> [snip]
> > >
> > > The second problem is to ensure cache coherency. As the userspace
> > > application will read data from the video buffers, those buffers will end
> > > up being cached in the processor's data cache. The driver does need to
> > > invalidate the cache before starting the DMA operation (userspace could
> > > in theory write to the buffers, but the data will be overwritten by DMA
> > > anyway, so there's no need to clean the cache).
> >
> > You'll need to clean the write buffers, otherwise the CPU may have data
> > queued that it has yet to write back to memory.
>
> Good points, thanks.

I thought this should have been taken care of by the CPU specific
dma_inv_range routine. However, In arch/arm/mm/cache-v7.c,
v7_dma_inv_range does not drain the write buffer; and the
v6_dma_inv_range does that in the end of all the cache maintenance
operaitons.
So this is probably something Russel can clarify.

Another approach is working from a different direction: the kernel
allocates the non-cached buffer and then mmap() into user space. I have
done that in similar situation to try to achieve "zero-copy".

David

Cheta...@emulex.com

unread,

Aug 6, 2009, 3:20:11 PM8/6/09

to

Something non-related. I haven't used this specific api but ARM1156 has an issue. If you use the clean-cache-block mcr feature then it might result in memory-corruption. So be careful. I'm not sure which of these(ARM1156T2-S or ARM1156T2F-S) variants has that errata.

Chetan

Jamie Lokier

unread,

Aug 6, 2009, 4:20:16 PM8/6/09

to

David Xiao wrote:
> Another approach is working from a different direction: the kernel
> allocates the non-cached buffer and then mmap() into user space. I have
> done that in similar situation to try to achieve "zero-copy".

open(O_DIRECT) does DMA to arbitrary pages allocated by userspace, and
O_DIRECT is used by some important applications, so the problem still
needs to be solved in general.

-- Jamie

Russell King - ARM Linux

unread,

Aug 6, 2009, 6:30:14 PM8/6/09

to

On Thu, Aug 06, 2009 at 11:46:14AM -0700, David Xiao wrote:
> On Thu, 2009-08-06 at 06:06 -0700, Laurent Pinchart wrote:
> > Hi Ben,
> >
> > On Thursday 06 August 2009 13:46:19 Ben Dooks wrote:
> > > On Thu, Aug 06, 2009 at 12:08:21PM +0200, Laurent Pinchart wrote:
> > [snip]
> > > >
> > > > The second problem is to ensure cache coherency. As the userspace
> > > > application will read data from the video buffers, those buffers will end
> > > > up being cached in the processor's data cache. The driver does need to
> > > > invalidate the cache before starting the DMA operation (userspace could
> > > > in theory write to the buffers, but the data will be overwritten by DMA
> > > > anyway, so there's no need to clean the cache).
> > >
> > > You'll need to clean the write buffers, otherwise the CPU may have data
> > > queued that it has yet to write back to memory.
> >
> > Good points, thanks.
>
> I thought this should have been taken care of by the CPU specific
> dma_inv_range routine. However, In arch/arm/mm/cache-v7.c,
> v7_dma_inv_range does not drain the write buffer; and the
> v6_dma_inv_range does that in the end of all the cache maintenance
> operaitons.

There's no such thing as "drain write buffer" in ARMv7. There are
barriers instead, in particular dsb, which replaces the original
"drain write buffer" instruction.

As far as userspace DMA coherency, the only way you could do it with
current kernel APIs is by using get_user_pages(), creating a scatterlist
from those, and then passing it to dma_map_sg(). While the device has
ownership of the SG, userspace must _not_ touch the buffer until after
DMA has completed.

However, that won't work with ARMv7's speculative prefetching. I'm
afraid with such things, DMA direct into userspace mappings becomes a
_lot_ harder, and lets face it, lots of Linux drivers just aren't going
to bother supporting this - we can't currently get agreement to have an
API to map DMA coherent pages into userspace!

David Xiao

unread,

Aug 7, 2009, 2:10:07 AM8/7/09

to

On Thu, 2009-08-06 at 15:25 -0700, Russell King - ARM Linux wrote:
> On Thu, Aug 06, 2009 at 11:46:14AM -0700, David Xiao wrote:
> > On Thu, 2009-08-06 at 06:06 -0700, Laurent Pinchart wrote:
> > > Hi Ben,
> > >
> > > On Thursday 06 August 2009 13:46:19 Ben Dooks wrote:
> > > > On Thu, Aug 06, 2009 at 12:08:21PM +0200, Laurent Pinchart wrote:
> > > [snip]
> > > > >
> > > > > The second problem is to ensure cache coherency. As the userspace
> > > > > application will read data from the video buffers, those buffers will end
> > > > > up being cached in the processor's data cache. The driver does need to
> > > > > invalidate the cache before starting the DMA operation (userspace could
> > > > > in theory write to the buffers, but the data will be overwritten by DMA
> > > > > anyway, so there's no need to clean the cache).
> > > >
> > > > You'll need to clean the write buffers, otherwise the CPU may have data
> > > > queued that it has yet to write back to memory.
> > >
> > > Good points, thanks.
> >
> > I thought this should have been taken care of by the CPU specific
> > dma_inv_range routine. However, In arch/arm/mm/cache-v7.c,
> > v7_dma_inv_range does not drain the write buffer; and the
> > v6_dma_inv_range does that in the end of all the cache maintenance
> > operaitons.
>
> There's no such thing as "drain write buffer" in ARMv7. There are
> barriers instead, in particular dsb, which replaces the original
> "drain write buffer" instruction.
>

Sorry, I overlooked the "DSB" inst in the end; yes, it looks like the
CP15 related "drain write buffer" inst is deprecated in V7.

> As far as userspace DMA coherency, the only way you could do it with
> current kernel APIs is by using get_user_pages(), creating a scatterlist
> from those, and then passing it to dma_map_sg(). While the device has
> ownership of the SG, userspace must _not_ touch the buffer until after
> DMA has completed.
>
> However, that won't work with ARMv7's speculative prefetching. I'm
> afraid with such things, DMA direct into userspace mappings becomes a
> _lot_ harder, and lets face it, lots of Linux drivers just aren't going
> to bother supporting this - we can't currently get agreement to have an
> API to map DMA coherent pages into userspace!

The V7 speculative prefetching will then probably apply to DMA coherency
issue in general, both kernel and user space DMAs. Could this be
addressed by inside the dma_unmap_sg/single() calling dma_cache_maint()
when the direction is DMA_FROM_DEVICE/DMA_BIDIRECTIONAL, to basically
invalidate the related cache lines in case any filled by prefetching?
Assuming dma_unmap_sg/single() is called after each DMA operation is
completed.

David

Laurent Pinchart

unread,

Aug 7, 2009, 3:30:19 AM8/7/09

to

On Thursday 06 August 2009 20:46:14 David Xiao wrote:

[snip]

> Another approach is working from a different direction: the kernel
> allocates the non-cached buffer and then mmap() into user space. I have
> done that in similar situation to try to achieve "zero-copy".

That's what most drivers do. While it's probably the easiest solution in many
cases, it will sometimes introduce additional memcpy() operations that I'd
like to avoid.

Think about the simple following use case. An application wants to display
video it acquires from the device to the screen using Xv. The video buffer is
allocated by Xv. Using the v4l2 user pointer streaming method, the device can
DMA directly to the Xv buffer. Using driver-allocated buffers, a memcpy() is
required between the v4l2 buffer and the Xv buffer.

Regards,

Laurent Pinchart

unread,

Aug 7, 2009, 3:50:10 AM8/7/09

to

On Friday 07 August 2009 00:25:43 Russell King - ARM Linux wrote:
>
> As far as userspace DMA coherency, the only way you could do it with
> current kernel APIs is by using get_user_pages(), creating a scatterlist
> from those, and then passing it to dma_map_sg(). While the device has
> ownership of the SG, userspace must _not_ touch the buffer until after
> DMA has completed.

If the buffers are going to be reused again and again, would it be possible to
mark the pages returned by get_user_pages() as non-cacheable instead ?

Regards,

Laurent Pinchart

unread,

Aug 7, 2009, 4:00:16 AM8/7/09

to

Sorry about this, but I'm not sure to understand the speculative prefetching
cache issue completely.

My understanding is that, even if userspace doesn't touch the DMA buffer while
DMA is in progress, it could still read from locations close to the buffer,
resulting in a speculative prefetch of data in the buffer. Those data would
then end up in the D-cache, and would not be coherent with what the device
transfers.

If that's correct, how do we avoid the problem in the general case of DMA to
kernel-allocated buffers ?

Regards,

Laurent Pinchart

Russell King - ARM Linux

unread,

Aug 7, 2009, 4:10:12 AM8/7/09

to

On Thu, Aug 06, 2009 at 10:59:26PM -0700, David Xiao wrote:
> The V7 speculative prefetching will then probably apply to DMA coherency
> issue in general, both kernel and user space DMAs. Could this be
> addressed by inside the dma_unmap_sg/single() calling dma_cache_maint()
> when the direction is DMA_FROM_DEVICE/DMA_BIDIRECTIONAL, to basically
> invalidate the related cache lines in case any filled by prefetching?
> Assuming dma_unmap_sg/single() is called after each DMA operation is
> completed.

It's something that I was going to look at, and it's probably going to
have to be something I do blind - I currently have no MPCore platform,
and even if my Realview EB worked, it doesn't use DMA at all.

However, it's not trivial - the unmap functions don't have all the
necessary information. dma_unmap_single() has the DMA address, which
we can convert to the original virtual address via dma_to_virt().
However, dma_unmap_page() can't translate back to a virtual page
since we're missing some information there.

It bugs me that the DMA API is restrictive in the information which
architectures can retain across a mapping which makes this non-trivial.
Had I known of these issues when the DMA API was originally being
discussed, I'd have suggested that we have an arch-specific dma_map
struct which could contain whatever information was required, rather
than requiring the driver to maintain the handle/size/direction/etc
between each of the calls. That would mean we could retain the virtual
address/struct page rather than having to work it back in some way.

Russell King - ARM Linux

unread,

Aug 7, 2009, 4:20:09 AM8/7/09

to

On Fri, Aug 07, 2009 at 09:58:30AM +0200, Laurent Pinchart wrote:
> Sorry about this, but I'm not sure to understand the speculative prefetching
> cache issue completely.

The general case with speculative prefetching is that if memory is
accessible, it can be prefetched.

In other words, if we mapped devices without NX (non-exec) set, the
CPU can prefetch instructions from devices, causing random read
accesses. Yes, I know it sounds crazy, but that's what I'm told
_can_ happen.

Matthieu CASTET

unread,

Aug 7, 2009, 4:20:13 AM8/7/09

to

Laurent Pinchart a �crit :

> On Thursday 06 August 2009 20:46:14 David Xiao wrote:
>
> Think about the simple following use case. An application wants to display
> video it acquires from the device to the screen using Xv. The video buffer is
> allocated by Xv. Using the v4l2 user pointer streaming method, the device can
> DMA directly to the Xv buffer. Using driver-allocated buffers, a memcpy() is
> required between the v4l2 buffer and the Xv buffer.
>

v4l2 got an API (overlay IRRC) that allow drivers to write directly in
framebuffer memory.
BTW Xv buffer is not always in video memory and the X driver can do a
memcpy.

Matthieu

Jamie Lokier

unread,

Aug 7, 2009, 6:00:16 AM8/7/09

to

Russell King - ARM Linux wrote:

> On Fri, Aug 07, 2009 at 09:58:30AM +0200, Laurent Pinchart wrote:
> > Sorry about this, but I'm not sure to understand the speculative prefetching
> > cache issue completely.
>
> The general case with speculative prefetching is that if memory is
> accessible, it can be prefetched.
>
> In other words, if we mapped devices without NX (non-exec) set, the
> CPU can prefetch instructions from devices, causing random read
> accesses. Yes, I know it sounds crazy, but that's what I'm told
> _can_ happen.

1. Does the architecture not prevent speculative instruction
prefetches from crossing a page boundary? It would be handy under the
circumstances.

2. Is NX available on all the CPUs with speculative prefetching
behaviour? If it is, just use that for device mappings?

-- Jamie

Russell King - ARM Linux

unread,

Aug 7, 2009, 6:00:24 AM8/7/09

to

On Fri, Aug 07, 2009 at 10:54:27AM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > On Fri, Aug 07, 2009 at 09:58:30AM +0200, Laurent Pinchart wrote:
> > > Sorry about this, but I'm not sure to understand the speculative prefetching
> > > cache issue completely.
> >
> > The general case with speculative prefetching is that if memory is
> > accessible, it can be prefetched.
> >
> > In other words, if we mapped devices without NX (non-exec) set, the
> > CPU can prefetch instructions from devices, causing random read
> > accesses. Yes, I know it sounds crazy, but that's what I'm told
> > _can_ happen.
>
> 1. Does the architecture not prevent speculative instruction
> prefetches from crossing a page boundary? It would be handy under the
> circumstances.
>
> 2. Is NX available on all the CPUs with speculative prefetching
> behaviour? If it is, just use that for device mappings?

I was using it as an example. Setting NX doesn't stop _data_ speculative
prefetching to _memory_ areas (as opposed to device areas.)

Getting things like the right memory attributes in place and ensuring
people don't abuse them is the first step towards getting this stuff
right. It's an ongoing project.

Laurent Pinchart

unread,

Aug 7, 2009, 6:20:11 AM8/7/09

to

On Friday 07 August 2009 10:12:23 Matthieu CASTET wrote:
> Laurent Pinchart a �crit :
> > On Thursday 06 August 2009 20:46:14 David Xiao wrote:
> >
> > Think about the simple following use case. An application wants to
> > display video it acquires from the device to the screen using Xv. The
> > video buffer is allocated by Xv. Using the v4l2 user pointer streaming
> > method, the device can DMA directly to the Xv buffer. Using
> > driver-allocated buffers, a memcpy() is required between the v4l2 buffer
> > and the Xv buffer.
>
> v4l2 got an API (overlay IRRC) that allow drivers to write directly in
> framebuffer memory.

That's right, but I was mostly using this as an example.

> BTW Xv buffer is not always in video memory and the X driver can do a
> memcpy.

Still, one less memcpy is better :-)

Regards,

Laurent Pinchart

Jamie Lokier

unread,

Aug 7, 2009, 6:30:17 AM8/7/09

to

David Xiao wrote:
> > However, that won't work with ARMv7's speculative prefetching. I'm
> > afraid with such things, DMA direct into userspace mappings becomes a
> > _lot_ harder, and lets face it, lots of Linux drivers just aren't going
> > to bother supporting this - we can't currently get agreement to have an
> > API to map DMA coherent pages into userspace!
>
> The V7 speculative prefetching will then probably apply to DMA coherency
> issue in general, both kernel and user space DMAs. Could this be
> addressed by inside the dma_unmap_sg/single() calling dma_cache_maint()
> when the direction is DMA_FROM_DEVICE/DMA_BIDIRECTIONAL, to basically
> invalidate the related cache lines in case any filled by prefetching?
> Assuming dma_unmap_sg/single() is called after each DMA operation is
> completed.

If it's possible, surely its essential because of O_DIRECT file and
block I/O?

-- Jamie

Laurent Desnogues

unread,

Aug 7, 2009, 8:10:06 AM8/7/09

to

On Fri, Aug 7, 2009 at 11:54 AM, Jamie Lokier<ja...@shareable.org> wrote:
>
> 1. Does the architecture not prevent speculative instruction
> prefetches from crossing a page boundary? �It would be handy under the
> circumstances.

There's no such restriction in ARMv7 architecture.

Laurent

Robin Holt

unread,

Aug 7, 2009, 9:20:10 AM8/7/09

to

On Fri, Aug 07, 2009 at 02:07:43PM +0200, Laurent Desnogues wrote:
> On Fri, Aug 7, 2009 at 11:54 AM, Jamie Lokier<ja...@shareable.org> wrote:
> >
> > 1. Does the architecture not prevent speculative instruction
> > prefetches from crossing a page boundary? �It would be handy under the
> > circumstances.
>
> There's no such restriction in ARMv7 architecture.

Doesn't it prevent them for uncached areas? I _THOUGHT_ there was an
alloc_consistent (or something like that) call on ARM which gave you
an uncached mapping where you could do DMA. I also thought there was
a dma_* set of functions which remapped as uncached before DMA begins
and remapped as normal after DMA has been completed.

Sorry for the fuzzy recollection. I am dredging from 2.6.21 timeframe.

Robin

Russell King - ARM Linux

unread,

Aug 7, 2009, 3:10:11 PM8/7/09

to

On Fri, Aug 07, 2009 at 08:15:01AM -0500, Robin Holt wrote:
> On Fri, Aug 07, 2009 at 02:07:43PM +0200, Laurent Desnogues wrote:
> > On Fri, Aug 7, 2009 at 11:54 AM, Jamie Lokier<ja...@shareable.org> wrote:
> > >
> > > 1. Does the architecture not prevent speculative instruction
> > > prefetches from crossing a page boundary? �It would be handy under the
> > > circumstances.
> >
> > There's no such restriction in ARMv7 architecture.
>
> Doesn't it prevent them for uncached areas?

"Uncached areas" is very very fuzzy. Are you talking about a non-cachable
memory mapping, or a strongly ordered mapping.

I'm afraid that we're going to have to require more precise use of language
to describe these things - wolley statements like "uncached areas" are now
just too ambiguous.

> I _THOUGHT_ there was an
> alloc_consistent (or something like that) call on ARM which gave you
> an uncached mapping where you could do DMA.

The dma_alloc_coherent() does _remap_ memory into a strongly ordered
mapping. However, the fully cached mapping remains, which means that
the CPU can still speculatively prefetch from that memory.

Since we map the fully cached mapping using section (or even supersection)
mappings for TLB efficiency, we can't change the memory type on a
per-page basis.

> I also thought there was a dma_* set of functions which remapped as
> uncached before DMA begins and remapped as normal after DMA has been
> completed.

You're talking about the deprecated DMA bounce code there. It's
basically the same problem since it uses the dma_alloc_coherent()
interface to gain a source of DMA-able memory.

Russell King - ARM Linux

unread,

Aug 7, 2009, 3:10:12 PM8/7/09

to

On Fri, Aug 07, 2009 at 11:23:39AM +0100, Jamie Lokier wrote:
> David Xiao wrote:
> > > However, that won't work with ARMv7's speculative prefetching. I'm
> > > afraid with such things, DMA direct into userspace mappings becomes a
> > > _lot_ harder, and lets face it, lots of Linux drivers just aren't going
> > > to bother supporting this - we can't currently get agreement to have an
> > > API to map DMA coherent pages into userspace!
> >
> > The V7 speculative prefetching will then probably apply to DMA coherency
> > issue in general, both kernel and user space DMAs. Could this be
> > addressed by inside the dma_unmap_sg/single() calling dma_cache_maint()
> > when the direction is DMA_FROM_DEVICE/DMA_BIDIRECTIONAL, to basically
> > invalidate the related cache lines in case any filled by prefetching?
> > Assuming dma_unmap_sg/single() is called after each DMA operation is
> > completed.
>
> If it's possible, surely its essential because of O_DIRECT file and
> block I/O?

The problem is that you require a _VIRTUAL_ address. The unmap functions
do not have that information passed to them, so we need some way of
maintaining that or calculating it. I've covered that issue in my
postings this morning (please follow up there instead.)

Laurent Pinchart

unread,

Aug 7, 2009, 4:20:06 PM8/7/09

to

On Friday 07 August 2009 21:01:45 Russell King - ARM Linux wrote:
> On Fri, Aug 07, 2009 at 08:15:01AM -0500, Robin Holt wrote:
> > On Fri, Aug 07, 2009 at 02:07:43PM +0200, Laurent Desnogues wrote:
> > > On Fri, Aug 7, 2009 at 11:54 AM, Jamie Lokier<ja...@shareable.org>
wrote:
> > > > 1. Does the architecture not prevent speculative instruction
> > > > prefetches from crossing a page boundary? It would be handy under
> > > > the circumstances.
> > >
> > > There's no such restriction in ARMv7 architecture.
> >
> > Doesn't it prevent them for uncached areas?
>
> "Uncached areas" is very very fuzzy. Are you talking about a non-cachable
> memory mapping, or a strongly ordered mapping.
>
> I'm afraid that we're going to have to require more precise use of language
> to describe these things - wolley statements like "uncached areas" are now
> just too ambiguous.

Ok. Maybe the kernel mapping from L_PTE_MT_UNCACHED to strongly ordered for
ARMv6 and up (not sure about how it worked for previous versions) brought some
confusion. I'll try to be more precise now.

> > I _THOUGHT_ there was an alloc_consistent (or something like that) call on
> > ARM which gave you an uncached mapping where you could do DMA.
>
> The dma_alloc_coherent() does _remap_ memory into a strongly ordered
> mapping. However, the fully cached mapping remains, which means that
> the CPU can still speculatively prefetch from that memory.

Does that mean that, in theory, all DMA transfers in the DMA_FROM_DEVICE
direction are currently broken on ARMv7 ?

The ARM Architecture Reference Manual (ARM DDI 0100I) states that

"• If the same memory locations are marked as having different memory types
(Normal, Device, or Strongly Ordered), for example by the use of synonyms in a
virtual to physical address mapping, UNPREDICTABLE behavior results.

• If the same memory locations are marked as having different cacheable
attributes, for example by the use of synonyms in a virtual to physical
address mapping, UNPREDICTABLE behavior results."

dma_alloc_coherent() ends up calling __dma_alloc(), which allocates pages
using alloc_pages(), flushes the data cache for the allocated virtual range
and then simply remaps the pages using PTEs previously allocated from the
kernel MM.

This would be broken if a fully cached Normal mapping already existed for
those physical pages. You seem to imply that's the case, but I'm not sure to
understand why.

> Since we map the fully cached mapping using section (or even supersection)
> mappings for TLB efficiency, we can't change the memory type on a
> per-page basis.
>
> > I also thought there was a dma_* set of functions which remapped as
> > uncached before DMA begins and remapped as normal after DMA has been
> > completed.
>
> You're talking about the deprecated DMA bounce code there. It's
> basically the same problem since it uses the dma_alloc_coherent()
> interface to gain a source of DMA-able memory.

Regards,

Laurent Pinchart

Russell King - ARM Linux

unread,

Aug 7, 2009, 4:30:11 PM8/7/09

to

On Fri, Aug 07, 2009 at 10:11:40PM +0200, Laurent Pinchart wrote:
> Ok. Maybe the kernel mapping from L_PTE_MT_UNCACHED to strongly ordered for
> ARMv6 and up (not sure about how it worked for previous versions) brought some
> confusion. I'll try to be more precise now.

It's something we should correct.

> Does that mean that, in theory, all DMA transfers in the DMA_FROM_DEVICE
> direction are currently broken on ARMv7 ?

Technically, yes. I haven't had a stream of bug reports which tends
to suggest that either the speculation isn't that aggressive in current
silicon, or we're just lucky so far.

> The ARM Architecture Reference Manual (ARM DDI 0100I) states that

Bear in mind that DDI0100 is out of date now. There's a different
document number for it (I forget what it is.)

> "• If the same memory locations are marked as having different memory types
> (Normal, Device, or Strongly Ordered), for example by the use of synonyms in a
> virtual to physical address mapping, UNPREDICTABLE behavior results.
>
> • If the same memory locations are marked as having different cacheable
> attributes, for example by the use of synonyms in a virtual to physical
> address mapping, UNPREDICTABLE behavior results."

Both of these we end up doing. The current position is "yes, umm, we're
not sure what we can do about that"... which also happens to be mine as
well. Currently, my best solution is to go for minimal lowmem and
maximal highmem - so _everything_ gets mapped in on an as required
basis.

> This would be broken if a fully cached Normal mapping already existed for
> those physical pages. You seem to imply that's the case, but I'm not sure to
> understand why.

The kernel direct mapping maps all system (low) memory with normal
memory cacheable attributes.

So using vmalloc, dma_alloc_coherent, using pages in userspace all
create duplicate mappings of pages.

David Xiao

unread,

Aug 7, 2009, 6:30:07 PM8/7/09

to

On Fri, 2009-08-07 at 13:28 -0700, Russell King - ARM Linux wrote:

> The kernel direct mapping maps all system (low) memory with normal
> memory cacheable attributes.
>
> So using vmalloc, dma_alloc_coherent, using pages in userspace all
> create duplicate mappings of pages.
>

If we do want to remove all these duplicate mappings, as part of
solution to deal with the speculative prefetching, probably one way is
to not map all the RAM into the direct-mapped space at paging_init()
time, and instead map them on-demand by different upper layer allocation
functions, such as vmalloc/dma_alloc_coherent/do_brk/kmalloc/
get_free_pages/etc. But then the distinction between upper layer
allocation functions and non-upper layer ones must be made clear though.

I know that mapping the RAM at paging_init() time can take advantage of
1M section mapping most of the time, and thus save many 1KB L2 page
tables. But a lot of memory still ends up being remapped with L2 page
tables later on, and meanwhile 1KB might not be as "precious" as it used
to be as well-:)

David

Laurent Pinchart

unread,

Aug 10, 2009, 9:50:09 AM8/10/09

to

On Friday 07 August 2009 22:28:29 Russell King - ARM Linux wrote:
> On Fri, Aug 07, 2009 at 10:11:40PM +0200, Laurent Pinchart wrote:
> > Ok. Maybe the kernel mapping from L_PTE_MT_UNCACHED to strongly ordered
> > for ARMv6 and up (not sure about how it worked for previous versions)
> > brought some confusion. I'll try to be more precise now.
>
> It's something we should correct.

Do you mean we should map L_PTE_MT_UNCACHED to Normal, non cacheable memory on
ARMv6 and up ? That looks like an easy change, but I'm scared of possible side
effects.

> > Does that mean that, in theory, all DMA transfers in the DMA_FROM_DEVICE
> > direction are currently broken on ARMv7 ?
>
> Technically, yes. I haven't had a stream of bug reports which tends to
> suggest that either the speculation isn't that aggressive in current
> silicon, or we're just lucky so far.

Current silicons probably avoid prefetching memory at random. The most
probable cause of problems would be a read in kernel virtual memory at a
location just before the buffer being written by DMA. This would result in a
few bytes being corrupted for no apparent reason. As the problem would be
quite difficult to reproduce, I don't expect many people to perform an in-
depth investigation and fill a bug report.

> > The ARM Architecture Reference Manual (ARM DDI 0100I) states that
>
> Bear in mind that DDI0100 is out of date now. There's a different document
> number for it (I forget what it is.)

Are you talking about the ARM Cortex A8 TRM (ARM DDI 0344D) ? I've read that
one (and I should have done so earlier, it helped me understand that the
kernel properly maps Linux PTE flags to ARM PTE flags where I thought there
was a bug).

> > "• If the same memory locations are marked as having different memory
> > types (Normal, Device, or Strongly Ordered), for example by the use of
> > synonyms in a virtual to physical address mapping, UNPREDICTABLE behavior
> > results.
> >
> > • If the same memory locations are marked as having different cacheable
> > attributes, for example by the use of synonyms in a virtual to physical
> > address mapping, UNPREDICTABLE behavior results."
>
> Both of these we end up doing. The current position is "yes, umm, we're not
> sure what we can do about that"... which also happens to be mine as well.
> Currently, my best solution is to go for minimal lowmem and maximal highmem
> - so _everything_ gets mapped in on an as required basis.

I suppose the problem will be more common in future architectures, even on
other platforms. Do we have the proper infrastructure to do so without
seriously damaging performances ?

> > This would be broken if a fully cached Normal mapping already existed for
> > those physical pages. You seem to imply that's the case, but I'm not sure
> > to understand why.
>
> The kernel direct mapping maps all system (low) memory with normal
> memory cacheable attributes.
>
> So using vmalloc, dma_alloc_coherent, using pages in userspace all
> create duplicate mappings of pages.

Right.

I'm experimenting with several solutions to the initial problem (handling DMA
and cache). Of course they all theoretically break because of the aliasing
introduced by the kernel low memory mapping combined with speculative
prefetching, but as that problem is global it won't affect performances of one
solution over the other.

1. Flushing the whole cache before giving ownership of the buffer to the
device works, but is quite costly.

2. Flushing only part of the cache might work, but I'm getting unhandled
kernel paging requests. I'm investigating that.

3. Marking the userspace mapping as non-cacheable might bring a performance
improvement, so I'd like to try that.

I'd like some help with marking the mapping as non-cacheable. As pages can be
unmapped from userspace virtual memory even though get_user_pages() prevent
them from being freed, I need to either:

a. Make sure the mapping will be non-cacheable when brought back in userspace
virtual memory after a page fault. This requires marking the whole underlying
VMA as non-cacheable (vma->vm_page_prot), possibly making much more than the
video buffers uncacheable.

My plan is to retrieve a pointer to the VMA underlying the buffer, then walk
the VMA virtual addresses range to mark all associated PTEs as uncacheable. If
a PTE is not present for some reason I won't need to care, as it will be
faulted in correctly using the VMA vm_page_prot the next time is is accessed.

I'm not sure how to handle young PTEs though. On at least ARMv7 a non-young
Linux PTE seems to result in an invalid ARM PTE (0x0000000). What exactly is
that for ? How should I care ?

b. Prevent the pages from being unmapped from the userspace virtual mapping,
in which case the whole VMA won't need to be marked as uncached (unless this
breaks coherency somewhere else).

I've read/heard that this can be done by using mlock() from userspace, but I
need a kernel-side solution. mlock() marks the VMA as VM_LOCKED among other
things. Would that be enough to prevent pages from being unmapped from
userspace virtual memory ?

Regards,

--
Laurent Pinchart

Catalin Marinas

unread,

Aug 11, 2009, 8:00:23 AM8/11/09

to

On Thu, 2009-08-06 at 22:59 -0700, David Xiao wrote:
> The V7 speculative prefetching will then probably apply to DMA coherency
> issue in general, both kernel and user space DMAs. Could this be
> addressed by inside the dma_unmap_sg/single() calling dma_cache_maint()
> when the direction is DMA_FROM_DEVICE/DMA_BIDIRECTIONAL, to basically
> invalidate the related cache lines in case any filled by prefetching?
> Assuming dma_unmap_sg/single() is called after each DMA operation is
> completed.

Theoretically, with speculative prefetching on ARMv7 and the FROM_DEVICE
case we need to invalidate the corresponding D-cache lines both before
and after the DMA transfer, i.e. in both dma_map_sg and dma_unmap_sg,
otherwise there is a risk of stale data in the cache.

--
Catalin

David Xiao

unread,

Aug 11, 2009, 2:30:14 PM8/11/09

to

On Tue, 2009-08-11 at 02:31 -0700, Catalin Marinas wrote:
> On Thu, 2009-08-06 at 22:59 -0700, David Xiao wrote:
> > The V7 speculative prefetching will then probably apply to DMA coherency
> > issue in general, both kernel and user space DMAs. Could this be
> > addressed by inside the dma_unmap_sg/single() calling dma_cache_maint()
> > when the direction is DMA_FROM_DEVICE/DMA_BIDIRECTIONAL, to basically
> > invalidate the related cache lines in case any filled by prefetching?
> > Assuming dma_unmap_sg/single() is called after each DMA operation is
> > completed.
>
> Theoretically, with speculative prefetching on ARMv7 and the FROM_DEVICE
> case we need to invalidate the corresponding D-cache lines both before
> and after the DMA transfer, i.e. in both dma_map_sg and dma_unmap_sg,
> otherwise there is a risk of stale data in the cache.
>

The dma_map_sg() code is already calling dma_cache_maint() to invalidate
the cache lines in the DMA_FROM_DEVICE/DMA_BIDIRECTIONAL direction
cases. And the suggestion was to do something similar in dma_unmap_sg()
case to deal with the speculative prefetching on ARMv7, and Russel has
other postings talking about the details of this in terms of
feasibility/etc.

Furthermore, duplicate MMU mappings in the kernel bring more twists to
this problem as explained in this email chain as well, especially in the
case of DMA-coherent memory (dma_alloc_coherent()).

David

Steven Walter

unread,

Aug 25, 2009, 9:00:15 AM8/25/09

to

On Thu, Aug 6, 2009 at 6:25 PM, Russell King - ARM
Linux<li...@arm.linux.org.uk> wrote:
[...]

> As far as userspace DMA coherency, the only way you could do it with
> current kernel APIs is by using get_user_pages(), creating a scatterlist
> from those, and then passing it to dma_map_sg(). �While the device has
> ownership of the SG, userspace must _not_ touch the buffer until after
> DMA has completed.

[...]

Would that work on a processor with VIVT caches? It seems not. In
particular, dma_map_page uses page_address to get a virtual address to
pass to map_single(). map_single() in turn uses this address to
perform cache maintenance. Since page_address() returns the kernel
virtual address, I don't see how any cache-lines for the userspace
virtual address would get invalidated (for the DMA_FROM_DEVICE case).

If that's true, then what is the correct way to allow DMA to/from a
userspace buffer with a VIVT cache? If not true, what am I missing?

Thanks
--
-Steven Walter <steven...@gmail.com>

David Xiao

unread,

Aug 25, 2009, 6:10:09 PM8/25/09

to

On Tue, 2009-08-25 at 05:53 -0700, Steven Walter wrote:
> On Thu, Aug 6, 2009 at 6:25 PM, Russell King - ARM
> Linux<li...@arm.linux.org.uk> wrote:
> [...]
> > As far as userspace DMA coherency, the only way you could do it with
> > current kernel APIs is by using get_user_pages(), creating a scatterlist
> > from those, and then passing it to dma_map_sg(). While the device has
> > ownership of the SG, userspace must _not_ touch the buffer until after
> > DMA has completed.
> [...]
>
> Would that work on a processor with VIVT caches? It seems not. In
> particular, dma_map_page uses page_address to get a virtual address to
> pass to map_single(). map_single() in turn uses this address to
> perform cache maintenance. Since page_address() returns the kernel
> virtual address, I don't see how any cache-lines for the userspace
> virtual address would get invalidated (for the DMA_FROM_DEVICE case).
>
> If that's true, then what is the correct way to allow DMA to/from a
> userspace buffer with a VIVT cache? If not true, what am I missing?

page_address() is basically returning page->virtual, which records the
virtual/physical mapping for both user/kernel space; and what only
matters there is highmem or not.

David

Laurent Pinchart

unread,

Aug 25, 2009, 7:20:05 PM8/25/09

to

On Wednesday 26 August 2009 00:02:48 David Xiao wrote:
> On Tue, 2009-08-25 at 05:53 -0700, Steven Walter wrote:
> > On Thu, Aug 6, 2009 at 6:25 PM, Russell King - ARM
> > Linux<li...@arm.linux.org.uk> wrote:
> > [...]
> >
> > > As far as userspace DMA coherency, the only way you could do it with
> > > current kernel APIs is by using get_user_pages(), creating a
> > > scatterlist from those, and then passing it to dma_map_sg(). While the
> > > device has ownership of the SG, userspace must _not_ touch the buffer
> > > until after DMA has completed.
> >
> > [...]
> >
> > Would that work on a processor with VIVT caches? It seems not. In
> > particular, dma_map_page uses page_address to get a virtual address to
> > pass to map_single(). map_single() in turn uses this address to
> > perform cache maintenance. Since page_address() returns the kernel
> > virtual address, I don't see how any cache-lines for the userspace
> > virtual address would get invalidated (for the DMA_FROM_DEVICE case).
> >
> > If that's true, then what is the correct way to allow DMA to/from a
> > userspace buffer with a VIVT cache? If not true, what am I missing?
>
> page_address() is basically returning page->virtual, which records the
> virtual/physical mapping for both user/kernel space; and what only
> matters there is highmem or not.

I'm not sure to get it. Are you implying that a physical page will then be
mapped to the same address in all contexts (kernelspace and userspace
processes) ? Is that even possible ? And if not, how could page->virtual store
both the initial kernel map and all the userspace mappings ?

--
Laurent Pinchart

David Xiao

unread,

Aug 26, 2009, 1:30:17 PM8/26/09

to

Sorry for the confusion, page_address() indeed only returns kernel
virtual address; and in order to support VIVT cache maintenance for the
user space mappings, the dma_map_sg/dma_map_page() functions or even the
struct scatterlist do seem to have to be modified to pass in virtual
address, I think.

David

Russell King - ARM Linux

unread,

Sep 1, 2009, 9:30:07 AM9/1/09

to

On Tue, Aug 25, 2009 at 08:53:29AM -0400, Steven Walter wrote:
> On Thu, Aug 6, 2009 at 6:25 PM, Russell King - ARM
> Linux<li...@arm.linux.org.uk> wrote:
> [...]
> > As far as userspace DMA coherency, the only way you could do it with
> > current kernel APIs is by using get_user_pages(), creating a scatterlist
> > from those, and then passing it to dma_map_sg(). �While the device has
> > ownership of the SG, userspace must _not_ touch the buffer until after
> > DMA has completed.
> [...]
>
> Would that work on a processor with VIVT caches? It seems not. In
> particular, dma_map_page uses page_address to get a virtual address to
> pass to map_single(). map_single() in turn uses this address to
> perform cache maintenance. Since page_address() returns the kernel
> virtual address, I don't see how any cache-lines for the userspace
> virtual address would get invalidated (for the DMA_FROM_DEVICE case).

You are correct.

> If that's true, then what is the correct way to allow DMA to/from a
> userspace buffer with a VIVT cache? If not true, what am I missing?

I don't think you read what I said (but I've also forgotten what I did
say).

To put it simply, the kernel does not support DMA direct from userspace
pages. Solutions which have been proposed in the past only work with a
sub-set of conditions (such as the one above only works with VIPT
caches.)

Russell King - ARM Linux

unread,

Sep 1, 2009, 9:40:06 AM9/1/09

to

On Wed, Aug 26, 2009 at 10:22:11AM -0700, David Xiao wrote:
> Sorry for the confusion, page_address() indeed only returns kernel
> virtual address; and in order to support VIVT cache maintenance for the
> user space mappings, the dma_map_sg/dma_map_page() functions or even the
> struct scatterlist do seem to have to be modified to pass in virtual
> address, I think.

That's the wrong answer. When DMA happens (and therefore these functions
are called) the userspace context could already have been switched away,
which means that any userspace address information is useless.

Adding support to the existing DMA API functions so they can be used for
userspace mapped pages is simply the wrong approach - most users of those
functions are not concerned with userspace mapped pages at all, and adding
that burden onto all those users is clearly sub-optimal.

The right answer? I don't think there is one (see my previous mail.)

Laurent Pinchart

unread,

Sep 1, 2009, 9:50:05 AM9/1/09

to

I might be missing something obvious, but I fail to see how VIVT caches could
work at all with multiple mappings. If a kernel-allocated buffer is DMA'ed to,
we certainly want to invalidate all cache lines that store buffer data. As the
cache doesn't care about physical addresses we thus need to invalidate all
virtual mappings for the buffer. If the buffer is mmap'ed in userspace I don't
see how that would be done.

--
Laurent Pinchart

Russell King - ARM Linux

unread,

Sep 1, 2009, 10:20:06 AM9/1/09

to

On Tue, Sep 01, 2009 at 03:43:48PM +0200, Laurent Pinchart wrote:
> I might be missing something obvious, but I fail to see how VIVT caches
> could work at all with multiple mappings. If a kernel-allocated buffer
> is DMA'ed to, we certainly want to invalidate all cache lines that store
> buffer data. As the cache doesn't care about physical addresses we thus
> need to invalidate all virtual mappings for the buffer. If the buffer is
> mmap'ed in userspace I don't see how that would be done.

You need to ask MM gurus about that. I don't touch the Linux MM very
often so tend to keep forgetting how it works. However, it does work
for shared mappings of files on CPUs with VIVT caches.

Hugh Dickins

unread,

Sep 1, 2009, 1:00:42 PM9/1/09

to

On Tue, 1 Sep 2009, Russell King - ARM Linux wrote:
> On Tue, Sep 01, 2009 at 03:43:48PM +0200, Laurent Pinchart wrote:
> > I might be missing something obvious, but I fail to see how VIVT caches
> > could work at all with multiple mappings. If a kernel-allocated buffer
> > is DMA'ed to, we certainly want to invalidate all cache lines that store
> > buffer data. As the cache doesn't care about physical addresses we thus
> > need to invalidate all virtual mappings for the buffer. If the buffer is
> > mmap'ed in userspace I don't see how that would be done.
>
> You need to ask MM gurus about that. I don't touch the Linux MM very
> often so tend to keep forgetting how it works. However, it does work
> for shared mappings of files on CPUs with VIVT caches.

I believe arch/arm/mm/flush.c __flush_dcache_aliases() is what does it.

Hugh

David Xiao

unread,

Sep 1, 2009, 2:10:07 PM9/1/09

to

On Tue, 2009-09-01 at 06:31 -0700, Russell King - ARM Linux wrote:
> On Wed, Aug 26, 2009 at 10:22:11AM -0700, David Xiao wrote:
> > Sorry for the confusion, page_address() indeed only returns kernel
> > virtual address; and in order to support VIVT cache maintenance for the
> > user space mappings, the dma_map_sg/dma_map_page() functions or even the
> > struct scatterlist do seem to have to be modified to pass in virtual
> > address, I think.
>
> That's the wrong answer. When DMA happens (and therefore these functions
> are called) the userspace context could already have been switched away,
> which means that any userspace address information is useless.
>

The dma_map_sg/page() needs to be set up before starting DMA operations.

If the context switch happens before/when DMA occurs, that is okay since
in the case of VIVT cache all the necessary cache lines will be
invalidated/flushed anyway with every context switch.

My understanding is that there are basically two issues associated with
VIVT cache in an OS environment:
1. address space change. When a context switch happens, if the new
address space is overlapping with the old one, as ARM linux does, all
the related cache lines have to be invalidated/flushed, unless something
like ASID used together with VIVT cache.

2. cache-line aliasing in the same address space.
In the user space DMA case, we are assuming that these physical pages
are only mapped twice, once in user space and once in kernel
direct-mapping.
I went through the kernel code path and think the kernel direct-mapping
was already flushed/invalidated before the pages were handed over to the
user space; therefore, the proposal is to record the user space virtual
address and do the proper cache maintenance operations.

> Adding support to the existing DMA API functions so they can be used for
> userspace mapped pages is simply the wrong approach - most users of those
> functions are not concerned with userspace mapped pages at all, and adding
> that burden onto all those users is clearly sub-optimal.
>

The kernel is already addressing the mmap() file case by putting the
mapping field into the struct page and etc; and I personally do not
think it is too much of a change for the user space DMA case, if we
agree the application/request is valid of course.

David

Imre Deak

unread,

Sep 2, 2009, 11:20:06 AM9/2/09

to

To my understanding buffers returned by dma_alloc_*, kmalloc, vmalloc
are ok:

The cache lines for direct mapping are flushed in dma_alloc_* and
vmalloc. After this you are not supposed to access the buffers
through the direct mapping until you're done with the DMA.

For kmalloc you use the direct mapping in the first place, so the
flush in dma_map_* will be enough.

For user mappings I think you'd have to do an additional flush for
the direct mapping, while the user mapping is flushed in dma_map_*.

--Imre

Imre Deak

unread,

Sep 3, 2009, 3:40:06 AM9/3/09

to

On Wed, Sep 02, 2009 at 05:10:44PM +0200, Deak Imre (Nokia-D/Helsinki) wrote:
> On Tue, Sep 01, 2009 at 03:43:48PM +0200, ext Laurent Pinchart wrote:

> > [...]

> > I might be missing something obvious, but I fail to see how VIVT caches could
> > work at all with multiple mappings. If a kernel-allocated buffer is DMA'ed to,
> > we certainly want to invalidate all cache lines that store buffer data. As the
> > cache doesn't care about physical addresses we thus need to invalidate all
> > virtual mappings for the buffer. If the buffer is mmap'ed in userspace I don't
> > see how that would be done.
>
> To my understanding buffers returned by dma_alloc_*, kmalloc, vmalloc
> are ok:
>
> The cache lines for direct mapping are flushed in dma_alloc_* and
> vmalloc. After this you are not supposed to access the buffers
> through the direct mapping until you're done with the DMA.
>
> For kmalloc you use the direct mapping in the first place, so the
> flush in dma_map_* will be enough.
>
> For user mappings I think you'd have to do an additional flush for
> the direct mapping, while the user mapping is flushed in dma_map_*.

Based on the the discussion so far this is my understanding on how
zero-copy DMA is possible on ARM. Could you please confirm / correct
these? :

- user space passes an arbitrary buffer:
- get_user_pages(user address range)
- DMA(user address range)
- user space reads from the buffer

Problems:
- not supported according to Russell
- unhandled faults for cache ops on not-present PTEs, but patch
from Laurent fixes this

- mmap a kernel buffer to user space with cacheable mapping:
- user space writes to the buffer
- flush cache(user address range)
- DMA(kernel buffer)
- user space reads from the buffer

The additional flush cache is needed for VIVT/aliasing VIPT.
Instead of the flush cache:
- the mapping can be done with writethrough, non-writeallocate or
non-cacheable mapping, or
- for aliasing VIPT a non-aliasing user address is picked

DMA(address range) is:
- dma_map_*(address range)
- perform DMA to/from address range
- dma_unmap_*(address range)

Thanks,

Russell King - ARM Linux

unread,

Sep 3, 2009, 4:40:09 AM9/3/09

to

On Wed, Sep 02, 2009 at 06:10:44PM +0300, Imre Deak wrote:
> To my understanding buffers returned by dma_alloc_*, kmalloc, vmalloc
> are ok:

For dma_map_*, the only pages/addresses which are valid to pass are
those returned by get_free_pages() or kmalloc. Everything else is
not permitted.

Use of vmalloc'd and dma_alloc_* pages with the dma_map_* APIs is invalid
use of the DMA API. See the notes in the DMA-mapping.txt document
against "dma_map_single".

> For user mappings I think you'd have to do an additional flush for
> the direct mapping, while the user mapping is flushed in dma_map_*.

I will not accept a patch which adds flushing of anything other than
the kernel direct mapping in the dma_map_* functions, so please find
a different approach.

Steven Walter

unread,

Sep 8, 2009, 9:10:05 AM9/8/09

to

On Thu, Sep 3, 2009 at 4:36 AM, Russell King - ARM
Linux<li...@arm.linux.org.uk> wrote:
> On Wed, Sep 02, 2009 at 06:10:44PM +0300, Imre Deak wrote:
>> To my understanding buffers returned by dma_alloc_*, kmalloc, vmalloc
>> are ok:
>
> For dma_map_*, the only pages/addresses which are valid to pass are
> those returned by get_free_pages() or kmalloc. �Everything else is
> not permitted.
>
> Use of vmalloc'd and dma_alloc_* pages with the dma_map_* APIs is invalid
> use of the DMA API. �See the notes in the DMA-mapping.txt document
> against "dma_map_single".

Actually, DMA-mapping.txt seems to explicitly say that it's allowed to
use pages allocated by vmalloc:

"It is possible to DMA to the _underlying_ memory mapped into a
vmalloc() area, but this requires walking page tables to get the
physical addresses, and then translating each of those pages back to a
kernel address using something like __va()."

>> For user mappings I think you'd have to do an additional flush for
>> the direct mapping, while the user mapping is flushed in dma_map_*.
>
> I will not accept a patch which adds flushing of anything other than
> the kernel direct mapping in the dma_map_* functions, so please find
> a different approach.

What's the concern here? Just the performance overhead of the checks
and additional flushes? It seems much more desirable for the
dma_map_* API to take care of potential cache aliases than to require
every driver to manage it for itself. After all, part of the purpose
of the DMA API is to manage the cache maintenance around DMAs in an
architecture-independent way.
--
-Steven Walter <steven...@gmail.com>