[RFC PATCH 0/3] Aggressively allocate the pages on cma reserved memory

Joonsoo Kim

unread,

May 7, 2014, 8:40:01 PM5/7/14

to

Hello,

This series tries to improve CMA.

CMA is introduced to provide physically contiguous pages at runtime
without reserving memory area. But, current implementation works like as
reserving memory approach, because allocation on cma reserved region only
occurs as fallback of migrate_movable allocation. We can allocate from it
when there is no movable page. In that situation, kswapd would be invoked
easily since unmovable and reclaimable allocation consider
(free pages - free CMA pages) as free memory on the system and free memory
may be lower than high watermark in that case. If kswapd start to reclaim
memory, then fallback allocation doesn't occur much.

In my experiment, I found that if system memory has 1024 MB memory and
has 512 MB reserved memory for CMA, kswapd is mostly invoked around
the 512MB free memory boundary. And invoked kswapd tries to make free
memory until (free pages - free CMA pages) is higher than high watermark,
so free memory on meminfo is moving around 512MB boundary consistently.

To fix this problem, we should allocate the pages on cma reserved memory
more aggressively and intelligenetly. Patch 2 implements the solution.
Patch 1 is the simple optimization which remove useless re-trial and patch 3
is for removing useless alloc flag, so these are not important.
See patch 2 for more detailed description.

This patchset is based on v3.15-rc4.

Thanks.
Joonsoo Kim (3):
CMA: remove redundant retrying code in __alloc_contig_migrate_range
CMA: aggressively allocate the pages on cma reserved memory when not
used
CMA: always treat free cma pages as non-free on watermark checking

include/linux/mmzone.h | 6 +++
mm/compaction.c | 4 --
mm/internal.h | 3 +-
mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++---------
4 files changed, 102 insertions(+), 28 deletions(-)

--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Joonsoo Kim

unread,

May 7, 2014, 8:40:02 PM5/7/14

to

CMA is introduced to provide physically contiguous pages at runtime.
For this purpose, it reserves memory at boot time. Although it reserve
memory, this reserved memory can be used for movable memory allocation
request. This usecase is beneficial to the system that needs this CMA
reserved memory infrequently and it is one of main purpose of
introducing CMA.

But, there is a problem in current implementation. The problem is that
it works like as just reserved memory approach. The pages on cma reserved
memory are hardly used for movable memory allocation. This is caused by
combination of allocation and reclaim policy.

The pages on cma reserved memory are allocated if there is no movable
memory, that is, as fallback allocation. So the time this fallback
allocation is started is under heavy memory pressure. Although it is under
memory pressure, movable allocation easily succeed, since there would be
many pages on cma reserved memory. But this is not the case for unmovable
and reclaimable allocation, because they can't use the pages on cma
reserved memory. These allocations regard system's free memory as
(free pages - free cma pages) on watermark checking, that is, free
unmovable pages + free reclaimable pages + free movable pages. Because
we already exhausted movable pages, only free pages we have are unmovable
and reclaimable types and this would be really small amount. So watermark
checking would be failed. It will wake up kswapd to make enough free
memory for unmovable and reclaimable allocation and kswapd will do.
So before we fully utilize pages on cma reserved memory, kswapd start to
reclaim memory and try to make free memory over the high watermark. This
watermark checking by kswapd doesn't take care free cma pages so many
movable pages would be reclaimed. After then, we have a lot of movable
pages again, so fallback allocation doesn't happen again. To conclude,
amount of free memory on meminfo which includes free CMA pages is moving
around 512 MB if I reserve 512 MB memory for CMA.

I found this problem on following experiment.

4 CPUs, 1024 MB, VIRTUAL MACHINE
make -j24

CMA reserve: 0 MB 512 MB
Elapsed-time: 234.8 361.8
Average-MemFree: 283880 KB 530851 KB

To solve this problem, I can think following 2 possible solutions.
1. allocate the pages on cma reserved memory first, and if they are
exhausted, allocate movable pages.
2. interleaved allocation: try to allocate specific amounts of memory
from cma reserved memory and then allocate from free movable memory.

I tested #1 approach and found the problem. Although free memory on
meminfo can move around low watermark, there is large fluctuation on free
memory, because too many pages are reclaimed when kswapd is invoked.
Reason for this behaviour is that successive allocated CMA pages are
on the LRU list in that order and kswapd reclaim them in same order.
These memory doesn't help watermark checking from kwapd, so too many
pages are reclaimed, I guess.

So, I implement #2 approach.
One thing I should note is that we should not change allocation target
(movable list or cma) on each allocation attempt, since this prevent
allocated pages to be in physically succession, so some I/O devices can
be hurt their performance. To solve this, I keep allocation target
in at least pageblock_nr_pages attempts and make this number reflect
ratio, free pages without free cma pages to free cma pages. With this
approach, system works very smoothly and fully utilize the pages on
cma reserved memory.

Following is the experimental result of this patch.

4 CPUs, 1024 MB, VIRTUAL MACHINE
make -j24

<Before>
CMA reserve: 0 MB 512 MB
Elapsed-time: 234.8 361.8
Average-MemFree: 283880 KB 530851 KB
pswpin: 7 110064
pswpout: 452 767502

<After>
CMA reserve: 0 MB 512 MB
Elapsed-time: 234.2 235.6
Average-MemFree: 281651 KB 290227 KB
pswpin: 8 8
pswpout: 430 510

There is no difference if we don't have cma reserved memory (0 MB case).
But, with cma reserved memory (512 MB case), we fully utilize these
reserved memory through this patch and the system behaves like as
it doesn't reserve any memory.

With this patch, we aggressively allocate the pages on cma reserved memory
so latency of CMA can arise. Below is the experimental result about
latency.

4 CPUs, 1024 MB, VIRTUAL MACHINE
CMA reserve: 512 MB
Backgound Workload: make -jN
Real Workload: 8 MB CMA allocation/free 20 times with 5 sec interval

N: 1 4 8 16
Elapsed-time(Before): 4309.75 9511.09 12276.1 77103.5
Elapsed-time(After): 5391.69 16114.1 19380.3 34879.2

So generally we can see latency increase. Ratio of this increase
is rather big - up to 70%. But, under the heavy workload, it shows
latency decrease - up to 55%. This may be worst-case scenario, but
reducing it would be important for some system, so, I can say that
this patch have advantages and disadvantages in terms of latency.

Signed-off-by: Joonsoo Kim <iamjoon...@lge.com>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..3ff24d4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -389,6 +389,12 @@ struct zone {
int compact_order_failed;
#endif

+#ifdef CONFIG_CMA
+ int has_cma;
+ int nr_try_cma;
+ int nr_try_movable;
+#endif
+
ZONE_PADDING(_pad1_)

/* Fields commonly accessed by the page reclaim scanner */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 674ade7..6f2b27b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -788,6 +788,16 @@ void __init __free_pages_bootmem(struct page *page, unsigned int order)
}

#ifdef CONFIG_CMA
+void __init init_alloc_ratio_counter(struct zone *zone)
+{
+ if (zone->has_cma)
+ return;
+
+ zone->has_cma = 1;
+ zone->nr_try_movable = 0;
+ zone->nr_try_cma = 0;
+}
+
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
{
@@ -803,6 +813,7 @@ void __init init_cma_reserved_pageblock(struct page *page)
set_pageblock_migratetype(page, MIGRATE_CMA);
__free_pages(page, pageblock_order);
adjust_managed_page_count(page, pageblock_nr_pages);
+ init_alloc_ratio_counter(page_zone(page));
}
#endif

@@ -1136,6 +1147,69 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
return NULL;
}

+#ifdef CONFIG_CMA
+static struct page *__rmqueue_cma(struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ long free, free_cma, free_wmark;
+ struct page *page;
+
+ if (migratetype != MIGRATE_MOVABLE || !zone->has_cma)
+ return NULL;
+
+ if (zone->nr_try_movable)
+ goto alloc_movable;
+
+alloc_cma:
+ if (zone->nr_try_cma) {
+ /* Okay. Now, we can try to allocate the page from cma region */
+ zone->nr_try_cma--;
+ page = __rmqueue_smallest(zone, order, MIGRATE_CMA);
+
+ /* CMA pages can vanish through CMA allocation */
+ if (unlikely(!page && order == 0))
+ zone->nr_try_cma = 0;
+
+ return page;
+ }
+
+ /* Reset ratio counter */
+ free_cma = zone_page_state(zone, NR_FREE_CMA_PAGES);
+
+ /* No cma free pages, so recharge only movable allocation */
+ if (free_cma <= 0) {
+ zone->nr_try_movable = pageblock_nr_pages;
+ goto alloc_movable;
+ }
+
+ free = zone_page_state(zone, NR_FREE_PAGES);
+ free_wmark = free - free_cma - high_wmark_pages(zone);
+
+ /*
+ * free_wmark is below than 0, and it means that normal pages
+ * are under the pressure, so we recharge only cma allocation.
+ */
+ if (free_wmark <= 0) {
+ zone->nr_try_cma = pageblock_nr_pages;
+ goto alloc_cma;
+ }
+
+ if (free_wmark > free_cma) {
+ zone->nr_try_movable =
+ (free_wmark * pageblock_nr_pages) / free_cma;
+ zone->nr_try_cma = pageblock_nr_pages;
+ } else {
+ zone->nr_try_movable = pageblock_nr_pages;
+ zone->nr_try_cma = free_cma * pageblock_nr_pages / free_wmark;
+ }
+
+ /* Reset complete, start on movable first */
+alloc_movable:
+ zone->nr_try_movable--;
+ return NULL;
+}
+#endif
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
@@ -1143,10 +1217,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
static struct page *__rmqueue(struct zone *zone, unsigned int order,
int migratetype)
{
- struct page *page;
+ struct page *page = NULL;
+
+ if (IS_ENABLED(CONFIG_CMA))
+ page = __rmqueue_cma(zone, order, migratetype);

retry_reserve:
- page = __rmqueue_smallest(zone, order, migratetype);
+ if (!page)
+ page = __rmqueue_smallest(zone, order, migratetype);

if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
page = __rmqueue_fallback(zone, order, migratetype);
@@ -4849,6 +4927,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
zone_pcp_init(zone);
+ if (IS_ENABLED(CONFIG_CMA))
+ zone->has_cma = 0;

/* For bootup, initialized properly in watermark setup */
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

Joonsoo Kim

unread,

May 7, 2014, 8:40:02 PM5/7/14

to

commit d95ea5d1('cma: fix watermark checking') introduces ALLOC_CMA flag
for alloc flag and treats free cma pages as free pages if this flag is
passed to watermark checking. Intention of that patch is that movable page
allocation can be be handled from cma reserved region without starting
kswapd. Now, previous patch changes the behaviour of allocator that
movable allocation uses the page on cma reserved region aggressively,
so this watermark hack isn't needed anymore. Therefore remove it.

Signed-off-by: Joonsoo Kim <iamjoon...@lge.com>

diff --git a/mm/compaction.c b/mm/compaction.c
index 627dc2e..36e2fcd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1117,10 +1117,6 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,

count_compact_event(COMPACTSTALL);

-#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-#endif
/* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
nodemask) {
diff --git a/mm/internal.h b/mm/internal.h
index 07b6736..a121762 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -384,7 +384,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
-#define ALLOC_FAIR 0x100 /* fair zone allocation */
+#define ALLOC_FAIR 0x80 /* fair zone allocation */

#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f2b27b..6af2fa1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1757,20 +1757,22 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
long min = mark;
long lowmem_reserve = z->lowmem_reserve[classzone_idx];
int o;
- long free_cma = 0;

free_pages -= (1 << order) - 1;
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
-#ifdef CONFIG_CMA
- /* If allocation can't use CMA areas don't use free CMA pages */
- if (!(alloc_flags & ALLOC_CMA))
- free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
+ /*
+ * We don't want to regard the pages on CMA region as free
+ * on watermark checking, since they cannot be used for
+ * unmovable/reclaimable allocation and they can suddenly
+ * vanish through CMA allocation
+ */
+ if (IS_ENABLED(CONFIG_CMA) && z->has_cma)
+ free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);

- if (free_pages - free_cma <= min + lowmem_reserve)
+ if (free_pages <= min + lowmem_reserve)
return false;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
@@ -2538,10 +2540,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
unlikely(test_thread_flag(TIF_MEMDIE))))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
-#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-#endif
return alloc_flags;
}

@@ -2811,10 +2809,6 @@ retry_cpuset:
if (!preferred_zone)
goto out;

-#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-#endif
retry:
/* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,

Joonsoo Kim

unread,

May 7, 2014, 8:40:02 PM5/7/14

to

We already have retry logic in migrate_pages(). It does retry 10 times.
So if we keep this retrying code in __alloc_contig_migrate_range(), we
would try to migrate some unmigratable page in 50 times. There is just one
small difference in -ENOMEM case. migrate_pages() don't do retry
in this case, however, current __alloc_contig_migrate_range() does. But,
I think that this isn't problem, because in this case, we may fail again
with same reason.

Signed-off-by: Joonsoo Kim <iamjoon...@lge.com>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..674ade7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6185,7 +6185,6 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
/* This function is based on compact_zone() from compaction.c. */
unsigned long nr_reclaimed;
unsigned long pfn = start;
- unsigned int tries = 0;
int ret = 0;

migrate_prep();
@@ -6204,10 +6203,6 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
ret = -EINTR;
break;
}
- tries = 0;
- } else if (++tries == 5) {
- ret = ret < 0 ? ret : -EBUSY;
- break;
}

nr_reclaimed = reclaim_clean_pages_from_list(cc->zone,
@@ -6216,6 +6211,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,

ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
0, MIGRATE_SYNC, MR_CMA);
+ if (ret) {
+ ret = ret < 0 ? ret : -EBUSY;
+ break;
+ }
}
if (ret < 0) {
putback_movable_pages(&cc->migratepages);

Marek Szyprowski

unread,

May 9, 2014, 8:40:01 AM5/9/14

to

Hello,

On 2014-05-08 02:32, Joonsoo Kim wrote:
> This series tries to improve CMA.
>
> CMA is introduced to provide physically contiguous pages at runtime
> without reserving memory area. But, current implementation works like as
> reserving memory approach, because allocation on cma reserved region only
> occurs as fallback of migrate_movable allocation. We can allocate from it
> when there is no movable page. In that situation, kswapd would be invoked
> easily since unmovable and reclaimable allocation consider
> (free pages - free CMA pages) as free memory on the system and free memory
> may be lower than high watermark in that case. If kswapd start to reclaim
> memory, then fallback allocation doesn't occur much.
>
> In my experiment, I found that if system memory has 1024 MB memory and
> has 512 MB reserved memory for CMA, kswapd is mostly invoked around
> the 512MB free memory boundary. And invoked kswapd tries to make free
> memory until (free pages - free CMA pages) is higher than high watermark,
> so free memory on meminfo is moving around 512MB boundary consistently.
>
> To fix this problem, we should allocate the pages on cma reserved memory
> more aggressively and intelligenetly. Patch 2 implements the solution.
> Patch 1 is the simple optimization which remove useless re-trial and patch 3
> is for removing useless alloc flag, so these are not important.
> See patch 2 for more detailed description.
>
> This patchset is based on v3.15-rc4.

Thanks for posting those patches. It basically reminds me the following
discussion:
http://thread.gmane.org/gmane.linux.kernel/1391989/focus=1399524

Your approach is basically the same. I hope that your patches can be
improved
in such a way that they will be accepted by mm maintainers. I only
wonder if the
third patch is really necessary. Without it kswapd wakeup might be still
avoided
in some cases.

> Thanks.
> Joonsoo Kim (3):
> CMA: remove redundant retrying code in __alloc_contig_migrate_range
> CMA: aggressively allocate the pages on cma reserved memory when not
> used
> CMA: always treat free cma pages as non-free on watermark checking
>
> include/linux/mmzone.h | 6 +++
> mm/compaction.c | 4 --
> mm/internal.h | 3 +-
> mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++---------
> 4 files changed, 102 insertions(+), 28 deletions(-)
>

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Michal Nazarewicz

unread,

May 9, 2014, 11:50:02 AM5/9/14

to

On Wed, May 07 2014, Joonsoo Kim wrote:
> Signed-off-by: Joonsoo Kim <iamjoon...@lge.com>

The code looks good to me, but I don't feel competent on whether the
approach is beneficial or not. Still:

Acked-by: Michal Nazarewicz <min...@mina86.com>

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michał “mina86” Nazarewicz (o o)
ooo +--<m...@google.com>--<xmpp:min...@jabber.org>--ooO--(_)--Ooo--

signature.asc

Michal Nazarewicz

unread,

May 9, 2014, 11:50:02 AM5/9/14

to

On Wed, May 07 2014, Joonsoo Kim wrote:

> commit d95ea5d1('cma: fix watermark checking') introduces ALLOC_CMA flag
> for alloc flag and treats free cma pages as free pages if this flag is
> passed to watermark checking. Intention of that patch is that movable page
> allocation can be be handled from cma reserved region without starting
> kswapd. Now, previous patch changes the behaviour of allocator that
> movable allocation uses the page on cma reserved region aggressively,
> so this watermark hack isn't needed anymore. Therefore remove it.
>
> Signed-off-by: Joonsoo Kim <iamjoon...@lge.com>

signature.asc

Michal Nazarewicz

unread,

May 9, 2014, 11:50:03 AM5/9/14

to

On Wed, May 07 2014, Joonsoo Kim wrote:

> We already have retry logic in migrate_pages(). It does retry 10 times.
> So if we keep this retrying code in __alloc_contig_migrate_range(), we
> would try to migrate some unmigratable page in 50 times. There is just one
> small difference in -ENOMEM case. migrate_pages() don't do retry
> in this case, however, current __alloc_contig_migrate_range() does. But,
> I think that this isn't problem, because in this case, we may fail again
> with same reason.
>
> Signed-off-by: Joonsoo Kim <iamjoon...@lge.com>

I think there was a reason for the retries in
__alloc_contig_migrate_range but perhaps those are no longer valid.

Acked-by: Michal Nazarewicz <min...@mina86.com>

signature.asc

Laura Abbott

unread,

May 12, 2014, 1:10:02 PM5/12/14

to

Hi,

We have an out of tree implementation of #1 and so far it's worked for us
although we weren't looking at the same metrics. I don't completely
understand the issue you pointed out with #1. It sounds like the issue is
that CMA pages are already in use by other processes and on LRU lists and
because the pages are on LRU lists these aren't counted towards the
watermark by kswapd. Is my understanding correct?

What metric are you using to determine all CMA memory was fully used?
We've been checking /proc/pagetypeinfo

>
> With this patch, we aggressively allocate the pages on cma reserved memory
> so latency of CMA can arise. Below is the experimental result about
> latency.
>
> 4 CPUs, 1024 MB, VIRTUAL MACHINE
> CMA reserve: 512 MB
> Backgound Workload: make -jN
> Real Workload: 8 MB CMA allocation/free 20 times with 5 sec interval
>
> N: 1 4 8 16
> Elapsed-time(Before): 4309.75 9511.09 12276.1 77103.5
> Elapsed-time(After): 5391.69 16114.1 19380.3 34879.2
>
> So generally we can see latency increase. Ratio of this increase
> is rather big - up to 70%. But, under the heavy workload, it shows
> latency decrease - up to 55%. This may be worst-case scenario, but
> reducing it would be important for some system, so, I can say that
> this patch have advantages and disadvantages in terms of latency.
>

Do you have any statistics related to failed migration from this? Latency
and utilization are issues but so is migration success. In the past we've
found that an increase in CMA utilization was related to increase in CMA
migration failures because pages were unmigratable. The current
workaround for this is limiting CMA pages to be used for user processes
only and not the file cache. Both of these have their own problems.

I'm going to see about running this through tests internally for comparison.
Hopefully I'll get useful results in a day or so.

Thanks,
Laura

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Joonsoo Kim

unread,

May 12, 2014, 9:20:02 PM5/12/14

to

Hello,

Yes, your understanding is correct.
kswapd want to reclaim normal (not CMA) pages, but LRU lists could
have a lot of CMA pages continuously by #1 approach, so watermark
aren't restored easily.

In this result, we can check whether CMA memory was used more or not
by MemFree stat.
I used /proc/zoneinfo to get an insight.

> >
> > With this patch, we aggressively allocate the pages on cma reserved memory
> > so latency of CMA can arise. Below is the experimental result about
> > latency.
> >
> > 4 CPUs, 1024 MB, VIRTUAL MACHINE
> > CMA reserve: 512 MB
> > Backgound Workload: make -jN
> > Real Workload: 8 MB CMA allocation/free 20 times with 5 sec interval
> >
> > N: 1 4 8 16
> > Elapsed-time(Before): 4309.75 9511.09 12276.1 77103.5
> > Elapsed-time(After): 5391.69 16114.1 19380.3 34879.2
> >
> > So generally we can see latency increase. Ratio of this increase
> > is rather big - up to 70%. But, under the heavy workload, it shows
> > latency decrease - up to 55%. This may be worst-case scenario, but
> > reducing it would be important for some system, so, I can say that
> > this patch have advantages and disadvantages in terms of latency.
> >
>
> Do you have any statistics related to failed migration from this? Latency
> and utilization are issues but so is migration success. In the past we've
> found that an increase in CMA utilization was related to increase in CMA
> migration failures because pages were unmigratable. The current
> workaround for this is limiting CMA pages to be used for user processes
> only and not the file cache. Both of these have their own problems.

I have the retrying number when doing 8 MB CMA allocation 20 times.
These number are average of 5 runs.

N: 1 4 8 16

Retrying(Before): 0 0 0.6 12.2
Retrying(After): 1.4 1.8 3 3.6

If you know any permanent failure case with file cache pages, please
let me know.

What I already know CMA migration failure about file cache pages is
the problems related to buffer_head lru, which you mentioned before.

Okay.
I really hope to see your result. :)
Thanks for your interest.

Thanks.

Joonsoo Kim

unread,

May 12, 2014, 10:30:02 PM5/12/14

to

Hello,

Oh... I didn't know that patch and discussion, because I have no interest
on CMA at that time. Your approach looks similar to #1
approach of mine and could have same problem of #1 approach which I mentioned
in patch 2/3. Please refer that patch description. :)
And, there is different purpose between this and yours. This patch is
intended to better use of CMA pages and so get maximum performance.
Just to not trigger oom, it can be possible to put this logic on reclaim path.
But that is sub-optimal to get higher performance, because it needs
migration in some cases.

If second patch works as intended, there are just a few of cma free pages
when we are toward on the watermark. So benefit of third patch would
be marginal and we can remove ALLOC_CMA.

Thanks.

Minchan Kim

unread,

May 12, 2014, 11:00:02 PM5/12/14

to

Hey Joonsoo,

I love this idea but when I see the code, I don't like that.
In allocation path, just try to allocate pages by round-robin so it's role
of allocator. If one of migratetype is full, just pass mission to reclaimer
with hint(ie, Hey reclaimer, it's non-movable allocation fail
so there is pointless if you reclaim MIGRATE_CMA pages) so that
reclaimer can filter it out during page scanning.
We already have an tool to achieve it(ie, isolate_mode_t).

And we couldn't do it in zone_watermark_ok with set/reset ALLOC_CMA?
If possible, it would be better becauser it's generic function to check
free pages and cause trigger reclaim/compaction logic.

> _______________________________________________
> OTC mailing list
> O...@blackduck.lge.com
> http://blackduck.lge.com/mailman/listinfo/otc

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

May 12, 2014, 11:10:01 PM5/12/14

to

On Mon, May 12, 2014 at 10:04:29AM -0700, Laura Abbott wrote:

Kswapd could reclaim MIGRATE_CMA pages unconditionally although allocator
patch was failed by non-movable allocation. It's pointless and should fix.

If Joonsoo's patch makes fail ratio higher, it would be okay to me because
we have more report from them and have a chance to fix it. It's better than
hiding the problem of CMA with some hack.

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Aneesh Kumar K.V

unread,

May 14, 2014, 4:50:01 AM5/14/14

to

Another issue i am facing with the current code is the atomic allocation
failing even with large number of CMA pages around. In my case we never
reclaimed because large part of the memory is consumed by the page cache and
for that, free memory check doesn't include at free_cma. I will test
with this patchset and update here once i have the results.

Can you write documentation around this ?

Can you add the commit message documentation here.

> +
> + /* Reset complete, start on movable first */
> +alloc_movable:
> + zone->nr_try_movable--;
> + return NULL;
> +}
> +#endif
> +
> /*
> * Do the hard work of removing an element from the buddy allocator.
> * Call me with the zone->lock already held.
> @@ -1143,10 +1217,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> static struct page *__rmqueue(struct zone *zone, unsigned int order,
> int migratetype)
> {
> - struct page *page;
> + struct page *page = NULL;
> +
> + if (IS_ENABLED(CONFIG_CMA))
> + page = __rmqueue_cma(zone, order, migratetype);

It would be better to move the migrate check here, So that it becomes

/* For migrate movable allocation try cma area first */
if (IS_ENABLED(CONFIG_CMA) && (migratetype == MIGRATE_MOVABLE))

>
> retry_reserve:
> - page = __rmqueue_smallest(zone, order, migratetype);
> + if (!page)
> + page = __rmqueue_smallest(zone, order, migratetype);
>
> if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
> page = __rmqueue_fallback(zone, order, migratetype);
> @@ -4849,6 +4927,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> zone_seqlock_init(zone);
> zone->zone_pgdat = pgdat;
> zone_pcp_init(zone);
> + if (IS_ENABLED(CONFIG_CMA))
> + zone->has_cma = 0;
>
> /* For bootup, initialized properly in watermark setup */
> mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
> --
> 1.7.9.5
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

Aneesh Kumar K.V

unread,

May 14, 2014, 5:50:02 AM5/14/14

to

IIUC that patch also interleave right ?

+#ifdef CONFIG_CMA
+ unsigned long nr_free = zone_page_state(zone, NR_FREE_PAGES);
+ unsigned long nr_cma_free = zone_page_state(zone, NR_FREE_CMA_PAGES);
+
+ if (migratetype == MIGRATE_MOVABLE && nr_cma_free &&
+ nr_free - nr_cma_free < 2 * low_wmark_pages(zone))
+ migratetype = MIGRATE_CMA;
+#endif /* CONFIG_CMA */

That doesn't always prefer CMA region. It would be nice to
understand why grouping in pageblock_nr_pages is beneficial. Also in
your patch you decrement nr_try_cma for every 'order' allocation. Why ?

+ if (zone->nr_try_cma) {
+ /* Okay. Now, we can try to allocate the page from cma region */
+ zone->nr_try_cma--;
+ page = __rmqueue_smallest(zone, order, MIGRATE_CMA);
+
+ /* CMA pages can vanish through CMA allocation */
+ if (unlikely(!page && order == 0))
+ zone->nr_try_cma = 0;
+
+ return page;
+ }

If we fail above MIGRATE_CMA alloc should we return failure ? Why
not try MOVABLE allocation on failure (ie fallthrough the code path) ?

> And, there is different purpose between this and yours. This patch is
> intended to better use of CMA pages and so get maximum performance.
> Just to not trigger oom, it can be possible to put this logic on reclaim path.
> But that is sub-optimal to get higher performance, because it needs
> migration in some cases.
>
> If second patch works as intended, there are just a few of cma free pages
> when we are toward on the watermark. So benefit of third patch would
> be marginal and we can remove ALLOC_CMA.
>
> Thanks.
>

-aneesh

Joonsoo Kim

unread,

May 14, 2014, 10:00:02 PM5/14/14

to

Hello,

I agree with leaving fast allocation path as simple as possible.
I will remove runtime computation for determining ratio in
__rmqueue_cma() and, instead, will use pre-computed value calculated
on the other path.

I am not sure that whether your second suggestion(Hey relaimer part)
is good or not. In my quick thought, that could be helpful in the
situation that many free cma pages remained. But, it would be not helpful
when there are neither free movable and cma pages. In generally, most
workloads mainly uses movable pages for page cache or anonymous mapping.
Although reclaim is triggered by non-movable allocation failure, reclaimed
pages are used mostly by movable allocation. We can handle these allocation
request even if we reclaim the pages just in lru order. If we rotate
the lru list for finding movable pages, it could cause more useful
pages to be evicted.

This is just my quick thought, so please let me correct if I am wrong.

>
> And we couldn't do it in zone_watermark_ok with set/reset ALLOC_CMA?
> If possible, it would be better becauser it's generic function to check
> free pages and cause trigger reclaim/compaction logic.

I guess, your *it* means ratio computation. Right?
I don't like putting it on zone_watermark_ok(). Although it need to
refer to free cma pages value which are also referred in zone_watermark_ok(),
this computation is for determining ratio, not for triggering
reclaim/compaction. And this zone_watermark_ok() is on more hot-path, so
putting this logic into zone_watermark_ok() looks not better to me.

I will think better place to do it.

Thanks.

Joonsoo Kim

unread,

May 14, 2014, 10:00:02 PM5/14/14

to

Hello,

Could you elaborate more on your issue?
I can't completely understand your problem.
So your atomic allocation is movable? And although there are many free
cma pages, that request is fail?

Okay.

> > +
> > + /* Reset complete, start on movable first */
> > +alloc_movable:
> > + zone->nr_try_movable--;
> > + return NULL;
> > +}
> > +#endif
> > +
> > /*
> > * Do the hard work of removing an element from the buddy allocator.
> > * Call me with the zone->lock already held.
> > @@ -1143,10 +1217,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> > static struct page *__rmqueue(struct zone *zone, unsigned int order,
> > int migratetype)
> > {
> > - struct page *page;
> > + struct page *page = NULL;
> > +
> > + if (IS_ENABLED(CONFIG_CMA))
> > + page = __rmqueue_cma(zone, order, migratetype);
>
> It would be better to move the migrate check here, So that it becomes
>
> /* For migrate movable allocation try cma area first */
> if (IS_ENABLED(CONFIG_CMA) && (migratetype == MIGRATE_MOVABLE))
>
>

Okay. But it makes no difference between current code and your
suggestion, because __rmqueue_cma would be inlined by compiler
optimization.

Thanks.

Joonsoo Kim

unread,

May 14, 2014, 10:10:02 PM5/14/14

to

Hello,

This is not interleave in my point of view. This logic will allocate
free movable pages until hitting 2 * low_wmark, and then allocate free
cma pages. Interleave that I mean is something like round-robin policy
with no constraint like above.

>
> That doesn't always prefer CMA region. It would be nice to
> understand why grouping in pageblock_nr_pages is beneficial. Also in
> your patch you decrement nr_try_cma for every 'order' allocation. Why ?

pageblock_nr_pages is just magic value with no rationale. :)
But we need grouping, because without it, we can't get physically
contiguous pages. When we allocate the pages for page cache, readahead
logic will try to allocate 32 pages. If we don't use grouping, disk
I/O for these pages can't be handled by one I/O request on some devices.
I'm not familiar to I/O device, please let me correct.

And, yes, I will consider 'order' allocation when inc/dec nr_try_cma.

>
> + if (zone->nr_try_cma) {
> + /* Okay. Now, we can try to allocate the page from cma region */
> + zone->nr_try_cma--;
> + page = __rmqueue_smallest(zone, order, MIGRATE_CMA);
> +
> + /* CMA pages can vanish through CMA allocation */
> + if (unlikely(!page && order == 0))
> + zone->nr_try_cma = 0;
> +
> + return page;
> + }
>
>
> If we fail above MIGRATE_CMA alloc should we return failure ? Why
> not try MOVABLE allocation on failure (ie fallthrough the code path) ?

This patch use fallthrough logic. If we fail on __rmqueue_cma(), it will
go __rmqueue() as usual.

Thanks.

Minchan Kim

unread,

May 14, 2014, 10:50:01 PM5/14/14

to

Sounds good.

>
> I am not sure that whether your second suggestion(Hey relaimer part)
> is good or not. In my quick thought, that could be helpful in the
> situation that many free cma pages remained. But, it would be not helpful
> when there are neither free movable and cma pages. In generally, most
> workloads mainly uses movable pages for page cache or anonymous mapping.
> Although reclaim is triggered by non-movable allocation failure, reclaimed
> pages are used mostly by movable allocation. We can handle these allocation
> request even if we reclaim the pages just in lru order. If we rotate
> the lru list for finding movable pages, it could cause more useful
> pages to be evicted.
>
> This is just my quick thought, so please let me correct if I am wrong.

Why should reclaimer reclaim unnecessary pages?
So, your answer is that it would be better because upcoming newly allocated
pages would be allocated easily without interrupt. But it could reclaim
too much pages until watermark for unmovable allocation is okay.
Even, sometime, you might see OOM.

Moreover, how could you handle current trobule?
For example, there is atomic allocation and the only thing to save the world
is kswapd because it's one of kswapd role but kswapd is spending many time to
reclaim CMA pages, which is pointless so the allocation would be easily failed.

>
> >
> > And we couldn't do it in zone_watermark_ok with set/reset ALLOC_CMA?
> > If possible, it would be better becauser it's generic function to check
> > free pages and cause trigger reclaim/compaction logic.
>
> I guess, your *it* means ratio computation. Right?

I meant just get_page_from_freelist like fair zone allocation for consistency
but as we discussed offline, i'm not against with you if it's not right place.

> I don't like putting it on zone_watermark_ok(). Although it need to
> refer to free cma pages value which are also referred in zone_watermark_ok(),
> this computation is for determining ratio, not for triggering
> reclaim/compaction. And this zone_watermark_ok() is on more hot-path, so
> putting this logic into zone_watermark_ok() looks not better to me.
>
> I will think better place to do it.

Yeb, Thanks!

>
> Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Heesub Shin

unread,

May 14, 2014, 10:50:01 PM5/14/14

to

Hello,

We have an out of tree implementation that is completely the same with
the approach Minchan said and it works, but it has definitely some
side-effects as you pointed, distorting the LRU and evicting hot pages.
I do not attach code fragments in this thread for some reasons, but it
must be easy for yourself. I am wondering if it could help also in your
case.

Thanks,
Heesub

>
>>
>> And we couldn't do it in zone_watermark_ok with set/reset ALLOC_CMA?
>> If possible, it would be better becauser it's generic function to check
>> free pages and cause trigger reclaim/compaction logic.
>
> I guess, your *it* means ratio computation. Right?
> I don't like putting it on zone_watermark_ok(). Although it need to
> refer to free cma pages value which are also referred in zone_watermark_ok(),
> this computation is for determining ratio, not for triggering
> reclaim/compaction. And this zone_watermark_ok() is on more hot-path, so
> putting this logic into zone_watermark_ok() looks not better to me.
>
> I will think better place to do it.
>
> Thanks.
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>
>

Minchan Kim

unread,

May 15, 2014, 1:10:01 AM5/15/14

to

Hello Heesub,

Actually, I discussed with Joonsoo to solve such corner case in future if
someone report it but you did it now. Thanks!

LRU churning is a general problem, not CMA specific although CMA would make
worse more agressively so I'd like to handle it another topic(ie, patchset)

The reason we did rotate them back to LRU head was just to avoid scanning
repeat overhead of one reclaim cycle so one of idea I can think of is that
we can put a reclaim cursor into LRU tail right before reclaim cycle and
start scanning from the cursor and update the cursor position on every
scanning cycle. Of course, we should rotate filtered out pages back to
LRU's tail, not head but with cursor, we can skip pointless pages which was
already scanned by this reclaim cycle.

The cursor should be removed when the reclaim cycle would be done so if next
reclaim happens, cursor will start from the beginning so it could make
unecessary scanning again until reaching the proper victim page so CPU usage
would be higher but it's better than evicting working set.

Another idea?

Kind regards,
Minchan Kim

Mel Gorman

unread,

May 15, 2014, 5:50:03 AM5/15/14

to

On Thu, May 15, 2014 at 11:10:55AM +0900, Joonsoo Kim wrote:
> > That doesn't always prefer CMA region. It would be nice to
> > understand why grouping in pageblock_nr_pages is beneficial. Also in
> > your patch you decrement nr_try_cma for every 'order' allocation. Why ?
>
> pageblock_nr_pages is just magic value with no rationale. :)

I'm not following this discussions closely but there is rational to that
value -- it's the size of a huge page for that architecture. At the time
the fragmentation avoidance was implemented this was the largest allocation
size of interest.

--
Mel Gorman
SUSE Labs

Gioh Kim

unread,

May 16, 2014, 4:10:02 AM5/16/14

to

Hi,

I've been trying to apply CMA into my platform.
USB host driver generated kernel panic like below when USB mouse is connected,
because I turned on CMA and set the CMA_SIZE_MBYTES value into zero by mistake.

I think the panic is cuased by atomic_pool in arch/arm/mm/dma-mapping.c.
Zero CMA_SIZE_MBYTES value skips CMA initialization and then atomic_pool is not initialized also
because __alloc_from_contiguous is failed in atomic_pool_init().

If CMA_SIZE_MBYTES_MAX is allowed to be zero, there should be defense code to check
CMA is initlaized correctly. And atomic_pool initialization should be done by __alloc_remap_buffer
instead of __alloc_from_contiguous if __alloc_from_contiguous is failed.

IMPO, it is more simple and powerful to restrict CMA_SIZE_MBYTES_MAX configuration to be larger than zero.

[ 1.474523] ------------[ cut here ]------------
[ 1.479150] WARNING: at arch/arm/mm/dma-mapping.c:496 __dma_alloc.isra.19+0x1b8/0x1e0()
[ 1.487160] coherent pool not initialised!
[ 1.491249] Modules linked in:
[ 1.494317] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.10.19+ #55
[ 1.500521] [<80013e20>] (unwind_backtrace+0x0/0xf8) from [<80011c60>] (show_stack+0x10/0x14)
[ 1.509064] [<80011c60>] (show_stack+0x10/0x14) from [<8001eedc>] (warn_slowpath_common+0x4c/0x6c)
[ 1.518038] [<8001eedc>] (warn_slowpath_common+0x4c/0x6c) from [<8001ef90>] (warn_slowpath_fmt+0x30/0x40)
[ 1.527616] [<8001ef90>] (warn_slowpath_fmt+0x30/0x40) from [<80017c28>] (__dma_alloc.isra.19+0x1b8/0x1e0)
[ 1.537282] [<80017c28>] (__dma_alloc.isra.19+0x1b8/0x1e0) from [<80017d7c>] (arm_dma_alloc+0x90/0x98)
[ 1.546608] [<80017d7c>] (arm_dma_alloc+0x90/0x98) from [<8034a860>] (ohci_init+0x1b0/0x278)
[ 1.555062] [<8034a860>] (ohci_init+0x1b0/0x278) from [<80332b0c>] (usb_add_hcd+0x184/0x5b8)
[ 1.563500] [<80332b0c>] (usb_add_hcd+0x184/0x5b8) from [<8034b5e0>] (ohci_platform_probe+0xd0/0x174)
[ 1.572729] [<8034b5e0>] (ohci_platform_probe+0xd0/0x174) from [<802f196c>] (platform_drv_probe+0x14/0x18)
[ 1.582401] [<802f196c>] (platform_drv_probe+0x14/0x18) from [<802f0714>] (driver_probe_device+0x6c/0x1f8)
[ 1.592064] [<802f0714>] (driver_probe_device+0x6c/0x1f8) from [<802f092c>] (__driver_attach+0x8c/0x90)
[ 1.601465] [<802f092c>] (__driver_attach+0x8c/0x90) from [<802eeec8>] (bus_for_each_dev+0x54/0x88)
[ 1.610518] [<802eeec8>] (bus_for_each_dev+0x54/0x88) from [<802efef0>] (bus_add_driver+0xd8/0x230)
[ 1.619572] [<802efef0>] (bus_add_driver+0xd8/0x230) from [<802f0de4>] (driver_register+0x78/0x14c)
[ 1.628632] [<802f0de4>] (driver_register+0x78/0x14c) from [<806ff018>] (ohci_hcd_mod_init+0x34/0x64)
[ 1.637859] [<806ff018>] (ohci_hcd_mod_init+0x34/0x64) from [<8000879c>] (do_one_initcall+0xec/0x14c)
[ 1.647088] [<8000879c>] (do_one_initcall+0xec/0x14c) from [<806dab30>] (kernel_init_freeable+0x150/0x220)
[ 1.656754] [<806dab30>] (kernel_init_freeable+0x150/0x220) from [<80509f54>] (kernel_init+0x8/0xf8)
[ 1.665895] [<80509f54>] (kernel_init+0x8/0xf8) from [<8000e398>] (ret_from_fork+0x14/0x3c)
[ 1.674264] ---[ end trace 6f1857db5ef45cb9 ]---
[ 1.678880] ohci-platform ohci-platform.0: can't setup
[ 1.684027] ohci-platform ohci-platform.0: USB bus 1 deregistered
[ 1.690362] ohci-platform: probe of ohci-platform.0 failed with error -12
[ 1.697188] ohci-platform ohci-platform.1: Generic Platform OHCI Controller
[ 1.704365] ohci-platform ohci-platform.1: new USB bus registered, assigned bus number 1
[ 1.712457] ------------[ cut here ]------------
[ 1.717096] WARNING: at arch/arm/mm/dma-mapping.c:496 __dma_alloc.isra.19+0x1b8/0x1e0()
[ 1.725105] coherent pool not initialised!
[ 1.729194] Modules linked in:
[ 1.732247] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G W 3.10.19+ #55
[ 1.739404] [<80013e20>] (unwind_backtrace+0x0/0xf8) from [<80011c60>] (show_stack+0x10/0x14)
[ 1.747949] [<80011c60>] (show_stack+0x10/0x14) from [<8001eedc>] (warn_slowpath_common+0x4c/0x6c)
[ 1.756923] [<8001eedc>] (warn_slowpath_common+0x4c/0x6c) from [<8001ef90>] (warn_slowpath_fmt+0x30/0x40)
[ 1.766502] [<8001ef90>] (warn_slowpath_fmt+0x30/0x40) from [<80017c28>] (__dma_alloc.isra.19+0x1b8/0x1e0)
[ 1.776168] [<80017c28>] (__dma_alloc.isra.19+0x1b8/0x1e0) from [<80017d7c>] (arm_dma_alloc+0x90/0x98)
[ 1.785484] [<80017d7c>] (arm_dma_alloc+0x90/0x98) from [<8034a860>] (ohci_init+0x1b0/0x278)
[ 1.793933] [<8034a860>] (ohci_init+0x1b0/0x278) from [<80332b0c>] (usb_add_hcd+0x184/0x5b8)
[ 1.802370] [<80332b0c>] (usb_add_hcd+0x184/0x5b8) from [<8034b5e0>] (ohci_platform_probe+0xd0/0x174)
[ 1.811597] [<8034b5e0>] (ohci_platform_probe+0xd0/0x174) from [<802f196c>] (platform_drv_probe+0x14/0x18)
[ 1.821263] [<802f196c>] (platform_drv_probe+0x14/0x18) from [<802f0714>] (driver_probe_device+0x6c/0x1f8)
[ 1.830926] [<802f0714>] (driver_probe_device+0x6c/0x1f8) from [<802f092c>] (__driver_attach+0x8c/0x90)
[ 1.840326] [<802f092c>] (__driver_attach+0x8c/0x90) from [<802eeec8>] (bus_for_each_dev+0x54/0x88)
[ 1.849379] [<802eeec8>] (bus_for_each_dev+0x54/0x88) from [<802efef0>] (bus_add_driver+0xd8/0x230)
[ 1.858432] [<802efef0>] (bus_add_driver+0xd8/0x230) from [<802f0de4>] (driver_register+0x78/0x14c)
[ 1.867488] [<802f0de4>] (driver_register+0x78/0x14c) from [<806ff018>] (ohci_hcd_mod_init+0x34/0x64)
[ 1.876714] [<806ff018>] (ohci_hcd_mod_init+0x34/0x64) from [<8000879c>] (do_one_initcall+0xec/0x14c)
[ 1.885940] [<8000879c>] (do_one_initcall+0xec/0x14c) from [<806dab30>] (kernel_init_freeable+0x150/0x220)
[ 1.895601] [<806dab30>] (kernel_init_freeable+0x150/0x220) from [<80509f54>] (kernel_init+0x8/0xf8)
[ 1.904741] [<80509f54>] (kernel_init+0x8/0xf8) from [<8000e398>] (ret_from_fork+0x14/0x3c)
[ 1.913085] ---[ end trace 6f1857db5ef45cba ]---

I'm adding my patch to restrict CMA_SIZE_MBYTES.
This patch is based on 3.15.0-rc5

-------------------------------- 8< --------------------------------------
From 9f8e6d3c1f4bdeeeb7af3df7898b773a612c62e8 Mon Sep 17 00:00:00 2001
From: Gioh Kim <gioh...@lge.com>
Date: Fri, 16 May 2014 16:15:43 +0900
Subject: [PATCH] drivers/base/Kconfig: restrict CMA size to non-zero value

The size of CMA area must be larger than zero.
If the size is zero, CMA canno be initialized.

Signed-off-by: Gioh Kim <gioh.kim@lge.c>
---
drivers/base/Kconfig | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 4b7b452..19b3578 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -222,13 +222,18 @@ config DMA_CMA
if DMA_CMA
comment "Default contiguous memory area size:"

+config CMA_SIZE_MBYTES_MAX
+ int
+ default 1024
+
config CMA_SIZE_MBYTES
int "Size in Mega Bytes"
depends on !CMA_SIZE_SEL_PERCENTAGE
+ range 1 CMA_SIZE_MBYTES_MAX
default 16
help
Defines the size (in MiB) of the default memory area for Contiguous
- Memory Allocator.
+ Memory Allocator. This value must be larger than zero.

config CMA_SIZE_PERCENTAGE
int "Percentage of total memory"
--
1.7.9.5

Michal Nazarewicz

unread,

May 16, 2014, 1:50:02 PM5/16/14

to

On Fri, May 16 2014, Gioh Kim wrote:
> If CMA_SIZE_MBYTES is allowed to be zero, there should be defense code

> to check CMA is initlaized correctly. And atomic_pool initialization
> should be done by __alloc_remap_buffer instead of
> __alloc_from_contiguous if __alloc_from_contiguous is failed.

Agreed, and this is the correct fix.

> IMPO, it is more simple and powerful to restrict CMA_SIZE_MBYTES_MAX
> configuration to be larger than zero.

No, because it makes it impossible to have CMA disabled by default and
only enabled if command line argument is given.

Furthermore, your patch does *not* guarantee CMA region to always be
allocated. If CMA_SIZE_SEL_PERCENTAGE is selected for instance. Or if
user explicitly passes 0 on command line.

signature.asc

Aneesh Kumar K.V

unread,

May 18, 2014, 1:40:02 PM5/18/14

to

Joonsoo Kim <iamjoon...@lge.com> writes:

> On Wed, May 14, 2014 at 02:12:19PM +0530, Aneesh Kumar K.V wrote:
>> Joonsoo Kim <iamjoon...@lge.com> writes:
>>
>>
>>
>> Another issue i am facing with the current code is the atomic allocation
>> failing even with large number of CMA pages around. In my case we never
>> reclaimed because large part of the memory is consumed by the page cache and
>> for that, free memory check doesn't include at free_cma. I will test
>> with this patchset and update here once i have the results.
>>
>
> Hello,
>
> Could you elaborate more on your issue?
> I can't completely understand your problem.
> So your atomic allocation is movable? And although there are many free
> cma pages, that request is fail?
>

non movable atomic allocations are failing because we don't have
anything other than CMA pages left and kswapd is yet to catchup ?

swapper/0: page allocation failure: order:0, mode:0x20
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.23-1500.pkvm2_1.5.ppc64 #1
Call Trace:
[c000000ffffcb610] [c000000000017330] .show_stack+0x130/0x200 (unreliable)
[c000000ffffcb6e0] [c00000000087a8c8] .dump_stack+0x28/0x3c
[c000000ffffcb750] [c0000000001e06f0] .warn_alloc_failed+0x110/0x160
[c000000ffffcb800] [c0000000001e5984] .__alloc_pages_nodemask+0x9d4/0xbf0
[c000000ffffcb9e0] [c00000000023775c] .alloc_pages_current+0xcc/0x1b0
[c000000ffffcba80] [c0000000007098d4] .__netdev_alloc_frag+0x1a4/0x1d0
[c000000ffffcbb20] [c00000000070d750] .__netdev_alloc_skb+0xc0/0x130
[c000000ffffcbbb0] [d000000009639b40] .tg3_poll_work+0x900/0x1110 [tg3]
[c000000ffffcbd10] [d00000000963a3a4] .tg3_poll_msix+0x54/0x200 [tg3]
[c000000ffffcbdb0] [c00000000071fcec] .net_rx_action+0x1dc/0x310
[c000000ffffcbe90] [c0000000000c1b08] .__do_softirq+0x158/0x330
[c000000ffffcbf90] [c000000000025744] .call_do_softirq+0x14/0x24
[c000000ffffc7e00] [c000000000011684] .do_softirq+0xf4/0x130
[c000000ffffc7e90] [c0000000000c1f18] .irq_exit+0xc8/0x110
[c000000ffffc7f10] [c000000000011258] .__do_irq+0xc8/0x1f0
[c000000ffffc7f90] [c000000000025768] .call_do_irq+0x14/0x24
[c00000000137b750] [c00000000001142c] .do_IRQ+0xac/0x130
[c00000000137b800] [c000000000002a64]
hardware_interrupt_common+0x164/0x180

....

Node 0 DMA: 408*64kB (C) 408*128kB (C) 408*256kB (C) 408*512kB (C) 408*1024kB (C) 406*2048kB (C) 199*4096kB (C) 97*8192kB (C) 6*16384kB (C) =
3348992kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB

meminfo details:

MemTotal: 65875584 kB
MemFree: 8001856 kB
Buffers: 49330368 kB
Cached: 178752 kB
SwapCached: 0 kB
Active: 28550464 kB
Inactive: 25476416 kB
Active(anon): 3771008 kB
Inactive(anon): 767360 kB
Active(file): 24779456 kB
Inactive(file): 24709056 kB
Unevictable: 15104 kB
Mlocked: 15104 kB
SwapTotal: 8384448 kB
SwapFree: 8384448 kB
Dirty: 0 kB

-aneesh

Gioh Kim

unread,

May 18, 2014, 9:50:02 PM5/18/14

to

Thank you for your advice. I didn't notice it.

I'm adding followings according to your advice:

- range restrict for CMA_SIZE_MBYTES and *CMA_SIZE_PERCENTAGE*
I think this can prevent the wrong kernel option.

- change size_cmdline into default value SZ_16M
I am not sure this can prevent if cma=0 cmdline option is also with base and limit options.

I don't know how to send the second patch.
Please pardon me that I just copy the patch here.

--------------------------------- 8< -------------------------------------
From c283eaac41b044a2abb11cfd32a60fff034633c3 Mon Sep 17 00:00:00 2001

From: Gioh Kim <gioh...@lge.com>
Date: Fri, 16 May 2014 16:15:43 +0900
Subject: [PATCH] drivers/base/Kconfig: restrict CMA size to non-zero value

The size of CMA area must be larger than zero.

If the size is zero, all physically-contiguous allocation
can be failed.

Signed-off-by: Gioh Kim <gioh...@lge.co.kr>
---
drivers/base/Kconfig | 14 ++++++++++++--
drivers/base/dma-contiguous.c | 3 ++-
2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 4b7b452..a7292ac 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -222,17 +222,27 @@ config DMA_CMA

if DMA_CMA
comment "Default contiguous memory area size:"

+config CMA_SIZE_MBYTES_DEFAULT
+ int
+ default 16
+

+config CMA_SIZE_MBYTES_MAX
+ int
+ default 1024
+
config CMA_SIZE_MBYTES
int "Size in Mega Bytes"
depends on !CMA_SIZE_SEL_PERCENTAGE

- default 16
+ range 1 CMA_SIZE_MBYTES_MAX
+ default CMA_SIZE_MBYTES_DEFAULT

help
Defines the size (in MiB) of the default memory area for Contiguous
- Memory Allocator.
+ Memory Allocator. This value must be larger than zero.

config CMA_SIZE_PERCENTAGE
int "Percentage of total memory"

depends on !CMA_SIZE_SEL_MBYTES
+ range 1 100
default 10
help
Defines the size of the default memory area for Contiguous Memory
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index b056661..5b70442 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -125,7 +125,8 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);

if (size_cmdline != -1) {
- selected_size = size_cmdline;
+ selected_size = ((size_cmdline == 0) ?
+ CONFIG_CMA_SIZE_MBYTES_DEFAULT : size_cmdline);
selected_base = base_cmdline;
selected_limit = min_not_zero(limit_cmdline, limit);
if (base_cmdline + size_cmdline == limit_cmdline)
--
1.7.9.5

2014-05-17 오전 2:45, Michal Nazarewicz 쓴 글:

> On Fri, May 16 2014, Gioh Kim wrote:
>> If CMA_SIZE_MBYTES is allowed to be zero, there should be defense code
>> to check CMA is initlaized correctly. And atomic_pool initialization
>> should be done by __alloc_remap_buffer instead of
>> __alloc_from_contiguous if __alloc_from_contiguous is failed.
>
> Agreed, and this is the correct fix.
>
>> IMPO, it is more simple and powerful to restrict CMA_SIZE_MBYTES_MAX
>> configuration to be larger than zero.
>
> No, because it makes it impossible to have CMA disabled by default and
> only enabled if command line argument is given.
>
> Furthermore, your patch does *not* guarantee CMA region to always be
> allocated. If CMA_SIZE_SEL_PERCENTAGE is selected for instance. Or if
> user explicitly passes 0 on command line.
>
>
>
--

Joonsoo Kim

unread,

May 18, 2014, 10:10:02 PM5/18/14

to

Hello,

I guess that it isn't the problem. In lru, movable pages and cma pages
would be interleaved. So it doesn't takes too long time to get the
page for non-movable allocation.

IMHO, in generally, memory shortage is made by movable allocation, so
to distinguish allocation type and to handle them differently has
marginal effect.

Anyway, I will think more deeply.

>
> >
> > >
> > > And we couldn't do it in zone_watermark_ok with set/reset ALLOC_CMA?
> > > If possible, it would be better becauser it's generic function to check
> > > free pages and cause trigger reclaim/compaction logic.
> >
> > I guess, your *it* means ratio computation. Right?
>
> I meant just get_page_from_freelist like fair zone allocation for consistency
> but as we discussed offline, i'm not against with you if it's not right place.

Okay :)

Thanks.

Joonsoo Kim

unread,

May 18, 2014, 10:20:02 PM5/18/14

to

On Thu, May 15, 2014 at 10:47:18AM +0100, Mel Gorman wrote:
> On Thu, May 15, 2014 at 11:10:55AM +0900, Joonsoo Kim wrote:
> > > That doesn't always prefer CMA region. It would be nice to
> > > understand why grouping in pageblock_nr_pages is beneficial. Also in
> > > your patch you decrement nr_try_cma for every 'order' allocation. Why ?
> >
> > pageblock_nr_pages is just magic value with no rationale. :)
>
> I'm not following this discussions closely but there is rational to that
> value -- it's the size of a huge page for that architecture. At the time
> the fragmentation avoidance was implemented this was the largest allocation
> size of interest.

Hello,

Indeed. There is a such good rationale.
Really thanks for informing it.

Thanks.

Joonsoo Kim

unread,

May 18, 2014, 10:30:02 PM5/18/14

to

Hello,

I think that third patch in this patchset would solve this problem.
Your problem may occur in following scenario.

1. Unmovable, reclaimable page are nearly empty.
2. There are some movable pages, so watermark checking is ok.
3. A lot of movable allocations are requested.
4. Most of movable pages are allocated.
5. But, watermark checking is still ok, because we have a lot of
free cma pages and this allocation is for movable type.
No waking up kswapd.
6. non-movable atomic allocation request => fail

So, the problem is in step #5. Althoght we have enough pages for
movable type, we should prepare allocation request for the others.
With my third patch, kswapd could be woken by movable allocation, so
your problem would disappreared.

Thanks.

Minchan Kim

unread,

May 18, 2014, 11:00:01 PM5/18/14

to

Please, don't assume there are ideal LRU ordering.
Newly allocated page by fairness allocation is located by head of LRU
while old pages are approaching the tail so there is huge time gab.
During the time, old pages could be dropped/promoting so one of side
could be filled with one type rather than interleaving both types pages
you expected.

Additionally, if you uses syncable backed device like ramdisk/zram
or something, pageout can be synchronized with page I/O.
In this case, reclaim time wouldn't be trivial than async I/O.
For exmaple, zram-swap case, it needs page copy + comperssion and
the speed depends on your CPU speed.

>
> IMHO, in generally, memory shortage is made by movable allocation, so
> to distinguish allocation type and to handle them differently has
> marginal effect.

Again, please don't think workloads you know only and open the various
possiblity from the design although such consideration doesn't make code
ugly.

>
> Anyway, I will think more deeply.

Yes, Please.

>
> >
> > >
> > > >
> > > > And we couldn't do it in zone_watermark_ok with set/reset ALLOC_CMA?
> > > > If possible, it would be better becauser it's generic function to check
> > > > free pages and cause trigger reclaim/compaction logic.
> > >
> > > I guess, your *it* means ratio computation. Right?
> >
> > I meant just get_page_from_freelist like fair zone allocation for consistency
> > but as we discussed offline, i'm not against with you if it's not right place.
>
> Okay :)
>
> Thanks.
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Joonsoo Kim

unread,

May 19, 2014, 12:50:02 AM5/19/14

to

I assumed general case, not ideal case.
Your example can be possible, but would be corner case.

>
> Additionally, if you uses syncable backed device like ramdisk/zram
> or something, pageout can be synchronized with page I/O.
> In this case, reclaim time wouldn't be trivial than async I/O.
> For exmaple, zram-swap case, it needs page copy + comperssion and
> the speed depends on your CPU speed.

This is a general problem what zram-swap have,
although reclaiming cma pages worse the situation.

Thanks.

Joonsoo Kim

unread,

May 19, 2014, 2:00:04 AM5/19/14

to

On Mon, May 19, 2014 at 10:47:12AM +0900, Gioh Kim wrote:
> Thank you for your advice. I didn't notice it.
>
> I'm adding followings according to your advice:
>
> - range restrict for CMA_SIZE_MBYTES and *CMA_SIZE_PERCENTAGE*
> I think this can prevent the wrong kernel option.
>
> - change size_cmdline into default value SZ_16M
> I am not sure this can prevent if cma=0 cmdline option is also with base and limit options.

Hello,

I think that this problem is originated from atomic_pool_init().
If configured coherent_pool size is larger than default cma size,
it can be failed even if this patch is applied.

How about below patch?
It uses fallback allocation if CMA is failed.

Thanks.

-----------------8<---------------------
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 6b00be1..2909ab9 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -379,7 +379,7 @@ static int __init atomic_pool_init(void)
unsigned long *bitmap;
struct page *page;
struct page **pages;
- void *ptr;
+ void *ptr = NULL;
int bitmap_size = BITS_TO_LONGS(nr_pages) * sizeof(long);

bitmap = kzalloc(bitmap_size, GFP_KERNEL);
@@ -393,7 +393,7 @@ static int __init atomic_pool_init(void)
if (IS_ENABLED(CONFIG_DMA_CMA))
ptr = __alloc_from_contiguous(NULL, pool->size, prot, &page,
atomic_pool_init);
- else
+ if (!ptr)
ptr = __alloc_remap_buffer(NULL, pool->size, gfp, prot, &page,
atomic_pool_init);
if (ptr) {

Gioh Kim

unread,

May 19, 2014, 5:20:04 AM5/19/14

to

In __dma_alloc function, your patch can make __alloc_from_pool work.
But __alloc_from_contiguous doesn't work.
Therefore __dma_alloc sometimes works and sometimes not according to the gfp(__GFP_WAIT) flag.
Do I understand correctly?

I think __dma_alloc should work consistently.
Both of __alloc_from_contiguous and __alloc_from_pool should work together,
or both of them do not work.

2014-05-19 오후 2:55, Joonsoo Kim 쓴 글:

Michal Nazarewicz

unread,

May 19, 2014, 4:10:02 PM5/19/14

to

On Sun, May 18 2014, Joonsoo Kim wrote:
> I think that this problem is originated from atomic_pool_init().
> If configured coherent_pool size is larger than default cma size,
> it can be failed even if this patch is applied.
>
> How about below patch?
> It uses fallback allocation if CMA is failed.

Yes, I thought about it, but __dma_alloc uses similar code:

else if (!IS_ENABLED(CONFIG_DMA_CMA))
addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller);
else
addr = __alloc_from_contiguous(dev, size, prot, &page, caller);

so it probably needs to be changed as well.

> -----------------8<---------------------
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 6b00be1..2909ab9 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -379,7 +379,7 @@ static int __init atomic_pool_init(void)
> unsigned long *bitmap;
> struct page *page;
> struct page **pages;
> - void *ptr;
> + void *ptr = NULL;
> int bitmap_size = BITS_TO_LONGS(nr_pages) * sizeof(long);
>
> bitmap = kzalloc(bitmap_size, GFP_KERNEL);
> @@ -393,7 +393,7 @@ static int __init atomic_pool_init(void)
> if (IS_ENABLED(CONFIG_DMA_CMA))
> ptr = __alloc_from_contiguous(NULL, pool->size, prot, &page,
> atomic_pool_init);
> - else
> + if (!ptr)
> ptr = __alloc_remap_buffer(NULL, pool->size, gfp, prot, &page,
> atomic_pool_init);
> if (ptr) {
>

--

signature.asc

Minchan Kim

unread,

May 19, 2014, 7:20:02 PM5/19/14

to

I talked with Joonsoo yesterday and should post our conclusion
for other reviewers/maintainers.

It's not a corner case and it could happen depending on zone and CMA
configuration. For example, there is 330M high zone and CMA consumes
300M in the space while normal movable area consumes just 30M.
In the case, unmovable allocation could make too many unnecessary
reclaiming of the zone so the conclusion we reached is to need target
reclaiming(ex, isolate_mode_t).

But not sure it should be part of this patchset because this patchset
is surely enhance(ie, before, it was hard to allocate page from CMA area
but this patchset makes it works) but this patchset could make mentioned
problem as side-effect so I think we could solve the issue(ie, too many
reclaiming in unbalanced zone) in another patchset.

Joonsoo, please mention this problem in the description when you respin
so other MM guys can notice that and give ideas, which would be helpful
a lot.

>
> >
> > Additionally, if you uses syncable backed device like ramdisk/zram
> > or something, pageout can be synchronized with page I/O.
> > In this case, reclaim time wouldn't be trivial than async I/O.
> > For exmaple, zram-swap case, it needs page copy + comperssion and
> > the speed depends on your CPU speed.
>
> This is a general problem what zram-swap have,
> although reclaiming cma pages worse the situation.
>
> Thanks.
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

May 19, 2014, 7:20:02 PM5/19/14

to

On Thu, May 15, 2014 at 11:45:31AM +0900, Heesub Shin wrote:

Heesub, To be sure, did you try round-robin allocate like Joonsoo's
approach and happend such LRU churning problem?

--
Kind regards,
Minchan Kim

Gioh Kim

unread,

May 19, 2014, 9:00:02 PM5/19/14

to

2014-05-20 오전 4:59, Michal Nazarewicz 쓴 글:

> On Sun, May 18 2014, Joonsoo Kim wrote:
>> I think that this problem is originated from atomic_pool_init().
>> If configured coherent_pool size is larger than default cma size,
>> it can be failed even if this patch is applied.

The coherent_pool size (atomic_pool.size) should be restricted smaller than cma size.

This is another issue, however I think the default atomic pool size is too small.
Only one port of USB host needs at most 256Kbytes coherent memory (according to the USB host spec).
If a platform has several ports, it needs more than 1MB.
Therefore the default atomic pool size should be at least 1MB.

>>
>> How about below patch?
>> It uses fallback allocation if CMA is failed.
>
> Yes, I thought about it, but __dma_alloc uses similar code:
>
> else if (!IS_ENABLED(CONFIG_DMA_CMA))
> addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller);
> else
> addr = __alloc_from_contiguous(dev, size, prot, &page, caller);
>
> so it probably needs to be changed as well.

If CMA option is not selected, __alloc_from_contiguous would not be called.
We don't need to the fallback allocation.

And if CMA option is selected and initialized correctly,
the cma allocation can fail in case of no-CMA-memory situation.
I thinks in that case we don't need to the fallback allocation also,
because it is normal case.

Therefore I think the restriction of CMA size option and make CMA work can cover every cases.

I think below patch is also good choice.
If both of you, Michal and Joonsoo, do not agree with me, please inform me.
I will make a patch including option restriction and fallback allocation.

>
>> -----------------8<---------------------
>> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
>> index 6b00be1..2909ab9 100644
>> --- a/arch/arm/mm/dma-mapping.c
>> +++ b/arch/arm/mm/dma-mapping.c
>> @@ -379,7 +379,7 @@ static int __init atomic_pool_init(void)
>> unsigned long *bitmap;
>> struct page *page;
>> struct page **pages;
>> - void *ptr;
>> + void *ptr = NULL;
>> int bitmap_size = BITS_TO_LONGS(nr_pages) * sizeof(long);
>>
>> bitmap = kzalloc(bitmap_size, GFP_KERNEL);
>> @@ -393,7 +393,7 @@ static int __init atomic_pool_init(void)
>> if (IS_ENABLED(CONFIG_DMA_CMA))
>> ptr = __alloc_from_contiguous(NULL, pool->size, prot, &page,
>> atomic_pool_init);
>> - else
>> + if (!ptr)
>> ptr = __alloc_remap_buffer(NULL, pool->size, gfp, prot, &page,
>> atomic_pool_init);
>> if (ptr) {
>>
>
>
>
--

Michal Nazarewicz

unread,

May 19, 2014, 9:30:02 PM5/19/14

to

On Mon, May 19 2014, Gioh Kim wrote:
> If CMA option is not selected, __alloc_from_contiguous would not be
> called. We don't need to the fallback allocation.
>
> And if CMA option is selected and initialized correctly,
> the cma allocation can fail in case of no-CMA-memory situation.
> I thinks in that case we don't need to the fallback allocation also,
> because it is normal case.
>
> Therefore I think the restriction of CMA size option and make CMA work
> can cover every cases.

Wait, you just wrote that if CMA is not initialised correctly, it's fine
for atomic pool initialisation to fail, but if CMA size is initialised
correctly but too small, this is somehow worse situation? I'm a bit
confused to be honest.

IMO, cma=0 command line argument should be supported, as should having
the default CMA size zero. If CMA size is set to zero, kernel should
behave as if CMA was not enabled at compile time.

signature.asc

Gioh Kim

unread,

May 19, 2014, 10:30:02 PM5/19/14

to

2014-05-20 오전 10:28, Michal Nazarewicz 쓴 글:

> On Mon, May 19 2014, Gioh Kim wrote:
>> If CMA option is not selected, __alloc_from_contiguous would not be
>> called. We don't need to the fallback allocation.
>>
>> And if CMA option is selected and initialized correctly,
>> the cma allocation can fail in case of no-CMA-memory situation.
>> I thinks in that case we don't need to the fallback allocation also,
>> because it is normal case.
>>
>> Therefore I think the restriction of CMA size option and make CMA work
>> can cover every cases.
>
> Wait, you just wrote that if CMA is not initialised correctly, it's fine
> for atomic pool initialisation to fail, but if CMA size is initialised
> correctly but too small, this is somehow worse situation? I'm a bit
> confused to be honest.

I'm sorry to confuse you.
Please forgive my poor English.
My point is atomic_pool should be able to work with/without CMA.

>
> IMO, cma=0 command line argument should be supported, as should having
> the default CMA size zero. If CMA size is set to zero, kernel should
> behave as if CMA was not enabled at compile time.

It's also good if atomic_pool can work well with zero CMA size.

I can give up my patch.
But Joonsoo's patch should be applied.

Joonsoo, can you please send the full patch to maintainers?

Joonsoo Kim

unread,

May 20, 2014, 2:40:02 AM5/20/14

to

Okay. Will do :)

Thanks.

Marek Szyprowski

unread,

May 20, 2014, 7:40:02 AM5/20/14

to

Hello,

On 2014-05-20 02:50, Gioh Kim wrote:
>
>
> 2014-05-20 오전 4:59, Michal Nazarewicz 쓴 글:
>> On Sun, May 18 2014, Joonsoo Kim wrote:
>>> I think that this problem is originated from atomic_pool_init().
>>> If configured coherent_pool size is larger than default cma size,
>>> it can be failed even if this patch is applied.
>
> The coherent_pool size (atomic_pool.size) should be restricted smaller
> than cma size.
>
> This is another issue, however I think the default atomic pool size is
> too small.
> Only one port of USB host needs at most 256Kbytes coherent memory
> (according to the USB host spec).

This pool is used only for allocation done in atomic context (allocations
done with GFP_ATOMIC flag), otherwise the standard allocation path is used.
Are you sure that each usb host port really needs so much memory allocated
in atomic context?

> If a platform has several ports, it needs more than 1MB.
> Therefore the default atomic pool size should be at least 1MB.
>
>>>
>>> How about below patch?
>>> It uses fallback allocation if CMA is failed.
>>
>> Yes, I thought about it, but __dma_alloc uses similar code:
>>
>> else if (!IS_ENABLED(CONFIG_DMA_CMA))
>> addr = __alloc_remap_buffer(dev, size, gfp, prot, &page,
>> caller);
>> else
>> addr = __alloc_from_contiguous(dev, size, prot, &page, caller);
>>
>> so it probably needs to be changed as well.
>
> If CMA option is not selected, __alloc_from_contiguous would not be
> called.
> We don't need to the fallback allocation.
>
> And if CMA option is selected and initialized correctly,
> the cma allocation can fail in case of no-CMA-memory situation.
> I thinks in that case we don't need to the fallback allocation also,
> because it is normal case.
>
> Therefore I think the restriction of CMA size option and make CMA work
> can cover every cases.
>
> I think below patch is also good choice.
> If both of you, Michal and Joonsoo, do not agree with me, please
> inform me.
> I will make a patch including option restriction and fallback allocation.

I'm not sure if we need a fallback for failed CMA allocation. The only
issue that
have been mentioned here and needs to be resolved is support for
disabling cma by
kernel command line. Right now it will fails completely.

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Michal Nazarewicz

unread,

May 20, 2014, 2:20:02 PM5/20/14

to

On Mon, May 19 2014, Gioh Kim wrote:
> My point is atomic_pool should be able to work with/without CMA.

Agreed.

>> IMO, cma=0 command line argument should be supported, as should having
>> the default CMA size zero. If CMA size is set to zero, kernel should
>> behave as if CMA was not enabled at compile time.

> It's also good if atomic_pool can work well with zero CMA size.

Exactly.

signature.asc

Gioh Kim

unread,

May 20, 2014, 8:20:02 PM5/20/14

to

2014-05-20 오후 8:38, Marek Szyprowski 쓴 글:

> Hello,
>
> On 2014-05-20 02:50, Gioh Kim wrote:
>>
>>
>> 2014-05-20 오전 4:59, Michal Nazarewicz 쓴 글:
>>> On Sun, May 18 2014, Joonsoo Kim wrote:
>>>> I think that this problem is originated from atomic_pool_init().
>>>> If configured coherent_pool size is larger than default cma size,
>>>> it can be failed even if this patch is applied.
>>
>> The coherent_pool size (atomic_pool.size) should be restricted smaller than cma size.
>>
>> This is another issue, however I think the default atomic pool size is too small.
>> Only one port of USB host needs at most 256Kbytes coherent memory (according to the USB host spec).
>
> This pool is used only for allocation done in atomic context (allocations
> done with GFP_ATOMIC flag), otherwise the standard allocation path is used.
> Are you sure that each usb host port really needs so much memory allocated
> in atomic context?

I don't know why but drivers/usb/host/ohci-hcd.c:ohci_init() calls dma_alloc_coherent with zero gfp.
Therefore it occurs panic if CMA is turned on and CONFIG_CMA_SIZE_MBYTES is zero.
A pointer pool->vaddr is NULL in __alloc_from_pool.

Below is my kernel message.
[ 24.339858] -----------[ cut here ]-----------
[ 24.344535] WARNING: at arch/arm/mm/dma-mapping.c:492 __dma_alloc.isra.19+0x25c/0x2a4()
[ 24.352554] coherent pool not initialised!
[ 24.356644] Modules linked in:
[ 24.359701] CPU: 1 PID: 711 Comm: sh Not tainted 3.10.19+ #42
[ 24.365488] [<800140e0>] (unwind_backtrace+0x0/0xf8) from [<80011f20>] (show_stack+0x10/0x14)
[ 24.374045] [<80011f20>] (show_stack+0x10/0x14) from [<8001f21c>] (warn_slowpath_common+0x4c/0x6c)
[ 24.383022] [<8001f21c>] (warn_slowpath_common+0x4c/0x6c) from [<8001f2d0>] (warn_slowpath_fmt+0x30/0x40)
[ 24.392602] [<8001f2d0>] (warn_slowpath_fmt+0x30/0x40) from [<80017f5c>] (__dma_alloc.isra.19+0x25c/0x2a4)
[ 24.402270] [<80017f5c>] (__dma_alloc.isra.19+0x25c/0x2a4) from [<800180d0>] (arm_dma_alloc+0x90/0x98)
[ 24.411580] [<800180d0>] (arm_dma_alloc+0x90/0x98) from [<8034ab54>] (ohci_init+0x1b0/0x278)
[ 24.420035] [<8034ab54>] (ohci_init+0x1b0/0x278) from [<80332e00>] (usb_add_hcd+0x184/0x5b8)
[ 24.428484] [<80332e00>] (usb_add_hcd+0x184/0x5b8) from [<8034b8d4>] (ohci_platform_probe+0xd0/0x174)
[ 24.437713] [<8034b8d4>] (ohci_platform_probe+0xd0/0x174) from [<802f1cac>] (platform_drv_probe+0x14/0x18)
[ 24.447385] [<802f1cac>] (platform_drv_probe+0x14/0x18) from [<802f0a54>] (driver_probe_device+0x6c/0x1f8)
[ 24.457049] [<802f0a54>] (driver_probe_device+0x6c/0x1f8) from [<802ef16c>] (bus_for_each_drv+0x44/0x8c)
[ 24.466537] [<802ef16c>] (bus_for_each_drv+0x44/0x8c) from [<802f09bc>] (device_attach+0x74/0x80)
[ 24.475416] [<802f09bc>] (device_attach+0x74/0x80) from [<802f0050>] (bus_probe_device+0x84/0xa8)
[ 24.484295] [<802f0050>] (bus_probe_device+0x84/0xa8) from [<802ee89c>] (device_add+0x4c0/0x58c)
[ 24.493088] [<802ee89c>] (device_add+0x4c0/0x58c) from [<802f21b8>] (platform_device_add+0xac/0x214)
[ 24.502227] [<802f21b8>] (platform_device_add+0xac/0x214) from [<8001bf3c>] (lg115x_init_usb+0xbc/0xe4)
[ 24.511618] [<8001bf3c>] (lg115x_init_usb+0xbc/0xe4) from [<80008734>] (do_user_initcalls+0x98/0x128)
[ 24.520843] [<80008734>] (do_user_initcalls+0x98/0x128) from [<80008870>] (proc_write_usercalls+0xac/0xd0)
[ 24.530512] [<80008870>] (proc_write_usercalls+0xac/0xd0) from [<80138f48>] (proc_reg_write+0x58/0x80)
[ 24.539830] [<80138f48>] (proc_reg_write+0x58/0x80) from [<800f0084>] (vfs_write+0xb0/0x1bc)
[ 24.548275] [<800f0084>] (vfs_write+0xb0/0x1bc) from [<800f04d0>] (SyS_write+0x3c/0x70)
[ 24.556287] [<800f04d0>] (SyS_write+0x3c/0x70) from [<8000e5c0>] (ret_fast_syscall+0x0/0x30)
[ 24.564726] --[ end trace c092568e2a263d21 ]--
[ 24.569345] ohci-platform ohci-platform.0: can't setup
[ 24.574498] ohci-platform ohci-platform.0: USB bus 1 deregistered
[ 24.582241] ohci-platform: probe of ohci-platform.0 failed with error -12
[ 24.590496] ohci-platform ohci-platform.1: Generic Platform OHCI Controller
[ 24.598984] ohci-platform ohci-platform.1: new USB bus registered, assigned bus number 1

>
>> If a platform has several ports, it needs more than 1MB.
>> Therefore the default atomic pool size should be at least 1MB.
>>
>>>>
>>>> How about below patch?
>>>> It uses fallback allocation if CMA is failed.
>>>
>>> Yes, I thought about it, but __dma_alloc uses similar code:
>>>
>>> else if (!IS_ENABLED(CONFIG_DMA_CMA))
>>> addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller);
>>> else
>>> addr = __alloc_from_contiguous(dev, size, prot, &page, caller);
>>>
>>> so it probably needs to be changed as well.
>>
>> If CMA option is not selected, __alloc_from_contiguous would not be called.
>> We don't need to the fallback allocation.
>>
>> And if CMA option is selected and initialized correctly,
>> the cma allocation can fail in case of no-CMA-memory situation.
>> I thinks in that case we don't need to the fallback allocation also,
>> because it is normal case.
>>
>> Therefore I think the restriction of CMA size option and make CMA work can cover every cases.
>>
>> I think below patch is also good choice.
>> If both of you, Michal and Joonsoo, do not agree with me, please inform me.
>> I will make a patch including option restriction and fallback allocation.
>
> I'm not sure if we need a fallback for failed CMA allocation. The only issue that
> have been mentioned here and needs to be resolved is support for disabling cma by
> kernel command line. Right now it will fails completely.

cma=0 in the kernel command line and CONFIG_CMA_SIZE_MBYTES 0 are set selected_size as zero in dma_contiguous_reserve.
And dma_contiguous_reserve_area cannot be called and atomic_pool is not initialized.

After that dma_alloc_coherent try to allocate via atomic_pool (__alloc_from_pool) or CMA (__alloc_from_contiguous).
Allocation via atomic_pool fails becauseof atomic_pool->vaddr is NULL.
And CMA allocation shouldn't be called because cma=0 or setting CONFIG_CMA_SIZE_MBYTES 0 is the same with disabling CMA.
If cma=0 or CONFIG_CMA_SIZE_MBYTES is 0, __alloc_remap_buffer should be called instead of __alloc_from_contiguous even-if CMA is turned on.

I'm poor at English so I describe the problem in seudo code:

if (CMA is turned on) and ((cma=0 in command line) or (CONFIG_CMA_SIZE_MBYTES=0))
try to allocate from CMA but CMA is not initialized

>
> Best regards

Laura Abbott

unread,

May 23, 2014, 9:00:02 PM5/23/14

to

On 5/12/2014 10:04 AM, Laura Abbott wrote:
>
> I'm going to see about running this through tests internally for comparison.
> Hopefully I'll get useful results in a day or so.
>
> Thanks,
> Laura
>

We ran some tests internally and found that for our purposes these patches made
the benchmarks worse vs. the existing implementation of using CMA first for some
pages. These are mostly androidisms but androidisms that we care about for
having a device be useful.

The foreground memory headroom on the device was on average about 40 MB smaller
when using these patches vs our existing implementation of something like
solution #1. By foreground memory headroom we simply mean the amount of memory
that the foreground application can allocate before it is killed by the Android
Low Memory killer.

We also found that when running a sequence of app launches these patches had
more high priority app kills by the LMK and more alloc stalls. The test did a
total of 500 hundred app launches (using 9 separate applications) The CMA
memory in our system is rarely used by its client and is therefore available
to the system most of the time.

Test device
- 4 CPUs
- Android 4.4.2
- 512MB of RAM
- 68 MB of CMA

Results:

Existing solution:
Foreground headroom: 200MB
Number of higher priority LMK kills (oom_score_adj < 529): 332
Number of alloc stalls: 607

Test patches:
Foreground headroom: 160MB
Number of higher priority LMK kills (oom_score_adj < 529):
459 Number of alloc stalls: 29538

We believe that the issues seen with these patches are the result of the LMK
being more aggressive. The LMK will be more aggressive because it will ignore
free CMA pages for unmovable allocations, and since most calls to the LMK are
made by kswapd (which uses GFP_KERNEL) the LMK will mostly ignore free CMA
pages. Because the LMK thresholds are higher than the zone watermarks, there
will often be a lot of free CMA pages in the system when the LMK is called,
which the LMK will usually ignore.

Thanks,
Laura

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Joonsoo Kim

unread,

May 25, 2014, 10:50:01 PM5/25/14

to

Hello,

Really thanks for testing!!!
If possible, please let me know nr_free_cma of these patches/your in-house
implementation before testing.

I can guess following scenario about your test.

On boot-up, CMA memory are mostly used by native processes, because
your implementation use CMA first for some pages. kswapd
is woken up late since non-CMA free memory is larger than my
implementation. And, on reclaiming, the LMK reclaiming memory by
killing app process would reclaim movable memory with high probability
since cma memory are mostly used by native processes and app processes
have just movable memory.

This is just my guess. But, if it is true, this is not fair test for
this patchset. If possible, could you make nr_free_cma same on both
implementation before testing?

Moreover, in mainline implementation, the LMK doesn't consider if memory
type is CMA or not. Maybe your overall system would be highly optimized
for your implementation, so I'm not sure if your testing is
appropriate or not for this patchset.

Anyway, I would like to optimize this for android. :)
Please let me know more about your system.

Thanks.