question on page_count(sg->page) >=2?

5 views
Skip to first unread message

Wang Zhenyu

unread,
Nov 18, 2005, 3:24:51 AM11/18/05
to Open-iSCSI

I may lose the context for this, but do we use it to identifiy
slab pages? I'm not sure about the whole code path, so I don't know if
this ref count is right. Shouldn't use !PageSlab(page) for this?

zhen

Mike Christie

unread,
Nov 18, 2005, 5:55:07 PM11/18/05
to open-...@googlegroups.com
Do think it is possible to have the block layer just bounce those pages
for us. Simiar to when a HW LLD sets the dma_mask for pages that it
cannot dma to.

Mike Christie

unread,
Nov 22, 2005, 1:38:57 AM11/22/05
to open-...@googlegroups.com
Oh yeah zhen, in that code path, do you think we are going to hit that
page_address on highmem pages or is it impossible? Accesing page_address
on a highmem page is going to lead to some wierd stuff.

Wang Zhenyu

unread,
Nov 22, 2005, 2:11:54 AM11/22/05
to open-...@googlegroups.com
we don't set any bounce limit with scsi host, so I think it's possible.
But I'm not sure if page_address() on highmem page is really an issue...
you mean for net stack or net driver?

Mike Christie

unread,
Nov 22, 2005, 2:25:10 AM11/22/05
to open-...@googlegroups.com
What? It is :)

> you mean for net stack or net driver?
>

iscsi_tcp.c

when we hit this code snippet:

if (sg->length + sg->offset <= PAGE_SIZE &&
page_count(sg->page) >= 2) { ibuf->sg.page = sg->page;
ibuf->sg.offset = sg->offset;
ibuf->sg.length = sg->length;
} else
iscsi_buf_init_iov(ibuf, page_address(sg->page),
sg->length);


If we hit the "else" we call page_address on a page that might be a
highmem page. Doing this will lead to blow ups. I think becuase the LLD
is not setting clustering the sg->lenght + sg->offset will always be
less than a PAGE (maybe for tape and other ULDs still injecting commands
that do not obey the queue limits this will be a problem) so that leaves
the page_count check. And I was trying to ask what the page_count is
normally and what it is for XFS and what it was trying to catch (I
cannot seem to find the thread which explained it). If we hit the else
then for highmem boxes we just passed iscsi_buf_init_iov a invalid pointer.

Are you saying above it is not a issue becuase page_count will always be
greater than 2 so we never hit the "else"?


Mike Christie

unread,
Nov 22, 2005, 2:36:05 AM11/22/05
to open-...@googlegroups.com
it looks like if we hit this we will also not set the offset if there
was one.


Wang Zhenyu

unread,
Nov 22, 2005, 2:48:31 AM11/22/05
to open-...@googlegroups.com
ok. That's why I asked for page_count value, because I can't find a reasonable
answer myself. :)

>
> Are you saying above it is not a issue becuase page_count will always be
> greater than 2 so we never hit the "else"?

I do see hit on my hightmem box sometimes.

Wang Zhenyu

unread,
Nov 22, 2005, 2:50:30 AM11/22/05
to open-...@googlegroups.com
yeah!

Wang Zhenyu

unread,
Nov 22, 2005, 3:38:32 AM11/22/05
to open-...@googlegroups.com
sorry for confusing, I mean I saw page_count == 1 else case on none highmem page.
For highmem page, does it will always be >=2?

Mike Christie

unread,
Nov 29, 2005, 4:32:44 PM11/29/05
to open-...@googlegroups.com
I have not gotton a highmem page that hits the page_address path, but
for ext3 it seems like I almost always fall back to the slow path
becuase of the page_count test.

FUJITA Tomonori

unread,
Dec 3, 2005, 1:08:33 PM12/3/05
to open-...@googlegroups.com
From: Mike Christie <mich...@cs.wisc.edu>
Subject: Re: question on page_count(sg->page) >=2?
Date: Tue, 29 Nov 2005 15:32:44 -0600

> > sorry for confusing, I mean I saw page_count == 1 else case on none highmem page.
> > For highmem page, does it will always be >=2?
>
>
> I have not gotton a highmem page that hits the page_address path, but
> for ext3 it seems like I almost always fall back to the slow path
> becuase of the page_count test.

It seems that this discussion does not finish, doesn't it?

As far as I know, ext3 does I/O operations only through page cache. So
I'm not sure that the page_count test fails. If the test fails, we are
in deep trouble with highmem because the slow path assume that pages
always have valid an address.

Here, we want to use sendpage only for page cache so checking
page->mapping would be better than the page_count test, I guess.

I don't think that highmem pages hit the slow path, however, BUG_ON
for invalid page addresses would be nice in case.


Index: kernel/iscsi_tcp.c
===================================================================
--- kernel/iscsi_tcp.c (revision 445)
+++ kernel/iscsi_tcp.c (working copy)
@@ -109,12 +109,14 @@
/*
* Fastpath: sg element fits into single page
*/
- if (sg->length + sg->offset <= PAGE_SIZE && page_count(sg->page) >= 2) {
+ if (sg->length + sg->offset <= PAGE_SIZE && sg->page->mapping) {
ibuf->sg.page = sg->page;
ibuf->sg.offset = sg->offset;
ibuf->sg.length = sg->length;
- } else
+ } else {
+ BUG_ON(!page_address(sg->page));
iscsi_buf_init_iov(ibuf, page_address(sg->page), sg->length);
+ }
ibuf->sent = 0;
}

Mike Christie

unread,
Dec 4, 2005, 12:36:51 PM12/4/05
to open-...@googlegroups.com
FUJITA Tomonori wrote:
> From: Mike Christie <mich...@cs.wisc.edu>
> Subject: Re: question on page_count(sg->page) >=2?
> Date: Tue, 29 Nov 2005 15:32:44 -0600
>
>
>>>sorry for confusing, I mean I saw page_count == 1 else case on none highmem page.
>>>For highmem page, does it will always be >=2?
>>
>>
>>I have not gotton a highmem page that hits the page_address path, but
>>for ext3 it seems like I almost always fall back to the slow path
>>becuase of the page_count test.
>
>
> It seems that this discussion does not finish, doesn't it?

No :)

>
> As far as I know, ext3 does I/O operations only through page cache. So
> I'm not sure that the page_count test fails. If the test fails, we are
> in deep trouble with highmem because the slow path assume that pages
> always have valid an address.

Stick in some printks. page_count is greater than 2 on many operations
so we hit the slow path a lot of ext3.

>
> Here, we want to use sendpage only for page cache so checking
> page->mapping would be better than the page_count test, I guess.

I thought the only reason we have that code is becuae XFS and maybe
reiser are sending slab pages. Wouldn't Zhen' test for slab pages be
better if that is all we are trying to catch? Or are you saying it is a
much more common problem (not just slab pages) and so the page->mapping
test will always catch the problem.

FUJITA Tomonori

unread,
Dec 4, 2005, 11:02:21 PM12/4/05
to open-...@googlegroups.com, mich...@cs.wisc.edu
From: Mike Christie <mich...@cs.wisc.edu>
Subject: Re: question on page_count(sg->page) >=2?
Date: Sun, 04 Dec 2005 11:36:51 -0600

> > As far as I know, ext3 does I/O operations only through page cache. So
> > I'm not sure that the page_count test fails. If the test fails, we are
> > in deep trouble with highmem because the slow path assume that pages
> > always have valid an address.
>
> Stick in some printks. page_count is greater than 2 on many operations
> so we hit the slow path a lot of ext3.
>
> >
> > Here, we want to use sendpage only for page cache so checking
> > page->mapping would be better than the page_count test, I guess.
>
> I thought the only reason we have that code is becuae XFS and maybe
> reiser are sending slab pages. Wouldn't Zhen' test for slab pages be
> better if that is all we are trying to catch? Or are you saying it is a
> much more common problem (not just slab pages) and so the page->mapping
> test will always catch the problem.

Sorry. I forgot about the pages of deleted inodes. The page_count test
may fail with ext3 because of that. And page->mapping stuff does not
work too because of that.

Using slab pages is not the root cause. ext3 also uses slab pages
(kmalloced pages) for frozen_data. We can handle this. The problem is
that XFS's way to use slab pages. However, there is no way to
distinguish slab pages used in that way from others. So there seems to
be no better options now. The page_count test doesn't work even with
ext3 on highmem kernels. So we need a workaround now.

Maybe we had better not use sendpage like core-iscsi.


Index: kernel/iscsi_tcp.c
===================================================================
--- kernel/iscsi_tcp.c (revision 446)
+++ kernel/iscsi_tcp.c (working copy)
@@ -109,12 +109,14 @@
/*
* Fastpath: sg element fits into single page
*/
- if (sg->length + sg->offset <= PAGE_SIZE && page_count(sg->page) >= 2) {
+ if (sg->length + sg->offset <= PAGE_SIZE && !PageSlab(sg->page)) {

Mike Christie

unread,
Dec 5, 2005, 3:43:01 AM12/5/05
to open-...@googlegroups.com, Christoph Hellwig
What we really need to do is complete whatever discussion Christoph
wanted the nbd people to do. I do not think LLDs should have to test for
page_count, PageIsHighMem or PageSlab or anything like that :( We are
just LLDs writers. We should not have to know how different FSs uses
whatever page. kmap for highmem pages is nice for the lowly LLD writer
because it just does the right thing for us :)

Christoph what did you want the NBD people to do?

Mike Christie

unread,
Dec 5, 2005, 3:45:31 AM12/5/05
to open-...@googlegroups.com
Oh yeah for now though I am fine with this. I will run it against some
XFS and reiserfs setups I made at work to test it out.

FUJITA Tomonori

unread,
Dec 5, 2005, 8:21:21 PM12/5/05
to open-...@googlegroups.com, mich...@cs.wisc.edu
From: FUJITA Tomonori <fujita....@lab.ntt.co.jp>
Subject: Re: question on page_count(sg->page) >=2?
Date: Mon, 05 Dec 2005 13:02:21 +0900

> > I thought the only reason we have that code is becuae XFS and maybe
> > reiser are sending slab pages. Wouldn't Zhen' test for slab pages be
> > better if that is all we are trying to catch? Or are you saying it is a
> > much more common problem (not just slab pages) and so the page->mapping
> > test will always catch the problem.
>
> Sorry. I forgot about the pages of deleted inodes. The page_count test
> may fail with ext3 because of that. And page->mapping stuff does not
> work too because of that.
>
> Using slab pages is not the root cause. ext3 also uses slab pages
> (kmalloced pages) for frozen_data. We can handle this. The problem is
> that XFS's way to use slab pages. However, there is no way to
> distinguish slab pages used in that way from others. So there seems to

I have to correct my statement. We can handle ext3's kmalloced pages
if the block size is 4096 bytes. We cannot if it is 1024 or 2048
bytes.


From: Mike Christie <mich...@cs.wisc.edu>
Subject: Re: question on page_count(sg->page) >=2?
Date: Mon, 05 Dec 2005 02:43:01 -0600

> What we really need to do is complete whatever discussion Christoph
> wanted the nbd people to do. I do not think LLDs should have to test for
> page_count, PageIsHighMem or PageSlab or anything like that :( We are
> just LLDs writers. We should not have to know how different FSs uses
> whatever page. kmap for highmem pages is nice for the lowly LLD writer
> because it just does the right thing for us :)

Yep. We don't want such tests though we can live with ext3 and xfs by
using the slab page test and the page_count test. I guess that it
would be preferable to not use sendpage like core-iscsi and
nbd. sendpage is designed for particular pages. As you said,
open-iscsi is just a LLD. So we don't except anything about pages that
come from upper layers.

Mike Christie

unread,
Dec 6, 2005, 1:30:15 PM12/6/05
to Christoph Hellwig, open-...@googlegroups.com, Wang Zhenyu
Christoph Hellwig wrote:

> On Mon, Dec 05, 2005 at 02:43:01AM -0600, Mike Christie wrote:
>
>>What we really need to do is complete whatever discussion Christoph
>>wanted the nbd people to do. I do not think LLDs should have to test for
>>page_count, PageIsHighMem or PageSlab or anything like that :( We are
>>just LLDs writers. We should not have to know how different FSs uses
>>whatever page. kmap for highmem pages is nice for the lowly LLD writer
>>because it just does the right thing for us :)
>>
>>Christoph what did you want the NBD people to do?
>
>
> Right now there's no proper way to the block driver to do the right thing.
> We either need to forbid sending down non-refcounted pages or declare a way
> to communicated that we did send down such a page. I'd prefer the first
> variant. We really need to bring this issue up on lkml, best with patches.
>
> Would anyone vounteer to fix ext3, I'll take care of xfs.

I can do it when I get a chance. I am not familar with Filesystems stuff
- was looking for a project to start learning that stuff though. I have
that scatterlist stuff I am trying to get out of the way right now. Zhen
this is your thread :) do you have time to do this or should we just see
who gets free first?

FUJITA Tomonori

unread,
Dec 6, 2005, 6:27:51 PM12/6/05
to open-...@googlegroups.com, h...@infradead.org, zhenyu...@intel.com
From: Mike Christie <mich...@cs.wisc.edu>
Subject: Re: question on page_count(sg->page) >=2?
Date: Tue, 06 Dec 2005 12:30:15 -0600

Here's a patch for ext3. We don't need the page_count or the slab page
test with this.

diff -u --new-file --recursive -x GPATH -x GRTAGS -x GSYMS -x GTAGS -x CVS -x .svn -x .config -x .config.cmd -x .config.old linux-2.6.15-rc4/fs/jbd/commit.c linux-2.6.15-rc4.jbd/fs/jbd/commit.c
--- linux-2.6.15-rc4/fs/jbd/commit.c 2005-12-01 19:05:34.000000000 +0900
+++ linux-2.6.15-rc4.jbd/fs/jbd/commit.c 2005-12-07 07:50:58.000000000 +0900
@@ -261,7 +261,7 @@
struct buffer_head *bh = jh2bh(jh);

jbd_lock_bh_state(bh);
- kfree(jh->b_committed_data);
+ free_page((unsigned long) jh->b_committed_data);
jh->b_committed_data = NULL;
jbd_unlock_bh_state(bh);
}
@@ -745,14 +745,14 @@
* Otherwise, we can just throw away the frozen data now.
*/
if (jh->b_committed_data) {
- kfree(jh->b_committed_data);
+ free_page((unsigned long) jh->b_committed_data);
jh->b_committed_data = NULL;
if (jh->b_frozen_data) {
jh->b_committed_data = jh->b_frozen_data;
jh->b_frozen_data = NULL;
}
} else if (jh->b_frozen_data) {
- kfree(jh->b_frozen_data);
+ free_page((unsigned long) jh->b_frozen_data);
jh->b_frozen_data = NULL;
}

diff -u --new-file --recursive -x GPATH -x GRTAGS -x GSYMS -x GTAGS -x CVS -x .svn -x .config -x .config.cmd -x .config.old linux-2.6.15-rc4/fs/jbd/journal.c linux-2.6.15-rc4.jbd/fs/jbd/journal.c
--- linux-2.6.15-rc4/fs/jbd/journal.c 2005-12-01 19:05:34.000000000 +0900
+++ linux-2.6.15-rc4.jbd/fs/jbd/journal.c 2005-12-07 08:02:08.000000000 +0900
@@ -328,10 +328,10 @@
char *tmp;

jbd_unlock_bh_state(bh_in);
- tmp = jbd_rep_kmalloc(bh_in->b_size, GFP_NOFS);
+ tmp = (char *) __get_free_page(GFP_NOFS | __GFP_NOFAIL);
jbd_lock_bh_state(bh_in);
if (jh_in->b_frozen_data) {
- kfree(tmp);
+ free_page((unsigned long) tmp);
goto repeat;
}

@@ -1799,13 +1799,13 @@
printk(KERN_WARNING "%s: freeing "
"b_frozen_data\n",
__FUNCTION__);
- kfree(jh->b_frozen_data);
+ free_page((unsigned long) jh->b_frozen_data);
}
if (jh->b_committed_data) {
printk(KERN_WARNING "%s: freeing "
"b_committed_data\n",
__FUNCTION__);
- kfree(jh->b_committed_data);
+ free_page((unsigned long) jh->b_committed_data);
}
bh->b_private = NULL;
jh->b_bh = NULL; /* debug, really */
diff -u --new-file --recursive -x GPATH -x GRTAGS -x GSYMS -x GTAGS -x CVS -x .svn -x .config -x .config.cmd -x .config.old linux-2.6.15-rc4/fs/jbd/transaction.c linux-2.6.15-rc4.jbd/fs/jbd/transaction.c
--- linux-2.6.15-rc4/fs/jbd/transaction.c 2005-12-01 19:05:34.000000000 +0900
+++ linux-2.6.15-rc4.jbd/fs/jbd/transaction.c 2005-12-07 07:49:45.000000000 +0900
@@ -665,8 +665,8 @@
if (!frozen_buffer) {
JBUFFER_TRACE(jh, "allocate memory for buffer");
jbd_unlock_bh_state(bh);
- frozen_buffer = jbd_kmalloc(jh2bh(jh)->b_size,
- GFP_NOFS);
+ frozen_buffer =
+ (char *) jbd_get_free_page(GFP_NOFS);
if (!frozen_buffer) {
printk(KERN_EMERG
"%s: OOM for frozen_buffer\n",
@@ -724,7 +724,8 @@
journal_cancel_revoke(handle, jh);

out:
- kfree(frozen_buffer);
+ if (frozen_buffer)
+ free_page((unsigned long) frozen_buffer);

JBUFFER_TRACE(jh, "exit");
return error;
@@ -877,7 +878,7 @@

repeat:
if (!jh->b_committed_data) {
- committed_data = jbd_kmalloc(jh2bh(jh)->b_size, GFP_NOFS);
+ committed_data = (char *) jbd_get_free_page(GFP_NOFS);
if (!committed_data) {
printk(KERN_EMERG "%s: No memory for committed data\n",
__FUNCTION__);
@@ -903,7 +904,8 @@
jbd_unlock_bh_state(bh);
out:
journal_put_journal_head(jh);
- kfree(committed_data);
+ if (committed_data)
+ free_page((unsigned long) committed_data);
return err;
}

diff -u --new-file --recursive -x GPATH -x GRTAGS -x GSYMS -x GTAGS -x CVS -x .svn -x .config -x .config.cmd -x .config.old linux-2.6.15-rc4/include/linux/jbd.h linux-2.6.15-rc4.jbd/include/linux/jbd.h
--- linux-2.6.15-rc4/include/linux/jbd.h 2005-12-01 19:05:36.000000000 +0900
+++ linux-2.6.15-rc4.jbd/include/linux/jbd.h 2005-12-07 07:47:02.000000000 +0900
@@ -70,8 +70,8 @@
extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
#define jbd_kmalloc(size, flags) \
__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
-#define jbd_rep_kmalloc(size, flags) \
- __jbd_kmalloc(__FUNCTION__, (size), (flags), 1)
+#define jbd_get_free_page(flags) \
+ __get_free_page((flags) | journal_oom_retry ? __GFP_NOFAIL : 0)

#define JFS_MIN_JOURNAL_BLOCKS 1024

Wang Zhenyu

unread,
Dec 6, 2005, 10:43:37 PM12/6/05
to Mike Christie, Christoph Hellwig, open-...@googlegroups.com

hey I'm just like you, otherwise I wouldn't raise this concern. :)
Seems Tomonori have a good catch already.

Reply all
Reply to author
Forward
0 new messages