[BUG][ext2] XIP does not work on ext2

Andiry Xu

unread,

Nov 4, 2013, 5:40:02 PM11/4/13

to

Hi,

When I'm trying XIP on ext2, I find that xip does not work on ext2
with latest kernel.

Reproduce steps:
Compile kernel with following configs:
CONFIG_BLK_DEV_XIP=y
CONFIG_EXT2_FS_XIP=y

And run following commands:
# mke2fs -b 4096 /dev/ram0
# mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
# dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16

And it shows:
dd: writing `/mnt/ramdisk/test1': No space left on device

df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
16MB write should only occupy 1/4 capacity.

Criminal commit:
After git bisect, it points to the following commit:
8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
Ext2: mark inode dirty after the function dquot_free_block_nodirty is called

Particularly, the following code:
@@ -1412,9 +1415,11 @@ allocated:
*errp = 0;
brelse(bitmap_bh);
- dquot_free_block_nodirty(inode, *count-num);
- mark_inode_dirty(inode);
- *count = num;
+ if (num < *count) {
+ dquot_free_block_nodirty(inode, *count-num);
+ mark_inode_dirty(inode);
+ *count = num;
+ }
return ret_block;

Not mark_inode_dirty() is called only when num is less than *count.
However, I've seen
with the dd command, there is case where num >= *count.

Fix:
I've verified that the following patch fixes the issue:
diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index 9f9992b..5446a52 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -1406,11 +1406,10 @@ allocated:

*errp = 0;
brelse(bitmap_bh);
- if (num < *count) {
+ if (num <= *count)
dquot_free_block_nodirty(inode, *count-num);
- mark_inode_dirty(inode);
- *count = num;
- }
+ mark_inode_dirty(inode);
+ *count = num;
return ret_block;

io_error:

However, I'm not familiar with ext2 source code and cannot tell if
this is the correct fix. At least it fixes my issue.

Thanks,
Andiry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jan Kara

unread,

Nov 4, 2013, 7:40:02 PM11/4/13

to

Hello,

On Mon 04-11-13 14:31:34, Andiry Xu wrote:
> When I'm trying XIP on ext2, I find that xip does not work on ext2
> with latest kernel.
>
> Reproduce steps:
> Compile kernel with following configs:
> CONFIG_BLK_DEV_XIP=y
> CONFIG_EXT2_FS_XIP=y
>
> And run following commands:
> # mke2fs -b 4096 /dev/ram0
> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
>
> And it shows:
> dd: writing `/mnt/ramdisk/test1': No space left on device
>
> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
> 16MB write should only occupy 1/4 capacity.
>
> Criminal commit:
> After git bisect, it points to the following commit:
> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
> Ext2: mark inode dirty after the function dquot_free_block_nodirty is called

Thanks for report and the bisection!

With this, you have essentially reverted a hunk from commit
8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
should be reverted. num should never ever be greater than *count and when
num == count, we the code inside if doesn't do anything useful.

I've looked into the code and I think I see the problem. It is a long
standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
ext2_get_block() just passes that request and ext2_get_blocks() actually
allocates 1 block. And that's were the commit you have identified makes a
difference because previously we returned that 1 block was allocated while
now we return that 0 blocks were allocated and thus allocation is repeated
until all free blocks are exhaused.

Attached patch should fix the problem.

Honza

--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

0001-ext2-Fix-fs-corruption-in-ext2_get_xip_mem.patch

Andiry Xu

unread,

Nov 4, 2013, 9:40:02 PM11/4/13

to

Hi Jan,

Thanks for the reply. I've verified that your patch fixes my issue.
And it's absolutely better than my solution.

Tested-by: Andiry Xu <andi...@gmail.com>

I have another question about ext2 XIP performance, although it's not
quite related to this thread.

I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
reside in main memory.
Then I use two different applications to write to the ram disk. One is
open() with O_DIRECT flag, and writing with Posix write(). Another is
open() with O_DIRECT, mmap() it to user space, then use memcpy() to
write data. I use different request size to write data, from 512 bytes
to 64MB.

In my understanding, the mmap version bypasses the file system and
does not go to kernel space, hence it should have better performance
than the Posix-write version. However, my test result shows it's not
always true: when the request size is between 8KB and 1MB, the
Posix-write() version has bandwidth about 7GB/s and mmap version only
has 5GB/s. The test is performed on a i7-3770K machine with 8GB
memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
has really bad performance, only 2GB/s for all request sizes.

Do you know the reason why write() outperforms mmap() in some cases? I
know it's not related the thread but I really appreciate if you can
answer my question.

Jan Kara

unread,

Nov 5, 2013, 9:40:03 AM11/5/13

to

Hello,

Thanks for testing!

> I have another question about ext2 XIP performance, although it's not
> quite related to this thread.
>
> I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
> The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
> reside in main memory.
> Then I use two different applications to write to the ram disk. One is
> open() with O_DIRECT flag, and writing with Posix write(). Another is
> open() with O_DIRECT, mmap() it to user space, then use memcpy() to
> write data. I use different request size to write data, from 512 bytes
> to 64MB.
>
> In my understanding, the mmap version bypasses the file system and
> does not go to kernel space, hence it should have better performance
> than the Posix-write version. However, my test result shows it's not
> always true: when the request size is between 8KB and 1MB, the
> Posix-write() version has bandwidth about 7GB/s and mmap version only
> has 5GB/s. The test is performed on a i7-3770K machine with 8GB
> memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
> has really bad performance, only 2GB/s for all request sizes.
>
> Do you know the reason why write() outperforms mmap() in some cases? I
> know it's not related the thread but I really appreciate if you can
> answer my question.

Well, I'm not completely sure. mmap()ed memory always works on page-by-page
basis - you first access the page, it gets faulted in and you can further
access it. So for small (sub page size) accesses this is a win because you
don't have an overhead of syscall and fs write path. For accesses larger
than page size the overhead of syscall and some initial checks is well
hidden by other things. I guess write() ends up being more efficient
because write path taken for each page is somewhat lighter than full page
fault. But you'd need to look into perf data to get some hard numbers on
where the time is spent.

Honza

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Andiry Xu

unread,

Nov 5, 2013, 8:30:02 PM11/5/13

to

Hi,

Thanks for the reply. However I have filled up the whole RAM disk
before doing the test, i.e. asked the brd driver to allocate all the
pages initially. Moreover I have done the test on PMFS, a file system
for persistent memory, and the result is the same. PMFS reserves some
physical memory during system boot and then use it to emulate the
persistent memory device, so there shouldn't be any page fault. I
think there should be some other reason.

I will try to look into the perf data and see what happens. Thanks for
the advice.

Thanks,
Andiry

Jan Kara

unread,

Nov 7, 2013, 12:10:01 AM11/7/13

to

Well, pages in ramdisk are always present, that's not an issue. But you
will get a page fault to map a particular physical page in process'
virtual address space when you first access that virtual address in the
mapping from the process. The cost of setting up this virtual->physical
mapping is what I'm talking about.

If you had a process which first mmaps the file and writes to all pages in
the mapping and *then* measure the cost of another round of writing to the
mapping, I would expect you should see speeds close to those of memory bus.

> Moreover I have done the test on PMFS, a file system
> for persistent memory, and the result is the same. PMFS reserves some
> physical memory during system boot and then use it to emulate the
> persistent memory device, so there shouldn't be any page fault.

Again I don't think page faults would be avoided here.

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Andiry Xu

unread,

Nov 7, 2013, 3:20:02 PM11/7/13

to

Yes, you are right, there are page faults observed with perf. I
misunderstood page fault as copying pages between backing store and
physical memory.

> If you had a process which first mmaps the file and writes to all pages in
> the mapping and *then* measure the cost of another round of writing to the
> mapping, I would expect you should see speeds close to those of memory bus.
>

I've tried this as well. mmap() performance improves but still not as
good as write().
I used the perf report to compare write() and mmap() applications. For
write() version, top of perf report shows as:
33.33% __copy_user_nocache
4.72% ext2_get_blocks
4.42% mutex_unlock
3.59% __find_get_block

which looks reasonable.

However, for mmap() version, the perf report looks strange:
94.98% libc-2.15.so [.] 0x000000000014698d
2.25% page_fault
0.18% handle_mm_fault

I don't know what the first item is but it took the majority of cycles.

>> Moreover I have done the test on PMFS, a file system
>> for persistent memory, and the result is the same. PMFS reserves some
>> physical memory during system boot and then use it to emulate the
>> persistent memory device, so there shouldn't be any page fault.
> Again I don't think page faults would be avoided here.
>

Yes, page faults cannot be avoided in this case as well.

Thanks,
Andiry

Jan Kara

unread,

Nov 7, 2013, 4:10:02 PM11/7/13

to

The first item means that it's some userspace code in libc. My guess
would be that it's libc's memcpy() function (or whatever you use to write
to mmap). How do you access the mmap?

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Andiry Xu

unread,

Nov 7, 2013, 5:00:01 PM11/7/13

to

Like this:

fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
for (i = 0; i < count; i++)
{
memcpy(dest, src, request_size);
dest += request_size;
}

Thanks,
Andiry

Jan Kara

unread,

Nov 7, 2013, 5:30:02 PM11/7/13

to

OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
to tune that though...

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Andiry Xu

unread,

Nov 7, 2013, 5:50:01 PM11/7/13

to

Hmm, I will try some different kinds of memcpy to see if there is a
difference. Just want to make sure I do not make some stupid mistakes
before trying that.
Thanks a lot for your help!

Thanks,
Andiry

Andiry Xu

unread,

Nov 8, 2013, 7:30:01 PM11/8/13

to

Your advice does makes difference. I use a optimized version of memcpy
and it does improve the mmap application performance: on a Ramdisk
with Ext2 xip, mmap() version now achieves 11GB/s of bandwidth,
comparing to posix write version with 7GB/s.

Now I wonder if they have a plan to update the memcpy() in libc..

Jan Kara

unread,

Nov 11, 2013, 5:20:01 AM11/11/13

to

Good :).

> Now I wonder if they have a plan to update the memcpy() in libc..

You better ask at glibc devel list... I've google for a while whether
memcpy() in glibc can be somehow tuned (for a particular instruction set)
but didn't find anything useful.

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR