Re: Prefetching in Linux

56 views
Skip to first unread message

Jan Kara

unread,
May 16, 2007, 1:32:55 PM5/16/07
to Krzysztof Lichota, prefetc...@googlegroups.com
[adding the mailing list to CC as Krzysztof suggested]

Hello,

On Wed 16-05-07 18:48:13, Krzysztof Lichota wrote:
> Jan Kara napisał(a):
> >> Great, I am planning to get in touch with ext3 developers in general
> >> this weekend after I clean up my e2moveblocks tool for reordering files
> >> in ext3 (it will a part of my SoC project). I hope he can help me avoid
> >> some stupid mistakes ;)
> > Actually, I have written a tool which remaps blocks on ext3 filesystem
> > (together with a tool that processes blktrace output so that you can feed
> > it into the remapper) for my testing. It's source code is attached if you'd
> > like to have a look (together with some README).
>
> Thanks a lot, I have looked through it :)
> I am attaching my tool, it is a bit simpler, but it has some
> differences. It contains some corectness tests to catch errors and it
> seems pretty stable in current state. The parser in my tool does not
> support defining ranges of blocks in file, but the underlying functions
> do, so I will describe how it is designed to work.
> It reads lines from input file, each line defines file name and range of
> blocks (offset, length) - it defines order of blocks in target area,
> file can appear multiple times on list, so different parts of file can
> be scattered in target area.
> After reading layout, it tries to find free area large enough to hold
> relocated blocks - size of necessary area is estimated based on ranges
> and takes into account indirect blocks which will be relocated alongside.
> Next, blocks are copied according to given order. If there is a hole
> (i.e. sparse file), it is skipped. If the block appears many times in
> layout file, it is copied only the first time, so blocks in overlapping
> ranges in layout file or indirect blocks are copied only the first time.
> It is designed to do so to not rely on layout file to be correct and to
> keep things simple to the maximum.
>
> The decision to not do any sorting/processing in my tool was taken
> deliberately - I wanted this tool to be extremely simple and to do all
> processing in some external tool which produces layout file. This way I
> can test different strategies of layout.
> Also for the same reason I have decided to have only one target area -
> if files should be gathered in groups, user can call the tool multiple
> times and has the guarantee that files in this group will be grouped
> together (if they do not appear in some of later processed groups).
>
> Now, the differences to your solution, forgive me if I understood wrong
> what it does :)
Actually from what you write it seems your tool works quite similarly as
mine (in README there are a few comments about how tools work). I'll sketch
it here:
e2remapblocks takes list of blocks to remap. Now these blocks can be data
blocks but also directory blocks or indirect blocks (negative offsets are
used to indicate indirect blocks). The remapping itself works the same way
as you write - the tool computes number of blocks needed (since I expect
indirect blocks in the input, I don't have to count with an extra space for
them), finds space for them and remaps blocks into the space in the order
in which they were in the input file.
It is true my tool does not handle holes - offsets with holes simply
shouldn't be in the input file. I have also a tool to generate the input
file - you first use blktrace to capture blocks which are read from disk.
Then you use e2block2map to map physical disk blocks (including indirect
blocks) to inodes. This tool has also a mode in which it keeps order of
blocks as returned by blktrace. Thus you get the blocks exactly in the
order in which they were read from disk.

> For example, I think it would be useful to relocate not only file data
> blocks, but also its inode, so that they are closer, and maybe for
> directories inodes of files which are in directory. One of the possible
> improvements I see in speeding up desktop apps is prefetching filesystem
> metadata and this would make it more efficient. What do you think?
Remapping inodes is exactly the next thing on my todo list :). It should
be reasonably easy to do. My next plan is to implement the remapping on a
mounted filesystem but it will take some time for proper kernel interfaces
to be established.

> Could you put the sources of your tool on in prefetch project repository
> or is it not possible for some reason? I will put the source of my tool
> there also and we could work on design and implementation of combined tool.
Yes, I can add them - where is the repository?

> BTW. Do you have any test results for your approach? I have tried my
> tool on one system (not very thorough test though) and the effect was
> that the boot time of the system with reordered files and disabled
> Ubuntu boot readahead was similar to system with unordered files and
> enabled readehead.
I was testing KDE startup - I have quite some results including
bootcharts. I can send them to you if you are interested. Actually, for me
remapping made startup time of KDE to be quite close to the time needed
when fcache was used (that is a tool that remaps all read blocks to a
special cache device on the first run) - which is probably the best you can
hope for when remapping blocks / inodes.

Honza
--
Jan Kara <ja...@suse.cz>
SuSE CR Labs

Krzysztof Lichota

unread,
May 16, 2007, 8:56:25 PM5/16/07
to prefetc...@googlegroups.com
Jan Kara napisał(a):

>> Now, the differences to your solution, forgive me if I understood wrong
>> what it does :)
> Actually from what you write it seems your tool works quite similarly as
> mine (in README there are a few comments about how tools work). I'll sketch
> it here:
> e2remapblocks takes list of blocks to remap. Now these blocks can be data
> blocks but also directory blocks or indirect blocks (negative offsets are
> used to indicate indirect blocks). The remapping itself works the same way
> as you write - the tool computes number of blocks needed (since I expect
> indirect blocks in the input, I don't have to count with an extra space for
> them), finds space for them and remaps blocks into the space in the order
> in which they were in the input file.

OK, I have missed that assumption. In fact they are very similar :)

> It is true my tool does not handle holes - offsets with holes simply
> shouldn't be in the input file.

Unless the files on disk were removed, truncated, etc. between the time
you run e2block2map and relocator. I think it would be better not rely
on what is fed into relocator.

> I have also a tool to generate the input
> file - you first use blktrace to capture blocks which are read from disk.
> Then you use e2block2map to map physical disk blocks (including indirect
> blocks) to inodes. This tool has also a mode in which it keeps order of
> blocks as returned by blktrace. Thus you get the blocks exactly in the
> order in which they were read from disk.

You mean the relocator? I haven't seen the option to keep the order, it
seems extents in file are sorted. Can you clarify that?

>> For example, I think it would be useful to relocate not only file data
>> blocks, but also its inode, so that they are closer, and maybe for
>> directories inodes of files which are in directory. One of the possible
>> improvements I see in speeding up desktop apps is prefetching filesystem
>> metadata and this would make it more efficient. What do you think?
> Remapping inodes is exactly the next thing on my todo list :). It should
> be reasonably easy to do.

If there are no hardlinks, I think :) And if using file paths, not
inodes. If there are hardlinks the scan of the whole directory structure
would be necessary to find the other directory pointing to inode.

I think that as we have these tools, we can set some design goals and
start designing the combined tool. Then we can split the work, so that
we do not work on the same functionality :)

The goals for relocator, as I see them:
1. Relocator must be resistant to errors in input (so that user cannot
screw his disk by feeding wrong file to relocator)
2. Relocator should be simple - the logic should be outside relocator
3. Relocator should be flexible - it should take any layout generated by
the logic module and lay out files on disk (subject to filesystem
structure constraints)

Now there is a decision to make regarding goal 3 - if relocator should
be fed with detailed filesystem-specific data (like indirect blocks,
etc.) or more high-level description (like block ranges in file).
The former gives greater flexibility on laying out files, but has the
disadvantage that relocator interface becomes filesystem-specific (and
cannot be reused for example on ReiserFS) and requires logic module to
know filesystem details. It is also more "error-prone", as logic module
might "forget" to move some filesystem structure, causing disk seek.
The latter has the disadvantage that some blocks (like indirect blocks)
are laid out of logic module control, so they might be in suboptimal
order, causing disk seeks.

Which way do you think is better? I think it is a more generic question
- what should be the split of responsibilities between relocator and
logic module. I have to think it over :)

> My next plan is to implement the remapping on a
> mounted filesystem but it will take some time for proper kernel interfaces
> to be established.

Yes, I have seen messages on linux-ext4 about online defragmenter for
ext3. This would be really cool if we could change layout on the fly.
But I am not sure it is stable enough yet, so I think offline
defragmenter is much better option now.
BTW. Do you know if the interface to kernel defragmenter is suitable for
our use? If not, we should take care to suit our needs...

>> Could you put the sources of your tool on in prefetch project repository
>> or is it not possible for some reason? I will put the source of my tool
>> there also and we could work on design and implementation of combined tool.
> Yes, I can add them - where is the repository?

You have to join the project on Google
(http://code.google.com/p/prefetch) then go to
http://code.google.com/p/prefetch/source
It is SVN, instructions to access repository are on the above page.
There will be bazaar branch on Launchpad also, synchronized with SVN,
but it still is not imported.

>> BTW. Do you have any test results for your approach? I have tried my
>> tool on one system (not very thorough test though) and the effect was
>> that the boot time of the system with reordered files and disabled
>> Ubuntu boot readahead was similar to system with unordered files and
>> enabled readehead.
> I was testing KDE startup - I have quite some results including
> bootcharts. I can send them to you if you are interested.

I think the better would be to put the results on project Wiki page
(http://code.google.com/p/prefetch/w/list).
We should start gathering such results, so that they are not duplicated.
Feel free to modify the wiki, just keep it somehow organized :)
The main page (which appears when clicking "Project home") is also
wikified, you can add a link to results page there. Or maybe better call
it "Experiments" :)

> Actually, for me
> remapping made startup time of KDE to be quite close to the time needed
> when fcache was used (that is a tool that remaps all read blocks to a
> special cache device on the first run)

Interesting, I haven't heard of fcache. Can you add the information
about it to bibliography wiki page? This is example of this scattered
knowledge I want to gather in prefetch project pages...

> - which is probably the best you can
> hope for when remapping blocks / inodes.

It is likely, but I hope by prefetching in advance we can do better than
fcache. As far as I understand it does not do any readahead.
And as I understand Linux kernel cache, it cannot do that, because
there is no way to prefetch pages without saying to which file it
belongs. Am I right?

BTW. fcache does not seem to be suitable for everyday use by casual
users. Too much hassle to set up :) But it is interesting tool for
comparison experiments.

Krzysztof Lichota


signature.asc

Jan Kara

unread,
May 17, 2007, 7:02:22 AM5/17/07
to prefetc...@googlegroups.com
On Thu 17-05-07 02:56:25, Krzysztof Lichota wrote:
> Jan Kara napisał(a):
> >> Now, the differences to your solution, forgive me if I understood wrong
> >> what it does :)
> > Actually from what you write it seems your tool works quite similarly as
> > mine (in README there are a few comments about how tools work). I'll sketch
> > it here:
> > e2remapblocks takes list of blocks to remap. Now these blocks can be data
> > blocks but also directory blocks or indirect blocks (negative offsets are
> > used to indicate indirect blocks). The remapping itself works the same way
> > as you write - the tool computes number of blocks needed (since I expect
> > indirect blocks in the input, I don't have to count with an extra space for
> > them), finds space for them and remaps blocks into the space in the order
> > in which they were in the input file.
>
> OK, I have missed that assumption. In fact they are very similar :)
Yes.

> > It is true my tool does not handle holes - offsets with holes simply
> > shouldn't be in the input file.
>
> Unless the files on disk were removed, truncated, etc. between the time
> you run e2block2map and relocator. I think it would be better not rely
> on what is fed into relocator.

There are some sanity checks but yes, it's not completely foolproof.
But it's expected that you capture the blocktrace, unmount / remount
read-only the filesystem and then run e2block2map and relocator.

> > I have also a tool to generate the input
> > file - you first use blktrace to capture blocks which are read from disk.
> > Then you use e2block2map to map physical disk blocks (including indirect
> > blocks) to inodes. This tool has also a mode in which it keeps order of
> > blocks as returned by blktrace. Thus you get the blocks exactly in the
> > order in which they were read from disk.
>
> You mean the relocator? I haven't seen the option to keep the order, it
> seems extents in file are sorted. Can you clarify that?

See the README file. The option -t of e2block2map does that.

> >> For example, I think it would be useful to relocate not only file data
> >> blocks, but also its inode, so that they are closer, and maybe for
> >> directories inodes of files which are in directory. One of the possible
> >> improvements I see in speeding up desktop apps is prefetching filesystem
> >> metadata and this would make it more efficient. What do you think?
> > Remapping inodes is exactly the next thing on my todo list :). It should
> > be reasonably easy to do.
>
> If there are no hardlinks, I think :) And if using file paths, not
> inodes. If there are hardlinks the scan of the whole directory structure
> would be necessary to find the other directory pointing to inode.

Yes, that's exactly what I plan to do. Basically there's not much else
you can do... e2block2map does that when you ask it to produce file names
and it does not take too long to scan the whole directory tree.

> I think that as we have these tools, we can set some design goals and
> start designing the combined tool. Then we can split the work, so that
> we do not work on the same functionality :)
>
> The goals for relocator, as I see them:
> 1. Relocator must be resistant to errors in input (so that user cannot
> screw his disk by feeding wrong file to relocator)

Yes, the final too should definitely be reasonably foolproof.

> 2. Relocator should be simple - the logic should be outside relocator

Yep. Both our tools seem to satisfy this. :)

> 3. Relocator should be flexible - it should take any layout generated by
> the logic module and lay out files on disk (subject to filesystem
> structure constraints)

Agreed.

> Now there is a decision to make regarding goal 3 - if relocator should
> be fed with detailed filesystem-specific data (like indirect blocks,
> etc.) or more high-level description (like block ranges in file).
> The former gives greater flexibility on laying out files, but has the
> disadvantage that relocator interface becomes filesystem-specific (and
> cannot be reused for example on ReiserFS) and requires logic module to
> know filesystem details. It is also more "error-prone", as logic module
> might "forget" to move some filesystem structure, causing disk seek.
> The latter has the disadvantage that some blocks (like indirect blocks)
> are laid out of logic module control, so they might be in suboptimal
> order, causing disk seeks.
>
> Which way do you think is better? I think it is a more generic question
> - what should be the split of responsibilities between relocator and
> logic module. I have to think it over :)

Yes. This has to be decided. The layout of metadata quite influences the
performance - that's why I've chosen to let it be a part of the input. I
think there's no problem that logic module would forget to ask for remapping
something. Such bug can be easily fixed if we find it...
The argument about filesystem dependency is of more importance I think.
But both the logic module and the relocator have to know about filesystem
structure anyway so I don't think it's a big deal.

> > My next plan is to implement the remapping on a
> > mounted filesystem but it will take some time for proper kernel interfaces
> > to be established.
>
> Yes, I have seen messages on linux-ext4 about online defragmenter for
> ext3. This would be really cool if we could change layout on the fly.
> But I am not sure it is stable enough yet, so I think offline
> defragmenter is much better option now.
> BTW. Do you know if the interface to kernel defragmenter is suitable for
> our use? If not, we should take care to suit our needs...

AFAIK the current interface is via ioctl and for ext4 only. So it's not
well usable for us and people also don't consider it the final one...

> >> Could you put the sources of your tool on in prefetch project repository
> >> or is it not possible for some reason? I will put the source of my tool
> >> there also and we could work on design and implementation of combined tool.
> > Yes, I can add them - where is the repository?
>
> You have to join the project on Google
> (http://code.google.com/p/prefetch) then go to
> http://code.google.com/p/prefetch/source
> It is SVN, instructions to access repository are on the above page.
> There will be bazaar branch on Launchpad also, synchronized with SVN,
> but it still is not imported.

OK. I'll have a look at that later.

> >> BTW. Do you have any test results for your approach? I have tried my
> >> tool on one system (not very thorough test though) and the effect was
> >> that the boot time of the system with reordered files and disabled
> >> Ubuntu boot readahead was similar to system with unordered files and
> >> enabled readehead.
> > I was testing KDE startup - I have quite some results including
> > bootcharts. I can send them to you if you are interested.
>
> I think the better would be to put the results on project Wiki page
> (http://code.google.com/p/prefetch/w/list).
> We should start gathering such results, so that they are not duplicated.
> Feel free to modify the wiki, just keep it somehow organized :)
> The main page (which appears when clicking "Project home") is also
> wikified, you can add a link to results page there. Or maybe better call
> it "Experiments" :)

Good idea, will do.

> > Actually, for me
> > remapping made startup time of KDE to be quite close to the time needed
> > when fcache was used (that is a tool that remaps all read blocks to a
> > special cache device on the first run)
>
> Interesting, I haven't heard of fcache. Can you add the information
> about it to bibliography wiki page? This is example of this scattered
> knowledge I want to gather in prefetch project pages...
>
> > - which is probably the best you can
> > hope for when remapping blocks / inodes.
>
> It is likely, but I hope by prefetching in advance we can do better than
> fcache. As far as I understand it does not do any readahead.
> And as I understand Linux kernel cache, it cannot do that, because
> there is no way to prefetch pages without saying to which file it
> belongs. Am I right?

Hmm, interesting question. You're right that it would not be trivial to
do preloaling on the block-device level on which the fcache works.

> BTW. fcache does not seem to be suitable for everyday use by casual
> users. Too much hassle to set up :) But it is interesting tool for
> comparison experiments.

Making it user friendly won't be too hard... But yes, you have to have a
separate partition for it.

Jan Kara

unread,
May 17, 2007, 12:29:34 PM5/17/07
to prefetc...@googlegroups.com
On Thu 17-05-07 02:56:25, Krzysztof Lichota wrote:
> >> Could you put the sources of your tool on in prefetch project repository
> >> or is it not possible for some reason? I will put the source of my tool
> >> there also and we could work on design and implementation of combined tool.
> > Yes, I can add them - where is the repository?
>
> You have to join the project on Google
> (http://code.google.com/p/prefetch) then go to
> http://code.google.com/p/prefetch/source
> It is SVN, instructions to access repository are on the above page.
> There will be bazaar branch on Launchpad also, synchronized with SVN,
> but it still is not imported.
It seems you have to add me to the project... I have an account
ja...@ucw.cz at Google. Will you add me please?

Krzysztof Lichota

unread,
May 17, 2007, 5:12:46 PM5/17/07
to prefetc...@googlegroups.com
Jan Kara napisał(a):

I have tried to add this account, but it says it is wrong account.
I think it must be something @gmail.com or similar. Mine is
"krzysztof.lichota", without any @ or domain.
I have tried adding jan.kara and someone was added as "jankara".
I hope it is you, check if you have access :)

Krzysztof Lichota


signature.asc

Krzysztof Lichota

unread,
May 20, 2007, 5:54:50 AM5/20/07
to prefetc...@googlegroups.com
Jan Kara napisał(a):

> On Thu 17-05-07 02:56:25, Krzysztof Lichota wrote:
>>> It is true my tool does not handle holes - offsets with holes simply
>>> shouldn't be in the input file.
>> Unless the files on disk were removed, truncated, etc. between the time
>> you run e2block2map and relocator. I think it would be better not rely
>> on what is fed into relocator.
> There are some sanity checks but yes, it's not completely foolproof.
> But it's expected that you capture the blocktrace, unmount / remount
> read-only the filesystem and then run e2block2map and relocator.

There is always a possibility something will change. That's why I would
prefer to catch disk accesses by inode and range of blocks. This
wouldn't change. But in any case we can use your remapper, it is just a
matter of adding handling of paths in function which loads extents. I
can do that, it is already done in my tool.

>>> I have also a tool to generate the input
>>> file - you first use blktrace to capture blocks which are read from disk.
>>> Then you use e2block2map to map physical disk blocks (including indirect
>>> blocks) to inodes. This tool has also a mode in which it keeps order of
>>> blocks as returned by blktrace. Thus you get the blocks exactly in the
>>> order in which they were read from disk.
>> You mean the relocator? I haven't seen the option to keep the order, it
>> seems extents in file are sorted. Can you clarify that?
> See the README file. The option -t of e2block2map does that.

OK. I got the impression that e2remapblocks, which calls function
sort_inodes_extents(), sorts extents in file, but I see it is doing this
only for traversal.

>> If there are no hardlinks, I think :) And if using file paths, not
>> inodes. If there are hardlinks the scan of the whole directory structure
>> would be necessary to find the other directory pointing to inode.
> Yes, that's exactly what I plan to do. Basically there's not much else
> you can do... e2block2map does that when you ask it to produce file names
> and it does not take too long to scan the whole directory tree.

What are the numbers? When fsck is doing the directory scan, it takes
several seconds. If the remapper is run at system shutdown, we cannot
use too much time, users are not happy if shutdown takes too long (see
latest critique of long shutdown time of Vista).

Fortunately, we can fall back to directory scan if we come across hard
link, which is not very common. Or we can choose not to relocate hard
link inode, I think this is the price we can pay ;) More important part
is to relocate normal files inodes.

>> Now there is a decision to make regarding goal 3 - if relocator should
>> be fed with detailed filesystem-specific data (like indirect blocks,
>> etc.) or more high-level description (like block ranges in file).
>> The former gives greater flexibility on laying out files, but has the
>> disadvantage that relocator interface becomes filesystem-specific (and
>> cannot be reused for example on ReiserFS) and requires logic module to
>> know filesystem details. It is also more "error-prone", as logic module
>> might "forget" to move some filesystem structure, causing disk seek.
>> The latter has the disadvantage that some blocks (like indirect blocks)
>> are laid out of logic module control, so they might be in suboptimal
>> order, causing disk seeks.
>>
>> Which way do you think is better? I think it is a more generic question
>> - what should be the split of responsibilities between relocator and
>> logic module. I have to think it over :)
> Yes. This has to be decided. The layout of metadata quite influences the
> performance - that's why I've chosen to let it be a part of the input. I
> think there's no problem that logic module would forget to ask for remapping
> something. Such bug can be easily fixed if we find it...
> The argument about filesystem dependency is of more importance I think.
> But both the logic module and the relocator have to know about filesystem
> structure anyway so I don't think it's a big deal.

OK. So let's go for detailed input. We can always add "smart" mode later
if such need arises. Or preprocess the input.

>>> My next plan is to implement the remapping on a
>>> mounted filesystem but it will take some time for proper kernel interfaces
>>> to be established.
>> Yes, I have seen messages on linux-ext4 about online defragmenter for
>> ext3. This would be really cool if we could change layout on the fly.
>> But I am not sure it is stable enough yet, so I think offline
>> defragmenter is much better option now.
>> BTW. Do you know if the interface to kernel defragmenter is suitable for
>> our use? If not, we should take care to suit our needs...
> AFAIK the current interface is via ioctl and for ext4 only. So it's not
> well usable for us and people also don't consider it the final one...

I thought it is also for ext3, but I do not track linux-ext4 list
closely. If it is not, it is out of consideration for now, we just have
to make sure it suits us when it is released, so that layout can be
optimized on the fly. This would really rock :)

>> You have to join the project on Google
>> (http://code.google.com/p/prefetch) then go to
>> http://code.google.com/p/prefetch/source
>> It is SVN, instructions to access repository are on the above page.
>> There will be bazaar branch on Launchpad also, synchronized with SVN,
>> but it still is not imported.
> OK. I'll have a look at that later.

I see you haven't checked it in yet. The account name you gave me was
not accepted by Google project interface, maybe this must be address
@gmail.com? I have tried adding "jan.kara" and it added it as "jankara".
Is this your account? Can you check if you can commit and change project
pages?

>> It is likely, but I hope by prefetching in advance we can do better than
>> fcache. As far as I understand it does not do any readahead.
>> And as I understand Linux kernel cache, it cannot do that, because
>> there is no way to prefetch pages without saying to which file it
>> belongs. Am I right?
> Hmm, interesting question. You're right that it would not be trivial to
> do preloaling on the block-device level on which the fcache works.

One of the parts of my proposal is to do filesystem metadata
prefetching. Behdad Esfahbod thinks that it would be even more important
than file prefetching. I have planned to do it by instrumenting
functions in ext3 module, but if it can be done on more generic level, I
would prefer to do it that way, so that all filesystems could benefit.
I will see how fcache does it.

So, concluding - I think we should start with your tool. I will modify
it to support file paths and sparse files. I can also add inode
relocation (without hard links) and you can work on relocation of hard
links, as you already have experience with that. I also have some tests
in my tool, I can modify them to work with your tool and I will add more
tests to make sure we do not spoil anything ;)

Does this split of work sound good to you?

BTW. What do you think about using C++? I do not like reinventing the
wheel and STL with its map class and hash_map would save a lot of time
implementing things like splay trees and hash maps in your tool.
I am not saying to drop the work you already done, but in case new
structures are needed, we could use STL structures. What do you think?
What is the policy in ext2 tool about this?

Krzysztof Lichota


signature.asc

Jan Kara

unread,
May 28, 2007, 1:05:07 PM5/28/07
to prefetc...@googlegroups.com
On Sun 20-05-07 11:54:50, Krzysztof Lichota wrote:
> Jan Kara napisał(a):
> > On Thu 17-05-07 02:56:25, Krzysztof Lichota wrote:
> >>> It is true my tool does not handle holes - offsets with holes simply
> >>> shouldn't be in the input file.
> >> Unless the files on disk were removed, truncated, etc. between the time
> >> you run e2block2map and relocator. I think it would be better not rely
> >> on what is fed into relocator.
> > There are some sanity checks but yes, it's not completely foolproof.
> > But it's expected that you capture the blocktrace, unmount / remount
> > read-only the filesystem and then run e2block2map and relocator.
>
> There is always a possibility something will change. That's why I would
> prefer to catch disk accesses by inode and range of blocks. This
> wouldn't change. But in any case we can use your remapper, it is just a
> matter of adding handling of paths in function which loads extents. I
> can do that, it is already done in my tool.
If something changes before we run e2block2map or after we call
e2remapblocks, it doesn't matter - the tool is handling this well. The only
possibility for problems is when filesystem changes between e2block2map and
e2remapblocks...

> >> If there are no hardlinks, I think :) And if using file paths, not
> >> inodes. If there are hardlinks the scan of the whole directory structure
> >> would be necessary to find the other directory pointing to inode.
> > Yes, that's exactly what I plan to do. Basically there's not much else
> > you can do... e2block2map does that when you ask it to produce file names
> > and it does not take too long to scan the whole directory tree.
>
> What are the numbers? When fsck is doing the directory scan, it takes
> several seconds. If the remapper is run at system shutdown, we cannot
> use too much time, users are not happy if shutdown takes too long (see
> latest critique of long shutdown time of Vista).

Yes, it takes several seconds (maybe 15). So it's not like we can do it
on every shutdown. I indended my tool to be run once in a longer time (e.g.
month or longer). We could watch block accesses and do mapping + remapping
only if we find out substantial differences...

> Fortunately, we can fall back to directory scan if we come across hard
> link, which is not very common. Or we can choose not to relocate hard
> link inode, I think this is the price we can pay ;) More important part
> is to relocate normal files inodes.

In my case, when info about accessed files is from blktrace, I have to do
full directory scan whenever I'd like to remap inodes...

> >> It is likely, but I hope by prefetching in advance we can do better than
> >> fcache. As far as I understand it does not do any readahead.
> >> And as I understand Linux kernel cache, it cannot do that, because
> >> there is no way to prefetch pages without saying to which file it
> >> belongs. Am I right?
> > Hmm, interesting question. You're right that it would not be trivial to
> > do preloaling on the block-device level on which the fcache works.
>
> One of the parts of my proposal is to do filesystem metadata
> prefetching. Behdad Esfahbod thinks that it would be even more important
> than file prefetching. I have planned to do it by instrumenting
> functions in ext3 module, but if it can be done on more generic level, I
> would prefer to do it that way, so that all filesystems could benefit.
> I will see how fcache does it.

Why can't you do prefetching by just reading needed blocks? SUSE has
package called 'preload' which does exactly that (and other distros have
similar packages). IMO prefetching does help a bit but not too much.

> BTW. What do you think about using C++? I do not like reinventing the
> wheel and STL with its map class and hash_map would save a lot of time
> implementing things like splay trees and hash maps in your tool.
> I am not saying to drop the work you already done, but in case new
> structures are needed, we could use STL structures. What do you think?
> What is the policy in ext2 tool about this?

I'm much more proficient in C than in C++ (kernel and all core tools are
in C...). Now I agree that in case of tool like this, STL can save some
work. e2fsprogs use C but I guess you don't have to care too much... So I
prefer C but hey, it's your project :).

Krzysztof Lichota

unread,
May 28, 2007, 3:03:48 PM5/28/07
to prefetc...@googlegroups.com
Jan Kara napisał(a):

>> There is always a possibility something will change. That's why I would
>> prefer to catch disk accesses by inode and range of blocks. This
>> wouldn't change. But in any case we can use your remapper, it is just a
>> matter of adding handling of paths in function which loads extents. I
>> can do that, it is already done in my tool.
> If something changes before we run e2block2map or after we call
> e2remapblocks, it doesn't matter - the tool is handling this well. The only
> possibility for problems is when filesystem changes between e2block2map and
> e2remapblocks...

I thought the input to e2block2map was block numbers? In such case if
file is moved, then e2block2map will not find it or it will find some
other file which has block in this place now.

>> What are the numbers? When fsck is doing the directory scan, it takes
>> several seconds. If the remapper is run at system shutdown, we cannot
>> use too much time, users are not happy if shutdown takes too long (see
>> latest critique of long shutdown time of Vista).
> Yes, it takes several seconds (maybe 15). So it's not like we can do it
> on every shutdown. I indended my tool to be run once in a longer time (e.g.
> month or longer). We could watch block accesses and do mapping + remapping
> only if we find out substantial differences...

Yes, I hope we can come up with some heuristic to determine when scan
will have such impact that it is reasonable to do it.
The distance of seeks (i.e. distance between blocks groups in short time
period) seems to be a good metric here.

>> Fortunately, we can fall back to directory scan if we come across hard
>> link, which is not very common. Or we can choose not to relocate hard
>> link inode, I think this is the price we can pay ;) More important part
>> is to relocate normal files inodes.
> In my case, when info about accessed files is from blktrace, I have to do
> full directory scan whenever I'd like to remap inodes...

Yes, but I hope tracing tool can provide file name in trace, so that
this step is not necessary, as it is costly.

>> One of the parts of my proposal is to do filesystem metadata
>> prefetching. Behdad Esfahbod thinks that it would be even more important
>> than file prefetching. I have planned to do it by instrumenting
>> functions in ext3 module, but if it can be done on more generic level, I
>> would prefer to do it that way, so that all filesystems could benefit.
>> I will see how fcache does it.
> Why can't you do prefetching by just reading needed blocks?

The blocks with metadata are not cached anywhere now, as far as I know.
I mean blocks with inodes, with directory entries, etc.

> SUSE has
> package called 'preload' which does exactly that (and other distros have
> similar packages). IMO prefetching does help a bit but not too much.

preload was written by Behdad and he thinks it is useful :)


>> BTW. What do you think about using C++? I do not like reinventing the
>> wheel and STL with its map class and hash_map would save a lot of time
>> implementing things like splay trees and hash maps in your tool.
>> I am not saying to drop the work you already done, but in case new
>> structures are needed, we could use STL structures. What do you think?
>> What is the policy in ext2 tool about this?
> I'm much more proficient in C than in C++ (kernel and all core tools are
> in C...). Now I agree that in case of tool like this, STL can save some
> work. e2fsprogs use C but I guess you don't have to care too much... So I
> prefer C but hey, it's your project :).

I prefer to think it is common project :)
And I hope at some point it is merged into e2fsprogs, as it belongs there.
I think I will use C++ where appropriate and when things stabilize and
C++ is not welcome, I will rewrite appropriate parts in C. This should
speed up development.

Krzysztof Lichota


signature.asc

behdad

unread,
May 28, 2007, 4:10:40 PM5/28/07
to prefetch-devel
On May 28, 3:03 pm, Krzysztof Lichota <krzys...@lichota.net> wrote:
> > SUSE has
> > package called 'preload' which does exactly that (and other distros have
> > similar packages). IMO prefetching does help a bit but not too much.
>
> preload was written by Behdad and he thinks it is useful :)

Quick note: About a year after I started my preload project [1],
when writing my findings I found out about the SuSE preload package,
which is simply just an static list of files and dentries to
readahead.
It's not that much of a project, mostly a package only. I have
covered
that in my thesis in the section called SuSE Preload. [2].


[1] http://preload.sf.net/
[2] http://behdad.org/preload.pdf

Cheers,

behdad

Jan Kara

unread,
May 29, 2007, 3:44:45 AM5/29/07
to prefetc...@googlegroups.com
On Mon 28-05-07 21:03:48, Krzysztof Lichota wrote:
> Jan Kara napisał(a):
> >> There is always a possibility something will change. That's why I would
> >> prefer to catch disk accesses by inode and range of blocks. This
> >> wouldn't change. But in any case we can use your remapper, it is just a
> >> matter of adding handling of paths in function which loads extents. I
> >> can do that, it is already done in my tool.
> > If something changes before we run e2block2map or after we call
> > e2remapblocks, it doesn't matter - the tool is handling this well. The only
> > possibility for problems is when filesystem changes between e2block2map and
> > e2remapblocks...
>
> I thought the input to e2block2map was block numbers? In such case if
> file is moved, then e2block2map will not find it or it will find some
> other file which has block in this place now.
Yes, it can happen it won't be able to find a block (in which case you
probably don't want to relocate it anyway as the file is removed / created
on every boot) or it finds a block of a different file (which would cause
some unwanted relocation). In both cases the filesystem will be consistent
in the end. Just the layout won't be perfect.

> >> What are the numbers? When fsck is doing the directory scan, it takes
> >> several seconds. If the remapper is run at system shutdown, we cannot
> >> use too much time, users are not happy if shutdown takes too long (see
> >> latest critique of long shutdown time of Vista).
> > Yes, it takes several seconds (maybe 15). So it's not like we can do it
> > on every shutdown. I indended my tool to be run once in a longer time (e.g.
> > month or longer). We could watch block accesses and do mapping + remapping
> > only if we find out substantial differences...
>
> Yes, I hope we can come up with some heuristic to determine when scan
> will have such impact that it is reasonable to do it.
> The distance of seeks (i.e. distance between blocks groups in short time
> period) seems to be a good metric here.

Actually, over some threshold (which is not that big) it does not matter how
far we seek (the thing that takes long time in seeking is the stabilization
of the head on a particular track). But measuring how many seeks longer
than some threshold do we have to do during boot is probably a good metric.

> >> Fortunately, we can fall back to directory scan if we come across hard
> >> link, which is not very common. Or we can choose not to relocate hard
> >> link inode, I think this is the price we can pay ;) More important part
> >> is to relocate normal files inodes.
> > In my case, when info about accessed files is from blktrace, I have to do
> > full directory scan whenever I'd like to remap inodes...
> Yes, but I hope tracing tool can provide file name in trace, so that
> this step is not necessary, as it is costly.

Generally, this is impossible as in some cases kernel doesn't have the
information. Probably the hardest thing are mmapped files - when writeout
decides to write some changed data or when some data is read from disk,
kernel just has an inode but nothing more. And this kind of access is not
uncommon - loading executables or dynamic libraries uses mmap.

> >> One of the parts of my proposal is to do filesystem metadata
> >> prefetching. Behdad Esfahbod thinks that it would be even more important
> >> than file prefetching. I have planned to do it by instrumenting
> >> functions in ext3 module, but if it can be done on more generic level, I
> >> would prefer to do it that way, so that all filesystems could benefit.
> >> I will see how fcache does it.
> > Why can't you do prefetching by just reading needed blocks?
>
> The blocks with metadata are not cached anywhere now, as far as I know.
> I mean blocks with inodes, with directory entries, etc.

Kernel caches them in block device page cache. So once metadata are
needed they are cached in memory and stay there for some time.

> I prefer to think it is common project :)
> And I hope at some point it is merged into e2fsprogs, as it belongs there.
> I think I will use C++ where appropriate and when things stabilize and
> C++ is not welcome, I will rewrite appropriate parts in C. This should
> speed up development.

OK, that sounds reasonable :).

Krzysztof Lichota

unread,
May 29, 2007, 4:06:02 AM5/29/07
to prefetc...@googlegroups.com
Jan Kara napisał(a):

>> I thought the input to e2block2map was block numbers? In such case if
>> file is moved, then e2block2map will not find it or it will find some
>> other file which has block in this place now.
> Yes, it can happen it won't be able to find a block (in which case you
> probably don't want to relocate it anyway as the file is removed / created
> on every boot) or it finds a block of a different file (which would cause
> some unwanted relocation). In both cases the filesystem will be consistent
> in the end. Just the layout won't be perfect.

This is what I call a bug :)
In case of fast-changing files it would be a great problem.
Don't know if it is a common case though, we will see when first tests
are performed.

>> Yes, I hope we can come up with some heuristic to determine when scan
>> will have such impact that it is reasonable to do it.
>> The distance of seeks (i.e. distance between blocks groups in short time
>> period) seems to be a good metric here.
> Actually, over some threshold (which is not that big) it does not matter how
> far we seek (the thing that takes long time in seeking is the stabilization
> of the head on a particular track). But measuring how many seeks longer
> than some threshold do we have to do during boot is probably a good metric.

According to this article:
http://www.cs.ucsb.edu/research/tech_reports/reports/2004-18.pdf
(page 9, figure 15) the seek time starts with about 2 ms and goes nearly
linearly to 11 ms. It is quite old paper though :)

If you have more recent data, please add it to Bibliography page.


>> Yes, but I hope tracing tool can provide file name in trace, so that
>> this step is not necessary, as it is costly.
> Generally, this is impossible as in some cases kernel doesn't have the
> information. Probably the hardest thing are mmapped files - when writeout
> decides to write some changed data or when some data is read from disk,
> kernel just has an inode but nothing more. And this kind of access is not
> uncommon - loading executables or dynamic libraries uses mmap.

I have done it in my tracing tool for SquashFS (see
http://lichota.net/~krzysiek/projects/kubuntu/dapper-livecd-optimization/)
by intercepting readpage operation in address_space_operations.
Is there something wrong with this approach?

>>> Why can't you do prefetching by just reading needed blocks?
>> The blocks with metadata are not cached anywhere now, as far as I know.
>> I mean blocks with inodes, with directory entries, etc.
> Kernel caches them in block device page cache. So once metadata are
> needed they are cached in memory and stay there for some time.

OK, I didn't know about that :)

Krzysztof Lichota


signature.asc

Jan Kara

unread,
May 29, 2007, 4:48:10 AM5/29/07
to prefetc...@googlegroups.com
On Tue 29-05-07 10:06:02, Krzysztof Lichota wrote:
> Jan Kara napisał(a):
> >> I thought the input to e2block2map was block numbers? In such case if
> >> file is moved, then e2block2map will not find it or it will find some
> >> other file which has block in this place now.
> > Yes, it can happen it won't be able to find a block (in which case you
> > probably don't want to relocate it anyway as the file is removed / created
> > on every boot) or it finds a block of a different file (which would cause
> > some unwanted relocation). In both cases the filesystem will be consistent
> > in the end. Just the layout won't be perfect.
>
> This is what I call a bug :)
> In case of fast-changing files it would be a great problem.
It doesn't make sence to remap fast-changing files (as they will soo end
in a different location anyway). Prefetching them can be useful though.

> Don't know if it is a common case though, we will see when first tests
> are performed.

During boot / app startup it's not common (at least in my test case).

> >> Yes, I hope we can come up with some heuristic to determine when scan
> >> will have such impact that it is reasonable to do it.
> >> The distance of seeks (i.e. distance between blocks groups in short time
> >> period) seems to be a good metric here.
> > Actually, over some threshold (which is not that big) it does not matter how
> > far we seek (the thing that takes long time in seeking is the stabilization
> > of the head on a particular track). But measuring how many seeks longer
> > than some threshold do we have to do during boot is probably a good metric.
>
> According to this article:
> http://www.cs.ucsb.edu/research/tech_reports/reports/2004-18.pdf
> (page 9, figure 15) the seek time starts with about 2 ms and goes nearly
> linearly to 11 ms. It is quite old paper though :)

OK, I'd trust more this paper than my memory :).

> If you have more recent data, please add it to Bibliography page.
> >> Yes, but I hope tracing tool can provide file name in trace, so that
> >> this step is not necessary, as it is costly.
> > Generally, this is impossible as in some cases kernel doesn't have the
> > information. Probably the hardest thing are mmapped files - when writeout
> > decides to write some changed data or when some data is read from disk,
> > kernel just has an inode but nothing more. And this kind of access is not
> > uncommon - loading executables or dynamic libraries uses mmap.
>
> I have done it in my tracing tool for SquashFS (see
> http://lichota.net/~krzysiek/projects/kubuntu/dapper-livecd-optimization/)
> by intercepting readpage operation in address_space_operations.
> Is there something wrong with this approach?

Oh, OK, that should work. I somehow thought readpage() call gets inode
but it gets file pointer from which you can get to dentry and reconstruct
the path.

Krzysztof Lichota

unread,
May 29, 2007, 4:45:13 AM5/29/07
to prefetc...@googlegroups.com
I have created the page for ext3 layout tool design:
http://code.google.com/p/prefetch/wiki/Ext3LayoutToolDesign

Please update as you see appropriate.

Krzysztof Lichota


signature.asc

Krzysztof Lichota

unread,
May 29, 2007, 4:53:09 AM5/29/07
to prefetc...@googlegroups.com
Jan Kara napisał(a):

> On Tue 29-05-07 10:06:02, Krzysztof Lichota wrote:
>> Jan Kara napisał(a):
>>> Yes, it can happen it won't be able to find a block (in which case you
>>> probably don't want to relocate it anyway as the file is removed / created
>>> on every boot) or it finds a block of a different file (which would cause
>>> some unwanted relocation). In both cases the filesystem will be consistent
>>> in the end. Just the layout won't be perfect.
>> This is what I call a bug :)
>> In case of fast-changing files it would be a great problem.
> It doesn't make sence to remap fast-changing files (as they will soo end
> in a different location anyway). Prefetching them can be useful though.

If we could pin them in one location and allocate some space around it,
this could be useful. But you are right, it is probably not very
feasible and requires too much work.

>> Don't know if it is a common case though, we will see when first tests
>> are performed.
> During boot / app startup it's not common (at least in my test case).

I thought about logs, but probably they are only written to.

>> I have done it in my tracing tool for SquashFS (see
>> http://lichota.net/~krzysiek/projects/kubuntu/dapper-livecd-optimization/)
>> by intercepting readpage operation in address_space_operations.
>> Is there something wrong with this approach?
> Oh, OK, that should work. I somehow thought readpage() call gets inode
> but it gets file pointer from which you can get to dentry and reconstruct
> the path.

Yes, unfortunately using this approach ties the implementation to
specific file system. I hope this can be done on higher level somehow.

Krzysztof Lichota


signature.asc
Reply all
Reply to author
Forward
0 new messages