This patch-set contains UBI, which stands for Unsorted Block Images. This
is closely related to the memory technology devices Linux subsystem (MTD),
so this new piece of software is from drivers/mtd/ubi.
In short, UBI is kind of LVM layer but for flash (MTD) devices. It makes
it possible to dynamically create, delete and re-size volumes. But the
analogy is not full. UBI also takes care of wear-leveling and bad
eraseblocks handling, so UBI completely hides 2 aspects of flash chips
which make them very difficult to work with:
1. wear of eraseblocks;
2. bad eraseblocks.
There is some documentation available at:
http://www.linux-mtd.infradead.org/doc/ubi.html
http://www.linux-mtd.infradead.org/faq/ubi.html
The sources are available via the GIT tree:
git://git.infradead.org/ubi-2.6.git (stable)
git://git.infradead.org/~dedekind/dedekind-ubi-2.6.git (devel)
One can also browse the GIT trees at http://git.infradead.org/
This is the third iteration of the post which has fixed most of the stuff
pointed to previously.
MAINTAINERS | 8
drivers/mtd/Kconfig | 2
drivers/mtd/Makefile | 2
drivers/mtd/ubi/Kconfig | 60 +
drivers/mtd/ubi/Kconfig.debug | 153 +++
drivers/mtd/ubi/Makefile | 7
drivers/mtd/ubi/account.c | 233 +++++
drivers/mtd/ubi/build.c | 467 +++++++++++
drivers/mtd/ubi/cdev.c | 926 ++++++++++++++++++++++
drivers/mtd/ubi/debug.c | 546 +++++++++++++
drivers/mtd/ubi/debug.h | 146 +++
drivers/mtd/ubi/eba.c | 1735 +++++++++++++++++++++++++++++++++++++++++
drivers/mtd/ubi/gluebi.c | 361 ++++++++
drivers/mtd/ubi/io.c | 1445 ++++++++++++++++++++++++++++++++++
drivers/mtd/ubi/misc.c | 167 +++
drivers/mtd/ubi/scan.c | 1478 +++++++++++++++++++++++++++++++++++
drivers/mtd/ubi/sysfs.c | 408 +++++++++
drivers/mtd/ubi/ubi.h | 867 ++++++++++++++++++++
drivers/mtd/ubi/uif.c | 842 ++++++++++++++++++++
drivers/mtd/ubi/upd.c | 359 ++++++++
drivers/mtd/ubi/vmt.c | 360 ++++++++
drivers/mtd/ubi/vtbl.c | 1387 +++++++++++++++++++++++++++++++++
drivers/mtd/ubi/wl.c | 1761 ++++++++++++++++++++++++++++++++++++++++++
fs/jffs2/fs.c | 12
fs/jffs2/os-linux.h | 6
fs/jffs2/wbuf.c | 24
include/linux/mtd/ubi.h | 196 ++++
include/mtd/Kbuild | 2
include/mtd/mtd-abi.h | 1
include/mtd/ubi-header.h | 371 ++++++++
include/mtd/ubi-user.h | 163 +++
31 files changed, 14495 insertions(+)
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> On Wed, 14 Mar 2007 17:20:24 +0200 Artem Bityutskiy <dede...@infradead.org> wrote:
>
> ...
>
> +/**
> + * leb_get_ver - get logical eraseblock version.
> + *
> + * @ubi: the UBI device description object
> + * @vol_id: the volume ID
> + * @lnum: the logical eraseblock number
> + *
> + * The logical eraseblock has to be locked. Note, all this leb_ver stuff is
> + * obsolete and will be removed eventually. FIXME: to be removed together with
> + * leb_ver support.
> + */
> +static inline int leb_get_ver(struct ubi_info *ubi, int vol_id, int lnum)
> +{
> + int idx, leb_ver;
> +
> + idx = vol_id2idx(ubi, vol_id);
> +
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> + leb_ver = ubi->eba.eba_tbl[idx].recs[lnum].leb_ver;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +
> + return leb_ver;
> +}
I very much doubt that the locking in this function (and in the similar
ones here) does anything useful.
> +static unsigned long long next_sqnum(struct ubi_info *ubi)
> +{
> + unsigned long long sqnum;
> +
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + sqnum = ubi->eba.global_sq_cnt++;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +
> + return sqnum;
> +}
That one makes sense,
> +static inline void leb_map(struct ubi_info *ubi, int vol_id, int lnum, int pnum)
> +{
> + int idx;
> +
> + idx = vol_id2idx(ubi, vol_id);
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs[lnum].pnum < 0);
> + ubi->eba.eba_tbl[idx].recs[lnum].pnum = pnum;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +}
I doubt if that one does.
> +/**
> + * leb_unmap - un-map a logical eraseblock.
> + *
> + * @ubi: the UBI device description object
> + * @vol_id: the volume ID
> + * @lnum: the logical eraseblock number to unmap
> + *
> + * This function un-maps a logical eraseblock and increases its version. The
> + * logical eraseblock has to be locked.
> + */
> +static inline void leb_unmap(struct ubi_info *ubi, int vol_id, int lnum)
The patch is full of nutty inlining.
Suggestion: just remove all of it. Then reintroduce inlining in only
those places where a benefit is demonstrable. Reduced code size according to
/bin/size would be a suitable metric.
> +static inline int leb2peb(struct ubi_info *ubi, int vol_id, int lnum)
> +{
> + int idx, pnum;
> +
> + idx = vol_id2idx(ubi, vol_id);
> +
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> + pnum = ubi->eba.eba_tbl[idx].recs[lnum].pnum;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +
> + return pnum;
> +}
Again, the locking seems pointless.
>
> There's way too much code here to expect it to get decently reviewed, alas.
Yes.
/me repeats wish that Not Everything Should Be Sent to lkml. :(
> > On Wed, 14 Mar 2007 17:20:24 +0200 Artem Bityutskiy <dede...@infradead.org> wrote:
> >
> > ...
> >
> > +/**
> > + * leb_get_ver - get logical eraseblock version.
> > + *
> > + * @ubi: the UBI device description object
> > + * @vol_id: the volume ID
> > + * @lnum: the logical eraseblock number
> > + *
> > + * The logical eraseblock has to be locked. Note, all this leb_ver stuff is
> > + * obsolete and will be removed eventually. FIXME: to be removed together with
> > + * leb_ver support.
> > + */
Please use kernel-doc syntax and test it. Using and testing it
are really easy to do. It's just a simple language. Don't make
(even trivial) problems for others to clean up...
Documentation/kernel-doc-nano-HOWTO.txt
Above: no "blank" line between the function name and its parameters.
> > +static inline int leb_get_ver(struct ubi_info *ubi, int vol_id, int lnum)
> > +{
> > + int idx, leb_ver;
> > +
> > + idx = vol_id2idx(ubi, vol_id);
> > +
> > + spin_lock(&ubi->eba.eba_tbl_lock);
> > + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> > + leb_ver = ubi->eba.eba_tbl[idx].recs[lnum].leb_ver;
> > + spin_unlock(&ubi->eba.eba_tbl_lock);
> > +
> > + return leb_ver;
> > +}
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
Just curious, but where would you suggest this be sent to for review then?
josh
> On Thu, Mar 15, 2007 at 02:24:10PM -0700, Randy Dunlap wrote:
> > On Thu, 15 Mar 2007 11:07:03 -0800 Andrew Morton wrote:
> >
> > >
> > > There's way too much code here to expect it to get decently reviewed, alas.
> >
> > Yes.
> >
> > /me repeats wish that Not Everything Should Be Sent to lkml. :(
>
> Just curious, but where would you suggest this be sent to for review then?
Valid question. I should have chosen some other more appropriate
patch to make that comment.
I don't see a better list for UBI patches, so lkml is OK IMO.
Here is a summary of my thinking on Linux-related mailing lists.
1. Bug reports can go to lkml or focused mailing lists.
2. Development (like patches) should go to focused mailing lists
if there is such a list and they have enough usage.
Development areas that qualify for this IMO are:
- ACPI
- ATA
- file systems
- frame buffer
- ieee1394
- MM/VM
- multimedia
- networking
- PCI
- power management, suspend/resume
- SCSI
- sound
- USB
- virtualization
(not that I expect anything close to concensus on this)
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
Well, yes, these are integers.
> > +static inline void leb_unmap(struct ubi_info *ubi, int vol_id, int lnum)
>
> The patch is full of nutty inlining.
Yeah, this file has too much of them.
> Suggestion: just remove all of it. Then reintroduce inlining in only
> those places where a benefit is demonstrable. Reduced code size according to
> /bin/size would be a suitable metric.
OK, thanks.
> > + spin_unlock(&ubi->eba.eba_tbl_lock);
> > +
> > + return pnum;
> > +}
>
> Again, the locking seems pointless.
Thanks for comments, will be fixed.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
Please, rise this question in a separate thread, and discuss with
subsystem maintainers. I was directed here by the MTD maintainer.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
If you do statement like this, please, provide reasons why you say it
for these patches and suggest something _constructive_. I do not see any
point in this vague phrase.
And please, note, I was directed here by David Woodhouse who is MTD
maintainer because he thinks the patch is large and needs more people to
look at it, not just him.
> Documentation/kernel-doc-nano-HOWTO.txt
>
> Above: no "blank" line between the function name and its parameters.
OK, I'll look at this again, thanks.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
> On Thu, 2007-03-15 at 14:24 -0700, Randy Dunlap wrote:
> > /me repeats wish that Not Everything Should Be Sent to lkml. :(
>
> If you do statement like this, please, provide reasons why you say it
> for these patches and suggest something _constructive_. I do not see any
> point in this vague phrase.
so do you believe that Everything (that is kernel-related) should
be sent to lkml?
> And please, note, I was directed here by David Woodhouse who is MTD
> maintainer because he thinks the patch is large and needs more people to
> look at it, not just him.
As I wrote to Mr. Boyer, that makes sense in this case.
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
Forgive my ignorance, but why did you not implement the two features
above as device mapper layers instead? A device mapper can arbitrarily
transform I/O addresses and contents and has direct access to the
mapped device's ioctl interfaces, etc.
Writing a mapper that's driven by out of band data on the underlying
device can be done in a couple hundred lines of code. I've done it to
implement an MD5 integrity checking layer. In this case, all the I/Os
were simply remapped to be above the on-disk MD5 table and expanded to
be at least the hashed cluster size.
Even if the device mapper API is not completely up to the task (which
I strongly suspect it is), it would seem simpler to extend it than to
add 14000 lines of parallel subsystem.
--
Mathematics is the supreme nostalgia of our time.
Just because UBI is designed for flash devices, not block devices. Note,
UBI is not for MMC/USB stick/SC/etc flashes, which are used as block
devices, but for _bare_ flashes.
Please, glance here to find more information about the different between
flashes and block devices:
http://www.linux-mtd.infradead.org/faq/general.html#L_mtd_vs_hdd
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
I'm well aware of all that. I wrote a NAND driver just last month.
Let's consider this table:
HARD drives MTD device
Consists of sectors Consists of eraseblocks
Sectors are small (512, 1024 bytes) Eraseblocks are larger (32KiB, 128KiB)
read sector and write sector read, write, and erase block
Bad sectors are re-mapped Bad eraseblocks are not hidden
HDD sectors don't wear out Eraseblocks get worn-out
If the end goal is to end up with something that looks like a block
device (which seems to be implied by adding transparent wear leveling
and bad block remapping), then I don't see any reason it can't be done
in device mapper. The 'smarts' of mtdblock could in fact be pulled up
a level. As I've pointed out already, you can already easily address
issues two, four, and five with device mapper layers.
If instead you still want the "NAND-ness" of the device exposed at the
top level so things can do raw eraseblock I/O more efficiently, then I
think instead of duplicating the device mapper framer, we should
instead think about how to integrate NAND devices more closely with
the block layer.
In the end, a block device is something which does random access
block-oriented I/O. Disk and NAND both fit that description.
--
Mathematics is the supreme nostalgia of our time.
Nope, not the end goal. It's more about wear-leveling across the entire
flash chip than it is presenting a "block like" device.
> and bad block remapping), then I don't see any reason it can't be done
> in device mapper. The 'smarts' of mtdblock could in fact be pulled up
There is nothing smart about mtdblock. And mtdblock has nothing to do
with UBI.
> a level. As I've pointed out already, you can already easily address
> issues two, four, and five with device mapper layers.
>
> If instead you still want the "NAND-ness" of the device exposed at the
> top level so things can do raw eraseblock I/O more efficiently, then I
> think instead of duplicating the device mapper framer, we should
> instead think about how to integrate NAND devices more closely with
> the block layer.
>
> In the end, a block device is something which does random access
> block-oriented I/O. Disk and NAND both fit that description.
NAND very much doesn't fit the "random access" part of that. For writes
you have to write in incrementing pages within eraseblocks.
UBI is about maximizing the number of available eraseblocks to efficiently
wear-level across the largest possible area on a flash chip. MTD itself
contains no higher-level capabilities to deal with this, and UBI uses the
underlying MTD device directly, not through ioctls. This allows existing
flash specific users (e.g. JFFS2) to run on top of UBI with minimal changes.
Your idea does have some merit, however I believe your focus is misplaced.
Rather than convert UBI to device mapper and somehow try to make it work
through mtdblock (sic), perhaps what should be done is come up with a
better interface for MTD to present itself as a block device. I would
still find that troubling though.
josh
Disks have OOB areas with ECC, it's just nicely hidden inside the
drive. They also typically have physical sectors bigger than 512
bytes, again hidden.
> > If the end goal is to end up with something that looks like a block
> > device (which seems to be implied by adding transparent wear leveling
>
> Nope, not the end goal. It's more about wear-leveling across the entire
> flash chip than it is presenting a "block like" device.
It seems to be about spanning devices and repartitioning as well.
Hence the analogy with LVM.
> > and bad block remapping), then I don't see any reason it can't be done
> > in device mapper. The 'smarts' of mtdblock could in fact be pulled up
>
> There is nothing smart about mtdblock. And mtdblock has nothing to do
> with UBI.
Note the scare quotes. Device mapper runs on top of a block device.
And mtdblock is currently the block interface that MTD exports. And it
has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
talking about an approach that doesn't involve UBI at all.
> > In the end, a block device is something which does random access
> > block-oriented I/O. Disk and NAND both fit that description.
>
> NAND very much doesn't fit the "random access" part of that. For writes
> you have to write in incrementing pages within eraseblocks.
And? You can't do I/O smaller than a sector on a disk.
--
Mathematics is the supreme nostalgia of our time.
Yes, it can span multiple MTDs which spreads the wear-leveling even
more. Yes, it can create/resize/remove volumes. It does that
differently than LVM, but the ideas are related. I don't see the issue
here I guess.
(UBI also has static volumes which LVM doesn't but that is an aside.)
> > > and bad block remapping), then I don't see any reason it can't be done
> > > in device mapper. The 'smarts' of mtdblock could in fact be pulled up
> >
> > There is nothing smart about mtdblock. And mtdblock has nothing to do
> > with UBI.
>
> Note the scare quotes. Device mapper runs on top of a block device.
> And mtdblock is currently the block interface that MTD exports. And it
> has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
> talking about an approach that doesn't involve UBI at all.
Ok, but what I'm saying is that using device mapper on top of mtdblock
is not a good solution. mtdblock caches writes within an eraseblock to
a DRAM buffer of eraseblock size. If you get a power failure before
that is flushed out, you lose an entire eraseblock's worth of data.
Oops. And if you constantly flush the buffer, there's no point in
having it in the first place because it doesn't help or hide anything
then. UBI doesn't have this problem.
That's why I suggested fixing the MTD layers that present block devices
first in the part of my reply that you cut off. It seems to me that
you're really after getting flash to look like a block device, which
would enable device mapper to be used for something similar to UBI.
That's fine, but until someone does that work UBI fills a need, has
users, and has an existing implementation.
josh
On Mon, 2007-03-19 at 12:08 -0500, Matt Mackall wrote:
> On Sun, Mar 18, 2007 at 03:31:50PM -0500, Josh Boyer wrote:
> > On Sun, Mar 18, 2007 at 02:18:12PM -0500, Matt Mackall wrote:
> > >
> > > I'm well aware of all that. I wrote a NAND driver just last month.
> > > Let's consider this table:
> > >
> > > HARD drives MTD device
> > > Consists of sectors Consists of eraseblocks
> > > Sectors are small (512, 1024 bytes) Eraseblocks are larger (32KiB, 128KiB)
> > > read sector and write sector read, write, and erase block
> > > Bad sectors are re-mapped Bad eraseblocks are not hidden
> > > HDD sectors don't wear out Eraseblocks get worn-out
> > N/A NAND flash addressed in pages
> > N/A NAND flash has OOB areas
> > N/A (?) NAND flash requires ECC
>
> Disks have OOB areas with ECC, it's just nicely hidden inside the
> drive. They also typically have physical sectors bigger than 512
> bytes, again hidden.
The difference is that the harddrive has an intellegent controller,
which hides all this away. NAND FLASH has not and we have to do it in
software.
> > > If the end goal is to end up with something that looks like a block
> > > device (which seems to be implied by adding transparent wear leveling
> >
> > Nope, not the end goal. It's more about wear-leveling across the entire
> > flash chip than it is presenting a "block like" device.
>
> It seems to be about spanning devices and repartitioning as well.
> Hence the analogy with LVM.
Yes, UBI is a kind of LVM for FLASH and we did think quite a time about
reusing LVM before we went the UBI way.
> > > and bad block remapping), then I don't see any reason it can't be done
> > > in device mapper. The 'smarts' of mtdblock could in fact be pulled up
> >
> > There is nothing smart about mtdblock. And mtdblock has nothing to do
> > with UBI.
>
> Note the scare quotes. Device mapper runs on top of a block device.
> And mtdblock is currently the block interface that MTD exports. And it
> has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
> talking about an approach that doesn't involve UBI at all.
MTD block has no 'smarts' at all. It is a stupid and broken hack, which
you can utilize to lose data and wear your FLASH out.
> > > In the end, a block device is something which does random access
> > > block-oriented I/O. Disk and NAND both fit that description.
> >
> > NAND very much doesn't fit the "random access" part of that. For writes
> > you have to write in incrementing pages within eraseblocks.
>
> And? You can't do I/O smaller than a sector on a disk.
Should we export block devices with 16/32/64/128 KiB size ? If not, we
would need to put a lot of clever functionality into the mtd block
device code, which we decided to put into UBI, so FLASH aware file
systems can use this shared functionality too.
If someone wants to implement an intellegent mtd block device, which
allows to run arbitrary filesystems, then it should be done on top of
UBI. It's not rocket science, but nobody bothers as we have functional
FLASH filesystems which do their job better w/o any notion of a block
device.
A disk _IS_ fundamentally different to FLASH and all the magic which is
done inside of CF-Cards and USB-Sticks is just hiding this away. Most of
the controller chips in these devices are broken and I would never ever
store any important data on such.
The main points of UBI are:
- wear levelling across the complete device
- background handling of bitflips
- safe updates
- handling of static volumes, which are easily accessible for
bootloaders
Nothing of this is anyway near of LVM and disks. The only LVM alike
feature is dynamic creation/deletion/resizing of volumes.
tglx
The issue is 14000 lines of patch to make a parallel subsystem.
> (UBI also has static volumes which LVM doesn't but that is an aside.)
If a static volume is simply a non-dynamic volume, then device mapper
can do that too. And countless other things. Which is not an aside.
UBI growing to do all the things that device mapper does is exactly
the thing we should be seeking to avoid.
> > > > and bad block remapping), then I don't see any reason it can't be done
> > > > in device mapper. The 'smarts' of mtdblock could in fact be pulled up
> > >
> > > There is nothing smart about mtdblock. And mtdblock has nothing to do
> > > with UBI.
> >
> > Note the scare quotes. Device mapper runs on top of a block device.
> > And mtdblock is currently the block interface that MTD exports. And it
> > has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
> > talking about an approach that doesn't involve UBI at all.
>
> Ok, but what I'm saying is that using device mapper on top of mtdblock
> is not a good solution. mtdblock caches writes within an eraseblock to
> a DRAM buffer of eraseblock size. If you get a power failure before
> that is flushed out, you lose an entire eraseblock's worth of data.
Sigh. That's precisely why I talked about moving said smarts. This is
nothing that a higher level remapping layer can't address.
> That's why I suggested fixing the MTD layers that present block devices
> first in the part of my reply that you cut off. It seems to me that
> you're really after getting flash to look like a block device, which
> would enable device mapper to be used for something similar to UBI.
> That's fine, but until someone does that work UBI fills a need, has
> users, and has an existing implementation.
False starts that get mainlined delay or prevent things getting done
right. The question is and remains "is UBI the right way to do
things?" Not "is UBI the easiest way to do things?" or "is UBI
something people have already adopted?"
If the right way is instead to extend the block layer and device
mapper to encompass the quirks of NAND in a sensible fashion, then UBI
should not go in.
Let me draw a picture so we have something to argue about:
iSCSI/nbd(6)
|
filesystem { swap | ext3 ext3 jffs2
\ | | | /
/ \ | dm-crypt->snapshot(5) /
device mapper -| \ \ | /
| partitioning /
| | partitioning(4)
| wear leveling(3) /
| | /
| block concatenation
| | | | |
\ bad block remapping(2)
| | | |
MTD raw block { raw block devices with no smarts(1)
/ | \ \
hardware { NAND NAND NAND NAND
Notes:
1. This would provide a block device that allowed writing pages and
a secondary method for erasing whole blocks as well as a method for
querying/setting out of band information.
2. This would hide erase blocks either by using an embedded table or
out of band info. This could stack on top of block concatenation if
desired.
3. This would provide wear leveling, and probably simultaneously
provide relatively efficient and safe access to write sector
and page-sized I/O. Below this level, things had better be
comfortable with the limitations of NAND if they want to work well.
4. JFFS2 has its own wear-leving scheme, as do several other
filesystems, so they probably want to bypass this piece of the stack.
5. We don't reimplement higher pieces of the stack (dm-crypt,
snapshot, etc.).
6. We make some things possible that simply aren't otherwise.
And this picture isn't even interesting yet. Imagine a dm-cache layer
that caches data read from disks in high-speed flash. Or using
dm-mirror to mirror writes to local flash over NBD or to a USB drive.
Neither of these can be done 'right' in a stack split between device
mapper and UBI.
--
Mathematics is the supreme nostalgia of our time.
I explained precisely what I meant by 'smarts' and why I put it in
'smarts' in quotes. And here you are repeat that same exact damn thing
I responded to five lines up.
> > > > In the end, a block device is something which does random access
> > > > block-oriented I/O. Disk and NAND both fit that description.
> > >
> > > NAND very much doesn't fit the "random access" part of that. For writes
> > > you have to write in incrementing pages within eraseblocks.
> >
> > And? You can't do I/O smaller than a sector on a disk.
>
> Should we export block devices with 16/32/64/128 KiB size ?
Sure, why not?
> A disk _IS_ fundamentally different to FLASH and all the magic which is
> done inside of CF-Cards and USB-Sticks is just hiding this away.
And yet they're still both block devices. That our current block layer
doesn't handle one as well as the other is something we should fix
instead of inventing a whole new full-feature but incompatible block
layer on the side.
--
Mathematics is the supreme nostalgia of our time.
It'll be much smaller after I remove "itsy-bitsy" and most of the
debugging stuff, in progress - wait for take 4.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
Parallel system exists since very long. One is
flash->SW_or_HW_FTL->all_blkdev_stuff. The other is MTD->JFFS2. Think
about _why_ there are 2 of them. Hint - reliability, performance. Your
ranting basically says that only the first one makes sense. This is not
true.
We enhance the second branch, not the first, please, realize this. Both
branches have their user base, and have always had.
> iSCSI/nbd(6)
> |
> filesystem { swap | ext3 ext3 jffs2
> \ | | | /
> / \ | dm-crypt->snapshot(5) /
> device mapper -| \ \ | /
> | partitioning /
> | | partitioning(4)
> | wear leveling(3) /
> | | /
> | block concatenation
> | | | | |
> \ bad block remapping(2)
> | | | |
> MTD raw block { raw block devices with no smarts(1)
> / | \ \
> hardware { NAND NAND NAND NAND
Matt, as I pointed in the first mail, flash != block device. In your
picture I see NAND->MTD raw block. So am I right that you assume that we
already have a decent FTL? The fact is that we do not.
Please, bear in mind that decent FTL is difficult and an FS on top of
FTL is slow, FTL hits performance considerably.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
Simply because we want to have the ability to write fine grained in
order to write data safe to FLASH. If we export those large sizes we
lose this ability and have to write full erase blocks for a couple of
bytes. This simply breaks JFFS2 and you can do the math yourself what
that means for the life time of FLASH, when you write small data chunks
in fast sequences and want to make sure that they are written to FLASH
immidiately.
> > A disk _IS_ fundamentally different to FLASH and all the magic which is
> > done inside of CF-Cards and USB-Sticks is just hiding this away.
>
> And yet they're still both block devices. That our current block layer
> doesn't handle one as well as the other is something we should fix
> instead of inventing a whole new full-feature but incompatible block
> layer on the side.
And yet they are still broken and unreliable. And you can wear them out
in no time, just because they are stupid and do full eraseblock updates
when you write one sector.
No thanks. A bunch of people have done experiments with those beasts and
they are unusable for environments, where we need to make sure, that
data is on FLASH.
UBI is not an incompatible block layer. It allows to implement a very
clever block layer on top. And you can use just one large partition and
small ones for your kernel image and bootloader, which still get the
benefits of data integrity (by doing background safe copies on bit
flips) and the easy implementation in an IPL.
tglx
No it can't and device mapper sits on top of block devices. FLASH is no
block device. Period.
Device mapper can not provide a simple easy to decode scheme for boot
loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
and be able to find the kernel or second stage boot loader in this
unordered device.
And no, fixed addresses do not work. Do you want to implement device
mapper into your Initialial Bootloader stage ?
> > That's why I suggested fixing the MTD layers that present block devices
> > first in the part of my reply that you cut off. It seems to me that
> > you're really after getting flash to look like a block device, which
> > would enable device mapper to be used for something similar to UBI.
> > That's fine, but until someone does that work UBI fills a need, has
> > users, and has an existing implementation.
>
> False starts that get mainlined delay or prevent things getting done
> right. The question is and remains "is UBI the right way to do
> things?" Not "is UBI the easiest way to do things?" or "is UBI
> something people have already adopted?"
>
> If the right way is instead to extend the block layer and device
> mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> should not go in.
No, block layer on top of FLASH needs 80% of the functionality of UBI in
the first place. You need to implement a clever journalling block device
emulator in order to keep the data alive and the FLASH not weared out
within no time. You need the wear levelling, otherwise you can throw
away your FLASH in no time.
> Let me draw a picture so we have something to argue about:
>
> iSCSI/nbd(6)
> |
> filesystem { swap | ext3 ext3 jffs2
> \ | | | /
> / \ | dm-crypt->snapshot(5) /
> device mapper -| \ \ | /
> | partitioning /
> | | partitioning(4)
> | wear leveling(3) /
> | | /
> | block concatenation
> | | | | |
> \ bad block remapping(2)
> | | | |
> MTD raw block { raw block devices with no smarts(1)
> / | \ \
> hardware { NAND NAND NAND NAND
>
> Notes:
> 1. This would provide a block device that allowed writing pages and
> a secondary method for erasing whole blocks as well as a method for
> querying/setting out of band information.
Forget about OOB data. OOB data is reserved for ECC. Please read the
recommendations of the NAND FLASH manufacturers. NAND gets less reliable
with higher density devices and smaller processes.
> 2. This would hide erase blocks either by using an embedded table or
> out of band info. This could stack on top of block concatenation if
> desired.
Hide erase blocks ? UBI does not hide anything. It maps logical
eraseblocks, which are exposed to the clients to arbitrary physical
eraseblocks on the FLASH device in order to provide across device wear
levelling.
This is fundamentaly different to device mapper.
> 3. This would provide wear leveling, and probably simultaneously
> provide relatively efficient and safe access to write sector
> and page-sized I/O. Below this level, things had better be
> comfortable with the limitations of NAND if they want to work well.
I don't see how this provides across device wear levelling.
> 4. JFFS2 has its own wear-leving scheme, as do several other
> filesystems, so they probably want to bypass this piece of the stack.
JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
wear levelling sucks.
> 5. We don't reimplement higher pieces of the stack (dm-crypt,
> snapshot, etc.).
Why should we reimplement that ?
> 6. We make some things possible that simply aren't otherwise.
>
> And this picture isn't even interesting yet. Imagine a dm-cache layer
> that caches data read from disks in high-speed flash. Or using
> dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> Neither of these can be done 'right' in a stack split between device
> mapper and UBI.
Err. Implement a clever block layer on top of UBI and use all the
goodies you want including device mapper.
tglx
A better way would be for MTD to deliver a block dev with a rich
enough interface for JFFS2 to use efficiently in the first place. Yes,
I know that can't be done with the current block dev layer. But that's
what the source is for.
> We enhance the second branch, not the first, please, realize this. Both
> branches have their user base, and have always had.
>
> > iSCSI/nbd(6)
> > |
> > filesystem { swap | ext3 ext3 jffs2
> > \ | | | /
> > / \ | dm-crypt->snapshot(5) /
> > device mapper -| \ \ | /
> > | partitioning /
> > | | partitioning(4)
> > | wear leveling(3) /
> > | | /
> > | block concatenation
> > | | | | |
> > \ bad block remapping(2)
> > | | | |
> > MTD raw block { raw block devices with no smarts(1)
> > / | \ \
> > hardware { NAND NAND NAND NAND
>
> Matt, as I pointed in the first mail, flash != block device.
And as I pointed out, you're wrong. It is both block oriented
(eraseBLOCK??) and random access. That's what a block device is. The
fact that it doesn't look like the other things that Linux currently
calls a block device and supports well is another matter.
> In your picture I see NAND->MTD raw block. So am I right that you
> assume that we already have a decent FTL? The fact is that we do
> not.
No. Look at the picture for more than two seconds, please.
I can tell you didn't do this because you didn't manage to find (1)
which explicitly says "with no smarts". And you also cut out the footnote
where I explained what I meant by "with no smarts".
Find the spots marked (2) and (3). These are your FTL.
> Please, bear in mind that decent FTL is difficult and an FS on top of
> FTL is slow, FTL hits performance considerably.
..and if you'd actually looked at the picture, you'd have seen JFFS2
bypassing it. Along with another footnote explaining it.
--
Mathematics is the supreme nostalgia of our time.
Which of the following two properties does it lack?
- discrete blocks
- non-sequential access to blocks
When you do the obvious s/blocks/eraseblocks/, this appears to be
true.
Saying "but I can't do I/O smaller than the blocksize" doesn't change
this any more than it would for disks.
Saying "but I can do smaller I/O efficiently in some circumstances"
also doesn't change it.
In historical UNIX, some tapes were block devices too. Because they
supported seek().
> Device mapper can not provide a simple easy to decode scheme for boot
> loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> and be able to find the kernel or second stage boot loader in this
> unordered device.
>
> And no, fixed addresses do not work. Do you want to implement device
> mapper into your Initialial Bootloader stage ?
This is exactly the same problem as booting on a desktop PC. But
somehow LILO manages. My first Linux box had a hell of a lot less disk
than the platform I bootstrapped (and wrote NAND drivers for) last
month had in NAND.
> > > That's why I suggested fixing the MTD layers that present block devices
> > > first in the part of my reply that you cut off. It seems to me that
> > > you're really after getting flash to look like a block device, which
> > > would enable device mapper to be used for something similar to UBI.
> > > That's fine, but until someone does that work UBI fills a need, has
> > > users, and has an existing implementation.
> >
> > False starts that get mainlined delay or prevent things getting done
> > right. The question is and remains "is UBI the right way to do
> > things?" Not "is UBI the easiest way to do things?" or "is UBI
> > something people have already adopted?"
> >
> > If the right way is instead to extend the block layer and device
> > mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> > should not go in.
>
> No, block layer on top of FLASH needs 80% of the functionality of UBI in
> the first place.
Incorrect. A block-based filesystem on top of flash needs this
functionality. But a block device suitable to device mapper layering
(which then provides the functionality) does not.
> You need to implement a clever journalling block device
> emulator in order to keep the data alive and the FLASH not weared out
> within no time. You need the wear levelling, otherwise you can throw
> away your FLASH in no time.
And that's why it's in my picture.
Sorry, I meant hiding bad blocks here. That's why this layer was
labeled "bad block remapping".
> > 3. This would provide wear leveling, and probably simultaneously
> > provide relatively efficient and safe access to write sector
> > and page-sized I/O. Below this level, things had better be
> > comfortable with the limitations of NAND if they want to work well.
>
> I don't see how this provides across device wear levelling.
Because the layer immediately beneath it ("block concatenation") takes
N devices and presents one logical device.
> > 4. JFFS2 has its own wear-leving scheme, as do several other
> > filesystems, so they probably want to bypass this piece of the stack.
>
> JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> wear levelling sucks.
Ok, fine. How about LogFS, then?
> > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > snapshot, etc.).
>
> Why should we reimplement that ?
So that you can get encryption and snapshot, etc.?
> > 6. We make some things possible that simply aren't otherwise.
> >
> > And this picture isn't even interesting yet. Imagine a dm-cache layer
> > that caches data read from disks in high-speed flash. Or using
> > dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> > Neither of these can be done 'right' in a stack split between device
> > mapper and UBI.
>
> Err. Implement a clever block layer on top of UBI and use all the
> goodies you want including device mapper.
If I wanted to have both device mapper and device mapper's little
brother in my kernel, I wouldn't have started this thread.
--
Mathematics is the supreme nostalgia of our time.
It appears to be, but it is not. You enforce semantics on a device,
which it does not have.
> Saying "but I can't do I/O smaller than the blocksize" doesn't change
> this any more than it would for disks.
There is a huge difference. Disk block size is 512 byte and FLASH block
size is min 16KiB and up to 256KiB.
Just do the math:
Write sampling data streams in 2KiB chunks to your uber devicemapper on
a 1GiB device with 64KiB erase block size:
Fine grained FLASH aware writes allow 32 chunks in a block without
erasing the block.
Your method erases the block 32 times to write the same amount of data.
Result: You wear out the flash 32 times faster. Cool feature.
> Saying "but I can do smaller I/O efficiently in some circumstances"
> also doesn't change it.
We can do it under _any_ circumstances and that _does_ change it.
Implementing a clever block device layer on top of UBI is simple and
would provide FLASH page sized I/O, i.e. 2Kib in the above example.
> In historical UNIX, some tapes were block devices too. Because they
> supported seek().
I'm impressed. How exactly are "some tapes" comparable to FLASH chips ?
Your next proposal is to throw away MTD-utils and use "mt" instead ?
> > Device mapper can not provide a simple easy to decode scheme for boot
> > loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> > and be able to find the kernel or second stage boot loader in this
> > unordered device.
> >
> > And no, fixed addresses do not work. Do you want to implement device
> > mapper into your Initialial Bootloader stage ?
>
> This is exactly the same problem as booting on a desktop PC. But
> somehow LILO manages. My first Linux box had a hell of a lot less disk
> than the platform I bootstrapped (and wrote NAND drivers for) last
> month had in NAND.
No, it is not. You get the absolute sector address of your second stage
and this is a complete nobrainer. The translation is done in the DISK
device.
You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
whatever - there is a more or less intellegent controller device, which
does the mapping to the physical storage location. There is _NO_ such
thing on a bare FLASH chip.
It does not matter, whether your embedded device had more NAND space
than my old CP/M machines floppy. It simply matters, that even the old
CP/M floppy device had some rudimentary intellence on board.
Furthermore I want to be able to get the bitflip correction on my second
stage loader / kernel in the same safe way as we do it for everything
else and still be able to bootstrap that from an extremly small
bootloader.
> > > If the right way is instead to extend the block layer and device
> > > mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> > > should not go in.
> >
> > No, block layer on top of FLASH needs 80% of the functionality of UBI in
> > the first place.
>
> Incorrect. A block-based filesystem on top of flash needs this
> functionality. But a block device suitable to device mapper layering
> (which then provides the functionality) does not.
How exactly does device mapper:
A) across device wear levelling ?
B) dynamic partitioning for FLASH aware file systems ?
C) across device wear levelling for FLASH aware file systems ?
D) background bit-flip corrections (copying affected blocks and recylce
the old one) ?
E) allow position independent placement of the second stage bootloader ?
> > You need to implement a clever journalling block device
> > emulator in order to keep the data alive and the FLASH not weared out
> > within no time. You need the wear levelling, otherwise you can throw
> > away your FLASH in no time.
>
> And that's why it's in my picture.
Yes, it is in your picture, but:
1) it excludes FLASH aware file systems and UBI does not.
2) your picture does still not explain how it does achive the above A),
B), C), D) and E)
Your extra path for partitioning(4) and JFFS2 is just a weird hack,
which makes your proposal completely absurd.
> > > Let me draw a picture so we have something to argue about:
> > >
> > > iSCSI/nbd(6)
> > > |
> > > filesystem { swap | ext3 ext3 jffs2
> > > \ | | | /
> > > / \ | dm-crypt->snapshot(5) /
> > > device mapper -| \ \ | /
> > > | partitioning /
> > > | | partitioning(4)
> > > | wear leveling(3) /
> > > | | /
> > > | block concatenation
> > > | | | | |
> > > \ bad block remapping(2)
> > > | | | |
> > > MTD raw block { raw block devices with no smarts(1)
> > > / | \ \
> > > hardware { NAND NAND NAND NAND
> > > Notes:
Let me draw an UBI picture:
VFS -Layer
|
(Future VFS-crypto)
/ \
| |
______ device mapper / fs |
/ | |
/ | |
| Do whatever you don't |
| want to do with FLASH |
| whether it makes sense |
| or not. |
| | |
Block layer UBI block device Flash aware filesystems
| \ /
device \ /
driver |
| |
_____|_______ |
| | |
| Device | UBI
| resident | |
| "UBI" | |
|___________| |
|
MTD-CORE
|
nand base driver
/ | \
device driver device driver device driver
| |
NAND NAND NAND
No notes: it's simple and self explaining.
> > > 1. This would provide a block device that allowed writing pages and
> > > a secondary method for erasing whole blocks as well as a method for
> > > querying/setting out of band information.
> >
> > Forget about OOB data. OOB data is reserved for ECC. Please read the
> > recommendations of the NAND FLASH manufacturers. NAND gets less reliable
> > with higher density devices and smaller processes.
> >
> > > 2. This would hide erase blocks either by using an embedded table or
> > > out of band info. This could stack on top of block concatenation if
> > > desired.
> >
> > Hide erase blocks ? UBI does not hide anything. It maps logical
> > eraseblocks, which are exposed to the clients to arbitrary physical
> > eraseblocks on the FLASH device in order to provide across device wear
> > levelling.
>
> Sorry, I meant hiding bad blocks here. That's why this layer was
> labeled "bad block remapping".
Oh well, a seperate bad block remapper. And how is the logical mapping
of wear levelled erase blocks done ?
> > > 3. This would provide wear leveling, and probably simultaneously
> > > provide relatively efficient and safe access to write sector
> > > and page-sized I/O. Below this level, things had better be
> > > comfortable with the limitations of NAND if they want to work well.
> >
> > I don't see how this provides across device wear levelling.
>
> Because the layer immediately beneath it ("block concatenation") takes
> N devices and presents one logical device.
And how is the wear levelling done on this logical device in
devicemapper ?
How is ensured, that the wear average is maintained also across the
partitions which are used by JFFS2 or other FLASH aware filesystems ?
> > > 4. JFFS2 has its own wear-leving scheme, as do several other
> > > filesystems, so they probably want to bypass this piece of the stack.
> >
> > JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> > wear levelling sucks.
>
> Ok, fine. How about LogFS, then?
LogFS can easily leverage UBI's wear algorithm.
> > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > > snapshot, etc.).
> >
> > Why should we reimplement that ?
>
> So that you can get encryption and snapshot, etc.?
1. On top of a clever block device.
2. UBI can do snapshots by design.
3. Encryption should be done on the VFS layer and not below the
filesystem layer. Doing it inside the block layer or the device mapper
is broken by design.
> > > 6. We make some things possible that simply aren't otherwise.
> > >
> > > And this picture isn't even interesting yet. Imagine a dm-cache layer
> > > that caches data read from disks in high-speed flash. Or using
> > > dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> > > Neither of these can be done 'right' in a stack split between device
> > > mapper and UBI.
> >
> > Err. Implement a clever block layer on top of UBI and use all the
> > goodies you want including device mapper.
>
> If I wanted to have both device mapper and device mapper's little
> brother in my kernel, I wouldn't have started this thread.
You still did not explain how devicemapper does:
- across device wear levelling
- dynamic partitioning for FLASH aware file systems
- across device wear levelling for FLASH aware file systems
- simple boot loader support
- fine grained I/O
UBI is not devicemappers little brother. It is the software version of
the silicon in a CF-CARD / USB-Stick, but it does a better job and
allows clever usage of FLASH aside of enforcing eraseblock sized I/O
units. Does your CF-Card / USB-Stick do that ?
Just think about the 1GiB USB stick, which would present you 64KiB I/O
units instead of 2KiB ones.
Your signature is a nice intellectual signboard, but the ancient simple
rule of three just tells me, that you are off by factor 32.
tglx
Why the hell would JFFS2 need a block device interface ?
What's the gain ?
> > We enhance the second branch, not the first, please, realize this. Both
> > branches have their user base, and have always had.
> >
> > > iSCSI/nbd(6)
> > > |
> > > filesystem { swap | ext3 ext3 jffs2
> > > \ | | | /
> > > / \ | dm-crypt->snapshot(5) /
> > > device mapper -| \ \ | /
> > > | partitioning /
> > > | | partitioning(4)
> > > | wear leveling(3) /
> > > | | /
> > > | block concatenation
> > > | | | | |
> > > \ bad block remapping(2)
> > > | | | |
> > > MTD raw block { raw block devices with no smarts(1)
> > > / | \ \
> > > hardware { NAND NAND NAND NAND
> >
> > Matt, as I pointed in the first mail, flash != block device.
>
> And as I pointed out, you're wrong. It is both block oriented
> (eraseBLOCK??) and random access. That's what a block device is. The
> fact that it doesn't look like the other things that Linux currently
> calls a block device and supports well is another matter.
It does well matter, as it is not a block device. It is a FLASH device
and you can do as much comparisons of eraseBLOCK as you want, you do not
turn FLASH into a DISK.
Again: Disks (including CF-Cards and USB-Sticks) have intellegent
controllers, which abstract the hardware oddities away and present you a
block device.
> > In your picture I see NAND->MTD raw block. So am I right that you
> > assume that we already have a decent FTL? The fact is that we do
> > not.
>
> No. Look at the picture for more than two seconds, please.
>
> I can tell you didn't do this because you didn't manage to find (1)
> which explicitly says "with no smarts". And you also cut out the footnote
> where I explained what I meant by "with no smarts".
>
> Find the spots marked (2) and (3). These are your FTL.
And where please are (2) and (3) inside of device mapper ?
> > Please, bear in mind that decent FTL is difficult and an FS on top of
> > FTL is slow, FTL hits performance considerably.
>
> ...and if you'd actually looked at the picture, you'd have seen JFFS2
> bypassing it. Along with another footnote explaining it.
The (4) partitioning and JFFS2 on top is a step back from the current
UBI functionality. Now we can have resizable partitioning even for JFFS2
and JFFS2 can utilize the UBI wear levelling, which is way better than
the crude heuristics of JFFS2.
You want to force FLASH into device mapper for some strange and no
obvious reason. Just the coincidence of "eraseBLOCK" and "BLOCKdevice"
is not really convincing.
You impose the usage of eraseblock size on FLASH, which is simply wrong:
DISK has a 1:1 relationship of "eraseblock" and minimal I/O. FLASH has
not. I did the math in a different mail and I'm not buying your factor
32 FLASH life time reduction for the price of having a bunch of lines of
code less in the kernel.
If you really consider to run ext3, xfs or whatever on top of FLASH,
please go and do the homework on CF-Cards and USB-Sticks. Run them into
the fast wearout death. And device mapper does not help anything to
avoid that. Running ext3 on top of FLASH with a minimal I/O size of
erase block size is simply braindead.
tglx
Sigh. That's the current /dev/mtdblock method, not my method. You're too
fixated on what you think I'm saying to hear what I'm saying.
> > Saying "but I can do smaller I/O efficiently in some circumstances"
> > also doesn't change it.
>
> We can do it under _any_ circumstances and that _does_ change it.
> Implementing a clever block device layer on top of UBI is simple and
> would provide FLASH page sized I/O, i.e. 2Kib in the above example.
Yes. I know. I've written a complete (non-Linux) FTL. I know what's
entailed.
> > In historical UNIX, some tapes were block devices too. Because they
> > supported seek().
>
> I'm impressed. How exactly are "some tapes" comparable to FLASH chips ?
>
> Your next proposal is to throw away MTD-utils and use "mt" instead ?
Don't be an ass. I'm pointing out that not all block devices are disks.
> > > Device mapper can not provide a simple easy to decode scheme for boot
> > > loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> > > and be able to find the kernel or second stage boot loader in this
> > > unordered device.
> > >
> > > And no, fixed addresses do not work. Do you want to implement device
> > > mapper into your Initialial Bootloader stage ?
> >
> > This is exactly the same problem as booting on a desktop PC. But
> > somehow LILO manages. My first Linux box had a hell of a lot less disk
> > than the platform I bootstrapped (and wrote NAND drivers for) last
> > month had in NAND.
>
> No, it is not. You get the absolute sector address of your second stage
> and this is a complete nobrainer. The translation is done in the DISK
> device.
LILO and friends manage to boot systems that use software RAID and
LVM. There are multiple methods. Some use block lists, some use tiny
boot partitions, etc. All of them are applicable to controllerless NAND.
> You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
> whatever - there is a more or less intellegent controller device, which
> does the mapping to the physical storage location. There is _NO_ such
> thing on a bare FLASH chip.
How many times do I have to tell you that I wrote a driver for
controllerless NAND just last month?
> How exactly does device mapper:
>
> A) across device wear levelling ?
The same way UBI does, but encapsulated in a device mapper layer.
> B) dynamic partitioning for FLASH aware file systems ?
See above.
> C) across device wear levelling for FLASH aware file systems ?
See above.
> D) background bit-flip corrections (copying affected blocks and recylce
> the old one) ?
See above.
> E) allow position independent placement of the second stage bootloader ?
See way above to my LILO response.
> > > You need to implement a clever journalling block device
> > > emulator in order to keep the data alive and the FLASH not weared out
> > > within no time. You need the wear levelling, otherwise you can throw
> > > away your FLASH in no time.
> >
> > And that's why it's in my picture.
>
> Yes, it is in your picture, but:
>
> 1) it excludes FLASH aware file systems and UBI does not.
> 2) your picture does still not explain how it does achive the above A),
> B), C), D) and E)
>
> Your extra path for partitioning(4) and JFFS2 is just a weird hack,
> which makes your proposal completely absurd.
No, it's just there to show the flexibility of device mapper. But I have
the sneaking suspicion you have no idea how device mapper works.
In brief: device mapper takes one or more devices, applies a mapping
to them, and returns a new device. For example, take various spans of
/dev/hda1 and /dev/sda3 and present them as new-device1. Take
new-device1 and transform it with dm-crypt to get new-device2. The
kernel doesn't decide how to do this, any more than it decides where
to mount your filesystems. Userspace does.
> > > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > > > snapshot, etc.).
> > >
> > > Why should we reimplement that ?
> >
> > So that you can get encryption and snapshot, etc.?
>
> 1. On top of a clever block device.
>
> 2. UBI can do snapshots by design.
Oh, so you HAVE reimplemented it.
> 3. Encryption should be done on the VFS layer and not below the
> filesystem layer. Doing it inside the block layer or the device mapper
> is broken by design.
That's highly debatable and not a topic for this thread.
--
Mathematics is the supreme nostalgia of our time.
Yes, by using fixed addresses, which is not what I want.
> > You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
> > whatever - there is a more or less intellegent controller device, which
> > does the mapping to the physical storage location. There is _NO_ such
> > thing on a bare FLASH chip.
>
> How many times do I have to tell you that I wrote a driver for
> controllerless NAND just last month?
Wow. I'm impressed because I'm pulling my opinion out of thin air.
> > How exactly does device mapper:
> >
> > A) across device wear levelling ?
>
> The same way UBI does, but encapsulated in a device mapper layer.
Does the device mapper do that ?
> > B) dynamic partitioning for FLASH aware file systems ?
>
> See above.
Does the device mapper do that ?
> > C) across device wear levelling for FLASH aware file systems ?
>
> See above.
Look at your own drawing.
> > D) background bit-flip corrections (copying affected blocks and recylce
> > the old one) ?
>
> See above.
Repeating patterns do not impress me. Your drawing tells otherwise
> > E) allow position independent placement of the second stage bootloader ?
>
> See way above to my LILO response.
Neither LILO nor GRUB have search capabilities for randomly located
second stage loaders.
> > > > You need to implement a clever journalling block device
> > > > emulator in order to keep the data alive and the FLASH not weared out
> > > > within no time. You need the wear levelling, otherwise you can throw
> > > > away your FLASH in no time.
> > >
> > > And that's why it's in my picture.
> >
> > Yes, it is in your picture, but:
> >
> > 1) it excludes FLASH aware file systems and UBI does not.
> > 2) your picture does still not explain how it does achive the above A),
> > B), C), D) and E)
> >
> > Your extra path for partitioning(4) and JFFS2 is just a weird hack,
> > which makes your proposal completely absurd.
>
> No, it's just there to show the flexibility of device mapper. But I have
> the sneaking suspicion you have no idea how device mapper works.
Sigh. Layering violation == flexibility.
> In brief: device mapper takes one or more devices, applies a mapping
> to them, and returns a new device. For example, take various spans of
> /dev/hda1 and /dev/sda3 and present them as new-device1. Take
> new-device1 and transform it with dm-crypt to get new-device2. The
> kernel doesn't decide how to do this, any more than it decides where
> to mount your filesystems. Userspace does.
I know how it works. But your blurb does not answer any of my questions.
> > > > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > > > > snapshot, etc.).
> > > >
> > > > Why should we reimplement that ?
> > >
> > > So that you can get encryption and snapshot, etc.?
> >
> > 1. On top of a clever block device.
> >
> > 2. UBI can do snapshots by design.
>
> Oh, so you HAVE reimplemented it.
No, it already works
> > 3. Encryption should be done on the VFS layer and not below the
> > filesystem layer. Doing it inside the block layer or the device mapper
> > is broken by design.
>
> That's highly debatable and not a topic for this thread.
I see, you define, what has to be discussed.
tglx
This is where we disagree obviously. However, getting UBI into mainline
won't delay or prevent your proposal from getting done. That's like
saying having ext3 in mainline prevents other filesystems from getting
created. There is nothing wrong with having different subsystems that
overlap in a few areas.
What you're proposing seems like it would take at least several weeks to
even get close to what is needed in terms of reliability and the
required wear-leveling if it is indeed possible to implement. And it
would likely duplicate some of the wear-leveling and bad block handling
code that is present in UBI anyway. In the meantime, the need for UBI
exists today and there is a working, tested implementation available.
josh
You failed to clearly define what is block until now, then you blame me
that I do not understand you. So I see block = eraseblock, lets assume
for further conversation.
OK. Suppose we have done what you say, although I _do not_ think it is
makes a lot of sense. So, now we have a block device, with 128KiB block
size. We have LVM, dm-wl or whatever stuff. Fine.
Do you realize that 128KiB is _huge_ block size, and performance will
suck, and suck a lot if you utilize say, ext2 or whatever block device
FS.
Do you realize that I may not be satisfied with slow I/O? Do I have
right to have faster one? Thanks if yes.
To make it faster I have to have a way to do finer grained I/O:
read/write to different positions of 128KiB block. Do you realize how
much you will abuse all the generic block device infrastructure if you
try to add this? Note, all levels up to LVM will need to have this. A
believe it is braindead ((c) tglx) to add this feature.
Also, in UBI we have the following features:
1. data type hints: you basically may help UBI to pick optimal
eraseblock if you specify data life-time - is it long-live data, or
short-live/temporary data.
2. Some other ones, do not want to describe now.
Do you offer to add this stuff to DM-mapper?
So, you approach only makes sense if you are going to work with flash as
block device with block size = eraseblock size. No finer grained access
at all. It is fine, some users may be ok with this. But please, do not
be so naive - the performance will suck a _lot_. Let alone I doubt it
will really fit the DM infrastructure.
We work on different approach. And in general, the picture which Thomas
drew to you makes _much more_ sense. Please, do not be so stuck to your
way, it is not bad or good, it is just _different_, and it has obvious
limitations which we do not want to have, thus we go other way.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
As a suggestion, let's stop right here and see if we can get both
sides talking in a more constructive fashion. Maybe it's just me, but
I see both sides talking past each other in a rather dramatic fashion.
Linux seems to allow multiple implementations of things at the edge
(such as filesystems), but not at the core (devicemapper when in,
device mapper didn't; it's unlikely that we would have two competing
block device layers or two VFS layers, etc.). The question then is
whether UBI and dm are close enough or not that should be one
subsystem or not.
There a number of red herrings that have been introduced in this
discussion; of *course* the existing block device layer can handle
FLASH devices; Matt is proposing that they be extended. And of
*course* you woulnd't propose to use ext2 on top of an 128k blocksize,
anymore than you would force a flash filesystem to use a 4k or 512
byte blocksize; there are plenty of configurations that won't make
sense, and by itself this isn't an indictment of the core idea that
the block device layer and dm should be augmented to encompass flash
functionality.
As far as who gets to do the work, unfortunately sometimes the people
submitting the new code have to make the changes suggested by the
reviewer. That's one of the prices that gets paid for mainline
inclusion. It's different when someone asks for a completely new
feature, especially for code that is already in mainline; then, "feel
free to send a patch" is perfectly accepted. But if it's a matter of
refactoring the code to fit in some other framework, that's often up
to the submitter to do, not the reviewer.
Of course, it remains to be seen whether or not this is a good idea to
do in the first place; but some of the arguments being used to shoot
down Matt's suggestion aren't really good ones to begin with.
> To make it faster I have to have a way to do finer grained I/O:
> read/write to different positions of 128KiB block. Do you realize how
> much you will abuse all the generic block device infrastructure if you
> try to add this? Note, all levels up to LVM will need to have this. A
> believe it is braindead ((c) tglx) to add this feature.
OK, and this could be it. But I suspect one of the things which may
be missing that make it easier for you to explain why what UBI is
doing is so different from the dm and block device stack is to include
the contents of:
http://www.linux-mtd.infradead.org/faq/ubi.html
http://www.linux-mtd.infradead.org/doc/ubi.html
and some system level documentation in a Documentation/ubi.txt file as
part of the patch set. I don't think people completely understand the
high-level architecture of what UBI is trying to achieve. What are
the interfaces at the top and the bottom of the stack? For example,
the fact that UBI exports Logical Erase blocks that are not a
power-of-two (possibly 128k minus 128 bytes) means that it certainly
might not be a good match for the dm stack. But why is that the case?
I can imagine good reasons for it, but a high-level description of the
design decisions would be very useful.
It would also help people understand why there are so many "units" in
UBI, since hopefully the high-level documentation would explain why
they fit together, and perhaps why some of the units weren't folded
together. What value do they add as separate components?
There are hints of the overall system architecture in some of the
indivdual comments for data structures, but even reading all of those,
there isn't quite enough for people to figure out what it is; and that
may be causing some of these comments of people saying there's too
much code to evaluate, or why didn't you do it *this* way?
Regards,
- Ted
Teo, the units will go away. I'll leave only 4 of them:
1. I/O, just to hide some I/O related complexities.
2. Scanning: just because I am planning to add other device attaching
methods, without scanning.
3. Wear-leveling, just because I want to improve the algorithm in
future. Changing algorithm means changing data structures. So I want to
keeps them separate.
4. EBA - because I want to keep all mapping-related stuff in one place.
Well, this does not have to be called unit, just mapping-related code in
on file. Also, long-term is to have the table on-flash (currently it is
in-ram which does not scale well).
Everything else will be folded together. No itsy-bitsy. I've almost
finished this re-structuring, doing bug-fixing.
P.S.: I'll let other folks to comment the other stuff.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
Perhaps, yes. Though I've been trying to be open to Matt's suggestions.
Please don't mistake confusion for hostility.
> There a number of red herrings that have been introduced in this
> discussion; of *course* the existing block device layer can handle
> FLASH devices; Matt is proposing that they be extended. And of
Sure. But the larger question is *should* it be extended to do so.
> *course* you woulnd't propose to use ext2 on top of an 128k blocksize,
> anymore than you would force a flash filesystem to use a 4k or 512
> byte blocksize; there are plenty of configurations that won't make
Except that flash filesystems don't use block devices at all. They use
MTD interfaces.
> sense, and by itself this isn't an indictment of the core idea that
> the block device layer and dm should be augmented to encompass flash
> functionality.
This is where the concept starts to lose me. Augmented how? To not use
MTD at all (obviously with the exception of the low-level flash
drivers)? How is that not going to duplicate MTD? Etc, etc.
Look at it from this point of view. MTD is the existing interface for
dealing with flash devices. UBI was written to solve issues with flash
devices. UBI was written on top of MTD. Suggesting that the UBI
developers go off and hack the block layer to work with flash devices
just to use dm seems completely foreign. Most of the boards UBI is used
on disable the block layer as much as they can because it's not needed.
This is the biggest source of confusion/contention. Making the somewhat
magical jump to representing flash as a block device without a bit more
detail as to how it's going to really cope with the requirements needed
for flash and why it's a great idea to do so is a bit hard to swallow.
Discussing the device mapper extensions is sort of pointless until this
is figured out.
> high-level architecture of what UBI is trying to achieve. What are
> the interfaces at the top and the bottom of the stack? For example,
> the fact that UBI exports Logical Erase blocks that are not a
> power-of-two (possibly 128k minus 128 bytes) means that it certainly
> might not be a good match for the dm stack. But why is that the case?
> I can imagine good reasons for it, but a high-level description of the
> design decisions would be very useful.
Basically because you need to store metadata in each eraseblock (and not
in OOB). That metadata consumes space reducing the usable storage in
each eraseblock by an amount and making it no longer a power-of-two.
> It would also help people understand why there are so many "units" in
> UBI, since hopefully the high-level documentation would explain why
> they fit together, and perhaps why some of the units weren't folded
> together. What value do they add as separate components?
Artem is reworking the units per your (and other's) suggestions. The
debugging code is also being worked on.
> There are hints of the overall system architecture in some of the
> indivdual comments for data structures, but even reading all of those,
> there isn't quite enough for people to figure out what it is; and that
> may be causing some of these comments of people saying there's too
> much code to evaluate, or why didn't you do it *this* way?
Some of that can probably be added, sure. Though to be fair, it'll add
even more lines to the patch and those links have been posted 4 times
already. They're even posted at the start of _this_ thread. Having it
in the patch under Documentation/ is a good idea, but you can't force
people to read that before they comment on things.
josh
> On Tue, 2007-03-20 at 09:52 -0400, Theodore Tso wrote:
>> As a suggestion, let's stop right here and see if we can get both
>> sides talking in a more constructive fashion. Maybe it's just me, but
>> I see both sides talking past each other in a rather dramatic fashion.
>
> Perhaps, yes. Though I've been trying to be open to Matt's suggestions.
> Please don't mistake confusion for hostility.
>
>> There a number of red herrings that have been introduced in this
>> discussion; of *course* the existing block device layer can handle
>> FLASH devices; Matt is proposing that they be extended. And of
>
> Sure. But the larger question is *should* it be extended to do so.
>
>> *course* you woulnd't propose to use ext2 on top of an 128k blocksize,
>> anymore than you would force a flash filesystem to use a 4k or 512
>> byte blocksize; there are plenty of configurations that won't make
>
> Except that flash filesystems don't use block devices at all. They use
> MTD interfaces.
>
>> sense, and by itself this isn't an indictment of the core idea that
>> the block device layer and dm should be augmented to encompass flash
>> functionality.
>
> This is where the concept starts to lose me. Augmented how? To not use
> MTD at all (obviously with the exception of the low-level flash
> drivers)? How is that not going to duplicate MTD? Etc, etc.
What Matt and Ted are looking at is the question 'are flash devices close enough
to other block devices that it would make sense to change the existing linux
definition of a block device to handle the special requirements of flash'
if the block device layer can be reasonably modified to accomodate flash, then
doing so greatly improves flexibility and maintainability. It would also reduce
the overall code size since existing features of the block layer (for example
snapshots) would not need to be duplicated or re-written for the flash block
layer.
if not then so be it.
everyone understands that flash has different requirement from a hard drive as a
block device, what isn't clear to the people reading this thread (and reviewing
the code) is why you believe that it is _so_ different that it's impossible to
consider extending the linux definition of a block device.
the fact that the native eraseblock size is significantly larger isn't a factor.
the fact that you erase in large blocks and then write in smaller blocks is a
difference, and one that the current block layer doesn't understand. but this is
a difference that the current block layer could be changed to understand. it's
not something that would justify a seperate-but-equal block layer for flash
devices.
as Ted notes, the idea that block sizes may not be powers of 2 (128k-128b from
his e-mail) _may_ end up being a big enough difference that it's not worth
teaching the exising block layer how to deal with, but it's not clear why you
are useing this odd size.
this is why you are being asked for further explinations.
David Lang
I am _not_ an block device layer expert. But I think it is silly idea to
abuse it adding a possibility of reading/writing from/to the middle of
the block. Isn't it obvious that the fact that block is _minimal_ I/O
unit is _deep_ inside the design???
We also need few other features as well, like data life-time hints to
help the wear-leveling engine to pick optimal eraseblock. And there are
more features we need to have. Do you want to add all those to block
device infrastructure?
Thomas wrote about how one can reuse all block device goodies, like LVM,
FSes etc. He drew a picture, just roll back and glance. This makes much
more sense.
Guess why we still do not have a decent FTL? Because it is _difficult_.
Now, when we have UBI one can implement FTL much, much easier. It
becomes really possible now. Because UBI already hides many complexities
of flash, and FTL layer should not care about many things. It may
concentrate on FTL problems, for example on a smart garbage collector,
which is also a difficult thing. Also, with UBI, for example, the FTL
layer may store on-flash tables with block mappings, because UBI takes
care of wear-leveling. I mean, FTL may update those tables as may times
as it want, without caring that corresponding eraseblocks go worn-out.
After we have implemented FTL, we can re-use all the block device
infrastructure - LVM, dm-crypt, ext3 and 4, and so on. This does make
sense. And this is at Thomas's picture.
So please, look at UBI as a low-level layer just which hides flash
complexities like wear and bad blocks. It also does write-failure
recovery automatically - this is very important feature. These are
essentially the features which makes our life horrible, and UBI kicks
them out. I am not a newbie in the area and I know how difficult is it
to develop on top of a raw flash. Yes, it allows creating volumes, but
this is not the main feature of it. It gust goes naturally.
And one note: UBI is flash type independent, so you can use it on top of
NOR/NAND/DataFlash/AG-AND/ECCd NOR and so on, as long as MTD support
exists. For example, we do not use OOB at all. I write it just because
Matt always used NAND as an example, just for clarification.
> as Ted notes, the idea that block sizes may not be powers of 2 (128k-128b from
> his e-mail) _may_ end up being a big enough difference that it's not worth
> teaching the exising block layer how to deal with, but it's not clear why you
> are useing this odd size.
Eraseblock size is power of 2. We store the erase counter (needed for
wear-levelling) and logical to physical eraseblock mapping in each
eraseblock. Thus, we reduce the size.
We do not want to have any on-flash table, because we end up with a
chicken-and-egg problem: the tables are updated often, so they cannot
sit in fixed eraseblocks. They should constantly change position to
ensure wear-leveling. This is very difficult and less robust.
> this is why you are being asked for further explinations.
Although we do not have shiny documentation, but all _essential_ are
explained in the existing, not shiny one, so those really interested
could find this there. I mean, if one does not know much about the area,
and does not spend time to explore it, we cannot really help. But
anyway, we will try to write better docs, it just a question of time.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
I've seen no real proposals about how this could be done, so it's a
purely academic question. But I'm dubious. The block layer is optimised,
perhaps unsurprisingly, for block devices. Making it handle our special
case might be possible, but I don't really think it's likely to fly once
it becomes real code and not just mental self-abuse. Hell, we haven't
even got block _discard_ support merged yet, because it's too esoteric
for people to care about.
The MTD API does need to be re-thought. It's no longer quite so
unthinkable that we'll encounter flash sizes above 4GiB, and the way we
(theoretically) handle asynchronous erases while read and write are
synchronous is a bit icky. I'm not averse to using queues and making it
look a _bit_ like block devices in some respects, but in practice I
don't think it'll be very close at all.
The MTD API is intended to represent the raw capabilities of the
underlying flash devices, with all the bizarre restrictions and features
that the various types of flash chip have. If you want to use it as a
block device, that's what translation layers are for.
--
dwmw2
No. We don't have a decent FTL because it's _pointless_. We've got basic
implementations of FTL, NFTL, INFTL etc. for compatibility with PCMCIA
stuff and DiskOnChip, but the fact remains that pretending to be a
normal block device with atomically-overwritten 512-byte sectors is just
_stupid_. You end up implementing a kind of pseudo-filesystem to do
that, and then on top of that you put a 'normal' filesystem with no real
knowledge about what's underneath. It's crap -- and as we currently have
it, the top level file system doesn't even get to tell the underlying
FTL that a given block can be discarded because it's no longer used. So
during garbage collection the FTL even ends up copying crap around the
medium that's no longer relevant.
This isn't DOS. We don't have to make our storage available through the
restricted interface that INT 13h offers us. We can, and do, do better
than that. And that's why we don't have a decent FTL implementation.
--
dwmw2
Absolutely, and so let's focus on that.
> Except that flash filesystems don't use block devices at all. They use
> MTD interfaces.
Yes, so that would be first issue. We would need to change the flash
filesystems to the block interface issue, and expand the block device
to use MTD. Now, maybe Matt is conversant about what would be
involved to do this, but I will admit to being MTD ignorant. But if
that's there is a huge impendance mismatch right there, that that
might be enough to kill it right there.
> > high-level architecture of what UBI is trying to achieve. What are
> > the interfaces at the top and the bottom of the stack? For example,
> > the fact that UBI exports Logical Erase blocks that are not a
> > power-of-two (possibly 128k minus 128 bytes) means that it certainly
> > might not be a good match for the dm stack. But why is that the case?
> > I can imagine good reasons for it, but a high-level description of the
> > design decisions would be very useful.
>
> Basically because you need to store metadata in each eraseblock (and not
> in OOB). That metadata consumes space reducing the usable storage in
> each eraseblock by an amount and making it no longer a power-of-two.
So this is probably a stupid question, but what drives the design
decision to store the metadata in-band instead of out-of-band (and you
don't have to answer me here; putting it in the overall system
architecture document is just as good, and probably better. :-)
> > It would also help people understand why there are so many "units" in
> > UBI, since hopefully the high-level documentation would explain why
> > they fit together, and perhaps why some of the units weren't folded
> > together. What value do they add as separate components?
>
> Artem is reworking the units per your (and other's) suggestions. The
> debugging code is also being worked on.
As I mentioned to you in IRC, in the future if there is pending
changes in response to reviewer comments, it might be a good idea to
mention that, so that reviewers know not make those comments again, or
worry that the comments had been ignored.
> > There are hints of the overall system architecture in some of the
> > indivdual comments for data structures, but even reading all of those,
> > there isn't quite enough for people to figure out what it is; and that
> > may be causing some of these comments of people saying there's too
> > much code to evaluate, or why didn't you do it *this* way?
>
> Some of that can probably be added, sure. Though to be fair, it'll add
> even more lines to the patch and those links have been posted 4 times
> already. They're even posted at the start of _this_ thread. Having it
> in the patch under Documentation/ is a good idea, but you can't force
> people to read that before they comment on things.
Well, having spent some time looking at the FAQ's and all of the
comments kernel docs embedded in the header files and source files,
there are sections that I would move to an overall system architecture
documentation, but there is still a lot that was missing that makes it
hard to review the patches. I'm sure a lot of it is my own ignorance,
but that's probably one of the challenges with the UBI layer; not as
more people have a basic background in say say scheduling or VM or
filesystem than there are people who have a basic background in flash
devices.
Regards,
- Ted
Because
a. Many flashes have no out-of-band. We want to support them as well
b. Modern MLC NAND flashes use _whole_ OOB for ECC and this is the
modern trend.
I will update FAQ and add this there later.
> As I mentioned to you in IRC, in the future if there is pending
> changes in response to reviewer comments, it might be a good idea to
> mention that, so that reviewers know not make those comments again, or
> worry that the comments had been ignored.
Teo, I wrote you 2 times that your point was understood and this would
be fixed. You should not think your comments are ignored because they
are not.
> Well, having spent some time looking at the FAQ's and all of the
> comments kernel docs embedded in the header files and source files,
> there are sections that I would move to an overall system architecture
> documentation, but there is still a lot that was missing that makes it
> hard to review the patches. I'm sure a lot of it is my own ignorance,
> but that's probably one of the challenges with the UBI layer; not as
> more people have a basic background in say say scheduling or VM or
> filesystem than there are people who have a basic background in flash
> devices.
Docs and FAQ will be improved, this is a question of time.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
While I agree with you, I still think decent FTL (a) makes sense and is
(b) difficult.
a. Some people may be satisfied with FTL and enjoy all the block
device-related software, which is huge benefit, although costs you
performance. Yes, FTL moves garbage around, but who cares as long as the
performance fits the system requirements.
b. It is certainly not easy.
But anyway, I agree with what you say, although you seem to be too
assertive.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
Ok, now we have reached the absurd. UBI quite fundamentally cannot do
wear leveling as good as LogFS can. Simply because UBI has zero
knowledge of the _contents_ of its blocks. Knowing whether a block is
90% garbage or not makes a great difference.
Also LogFS currently requires erasesizes of 2^n.
Thomas, I can give you my opinion on this flamewar in private - after
you have cooled off.
Jörn
--
When I am working on a problem I never think about beauty. I think
only how to solve the problem. But when I have finished, if the
solution is not beautiful, I know it is wrong.
-- R. Buckminster Fuller
Last time I talked to you about that, you said it would be possible and
fixable. We talked about several mechanisms, which would allow a
filesystem or other users to hint such things to UBI.
Even if the LogFS wear levelling is so superior, it CAN'T do across
device wear levelling.
tglx
Note the word "currently". And yes, we did talk about hints. Back then
I still believed in UBI. That has changed and I would like to spare
myself another flamewar, so please leave it at that.
> Even if the LogFS wear levelling is so superior, it CAN'T do across
> device wear levelling.
Correct. And I don't see any problem with this. I see two classes of
usecases for flash, with some amount of overlap in between.
1. Small amounts of flash.
Here the flash contains a large ratio of read-only data. Bootloader,
kernel, etc. Having wear levelling across the device will gain you
something. This is what you designed UBI for.
2. Large amounts of flash.
Just to be precise, large can go well into the Terabyte range and
beyond. I don't mean large as in "the biggest embedded device I worked
on last year" - that is still small.
Even if such flashes still contain a bootloader and a kernel, that will
occupy less than 1% of the device. Wear leveling across the device is
fairly pointless here. This is what I designed LogFS for.
There is some middle ground where a combination of UBI and LogFS may
make sense. LogFS can still make sense for devices as small as 64MiB.
But I'm not too concerned about that because flashes will continue to
grow and the advantages of cross-device wear leveling will continue to
diminish.
Jörn
--
"Security vulnerabilities are here to stay."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
Exactly. Although it is true that it cannot be _as good_ as FS, one can
optimize this by means of asking FS beforehand and make it quite OK. And
eraseblock movement is not so frequent event - we do it only once the
erase counter difference is more then 4Ki (although it is tunable).
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
Still you need to have a solution for handling bitflips in those
bootloader and kernel areas.
I don't dispute, that on a Terrabyte solid state disk which is used in a
totally different way, UBI is not necessarily the right tool.
> There is some middle ground where a combination of UBI and LogFS may
> make sense. LogFS can still make sense for devices as small as 64MiB.
> But I'm not too concerned about that because flashes will continue to
> grow and the advantages of cross-device wear leveling will continue to
> diminish.
Flashes will grow, but this will not change the embedded use case with a
relativly small FLASH and the bootloader / kernel / rootfs / datafs
scenario, where UBI is the right tool to use.
There is no hammer for all nails and I don't see device mapper doing
what UBI does right now.
tglx
Correct. It may make sense to use UBI for that, I don't know. What I
do know is that UBI cannot make wear leveling decisions as well as
LogFS.
And that is all I care about wrt. this discussion.
Jörn
--
Joern's library part 8:
http://citeseer.ist.psu.edu/plank97tutorial.html
So lets discuss this in different thread when you post a request to
include LogFS into mainline.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
I really did not want to become involved in this. So please be nice and
leave the flamethrower in your weapon closet or I will disappear again
before you can say "fire".
On Tue, 20 March 2007 21:32:40 +0000, David Woodhouse wrote:
> On Tue, 2007-03-20 at 10:58 -0800, David Lang wrote:
> > What Matt and Ted are looking at is the question 'are flash devices close enough
> > to other block devices that it would make sense to change the existing linux
> > definition of a block device to handle the special requirements of flash'
>
> I've seen no real proposals about how this could be done, so it's a
> purely academic question.
What you have seen and shot down were patches to make mtd more generic.
So let me just assume both mtd and jffs2 were generic, even though they
currently aren't.
In very broad terms, an mtd is a device with:
1. a read operation
2. a write operation
3. an erase operation
4. a minimal write blocksize
5. a minimal erase blocksize
6. a method to query bad eraseblocks
7. a method to mark bad eraseblocks
Anything else? There are many more fields, but I believe this is the
essential. point() and unpoint() were omitted, because they are just
one option to provide XIP. filemap_xip.c is another used for block
devices.
In very broad terms, a block device has:
1. a read operation
2. a write operation
3. some devices have an ioctl() for erase, but that is uncommon
4. a blocksize
What is missing? Obviously the erase operation needs to become a
first-class citizen and block devices need two fields for the two
meaningful blocksizes. And they need methods to query and set bad
blocks.
So far it looks simple enough. Obviously there are many messy details
left out, so it will be a lot of work in practice. So the question is:
is it worth it?
What are the gains from combining mtd and block devices?
[ And at this point I would like to state again that I don't want to
become involved in the UBI discussion. The question whether two
seperate subsystems make sense is quite independent and I don't want
both discussions to get mixed up. ]
Jörn
--
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu
Artem, no need to be defensive. You did tell me that you were going
to address them; but then you resubmitted patches where they weren't
address. Normally, patch authors take all of the comments, clear them
all, and then in the next repost of the patch, either explain why it
wasn't feasible to handle some of the comments, *OR* why some of the
comments were so hard to handle that they wouldn't be handled until a
future version of the patch. Furthermore, in a patch of the size that
you are submitting, a listing of what you *did* fix would also be
good.
And at this point, I don't doubt that you will at some point going to
heed my comments --- but note that doing so will involve a massive
refactorization of the code, which will tend to invalidate the reviews
done of this current (take 3) version of the patches; so I am bit
curious what was your motiviation in reposting this round of the
patches.
Believe it or not, the people who are responding on this thread are
trying to help. Otherwise they would just be ignoring you and UBI.
Keeping this thought in mind and trying to help them, where in some
cases perhaps they are lacking in the same knowledge and experience
and those who have been working on UBI and have spent many months
thinking about the problem, may help keep things more constructive.
Regards,
- Ted
Yes. However, nobody has actually reviewed any of the _code_ in this
round. So while it may have been somewhat superfluous from a submission
standpoint, at least no code review has been wasted and we are getting
some fairly decent design discussions.
josh
Well, in take 3 I _already_ removed quite _a lot_ of itsy-bitsy units,
and I thought it is enough, so I've submitted it. But later I've
realized that I should go further (e.g., dispose of per-unit data
structures), and started more re-work. Yes, I should have notified about
this new re-work, apologies.
Point taken, will be fixed :-)
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
On Wed, 2007-03-21 at 09:50 -0400, Theodore Tso wrote:
> Keeping this thought in mind and trying to help them, where in some
> cases perhaps they are lacking in the same knowledge and experience
> and those who have been working on UBI and have spent many months
> thinking about the problem, may help keep things more constructive.
since I am one of those who are working on UBI together with Artem, I
would like to try to describe how we are using UBI and how it solves
some problems, we could not solve to our satisfaction with existing
solutions.
Our system is using only NAND flashes and we use a nand-flash-controller
for the 1st stage boot-code (we called it IPL - initial program loader)
to appear contiguous in memory. The size of the IPL is limited to 4 KiB.
In these 4KiB code we initialize the processor, and scan the NAND flash
for copies of our 2nd stage boot-loader (SPL), which is stored in one or
even redundantly in multiple static UBI volumes. If the SPL has one-bit
flips we can correct them in the IPL before loading it into RAM.
We want to correct bit-flips as soon as possible to avoid getting
uncorrectable errors. UBI is doing that transparently for us by copying
the block with the bitflip to a free block. The logic in the IPL is able
to cope with the situation where UBI is interrupted when doing that.
Imagine if you have boot-code at a fixed location (skipping bad blocks
of course too): if you erase a block of the SPL to write it back to the
same block (to remove bit-errors), you will not be robust against
power-loss anymore! You could put a 2nd temporary SPL before you could
do this, but that is not nice too.
The static UBI volumes are so simple structured, that it is possible to
load them to RAM using only a few KiB of code. We do not even look at
UBIs volume information table to be able to do this.
UBI solves here:
1. possiblitly to boot a controller with limited resources using NAND
2. transparent bitflip correction on read only data e.g. boot-code,
kernel, initrd. Note that the mechanisms here are robust against
power-loss, that is also very important to us
We wanted to use JFFS2 and found that the traditional update mechanisms
did not ensure that an interrupted update attempt can be detected as
such. The UBI volume update ensures, that the volume is only usable
after it was updated completely.
3. Update mechanism which ensures that incomplete data cannot be used
We found that putting certain flash content at fixed locations with
fixed size is especially cumbersome if raw NAND is used e.g. if you
consider that bad blocks have to be skipped. Resizing partitions is a
pain.
UBI helps us to get rid of those limitations. We can resize the UBI
volumes and because UBI takes care to find the volume data even our
second stage boot-code can be located anywhere on the chip.
4. Volume resizing is easy
Because we want to ensure that we gain maximum lifetime for our systems,
we want that bitflips are corrected immediately when they are found.
Feature 2. of UBI does this for us.
I think that the largest portion of what we put in our NAND flashes is
code and data and basically read-only. Nevertheless data is written
during operation and, as already pointed out, maximum lifetime is
important for us, and wear-leveling across the whole flash chip helps
espcially with our usage-pattern. UBIs ability to copy blocks
transparently e.g. a read-only block with small erase count to a free
block with relatively high erase count, helps to get this done.
5. wear-leveling across the whole flash chip
We found that being able to use the same code update mechanisms for
NAND/NOR/? based systems is a nice side effect too. That was one reason
beside others (see previous mails) to up the UBI meta data into the data
section of the flashes and sacrifice therefore some space for data and
of course that the usable size of a block is not n^2 anymore. I think it
was a good decision, because if we would have put in in NAND OOB area,
the discussion here might be limited to NAND users only.
Regards,
Frank Haverkamp
several of these things sound like they would be useful to other block devices
as well
> UBI solves here:
> 1. possiblitly to boot a controller with limited resources using NAND
> 2. transparent bitflip correction on read only data e.g. boot-code,
> kernel, initrd. Note that the mechanisms here are robust against
> power-loss, that is also very important to us
>
> We wanted to use JFFS2 and found that the traditional update mechanisms
> did not ensure that an interrupted update attempt can be detected as
> such. The UBI volume update ensures, that the volume is only usable
> after it was updated completely.
a dm layer that detects and remaps soft errors before they become hard errors is
useful for hard drives as well.
> 3. Update mechanism which ensures that incomplete data cannot be used
>
> We found that putting certain flash content at fixed locations with
> fixed size is especially cumbersome if raw NAND is used e.g. if you
> consider that bad blocks have to be skipped. Resizing partitions is a
> pain.
>
> UBI helps us to get rid of those limitations. We can resize the UBI
> volumes and because UBI takes care to find the volume data even our
> second stage boot-code can be located anywhere on the chip.
>
> 4. Volume resizing is easy
>
> Because we want to ensure that we gain maximum lifetime for our systems,
> we want that bitflips are corrected immediately when they are found.
> Feature 2. of UBI does this for us.
>
> I think that the largest portion of what we put in our NAND flashes is
> code and data and basically read-only. Nevertheless data is written
> during operation and, as already pointed out, maximum lifetime is
> important for us, and wear-leveling across the whole flash chip helps
> espcially with our usage-pattern. UBIs ability to copy blocks
> transparently e.g. a read-only block with small erase count to a free
> block with relatively high erase count, helps to get this done.
both of these also sound useful as dm layers (in fact lvm already does some of
the resizing things)
> 5. wear-leveling across the whole flash chip
>
> We found that being able to use the same code update mechanisms for
> NAND/NOR/? based systems is a nice side effect too. That was one reason
> beside others (see previous mails) to up the UBI meta data into the data
> section of the flashes and sacrifice therefore some space for data and
> of course that the usable size of a block is not n^2 anymore. I think it
> was a good decision, because if we would have put in in NAND OOB area,
> the discussion here might be limited to NAND users only.
wear leveling would also be useful on other block devices (think CD-RAM for
example)
cross-device wear leveling sounds a lot like putting the wear leveling layer
above a lvm-like layer that stitches the seperate flash chips togeather into one
logical device.
additionally, if wear leveling is an optional layer in dm then it can be left
out when it's not appropriate (like when the FS has features that make it
unnessasary, or when it's read-only so you don't have writes to worry about (or
even if it's read-only 99.9999% of the time so writes are so rare that all the
writes in the expected lifetime of the device won't cause problems)
David Lang
This patch-set contains UBI, which stands for Unsorted Block Images. This
is closely related to the memory technology devices Linux subsystem (MTD),
so this new piece of software is from drivers/mtd/ubi.
In short, UBI provides wear-levelling support across the whole flash chip.
UBI completely hides 2 aspects of flash chips which make them very difficult to
work with:
1. wear of eraseblocks;
2. bad eraseblocks.
UBI also makes it possible to dynamically create, delete and re-size flash
partitions (UBI volumes). So here some analogy to LVM may be pointed.
There is some documentation available at:
http://www.linux-mtd.infradead.org/doc/ubi.html
http://www.linux-mtd.infradead.org/faq/ubi.html
The sources are available via the GIT tree:
git://git.infradead.org/ubi-2.6.git (stable)
git://git.infradead.org/~dedekind/dedekind-ubi-2.6.git (devel)
One can also browse the GIT trees at http://git.infradead.org/
This is the 4th iteration of the post which has fixed most of the stuff
pointed to previously:
- Removed "itsy-bitsy" units
- Removed a lot of debugging stuff
- Fixed kernel-doc
- fixed inline damage in eba.c
MAINTAINERS | 8
drivers/mtd/Kconfig | 2
drivers/mtd/Makefile | 2
drivers/mtd/ubi/Kconfig | 58 +
drivers/mtd/ubi/Kconfig.debug | 104 ++
drivers/mtd/ubi/Makefile | 7
drivers/mtd/ubi/build.c | 843 ++++++++++++++++++++
drivers/mtd/ubi/cdev.c | 724 +++++++++++++++++
drivers/mtd/ubi/debug.c | 224 +++++
drivers/mtd/ubi/debug.h | 161 +++
drivers/mtd/ubi/eba.c | 1132 ++++++++++++++++++++++++++++
drivers/mtd/ubi/gluebi.c | 337 ++++++++
drivers/mtd/ubi/io.c | 1263 +++++++++++++++++++++++++++++++
drivers/mtd/ubi/kapi.c | 545 +++++++++++++
drivers/mtd/ubi/misc.c | 105 ++
drivers/mtd/ubi/scan.c | 1374 +++++++++++++++++++++++++++++++++
drivers/mtd/ubi/scan.h | 167 ++++
drivers/mtd/ubi/ubi.h | 560 +++++++++++++
drivers/mtd/ubi/upd.c | 348 ++++++++
drivers/mtd/ubi/vmt.c | 811 ++++++++++++++++++++
drivers/mtd/ubi/vtbl.c | 809 ++++++++++++++++++++
drivers/mtd/ubi/wl.c | 1698 ++++++++++++++++++++++++++++++++++++++++++
fs/jffs2/fs.c | 12
fs/jffs2/os-linux.h | 6
fs/jffs2/wbuf.c | 24
include/linux/mtd/ubi.h | 191 ++++
include/mtd/Kbuild | 2
include/mtd/mtd-abi.h | 1
include/mtd/ubi-header.h | 360 ++++++++
include/mtd/ubi-user.h | 161 +++
30 files changed, 12039 insertions(+)
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
Actually, no. LogFS is not broken, there is nothing to fix.
And there is no fundamental reason why UBI should export blocks with
non-power-of-two sizes. UBI currently consists of two parts that are
intimately intertwined in the current implementation, but have
relatively little connection otherwise.
1. Logical volume management.
2. Static volumes.
Logical volume management can just as easily move its management
information into a table, instead of having it spread across all blocks.
Blocks can keep their original size. Since you have to scan flash
anyway, you can also scan for a table, compare a magical number and do
some extra check to protect yourself against a UBI image inside some
logical volume. No big deal.
Static volumes can keep a header inside their volumes. The tiny
first-stage bootloader is currently scanning flash and can continue to
do so. But at least this header no longer causes trouble for LogFS or
any other UBI user.
UBI is just as broken as LogFS is. It works with every user in mainline
(which comes down to JFFS2). LogFS works with every MTD device in
mainline. The only combination that doesn't work is LogFS on UBI - due
to deliberate design decisions on both sides.
Jörn
--
Joern's library part 8:
http://citeseer.ist.psu.edu/plank97tutorial.html
> On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote:
>> On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
>>>
>>> Also LogFS currently requires erasesizes of 2^n.
>>
>> Last time I talked to you about that, you said it would be possible and
>> fixable.
>
> Actually, no. LogFS is not broken, there is nothing to fix.
>
> And there is no fundamental reason why UBI should export blocks with
> non-power-of-two sizes. UBI currently consists of two parts that are
> intimately intertwined in the current implementation, but have
> relatively little connection otherwise.
>
> 1. Logical volume management.
> 2. Static volumes.
>
> Logical volume management can just as easily move its management
> information into a table, instead of having it spread across all blocks.
> Blocks can keep their original size. Since you have to scan flash
> anyway, you can also scan for a table, compare a magical number and do
> some extra check to protect yourself against a UBI image inside some
> logical volume. No big deal.
if you are being paranoid about write cycles putting the write count in the
block you are writing avoids doing an erase/write elsewhere
although, since you can flip bits to 1 without requireing an erase you could
sacrafice some space and say that your table has a normal counter for the number
of times the block has been erased, but a 'tally counter' where you turn one bit
on each time you erase the block, and when you fill up the tally block you
re-write the entire table, clearing all the tallys. if you have relativly large
eraseblocks it seems like you could afford to sacrafice the space in your master
table to avoid erases of it
IIRC someone said that the count per block was 128 bits? if so you could have a
master table with a 32 bit integer (4B writes) + 96 bits of tally so that you
would only have to re-write the table when a block on the flash has been erased
96 times. with this approach the wear on your table is unlikely to be a factor
until the point where you are loosing a lot of other eraseblocks due to wear
(at which point you could shift to a different block for your table, or retire
the flash)
David Lang
[ This was not a request for UBI to be changed. The only purpose was to
illustrate that LogFS is not broken. The previous thread suggested
otherwise and I just couldn't leave it at that. ]
> if you are being paranoid about write cycles putting the write count in the
> block you are writing avoids doing an erase/write elsewhere
>
> although, since you can flip bits to 1 without requireing an erase you
[ vice versa. you can flip bits to 0 without erasing. ]
> could sacrafice some space and say that your table has a normal counter for
> the number of times the block has been erased, but a 'tally counter' where
> you turn one bit on each time you erase the block, and when you fill up the
> tally block you re-write the entire table, clearing all the tallys. if you
> have relativly large eraseblocks it seems like you could afford to
> sacrafice the space in your master table to avoid erases of it
Or you could have a table and any number of updates to it. Erase one
block, append a small update marker to the table. There are plenty of
options. All have in common that code would be more complicated.
Another advantage is that erase counts don't get reset if the race
against a power failure during erase is lost.
Whether the advantaves of power-of-two blocksizes and safe erasecounts
are worth it, I leave for others to decide.
Jörn
--
Fools ignore complexity. Pragmatists suffer it.
Some can avoid it. Geniuses remove it.
-- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept. 1982
And on NAND flash you can't just do it in multiple cycles one bit at a
time. The 'tally' trick isn't viable there.
--
dwmw2
You can on NAND. ECC is done in software. And for a data structure as
simple as the 'tally', foregoing ECC is not a huge problem - most
bitflips are easily detected and the remaining only cause off-by-a-few
on the erase count.
On NOR with transparent (hardware) ECC you can't.
Jörn
--
Homo Sapiens is a goal, not a description.
-- unknown
You're only allowed a limited number of write cycles to each page
though. So you can't just clear the bits in a 2112-byte page one at a
time; typically when you clear the fifth bit, the contents of the whole
page become undefined until the next erase cycle.
--
dwmw2
That limitation stems from ECC and ECC is done in software. Currently
everyone and his dog is doing ECC in chunks of 256 bytes on NAND. So
your minimum write size is 256 bytes _if you care about ECC_. If you
don't care, you can write single bits on NAND, just as you can on NOR.
Controlling ECC in software means we are quite flexible. Given
sufficient incentive, we can change the rules quite significantly.
Jörn
--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike
No, on NAND flash it's a limitation of the hardware. The number of write
cycles you can perform to a given page is limited. Exceed it and the
contents of that page become undefined due to leakage, until you next
erase it.
--
dwmw2
Are you sure? Do you have any specs or similar that state this?
So far I have only encountered this limitation by word of mouth. And
such a myth coming from ECC effects is nothing that would surprise me.
Jörn
--
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories
Right and you cannot write to random locations in a page. The write
chunks have to be in consecutive order. If you write 0xAA to offset 0,
you cannot rewrite it to 0x00 later without risking corruption.
tglx
See pp 6 and 31 of http://david.woodhou.se/TC58DVAM72AF1FT_030124.pdf
for example.
--
dwmw2
False. There is.
> UBI currently consists of two parts that are
> intimately intertwined in the current implementation, but have
> relatively little connection otherwise.
False. They do have connection.
> 1. Logical volume management.
> 2. Static volumes.
>
> Logical volume management can just as easily move its management
> information into a table, instead of having it spread across all blocks.
> Blocks can keep their original size. Since you have to scan flash
> anyway, you can also scan for a table, compare a magical number and do
> some extra check to protect yourself against a UBI image inside some
> logical volume. No big deal.
First off, I see these no big deal statements for years already, and no
decent implementation proved by usage in real world. Could we please,
move these academic discussions to another thread?
Second, it is much more robust to kip erase counter and mapping
information on per-eraseblock basis then to keep any on-flash table -
you may always scan whole media and gracefully recover from errors and
corruptions. And you do not loose use a lot in case of corruptions.
Third, it is much simpler then keeping any on-flash table, it is thus
robust. We do not need a journal to update any table.
Third, if needed, on-flash table may be _added_ to increase scalability,
so "since you have to scan flash anyway" may become false when there is
real need in better scalability. For now scanning is OK. And still,
scanning method will be a good fall-back way to recover from errors.
> UBI is just as broken as LogFS is. It works with every user in mainline
> (which comes down to JFFS2). LogFS works with every MTD device in
> mainline. The only combination that doesn't work is LogFS on UBI - due
> to deliberate design decisions on both sides.
You are welcome to discuss other irrelevant things to this thread.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
You could wait a day, then reread what I wrote. Maybe you will notice
that what I wrote is not identical to what we have discussed about a
year ago and you seem to have read.
You may also want to reread this:
||[ This was not a request for UBI to be changed. The only purpose was to
||illustrate that LogFS is not broken. The previous thread suggested
||otherwise and I just couldn't leave it at that. ]
Jörn
--
tglx1 thinks that joern should get a (TM) for "Thinking Is Hard"
-- Thomas Gleixner