Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[PATCH 00/22 take 3] UBI: Unsorted Block Images

114 views

Skip to first unread message

Artem Bityutskiy

unread,

Mar 14, 2007, 11:23:20 AM3/14/07

to Linux Kernel Mailing List

Hello,

This patch-set contains UBI, which stands for Unsorted Block Images. This
is closely related to the memory technology devices Linux subsystem (MTD),
so this new piece of software is from drivers/mtd/ubi.

In short, UBI is kind of LVM layer but for flash (MTD) devices. It makes
it possible to dynamically create, delete and re-size volumes. But the
analogy is not full. UBI also takes care of wear-leveling and bad
eraseblocks handling, so UBI completely hides 2 aspects of flash chips
which make them very difficult to work with:

1. wear of eraseblocks;
2. bad eraseblocks.

There is some documentation available at:
http://www.linux-mtd.infradead.org/doc/ubi.html
http://www.linux-mtd.infradead.org/faq/ubi.html

The sources are available via the GIT tree:
git://git.infradead.org/ubi-2.6.git (stable)
git://git.infradead.org/~dedekind/dedekind-ubi-2.6.git (devel)

One can also browse the GIT trees at http://git.infradead.org/

This is the third iteration of the post which has fixed most of the stuff
pointed to previously.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Artem Bityutskiy

unread,

Mar 14, 2007, 11:23:32 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/include/linux/mtd/ubi.h tmp-to/include/linux/mtd/ubi.h
--- tmp-from/include/linux/mtd/ubi.h 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/include/linux/mtd/ubi.h 2007-03-14 17:15:49.000000000 +0200
@@ -0,0 +1,196 @@
+/*
+ * Copyright (c) International Business Machines Corp., 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Author: Artem B. Bityutskiy
+ */
+
+#ifndef __LINUX_UBI_H__
+#define __LINUX_UBI_H__
+
+#include <asm/ioctl.h>
+#include <linux/types.h>
+#include <mtd/ubi-user.h>
+
+/**
+ * enum ubi_data_type - UBI data type hint constants.
+ *
+ * @UBI_DATA_LONGTERM: long-term data
+ * @UBI_DATA_SHORTTERM: short-term data
+ * @UBI_DATA_UNKNOWN: data persistence is unknown
+ *
+ * These constants are used when data is written to UBI volumes in order to
+ * help the UBI wear-leveling unit to find more appropriate physical
+ * eraseblocks.
+ */
+enum ubi_data_type {
+ UBI_DATA_LONGTERM = 1,
+ UBI_DATA_SHORTTERM,
+ UBI_DATA_UNKNOWN
+};
+
+/**
+ * enum ubi_open_mode - UBI volume open mode constants.
+ *
+ * @UBI_READONLY: read-only mode
+ * @UBI_READWRITE: read-write mode
+ * @UBI_EXCLUSIVE: exclusive mode
+ */
+enum ubi_open_mode {
+ UBI_READONLY = 1,
+ UBI_READWRITE,
+ UBI_EXCLUSIVE
+};
+
+/**
+ * struct ubi_vol_info - UBI volume description data structure.
+ *
+ * @vol_id: volume ID
+ * @ubi_num: UBI device number this volume belongs to
+ * @size: how many physical eraseblocks are reserved for this volume
+ * @used_bytes: how many bytes of data this volume contains
+ * @used_ebs: how many physical eraseblocks of this volume actually contain any
+ * data
+ * @vol_type: volume type (%UBI_DYNAMIC_VOLUME or %UBI_STATIC_VOLUME)
+ * @corrupted: non-zero if the volume is corrupted (static volumes only)
+ * @upd_marker: non-zero if the volume has update marker set
+ * @alignment: volume alignment
+ * @usable_leb_size: how many bytes are available in logical eraseblocks of
+ * this volume
+ * @name_len: volume name length
+ * @name: volume name
+ * @cdev: UBI volume character device major and minor numbers
+ *
+ * The @corrupted flag is only relevant to static volumes and is always zero
+ * for dynamic ones. This is because UBI does not care about dynamic volume
+ * data protection and only cares about protecting static volume data.
+ *
+ * The @upd_marker flag is set if the volume update operation was interrupted.
+ * Before touching the volume data during the update operation, UBI first sets
+ * the update marker flag for this volume. If the volume update operation was
+ * further interrupted, the update marker indicates this. If the update marker
+ * is set, the contents of the volume is certainly damaged and a new volume
+ * update operation has to be started.
+ *
+ * To put it differently, @corrupted and @upd_marker fields have different
+ * semantics:
+ * o the @corrupted flag means that this static volume is corrupted for some
+ * reasons, but not because an interrupted volume update
+ * o the @upd_marker field means that the volume is damaged because of an
+ * interrupted update operation.
+ *
+ * I.e., the @corrupted flag is never set if the @upd_marker flag is set.
+ *
+ * The @used_bytes and @used_ebs fields are only really needed for static
+ * volumes and contain the number of bytes stored in this static volume and how
+ * many eraseblock this data occupies. In case of dynamic volumes, the
+ * @used_bytes field is equivalent to @size*@usable_leb_size, and the @used_ebs
+ * field is equivalent to @size.
+ *
+ * In general, logical eraseblock size is a property of the UBI device, not
+ * of the UBI volume. Indeed, the logical eraseblock size depends on the
+ * physical eraseblock size and on how much bytes UBI headers consume. But
+ * because of the volume alignment (@alignment), the usable size of logical
+ * eraseblocks if a volume may be less. The following equation is true:
+ * @usable_leb_size = LEB size - (LEB size mod @alignment),
+ * where LEB size is the logical eraseblock size defined by the UBI device.
+ *
+ * The alignment is multiple to the minimal flash input/output unit size or %1
+ * if all the available space is used.
+ *
+ * To put this differently, alignment may be considered is a way to change
+ * volume logical eraseblock sizes.
+ */
+struct ubi_vol_info {
+ int ubi_num;
+ int vol_id;
+ int size;
+ long long used_bytes;
+ int used_ebs;
+ int vol_type;
+ int corrupted;
+ int upd_marker;
+ int alignment;
+ int usable_leb_size;
+ int name_len;
+ const char *name;
+ dev_t cdev;
+};
+
+/**
+ * struct ubi_dev_info - UBI device description data structure.
+ *
+ * @ubi_num: ubi device number
+ * @leb_size: logical eraseblock size on this UBI device
+ * @min_io_size: minimal I/O unit size
+ * @ro_mode: if this device is in read-only mode
+ * @cdev: UBI character device major and minor numbers
+ *
+ * Note, @leb_size is the logical eraseblock size offered by the UBI device.
+ * Volumes of this UBI device may have smaller logical eraseblock size if their
+ * alignment is not equivalent to %1.
+ */
+struct ubi_dev_info {
+ int ubi_num;
+ int leb_size;
+ int min_io_size;
+ int ro_mode;
+ dev_t cdev;
+};
+
+/* UBI descriptor given to users when they open UBI volumes */
+struct ubi_vol_desc;
+
+int ubi_get_device_info(int ubi_num, struct ubi_dev_info *di);
+void ubi_get_volume_info(struct ubi_vol_desc *desc, struct ubi_vol_info *vi);
+struct ubi_vol_desc *ubi_open_volume(int ubi_num, int vol_id, int mode);
+struct ubi_vol_desc *ubi_open_volume_nm(int ubi_num, const char *name,
+ int mode);
+void ubi_close_volume(struct ubi_vol_desc *desc);
+int ubi_eraseblock_read(struct ubi_vol_desc *desc, int lnum, char *buf,
+ int offset, int len, int check);
+int ubi_eraseblock_write(struct ubi_vol_desc *desc, int lnum, const void *buf,
+ int offset, int len, int dtype);
+int ubi_eraseblock_erase(struct ubi_vol_desc *desc, int lnum);
+int ubi_eraseblock_unmap(struct ubi_vol_desc *desc, int lnum);
+int ubi_eraseblock_is_mapped(struct ubi_vol_desc *desc, int lnum);
+
+/*
+ * ubi_read - read data from an logical eraseblock (simplified).
+ *
+ * This function is the same as the 'ubi_eraseblock_read()' function, but it
+ * does not provide the checking capability.
+ */
+static inline int ubi_read(struct ubi_vol_desc *desc, int lnum, char *buf,
+ int offset, int len)
+{
+ return ubi_eraseblock_read(desc, lnum, buf, offset, len, 0);
+}
+
+/*
+ * ubi_write - write data to a logical eraseblock (simplified).
+ *
+ * This function is the same as the 'ubi_eraseblock_write()' functions, but it
+ * does not have the data type argument.
+ */
+static inline int ubi_write(struct ubi_vol_desc *desc, int lnum,
+ const void *buf, int offset, int len)
+{
+ return ubi_eraseblock_write(desc, lnum, buf, offset, len,
+ UBI_DATA_UNKNOWN);
+}
+
+#endif /* !__LINUX_UBI_H__ */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:23:42 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/build.c tmp-to/drivers/mtd/ubi/build.c
--- tmp-from/drivers/mtd/ubi/build.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/build.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,467 @@

+/*
+ * Copyright (c) International Business Machines Corp., 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *

+ * Author: Artem B. Bityutskiy,
+ * Frank Haverkamp
+ */
+
+/*
+ * UBI build unit.
+ *
+ * This unit is responsible for UBI initialization and for building UBI
+ * devices. At the moment UBI only one build method - scanning. But in futire
+ * we may add on-flash EBA table support and implement better build method.
+ */
+
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/stringify.h>
+#include <linux/stat.h>
+#include "ubi.h"
+
+/* Maximum MTD device specification parameter length */
+#define UBI_MTD_PARAM_LEN_MAX 64
+
+/**
+ * struct mtd_dev_param - MTD device parameter description data structure.
+ *
+ * @name: MTD device name or number string
+ * @vid_hdr_offs: VID header offset
+ * @data_offs: data offset
+ */
+struct mtd_dev_param
+{
+ char name[UBI_MTD_PARAM_LEN_MAX];
+ int vid_hdr_offs;
+ int data_offs;
+};
+
+/* Numbers of elements set in the @mtd_dev_param array. */
+static int mtd_devs = 0;
+
+/* MTD devices specification parameters */
+static struct mtd_dev_param mtd_dev_param[UBI_MAX_DEVICES];
+
+/* Number of UBI devices in system */
+int ubis_num;
+
+/* All the UBI devices in system */
+struct ubi_info *ubis[UBI_MAX_DEVICES];
+
+/**
+ * attach_by_scanning - attach a MTD device using scanning method.
+ *
+ * @ubi: UBI device descriptor
+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ *
+ * Note, currently this is the only method to attach UBI devices. Hopefully in
+ * the future we'll have more scalable attaching methods and avoid full media
+ * scanning. In this case scanning could be used as a fall-back attaching
+ * method.
+ */
+static int attach_by_scanning(struct ubi_info *ubi)
+{
+ int err;
+ struct ubi_scan_info *si;
+
+ dbg_bld("attach mtd device by scanning");
+
+ si = ubi_scan(ubi);
+ if (IS_ERR(si))
+ return PTR_ERR(si);
+
+ err = ubi_vtbl_init_scan(ubi, si);
+ if (err)
+ goto out_si;
+
+ err = ubi_acc_init_scan(ubi, si);
+ if (err)
+ goto out_vtbl;
+
+ err = ubi_wl_init_scan(ubi, si);
+ if (err)
+ goto out_vtbl;
+
+ err = ubi_eba_init_scan(ubi, si);
+ if (err)
+ goto out_wl;
+
+ ubi_msg("mean erase counter: %d", si->mean_ec);
+ ubi_scan_destroy_si(si);
+ return 0;
+
+out_wl:
+ ubi_wl_close(ubi);
+out_vtbl:
+ ubi_vtbl_close(ubi);
+out_si:
+ ubi_scan_destroy_si(si);
+ return err;
+}
+
+/**
+ * detach_mtd_dev - detach an MTD device.
+ *
+ * @ubi: the UBI device description object
+ */
+static void detach_mtd_dev(struct ubi_info *ubi)
+{
+ int mtd_num = ubi->io.mtd_num, ubi_num = ubi->ubi_num;
+
+ dbg_bld("detaching mtd%d from ubi%d", mtd_num, ubi_num);
+
+ ubi_uif_close(ubi);
+ ubi_eba_close(ubi);
+ ubi_wl_close(ubi);
+ ubi_vtbl_close(ubi);
+ ubi_io_close(ubi);
+ kfree(ubis[ubi_num]);
+ ubis[ubi_num] = NULL;
+ ubis_num -= 1;
+ ubi_assert(ubis_num >= 0);
+
+ ubi_msg("detached mtd%d from ubi%d", mtd_num, ubi_num);
+}
+
+/**
+ * attach_mtd_dev - attach an MTD device.
+ *
+ * @mtd_dev: MTD device name or number string to attach
+ * @vid_hdr_offset: volume identifier header offset in physical eraseblocks
+ * @data_offset: data offset in physical eraseblock
+ *
+ * This function attaches an MTD device to UBI. It first treats @mtd_dev as the
+ * MTD device name, and tries to open it by this name. If it is unable to open,
+ * it tries to convert @mtd_dev to an integer and open the MTD device by its
+ * number. Returns zero in case of success and a negative error code in case of
+ * failure.
+ */
+static int attach_mtd_dev(const char *mtd_dev, int vid_hdr_offset,
+ int data_offset)
+{
+ struct mtd_info *mtd;
+ int i, err, mtd_num, ubi_num = ubis_num;
+ struct ubi_info *ubi;
+
+ if (ubis_num == UBI_MAX_DEVICES) {
+ ubi_err("too many UBI devices, max. is %d", UBI_MAX_DEVICES);
+ return -EINVAL;
+ }
+
+ mtd = get_mtd_device_nm(mtd_dev);
+ if (IS_ERR(mtd)) {
+ char *endp;
+
+ if (PTR_ERR(mtd) != -ENODEV)
+ return PTR_ERR(mtd);
+
+ /*
+ * Probably this is not MTD device name but MTD device number -
+ * check it out.
+ */
+ mtd_num = simple_strtoul(mtd_dev, &endp, 0);
+ if (*endp != '\0' || mtd_dev == endp) {
+ ubi_err("incorrect MTD device: \"%s\"", mtd_dev);
+ return -ENODEV;
+ }
+
+ mtd = get_mtd_device(NULL, mtd_num);
+ if (IS_ERR(mtd))
+ return PTR_ERR(mtd);
+ }
+
+ mtd_num = mtd->index;
+
+ /*
+ * Ok, now we know MTD device number. Close it and let I/O unit to deal
+ * with it later.
+ */
+ put_mtd_device(mtd);
+
+ /* Check if we already have the same MTD device attached */
+ for (i = 0; i < ubis_num; i++)
+ if (ubis[i]->io.mtd_num == mtd_num) {
+ ubi_err("mtd%d is already attached to ubi%d",
+ mtd_num, i);
+ return -EINVAL;
+ }
+
+ ubi = ubis[ubi_num] = kzalloc(sizeof(struct ubi_info), GFP_KERNEL);
+ if (!ubis[ubi_num])
+ return -ENOMEM;
+
+ ubi->ubi_num = ubi_num;
+
+ dbg_bld("attaching mtd%d to ubi%d", mtd_num, ubi_num);
+
+ err = ubi_io_init(ubi, mtd_num, vid_hdr_offset, data_offset);
+ if (err) {
+ dbg_err("failed to initialize I/O unit, error %d", err);
+ goto out_free;
+ }
+
+ err = attach_by_scanning(ubi);
+ if (err) {
+ dbg_err("failed to attach MTD device, error %d", err);
+ goto out_io;
+ }
+
+ err = ubi_uif_init(ubi);
+ if (err) {
+ dbg_err("failed to initialize user interfaces unit, error %d",
+ err);
+ goto out_detach;
+ }
+
+ ubis_num += 1;
+
+ ubi_msg("attached mtd%d to ubi%d", mtd_num, ubi_num);
+ ubi_msg("MTD device name: \"%s\"", ubi->io.mtd_name);
+ ubi_msg("MTD device size: %llu MiB",
+ ubi->io.flash_size >> 20);
+ ubi_msg("physical eraseblock size: %d bytes (%d KiB)",
+ ubi->io.peb_size, ubi->io.peb_size >> 10);
+ ubi_msg("logical eraseblock size: %d bytes", ubi->io.leb_size);
+ ubi_msg("number of good PEBs: %d", ubi->io.good_peb_count);
+ ubi_msg("number of bad PEBs: %d", ubi->io.bad_peb_count);
+ ubi_msg("smallest flash I/O unit: %d", ubi->io.min_io_size);
+ ubi_msg("VID header offset: %d (aligned %d)",
+ ubi->io.vid_hdr_offset, ubi->io.vid_hdr_aloffset);
+ ubi_msg("data offset: %d", ubi->io.leb_start);
+ ubi_msg("max. allowed volumes: %d", ubi->vtbl.vt_slots);
+ ubi_msg("wear-leveling threshold: %d", CONFIG_MTD_UBI_WL_THRESHOLD);
+ ubi_msg("number of internal volumes: %d", UBI_INT_VOL_COUNT);
+ ubi_msg("number of user volumes: %d", ubi->vtbl.vol_count);
+ ubi_msg("available PEBs: %d", ubi->acc.avail_pebs);
+ ubi_msg("total number of reserved PEBs: %d", ubi->acc.rsvd_pebs);
+ ubi_msg("number of PEBs reserved for bad PEB handling: %d",
+ ubi->acc.beb_rsvd_pebs);
+
+ ubi_wl_enable_thread(ubi);
+
+ return 0;
+
+out_detach:
+ ubi_eba_close(ubi);
+ ubi_wl_close(ubi);
+ ubi_vtbl_close(ubi);
+out_io:
+ ubi_io_close(ubi);
+out_free:
+ kfree(ubi);
+ ubis[ubi_num] = NULL;
+ return err;
+}
+
+static int __init ubi_init(void)
+{
+ int err, i, k;
+
+ if (mtd_devs > UBI_MAX_DEVICES) {
+ printk("UBI error: too many MTD devices, max. is %d\n",
+ UBI_MAX_DEVICES);
+ return -EINVAL;
+ }
+
+ err = ubi_dbg_init();
+ if (err) {
+ printk("UBI error: failed to initialize debugging unit, "
+ "error %d", err);
+ return err;
+ }
+
+ err = ubi_sysfs_infr_init();
+ if (err) {
+ ubi_err("failed to initialize sysfs infrastructure, error %d",
+ err);
+ goto out_dbg;
+ }
+
+ /* Attach MTD devices */
+ for (i = 0; i < mtd_devs; i++) {
+ struct mtd_dev_param *p = &mtd_dev_param[i];
+
+ cond_resched();
+
+ if (!p->name) {
+ dbg_err("empty name");
+ err = -EINVAL;
+ goto out_detach;
+ }
+
+ err = attach_mtd_dev(p->name, p->vid_hdr_offs, p->data_offs);
+ if (err)
+ goto out_detach;
+ }
+
+ return 0;
+
+out_detach:
+ for (k = 0; k < i; k++)
+ detach_mtd_dev(ubis[k]);
+ ubi_sysfs_infr_close();
+out_dbg:
+ ubi_dbg_close();
+ return err;
+}
+module_init(ubi_init);
+
+static void __exit ubi_exit(void)
+{
+ int i, n = ubis_num;
+
+ for (i = 0; i < n; i++)
+ detach_mtd_dev(ubis[i]);
+ ubi_sysfs_infr_close();
+ ubi_dbg_close();
+}
+module_exit(ubi_exit);
+
+/*
+ * bytes_str_to_int - convert a string representing a number of bytes to an
+ * integer.
+ *
+ * @str: the string to convert
+ *
+ * This function returns positive resulting integer in case of success and a
+ * negative error code in case of failure.
+ */
+static int __init bytes_str_to_int(const char *str)
+{
+ char *endp;
+ unsigned long result;
+
+ result = simple_strtoul(str, &endp, 0);
+ if (str == endp || result < 0) {
+ printk("UBI error: incorrect bytes count: \"%s\"\n", str);
+ return -EINVAL;
+ }
+
+ switch (*endp) {
+ case 'G':
+ result *= 1024;
+ case 'M':
+ result *= 1024;
+ case 'K':
+ case 'k':
+ result *= 1024;
+ if (endp[1] == 'i' && (endp[2] == '\0' ||
+ endp[2] == 'B' || endp[2] == 'b'))
+ endp += 2;
+ case '\0':
+ break;
+ default:
+ printk("UBI error: incorrect bytes count: \"%s\"\n", str);
+ return -EINVAL;
+ }
+
+ return result;
+}
+
+/**
+ * ubi_mtd_param_parse - parse the "mtd" UBI parameter.
+ *
+ * @val: the parameter value to parse
+ * @kp: not used
+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of error.
+ */
+static int __init ubi_mtd_param_parse(const char *val, struct kernel_param *kp)
+{
+ int i, len;
+ struct mtd_dev_param *p;
+ char buf[UBI_MTD_PARAM_LEN_MAX];
+ char *pbuf = &buf[0];
+ char *tokens[3] = {NULL, NULL, NULL};
+
+ /* Ensure EC and VID headers have correct size */
+ BUILD_BUG_ON(sizeof(struct ubi_ec_hdr) != 64);
+ BUILD_BUG_ON(sizeof(struct ubi_vid_hdr) != 64);
+
+ if (mtd_devs == UBI_MAX_DEVICES) {
+ printk("UBI error: too many parameters, max. is %d\n",
+ UBI_MAX_DEVICES);
+ return -EINVAL;
+ }
+
+ len = strnlen(val, UBI_MTD_PARAM_LEN_MAX);
+ if (len == UBI_MTD_PARAM_LEN_MAX) {
+ printk("UBI error: parameter \"%s\" is too long, max. is %d\n",
+ val, UBI_MTD_PARAM_LEN_MAX);
+ return -EINVAL;
+ }
+
+ if (len == 0) {
+ printk("UBI warning: empty \"mtd\" parameter - ignored\n");
+ return 0;
+ }
+
+ strcpy(buf, val);
+
+ /* Get rid of the final newline */
+ if (buf[len - 1] == '\n')
+ buf[len - 1] = 0;
+
+ for (i = 0; i < 3; i++)
+ tokens[i] = strsep(&pbuf, ",");
+
+ if (pbuf) {
+ printk("UBI error: too many arguments at \"%s\"\n", val);
+ return -EINVAL;
+ }
+
+ if (tokens[0] == '\0')
+ return -EINVAL;
+
+ p = &mtd_dev_param[mtd_devs];
+ strcpy(&p->name[0], tokens[0]);
+
+ if (tokens[1])
+ p->vid_hdr_offs = bytes_str_to_int(tokens[1]);
+ if (tokens[2])
+ p->data_offs = bytes_str_to_int(tokens[2]);
+
+ if (p->vid_hdr_offs < 0)
+ return p->vid_hdr_offs;
+ if (p->data_offs < 0)
+ return p->data_offs;
+
+ mtd_devs += 1;
+ return 0;
+}
+
+module_param_call(mtd, ubi_mtd_param_parse, NULL, NULL, 000);
+MODULE_PARM_DESC(mtd, "MTD devices to attach. Parameter format: "
+ "mtd=<name|num>[,<vid_hdr_offs>,<data_offs>]. "
+ "Multiple \"mtd\" parameters may be specified.\n"
+ "MTD devices may be specified by their number or name. "
+ "Optional \"vid_hdr_offs\" and \"data_offs\" parameters "
+ "specify UBI VID header position and data starting "
+ "position to be used by UBI.\n"
+ "Example: mtd=content,1984,2048 mtd=4 - attach MTD device"
+ "with name content using VID header offset 1984 and data "
+ "start 2048, and MTD device number 4 using default "
+ "offsets");
+
+MODULE_VERSION(__stringify(UBI_VERSION));
+MODULE_DESCRIPTION("UBI - Unsorted Block Images");
+MODULE_AUTHOR("Artem B. Bityutskiy");
+MODULE_LICENSE("GPL");

Artem Bityutskiy

unread,

Mar 14, 2007, 11:23:54 AM3/14/07

to Linux Kernel Mailing List

+/*
+ * Copyright (c) International Business Machines Corp., 2006

+ * Copyright (C) Nokia Corporation, 2006, 2007

+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *

+ * Author: Artem B. Bityutskiy
+ */
+

+#ifndef __UBI_UBI_H__
+#define __UBI_UBI_H__
+
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/string.h>
+#include <linux/mtd/mtd.h>
+
+#include <mtd/ubi-header.h>
+#include <linux/mtd/ubi.h>
+
+#include "debug.h"
+
+/* Maximum number UBI devices supported in this implementation */
+#define UBI_MAX_DEVICES 32
+
+/* Name used for UBI character devices, sysfs, etc */
+#define UBI_STRING_NAME "ubi"
+
+/* Normal UBI messages */
+#define ubi_msg(fmt, ...) \
+ printk(KERN_INFO "UBI: " fmt "\n", ##__VA_ARGS__)
+/* UBI warning messages */
+#define ubi_warn(fmt, ...) \
+ printk(KERN_WARNING "UBI warning: %s: " fmt "\n", __FUNCTION__, \
+ ##__VA_ARGS__)
+/* UBI error messages */
+#define ubi_err(fmt, ...) \
+ printk(KERN_ERR "UBI error: %s " fmt "\n", __FUNCTION__, \
+ ##__VA_ARGS__)
+
+extern int ubis_num;
+extern struct ubi_info *ubis[];
+
+/*
+ * I/O unit's stuff.
+ *
+ * ============================================================================
+ */
+
+/**
+ * struct ubi_io_info - UBI I/O unit description data structure.
+ *
+ * @flash_size: underlying MTD device size (in bytes)
+ * @peb_count: count of physical eraseblocks on the MTD device
+ * @peb_size: physical eraseblock size
+ * @bad_peb_count: count of bad physical eraseblocks
+ * @good_peb_count: count of good physical eraseblocks
+ * @min_io_size: minimal input/output unit size of the underlying MTD device
+ * @hdrs_min_io_size: minimal I/O unit size used for VID and EC headers
+ * @ro_mode: if the UBI device is in read-only mode
+ * @leb_size: logical eraseblock size
+ * @leb_start: starting offset of logical eraseblocks within physical
+ * eraseblocks
+ * @ec_hdr_alsize: size of the EC header aligned to @hdrs_min_io_size
+ * @vid_hdr_alsize: size of the VID header aligned to @hdrs_min_io_size
+ * @vid_hdr_offset: starting offset of the volume identifier header (might be
+ * unaligned)
+ * @vid_hdr_aloffset: starting offset of the VID header aligned to
+ * @hdrs_min_io_size
+ * @vid_hdr_shift: contains @vid_hdr_offset - @vid_hdr_aloffset
+ * @bad_allowed: whether the MTD device admits of bad physical eraseblocks or
+ * not
+ * @mtd_num: MTD device number
+ * @mtd_name: MTD device name
+ * @mtd: MTD device descriptor
+ */
+struct ubi_io_info {
+ long long flash_size; /* public */
+ int peb_count; /* public */
+ int peb_size; /* public */
+ int bad_peb_count; /* public */
+ int good_peb_count; /* public */
+ int min_io_size; /* public */
+ int hdrs_min_io_size; /* public */
+ int ro_mode; /* public */
+ int leb_size; /* public */
+ int leb_start; /* public */
+ int ec_hdr_alsize; /* public */
+ int vid_hdr_alsize; /* public */
+ int vid_hdr_offset; /* public */
+ int vid_hdr_aloffset; /* public */
+ int vid_hdr_shift; /* public */
+ int bad_allowed; /* public */
+ int mtd_num; /* public */
+ const char *mtd_name; /* public */
+ struct mtd_info *mtd; /* private */
+};
+
+/*
+ * Error codes returned by the I/O unit.
+ *
+ * UBI_IO_PEB_EMPTY: the physical eraseblock is empty, i.e. it contains only
+ * 0xFF bytes
+ * UBI_IO_PEB_FREE: the physical eraseblock is free, i.e. it contains only a
+ * valid erase counter header, and the rest are %0xFF bytes
+ * UBI_IO_BAD_EC_HDR: the erase counter header is corrupted (bad magic or CRC)
+ * UBI_IO_BAD_VID_HDR: the volume identifier header is corrupted (bad magic or
+ * CRC)
+ * UBI_IO_BITFLIPS: bit-flips were detected and corrected
+ */
+enum {
+ UBI_IO_PEB_EMPTY = 1,
+ UBI_IO_PEB_FREE,
+ UBI_IO_BAD_EC_HDR,
+ UBI_IO_BAD_VID_HDR,
+ UBI_IO_BITFLIPS
+};
+
+int ubi_io_read(const struct ubi_info *ubi, void *buf, int pnum, int offset,
+ int len);
+int ubi_io_write(const struct ubi_info *ubi, const void *buf, int pnum,
+ int offset, int len);
+int ubi_io_sync_erase(const struct ubi_info *ubi, int pnum, int torture);
+int ubi_io_is_bad(const struct ubi_info *ubi, int pnum);
+int ubi_io_mark_bad(const struct ubi_info *ubi, int pnum);
+int ubi_io_read_ec_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_ec_hdr *ec_hdr, int verbose);
+int ubi_io_write_ec_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_ec_hdr *ec_hdr);
+int ubi_io_read_vid_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_vid_hdr *vid_hdr, int verbose);
+int ubi_io_write_vid_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_vid_hdr *vid_hdr);
+int ubi_io_init(struct ubi_info *ubi, int mtd_num, int vid_hdr_offset,
+ int data_offset);
+void ubi_io_close(const struct ubi_info *ubi);
+
+/*
+ * Scanning unit's stuff.
+ *
+ * ============================================================================
+ */
+
+/* The erase counter value for this physical eraseblock is unknown */
+#define NAND_SCAN_UNKNOWN_EC (-1)
+
+/* Get sequence number during early initialization stages */
+#define ubi_scan_next_sqnum(si) (++(si)->max_sqnum)
+
+/**
+ * struct ubi_scan_leb - scanning information about a physical eraseblock.
+ *
+ * @ec: erase counter (%NAND_SCAN_UNKNOWN_EC if it is unknown)
+ * @pnum: physical eraseblock number
+ * @lnum: logical eraseblock number
+ * @scrub: if this physical eraseblock needs scrubbing
+ * @sqnum: sequence number
+ * @u.rb: link in the per-volume RB-tree of &struct ubi_scan_leb objects
+ * @u.list: link in one of the eraseblock lists
+ * @leb_ver: logical eraseblock version (obsolete)
+ *
+ * One object of this type is allocated for each physical eraseblock during
+ * scanning.
+ */
+struct ubi_scan_leb {
+ int ec;
+ int pnum;
+ int lnum;
+ int scrub;
+ unsigned long long sqnum;
+ union {
+ struct rb_node rb;
+ struct list_head list;
+ } u;
+ uint32_t leb_ver; /* FIXME: obsolete, to be removed */
+};
+
+/**
+ * struct ubi_scan_volume - scanning information about a volume.

+ *
+ * @vol_id: volume ID

+ * @highest_lnum: highest logical eraseblock number in this volume
+ * @leb_count: number of logical eraseblocks in this volume

+ * @vol_type: volume type

+ * @used_ebs: number of used logical eraseblocks in this volume (only for
+ * static volumes)
+ * @last_data_size: amount of data in the last logical eraseblock of this
+ * volume (always equivalent to the usable logical eraseblock size in case of
+ * dynamic volumes)
+ * @data_pad: how many bytes at the end of logical eraseblocks of this volume
+ * are not used (due to volume alignment)
+ * @compat: compatibility flags of this volume
+ * @rb: link in the volume RB-tree
+ * @root: root of the RB-tree containing all the eraseblock belonging to this
+ * volume (&struct ubi_scan_leb objects)
+ *
+ * One object of this type is allocated for each volume during scanning.
+ */
+struct ubi_scan_volume {
+ int vol_id;
+ int highest_lnum;
+ int leb_count;
+ int vol_type;
+ int used_ebs;
+ int last_data_size;
+ int data_pad;
+ int compat;
+ struct rb_node rb;
+ struct rb_root root;
+};
+
+/**
+ * struct ubi_scan_info - UBI scanning information.
+ *
+ * @volumes: root of the volume RB-tree
+ * @corr: list of corrupted physical eraseblocks
+ * @free: list of free physical eraseblocks
+ * @erase: list of physical eraseblocks which have to be erased
+ * @alien: list of physical eraseblocks which should not be used by UBI (e.g.,
+ * those belonging to "preserve"-compatible internal volumes)
+ * @vols_found: number of volumes found during scanning
+ * @highest_vol_id: highest volume ID
+ * @alien_peb_count: count of physical eraseblocks in the @alien list
+ * @is_empty: flag indicating whether the MTD device is empty or not
+ * @min_ec: lowest erase counter value
+ * @max_ec: highest erase counter value
+ * @max_sqnum: highest sequence number value
+ * @mean_ec: mean erase counter value
+ * @ec_sum: a temporary variable used when calculating @mean_ec
+ * @ec_count: a temporary variable used when calculating @mean_ec
+ *
+ * This data structure contains the result of scanning and may be used by other
+ * UBI units to build final UBI data structures, further error-recovery and so
+ * on.
+ */
+struct ubi_scan_info {
+ struct rb_root volumes; /* public */
+ struct list_head corr; /* public */
+ struct list_head free; /* public */
+ struct list_head erase; /* public */
+ struct list_head alien; /* public */
+ int vols_found; /* public */
+ int highest_vol_id; /* public */
+ int alien_peb_count; /* public */
+ int is_empty; /* public */
+ int min_ec; /* public */
+ int max_ec; /* public */
+ unsigned long long max_sqnum; /* public */
+ int mean_ec; /* public */
+ int ec_sum; /* private */
+ int ec_count; /* private */
+};
+
+int ubi_scan_add_peb(const struct ubi_info *ubi, struct ubi_scan_info *si,
+ int pnum, int ec, const struct ubi_vid_hdr *vid_hdr,
+ int bitflips);
+int ubi_scan_add_corr_peb(struct ubi_scan_info *si, int pnum, int ec);
+struct ubi_scan_volume *ubi_scan_find_sv(const struct ubi_scan_info *si,
+ int vol_id);
+struct ubi_scan_leb *ubi_scan_find_seb(const struct ubi_scan_volume *sv,
+ int lnum);
+int ubi_scan_erase_peb(const struct ubi_info *ubi,
+ const struct ubi_scan_info *si, int pnum, int ec);
+struct ubi_scan_leb *ubi_scan_get_free_peb(const struct ubi_info *ubi,
+ struct ubi_scan_info *si);
+void ubi_scan_rm_volume(struct ubi_scan_info *si, struct ubi_scan_volume *sv);
+struct ubi_scan_info *ubi_scan(struct ubi_info *ubi);
+void ubi_scan_destroy_si(struct ubi_scan_info *si);
+
+/*
+ * Volume table unit's stuff.
+ *
+ * ============================================================================
+ */
+
+/**
+ * struct ubi_vtbl_vtr - in-memory representation of volume table records.
+ *
+ * @reserved_pebs: how many physical eraseblocks are reserved for this volume

+ * @alignment: volume alignment

+ * @data_pad: how many bytes are not used at the end of physical eraseblocks to
+ * satisfy the requested alignment

+ * @vol_type: volume type (%UBI_DYNAMIC_VOLUME or %UBI_STATIC_VOLUME)

+ * @name_len: volume name length
+ * @name: volume name

+ * @usable_leb_size: logical eraseblock size without padding
+ * @used_ebs: how many logical eraseblocks in this volume contain data
+ * @last_eb_bytes: how many bytes are stored in the last logical eraseblock

+ * @used_bytes: how many bytes of data this volume contains

+ * @corrupted: non-zero if the volume is corrupted (static volumes only)

+ * @upd_marker: non-zero if the update marker is set for this volume
+ *
+ * Note, the @usable_leb_size field is not stored on flash, as it is easily
+ * calculated with help of the @data_pad field. But it is just very handy, so
+ * we keep it in the in-RAM volume table record representation. The same is
+ * true for @last_eb_bytes, @used_bytes and @corrupted. We do not store them in
+ * the on-flash volume table but keep handy in RAM.
+ *
+ * The @corrupted field indicates that the volume's contents is corrupted.
+ * Since UBI protects only the contents of static volumes, this field is only
+ * relevant to static volumes. In case of dynamic volumes it is user's
+ * responsibility to assure data integrity.
+ *
+ * The @upd_marker flag indicates that this volume is either being updated at
+ * the moment or is damaged because of an unclean reboot. Note, the @corrupted
+ * flag is always cleared if the @upd_marker flag is set.
+ */
+struct ubi_vtbl_vtr {
+ int reserved_pebs;
+ int alignment;
+ int data_pad;
+ int vol_type;
+ int name_len;
+ int usable_leb_size;
+ const char *name;
+ int used_ebs;
+ int last_eb_bytes;
+ long long used_bytes;
+ int corrupted;
+ int upd_marker;
+};
+
+/**
+ * struct ubi_vtbl_info - UBI volume table unit description data structure.
+ *
+ * @vt_slots: how many volume table records are stored in the volume table
+ * @vol_count: count of existing volumes
+ * @vt_size: size of the volume table in bytes
+ * @vt: in-RAM copy of the volume table
+ * @ivol_vtrs: volume table records corresponding to internal volumes
+ * @vtbl_lock: protects volume table
+ */
+struct ubi_vtbl_info {
+ int vt_slots; /* public */
+ int vol_count; /* public */
+ int vt_size; /* private */
+ struct ubi_vtbl_vtr *vt; /* private */
+ struct ubi_vtbl_vtr ivol_vtrs[UBI_INT_VOL_COUNT]; /* private */
+ struct mutex vtbl_lock;
+};
+
+int ubi_vtbl_mkvol(struct ubi_info *ubi, int vol_id, struct ubi_vtbl_vtr *vtr);
+int ubi_vtbl_rmvol(struct ubi_info *ubi, int vol_id);
+int ubi_vtbl_rsvol(struct ubi_info *ubi, int vol_id, int reserved_pebs);
+int ubi_vtbl_set_upd_marker(struct ubi_info *ubi, int vol_id);
+int ubi_vtbl_clear_upd_marker(struct ubi_info *ubi, int vol_id,
+ long long bytes);
+int ubi_vtbl_set_corrupted(struct ubi_info *ubi, int vol_id);
+const struct ubi_vtbl_vtr *ubi_vtbl_get_vtr(const struct ubi_info *ubi,
+ int vol_id);
+int ubi_vtbl_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si);
+void ubi_vtbl_close(const struct ubi_info *ubi);
+
+/*
+ * Accounting unit's stuff.
+ *
+ * ============================================================================
+ */
+
+/**
+ * struct ubi_acc_info - UBI accounting unit description data structure.
+ *
+ * @rsvd_pebs: count of reserved physical eraseblocks
+ * @avail_pebs: count of available physical eraseblocks
+ * @beb_rsvd_pebs: how many physical eraseblocks are reserved for bad PEB
+ * handling
+ * @beb_rsvd_max: how many PEBs have to be reserved for bad PEB handling, i.e.,
+ * the normal level of reserved PEBs
+ * @lock: protects @rsvd_pebs, @avail_pebs, @beb_rsvd_pebs, and @beb_rsvd_max
+ */
+struct ubi_acc_info {
+ int rsvd_pebs; /* public */
+ int avail_pebs; /* public */
+ int beb_rsvd_pebs; /* public */
+ int beb_rsvd_max; /* public */
+ spinlock_t lock; /* private */
+};
+
+int ubi_acc_reserve(struct ubi_info *ubi, int pebs);
+void ubi_acc_free(struct ubi_info *ubi, int pebs);
+void ubi_acc_peb_marked_bad(struct ubi_info *ubi);
+int ubi_acc_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si);
+
+/*
+ * Wear-leveling unit's stuff.
+ *
+ * ============================================================================
+ */
+
+struct ubi_wl_entry;
+struct ubi_work;
+
+/**
+ * struct ubi_wl_info - the UBI wear-leveling unit description data structure.
+ *
+ * @used: RB-tree of used physical eraseblocks
+ * @free: RB-tree of free physical eraseblocks
+ * @scrub: RB-tree of physical eraseblocks which need scrubbing
+ * @prot.pnum: protection tree indexed by physical eraseblock numbers
+ * @prot.aec: protection tree indexed by absolute erase counter value
+ * @lock: protects the @used, @free, @prot, @lookuptbl, @abs_ec, @move_from,
+ * @move_to, @move_to_put @erase_pending, @wl_scheduled, and @pending_works
+ * fields
+ * @wl_scheduled: non-zero if the wear leveling was scheduled
+ * @lookuptbl: a table to quickly find a &struct ubi_wl_entry object for any
+ * physical eraseblock
+ * @abs_ec: absolute erase counter
+ * @move_from: physical eraseblock from where the data is being moved
+ * @move_to: physical eraseblock where the data is being moved to
+ * @move_from_put: if the "from" PEB was put
+ * @move_to_put: if the "to" PEB was put
+ * @max_ec: current highest erase counter value
+ * @pending_works: list of pending works
+ * @pending_works_count: count of pending works
+ * @task: pointer to the &struct task_struct of the background thread
+ * @thread_enabled: if the background thread is enabled
+ * @bgt_name: background thread name
+ */
+struct ubi_wl_info {
+ struct rb_root used; /* private */
+ struct rb_root free; /* private */
+ struct rb_root scrub; /* private */
+ struct {
+ struct rb_root pnum; /* private */
+ struct rb_root aec; /* private */
+ } prot;
+ spinlock_t lock; /* private */
+ int wl_scheduled; /* private */
+ struct ubi_wl_entry **lookuptbl; /* private */
+ unsigned long long abs_ec; /* public */
+ struct ubi_wl_entry *move_from; /* private */
+ struct ubi_wl_entry *move_to; /* private */
+ int move_from_put; /* private */
+ int move_to_put; /* private */
+ int max_ec; /* public */
+ struct list_head pending_works; /* private */
+ int pending_works_count; /* public */
+ struct task_struct *task; /* private */
+ int thread_enabled; /* private */
+ char *bgt_name; /* public */
+};
+
+int ubi_wl_get_peb(struct ubi_info *ubi, enum ubi_data_type dtype);
+int ubi_wl_put_peb(struct ubi_info *ubi, int pnum, int torture);
+int ubi_wl_flush(struct ubi_info *ubi);
+int ubi_wl_scrub_peb(struct ubi_info *ubi, int pnum);
+void ubi_wl_enable_thread(struct ubi_info *ubi);
+int ubi_wl_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si);
+void ubi_wl_close(struct ubi_info *ubi);
+
+/*
+ * EBA unit's stuff.
+ *
+ * ============================================================================
+ */
+
+/**
+ * struct ubi_eba_info - UBI EBA unit description data structure.
+ *
+ * @eba_tbl: the eraseblock association table
+ * @global_sq_cnt: global sequence counter
+ * @eba_tbl_lock: protects the EBA table and the global sequence counter
+ * @ltree: the lock tree
+ * @ltree_lock: protects the lock tree
+ * @num_volumes: number of volumes mapped by the EBA table
+ */
+struct ubi_eba_info {
+ struct ubi_eba_tbl_volume *eba_tbl; /* private */
+ unsigned long long global_sq_cnt; /* private */
+ spinlock_t eba_tbl_lock; /* private */
+ struct rb_root ltree; /* private */
+ spinlock_t ltree_lock; /* private */
+ int num_volumes; /* private */
+};
+
+int ubi_eba_mkvol(struct ubi_info *ubi, int vol_id, int leb_count);
+int ubi_eba_rmvol(struct ubi_info *ubi, int vol_id);
+int ubi_eba_rsvol(struct ubi_info *ubi, int vol_id, int reserved_pebs);
+int ubi_eba_unmap_leb(struct ubi_info *ubi, int vol_id, int lnum);
+int ubi_eba_read_leb(struct ubi_info *ubi, int vol_id, int lnum, void *buf,
+ int offset, int len, int check);
+int ubi_eba_write_leb(struct ubi_info *ubi, int vol_id, int lnum,
+ const void *buf, int offset, int len,
+ enum ubi_data_type dtype);
+int ubi_eba_write_leb_st(struct ubi_info *ubi, int vol_id, int lnum,
+ const void *buf, int len, enum ubi_data_type dtype,
+ int used_ebs);
+int ubi_eba_leb_is_mapped(struct ubi_info *ubi, int vol_id, int lnum);
+void ubi_eba_leb_remap(struct ubi_info *ubi, int vol_id, int lnum, int pnum);
+int ubi_eba_copy_leb(struct ubi_info *ubi, int from, int to,
+ struct ubi_vid_hdr *vid_hdr);
+int ubi_eba_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si);
+void ubi_eba_close(const struct ubi_info *ubi);
+
+/*
+ * User-interfaces unit's stuff.
+ *
+ * ============================================================================
+ */
+
+struct ubi_vol_desc;
+
+/**
+ * struct ubi_uif_volume - UBI volume description data structure.
+ *
+ * @dev: class device object to make use of the the Linux device model
+ * @cdev: Linux character device object to create a character device for this
+ * volume
+ * @ubi: reference to the UBI description object this volume belongs to

+ * @vol_id: volume ID

+ * @list: links volumes belonging to the same UBI device
+ * @readers: number of users holding this volume in read-only mode
+ * @writers: number of users holding this volume in read-write mode
+ * @exclusive: whether somebody holds this volume in exclusive mode
+ * @removed: if the volume was removed from the UBI device
+ * @checked: if this static volume was checked
+ * @vol_lock: protects the @readers, @writers, @exclusive, and @removed fields
+ *
+ * @updating: whether the volume is being updated
+ * @upd_ebs: how many eraseblocks are expected to be updated
+ * @upd_bytes: how many bytes are expected to be received
+ * @upd_received: how many update bytes were already received
+ * @upd_buf: update buffer which is used to collect update data
+ *
+ * @gluebi_desc: gluebi UBI volume descriptor
+ * @gluebi_refcount: reference count of the gluebi MTD device
+ * @gluebi_mtd: MTD device description object of the gluebi MTD device
+ */
+struct ubi_uif_volume {
+ struct device dev;
+ struct cdev cdev;
+ struct ubi_info *ubi;
+ int vol_id;
+ struct list_head list;
+ int readers;
+ int writers;
+ int exclusive;
+ int removed;
+ int checked;
+ spinlock_t vol_lock;
+
+ int updating;
+ int upd_ebs;
+ long long upd_bytes;
+ long long upd_received;
+ void *upd_buf;
+
+#ifdef CONFIG_MTD_UBI_GLUEBI
+ /* Gluebi-related stuff may be compiled out */
+ struct ubi_vol_desc *gluebi_desc;
+ int gluebi_refcount;
+ struct mtd_info gluebi_mtd;
+#endif
+};
+
+/**
+ * struct ubi_vol_desc - descriptor of the UBI volume returned when it is
+ * opened.
+ *
+ * @vol: reference to the corresponding volume description object
+ * @mode: open mode (%UBI_READONLY, %UBI_READWRITE, or %UBI_EXCLUSIVE)
+ */
+struct ubi_vol_desc {
+ struct ubi_uif_volume *vol;
+ int mode;
+};
+
+/**
+ * struct ubi_uif_info - UBI user interfaces unit description structure.
+ *
+ * @cdev: Linux character device object to create a character device of this
+ * UBI device
+ * @dev: class device object to use the the Linux device model
+ * @major: major number of this UBI character device
+ * @volumes: list of 'struct ubi_uif_volume' objects of each volume
+ * @volumes_list_lock: protects the the @volumes list
+ * @vol_check_lock: serializes volume checking
+ * @vol_change_lock: serializes volume creation/deletion/re-sizing
+ * @ubi_name: name of this UBI device
+ *
+ * The @volumes_list_lock lock protects the list of volumes. So it has to be
+ * locked when the @volumes list is being accessed. If @vol->vol_lock has to be
+ * locked simultaneously, @volumes_list_lock is locked first.
+ *
+ * The @vol_check_lock lock is used when checking static volumes consistency to
+ * prevent simultaneous volume checks.
+ *
+ * The volume_change_lock serializes volume creation, deletion and re-sizing,
+ * i.e. it serializes access to the volume table and EBA units.
+ */
+struct ubi_uif_info {
+ struct cdev cdev;
+ struct device dev;
+ int major;
+ struct list_head volumes;
+ spinlock_t volumes_list_lock;
+ struct mutex vol_check_lock;
+ struct mutex vol_change_lock;
+ char ubi_name[sizeof(UBI_STRING_NAME) + 5];
+};
+
+int ubi_uif_get_exclusive(struct ubi_vol_desc *desc);
+void ubi_uif_revoke_exclusive(struct ubi_vol_desc *desc, int mode);
+int ubi_uif_init(struct ubi_info *ubi);
+void ubi_uif_close(struct ubi_info *ubi);
+
+/* The update functionality is implemented in upd.c */
+int ubi_upd_start(struct ubi_uif_volume *vol, long long bytes);
+int ubi_upd_write_data(struct ubi_uif_volume *vol, const void __user *buf,
+ int count);
+void ubi_upd_abort(struct ubi_uif_volume *vol);
+
+/* Character device interfaces are implemented in cdev.c */
+int ubi_cdev_init(struct ubi_info *ubi);
+void ubi_cdev_close(struct ubi_info *ubi);
+int ubi_cdev_vol_init(struct ubi_info *ubi, struct ubi_uif_volume *vol);
+#define ubi_cdev_vol_close(vol) cdev_del(&(vol)->cdev)
+
+/* The volume management functionality is implemented at vmt.c */
+int ubi_vmt_mkvol(struct ubi_info *ubi, int vol_id, struct ubi_vtbl_vtr *vtr);
+int ubi_vmt_rmvol(struct ubi_vol_desc *desc);
+int ubi_vmt_rsvol(struct ubi_info *ubi, int vol_id, int reserved_pebs);
+
+/* Sysfs support is implemented in sysfs.c */
+int ubi_sysfs_infr_init(void);
+void ubi_sysfs_infr_close(void);
+int ubi_sysfs_init(struct ubi_info *ubi);
+void ubi_sysfs_close(struct ubi_info *ubi);
+int ubi_sysfs_vol_init(struct ubi_info *ubi, struct ubi_uif_volume *vol);
+void ubi_sysfs_vol_close(struct ubi_uif_volume *vol);
+
+/* Gluebi stuff is implemented in gluebi.c */
+#ifdef CONFIG_MTD_UBI_GLUEBI
+int ubi_gluebi_vol_init(const struct ubi_info *ubi, struct ubi_uif_volume *vol);
+int ubi_gluebi_vol_close(struct ubi_uif_volume *vol);
+#else
+#define ubi_gluebi_vol_init(ubi, vol) 0
+#define ubi_gluebi_vol_close(vol) 0
+#endif
+
+/**
+ * struct ubi_info - UBI device description structure
+ *
+ * @ubi_num: number of the UBI device
+ * @io: input/output unit information
+ * @wl: wear-leveling unit information
+ * @beb: bad eraseblock handling unit information
+ * @vtbl: volume table unit information
+ * @acc: accounting unit information
+ * @upd: update unit information
+ * @eba: EBA unit information
+ * @uif: user interface unit information
+ */
+struct ubi_info {
+ int ubi_num;
+ struct ubi_io_info io;
+ struct ubi_wl_info wl;
+ struct ubi_vtbl_info vtbl;
+ struct ubi_acc_info acc;
+ struct ubi_eba_info eba;
+ struct ubi_uif_info uif;
+};
+
+/*
+ * Miscellaneous stuff.
+ *
+ * ============================================================================
+ */
+
+/**
+ * rb_for_each_entry - walk an RB-tree.
+ *
+ * @rb: a pointer to type 'struct rb_node' to to use as a loop counter
+ * @pos: a pointer to RB-tree entry type to use as a loop counter
+ * @root: RB-tree's root
+ * @member: the name of the 'struct rb_node' within the RB-tree entry
+ */
+#define rb_for_each_entry(rb, pos, root, member) \
+ for (rb = rb_first(root), \
+ pos = (rb ? container_of(rb, typeof(*pos), member) : NULL); \
+ rb; \
+ rb = rb_next(rb), pos = container_of(rb, typeof(*pos), member))
+
+/*
+ * align_up - align an integer to another integer.
+ *
+ * @x: the integer to align
+ * @y: the integer to align to
+ *
+ * This function returns the lowest number which is multiple to @y and not less
+ * then @x.
+ */
+static inline int align_up(int x, int y)
+{
+ return y*(x/y) + (!!(x % y)) * y;
+}
+
+/*
+ * align_down - align an integer to another integer.
+ *
+ * @x: the integer to align
+ * @y: the integer to align to
+ *
+ * This function returns the highest number which is multiple to @y and not
+ * greater then @x.
+ */
+static inline int align_down(int x, int y)
+{
+ return y*(x/y);
+}
+
+int ubi_buf_all_ff(const void *buf, int size);
+int ubi_buf_all_zeroes(const void *buf, int size);
+int ubi_check_pattern(const void *buf, uint8_t patt, int size);
+int ubi_calc_data_len(const struct ubi_info *ubi, const void *buf, int length);
+int ubi_check_volume(struct ubi_info *ubi, int vol_id);
+
+/**
+ * ubi_zalloc_vid_hdr - allocate a volume identifier header.

+ *
+ * @ubi: the UBI device description object

+ *
+ * This function returns a pointer to the newly allocated and zero-filled
+ * volume identifier header object in case of success and %NULL in case of

+ * failure.
+ */

+static inline struct ubi_vid_hdr *ubi_zalloc_vid_hdr(const struct ubi_info *ubi)
+{
+ void *vid_hdr;
+
+ vid_hdr = kzalloc(ubi->io.vid_hdr_alsize, GFP_KERNEL);
+ if (unlikely(!vid_hdr))
+ return NULL;
+
+ /*
+ * If VID headers may be stored at non-aligned flash offsets, so we
+ * shift the pointer to hide alignment complexities from other UBI
+ * units.
+ */
+ return vid_hdr + ubi->io.vid_hdr_shift;
+}
+
+/**
+ * ubi_free_vid_hdr - free a volume identifier header.

+ *
+ * @ubi: the UBI device description object

+ * @vid_hdr: a pointer to the object to free
+ */
+static inline void ubi_free_vid_hdr(const struct ubi_info *ubi,
+ struct ubi_vid_hdr *vid_hdr)
+{
+ void *p = vid_hdr;
+
+ if (unlikely(!p))
+ return;
+ /* The pointer was shifted, shift it back */
+ kfree(p - ubi->io.vid_hdr_shift);
+}
+
+/*
+ * This function is equivalent to 'ubi_io_read()', but @offset is relative to
+ * the beginning of the logical eraseblock, not to the beginning of the
+ * physical eraseblock.
+ */
+static inline int ubi_io_read_data(const struct ubi_info *ubi, void *buf,
+ int pnum, int offset, int len)
+{
+ ubi_assert(offset >= 0);
+ return ubi_io_read(ubi, buf, pnum, offset + ubi->io.leb_start, len);
+}
+
+/*
+ * This function is equivalent to 'ubi_io_write()', but @offset is relative to
+ * the beginning of the logical eraseblock, not to the beginning of the
+ * physical eraseblock.
+ */
+static inline int ubi_io_write_data(const struct ubi_info *ubi, const void *buf,
+ int pnum, int offset, int len)
+{
+ ubi_assert(offset >= 0);
+ return ubi_io_write(ubi, buf, pnum, offset + ubi->io.leb_start, len);
+}
+
+/**
+ * ubi_ro_mode - switch to read-only mode.

+ *
+ * @ubi: the UBI device description object
+ */

+static inline void ubi_ro_mode(struct ubi_info *ubi)
+{
+ ubi->io.ro_mode = 1;
+ ubi_warn("switch to read-only mode");
+}
+
+/**
+ * ubi_is_ivol - check if a volume is an internal volume.
+ *
+ * @vol_id: ID of the volume to test
+ *
+ * If the volume is internal volume, %1 is returned, otherwise %0 is returned.
+ */
+static inline int ubi_is_ivol(int vol_id)
+{
+ return vol_id >= UBI_INTERNAL_VOL_START &&
+ vol_id < UBI_INTERNAL_VOL_START + UBI_INT_VOL_COUNT;
+}
+
+/*
+ * ubi_ivol_is_known - check if this is a known internal volume.
+ *
+ * @vol_id: ID of the volume to check.
+ *
+ * This function returns non-zero if this is a known and supported internal
+ * volume and non-zero if not.
+ */
+static inline int ubi_ivol_is_known(int vol_id)
+{
+ return vol_id == UBI_LAYOUT_VOL_ID;
+}
+
+/**
+ * ubi_get_compat - get compatibility flags of a volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: volume ID
+ *
+ * This function returns compatibility flags of volumes. User volumes have no
+ * compatibility flags, so %0 is returned. The @vol_id must be correct.
+ */
+static inline int ubi_get_compat(const struct ubi_info *ubi, int vol_id)
+{
+ if (!ubi_is_ivol(vol_id))
+ return 0;
+
+ switch (vol_id) {
+ case UBI_LAYOUT_VOL_ID:
+ return UBI_LAYOUT_VOLUME_COMPAT;
+ default:
+ BUG();
+ }
+

+ return -ENODEV;
+}
+

+#endif /* !__UBI_UBI_H__ */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:24:09 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/uif.c tmp-to/drivers/mtd/ubi/uif.c
--- tmp-from/drivers/mtd/ubi/uif.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/uif.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,842 @@

+/*
+ * Copyright (c) International Business Machines Corp., 2006
+ *

+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Author: Artem B. Bityutskiy
+ */
+

+/*
+ * UBI user interfaces unit.
+ *
+ * This unit implements all the UBI user interfaces: kernel interfaces,
+ * character device interfaces, and sysfs interfaces.
+ *
+ * This file mostly implements kernel-API functions, character device
+ * operations are implemented in cdev.c, update functionality is implemented in
+ * upd.c, volume chane functionality - in vmt.c, and sysfs interfaces are
+ * implemented in sysfs.c.
+ *
+ * There are two kinds of character devices: UBI character devices and UBI
+ * volume character devices. UBI character devices allow users to manipulate
+ * whole volumes: create, remove, and re-size them. Volume character devices
+ * provide volume I/O capabilities.
+ *
+ * Major and minor numbers are assigned dynamically to both UBI and volume
+ * character devices.
+ */
+
+#include <linux/module.h>
+#include <linux/err.h>
+#include <asm/div64.h>
+#include "ubi.h"
+
+/**
+ * ubi_get_device_info - get information about an UBI device.
+ *
+ * @ubi_num: UBI device number
+ * @di: the volume information is returned here
+ *
+ * This function fills the provided @di object with the information about the
+ * UBI device number @ubi_num. Returns %0 in case of success and a %-ENODEV if
+ * there is no such UBI device.
+ */

+int ubi_get_device_info(int ubi_num, struct ubi_dev_info *di)

+{
+ const struct ubi_info *ubi;
+ int err = -ENODEV;
+
+ if (!try_module_get(THIS_MODULE))
+ return err;
+
+ if (ubi_num < 0 || ubi_num >= UBI_MAX_DEVICES)
+ goto out_put;
+

+ if (!ubis[ubi_num])

+ goto out_put;
+
+ ubi = ubis[ubi_num];
+
+ di->ubi_num = ubi->ubi_num;
+ di->leb_size = ubi->io.leb_size;
+ di->min_io_size = ubi->io.min_io_size;
+ di->ro_mode = ubi->io.ro_mode;
+ di->cdev = MKDEV(ubi->uif.major, 0);
+ err = 0;
+
+out_put:
+ module_put(THIS_MODULE);
+ return err;
+}
+EXPORT_SYMBOL_GPL(ubi_get_device_info);
+
+/**
+ * ubi_get_volume_info - get information about an UBI volume.
+ *
+ * @desc: volume descriptor
+ * @vi: the volume information is returned here
+ */

+void ubi_get_volume_info(struct ubi_vol_desc *desc, struct ubi_vol_info *vi)

+{
+ int vol_id = desc->vol->vol_id;
+ const struct ubi_info *ubi = desc->vol->ubi;
+ const struct ubi_vtbl_vtr *vtr;
+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ vi->vol_id = vol_id;
+ vi->ubi_num = ubi->ubi_num;
+ vi->size = vtr->reserved_pebs;
+ vi->used_bytes = vtr->used_bytes;
+ vi->vol_type = vtr->vol_type;
+ vi->corrupted = vtr->corrupted;
+ vi->upd_marker = vtr->upd_marker;
+ vi->alignment = vtr->alignment;
+ vi->usable_leb_size = vtr->usable_leb_size;
+ vi->name_len = vtr->name_len;
+ vi->name = vtr->name;
+ vi->cdev = MKDEV(ubi->uif.major, vi->vol_id + 1);
+}
+EXPORT_SYMBOL_GPL(ubi_get_volume_info);
+
+/*
+ * ubi_open_volume - open an UBI volume.
+ *
+ * @ubi_num: the UBI device number
+ * @vol_id: ID of the volume to open
+ * @mode: volume open mode
+ *
+ * This function opens an UBI volume. The @mode parameter specifies if the
+ * volume should be opened in read-only mode, read-write mode, or exclusive
+ * mode. The exclusive mode guarantees that nobody else will be able to open
+ * this volume. UBI allows to have many volume readers and one writer at a
+ * time.
+ *
+ * If a static volume is being opened for the first time since boot, the volume
+ * will be checked by this function, which means it will be fully read and the
+ * CRC checksum of each logical eraseblock will be checked.
+ *
+ * This function returns an UBI volume descriptor in case of success. In case
+ * of failure, the following error codes may be returned:
+ *
+ * o %-EBUSY if the volume is busy (it is being updated, or it is already
+ * opened in exclusive or read-write mode by somebody else);
+ * o %-EINVAL if the input arguments are invalid;
+ * o %-ENODEV if this volume does not exist or the UBI device does not exist;
+ * o other negative error codes in case of other errors.
+ */

+struct ubi_vol_desc *ubi_open_volume(int ubi_num, int vol_id, int mode)

+{
+ int err, found = 0;
+ struct ubi_vol_desc *desc;
+ struct ubi_info *ubi;
+ struct ubi_uif_volume *vol;
+
+ dbg_uif("open device %d volume %d, mode %d", ubi_num, vol_id, mode);
+
+ err = -ENODEV;
+ if (!try_module_get(THIS_MODULE))
+ return ERR_PTR(err);
+
+ if (ubi_num < 0 || ubi_num >= UBI_MAX_DEVICES)
+ goto out_put;
+

+ if (!ubis[ubi_num])

+ goto out_put;
+
+ ubi = ubis[ubi_num];
+ ubi_assert(ubi->ubi_num == ubi_num);
+
+ desc = kmalloc(sizeof(struct ubi_vol_desc), GFP_KERNEL);
+ if (!desc) {
+ err = -ENOMEM;
+ goto out_put;
+ }
+
+ err = -EINVAL;
+ if (vol_id < 0 || vol_id >= ubi->vtbl.vt_slots)
+ goto out_free;
+
+ if (mode != UBI_READONLY && mode != UBI_READWRITE &&
+ mode != UBI_EXCLUSIVE)
+ goto out_free;
+
+ spin_lock(&ubi->uif.volumes_list_lock);
+ list_for_each_entry(vol, &ubi->uif.volumes, list) {
+ if (vol->vol_id == vol_id) {
+ found = 1;
+ break;
+ }
+ }
+
+ if (!found) {
+ err = -ENODEV;
+ goto out_unlock;
+ }
+
+ err = -EBUSY;
+ spin_lock(&vol->vol_lock);
+ switch (mode) {
+ case UBI_READONLY:
+ if (vol->exclusive)
+ goto out_unlock_vol;
+ vol->readers += 1;
+ break;
+
+ case UBI_READWRITE:
+ if (vol->exclusive || vol->writers > 0)
+ goto out_unlock_vol;
+ vol->writers += 1;
+ break;
+
+ case UBI_EXCLUSIVE:
+ if (vol->exclusive || vol->writers || vol->readers)
+ goto out_unlock_vol;
+ vol->exclusive = 1;
+ break;
+ }
+ spin_unlock(&vol->vol_lock);
+ spin_unlock(&ubi->uif.volumes_list_lock);
+
+ desc->vol = vol;
+ desc->mode = mode;
+
+ mutex_lock(&ubi->uif.vol_check_lock);
+ if (!vol->checked) {
+ /*
+ * This is the first time this volume is being opened, we have
+ * to check it. If the volume is corrupted, we still return
+ * success, but mark it as corrupted and return %-EUCLEAN on
+ * any further read.
+ */
+ err = ubi_check_volume(ubi, vol_id);
+ if (err < 0)
+ goto unlock_close;
+
+ if (err == 1) {
+ ubi_warn("volume %d on UBI device %d is corrupted",
+ vol_id, ubi->ubi_num);
+ err = ubi_vtbl_set_corrupted(ubi, vol_id);
+ if (err)
+ goto unlock_close;
+ }
+ vol->checked = 1;
+ }
+ mutex_unlock(&ubi->uif.vol_check_lock);
+
+ return desc;
+
+out_unlock_vol:
+ spin_unlock(&vol->vol_lock);
+out_unlock:
+ spin_unlock(&ubi->uif.volumes_list_lock);
+out_free:
+ kfree(desc);
+out_put:
+ module_put(THIS_MODULE);
+ return ERR_PTR(err);
+
+unlock_close:
+ mutex_unlock(&ubi->uif.vol_check_lock);
+ ubi_close_volume(desc);
+ return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(ubi_open_volume);
+
+/*
+ * ubi_open_volume_nm - open an UBI volume by volume name.
+ *
+ * @ubi_num: the UBI device number

+ * @name: volume name

+ * @mode: volume open mode
+ *
+ * This function is similar to the 'ubi_open_volume()' function, but opens UBI
+ * volumes by name.
+ */
+struct ubi_vol_desc *ubi_open_volume_nm(int ubi_num, const char *name, int mode)
+{
+ int vol_id = -1, len;
+ struct ubi_vol_desc *ret;
+ struct ubi_info *ubi;
+ struct ubi_uif_volume *vol;
+
+ dbg_uif("open volume by name %s, mode %d", name, mode);
+
+ if (!name)
+ return ERR_PTR(-EINVAL);
+
+ len = strnlen(name, UBI_VOL_NAME_MAX + 1);
+ if (len > UBI_VOL_NAME_MAX)
+ return ERR_PTR(-EINVAL);
+
+ ret = ERR_PTR(-ENODEV);
+ if (!try_module_get(THIS_MODULE))
+ return ret;
+
+ if (ubi_num < 0 || ubi_num >= UBI_MAX_DEVICES)
+ goto out_put;
+

+ if (!ubis[ubi_num])

+ goto out_put;
+
+ ubi = ubis[ubi_num];
+ ubi_assert(ubi->ubi_num == ubi_num);
+
+ /* Walk all volumes of this UBI device */
+ spin_lock(&ubi->uif.volumes_list_lock);
+ list_for_each_entry(vol, &ubi->uif.volumes, list) {
+ const struct ubi_vtbl_vtr *vtr;
+
+ vtr = ubi_vtbl_get_vtr(vol->ubi, vol->vol_id);
+ if (len == vtr->name_len && !strcmp(name, vtr->name)) {
+ vol_id = vol->vol_id;
+ spin_unlock(&vol->vol_lock);
+ break;
+ }
+ }
+ spin_unlock(&ubi->uif.volumes_list_lock);
+
+ if (vol_id < 0)
+ goto out_put;
+
+ ret = ubi_open_volume(ubi_num, vol_id, mode);
+
+out_put:
+ module_put(THIS_MODULE);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ubi_open_volume_nm);
+
+/**
+ * ubi_close_volume - close an UBI volume.
+ *
+ * @desc: UBI volume descriptor
+ */
+void ubi_close_volume(struct ubi_vol_desc *desc)
+{
+ struct ubi_uif_volume *vol = desc->vol;
+
+ dbg_uif("close volume %d, mode %d", vol->vol_id, desc->mode);
+
+ spin_lock(&vol->vol_lock);
+ switch (desc->mode) {
+ case UBI_READONLY:
+ ubi_assert(vol->readers > 0 && vol->writers >= 0);
+ ubi_assert(vol->exclusive == 0);
+ vol->readers -= 1;
+ break;
+
+ case UBI_READWRITE:
+ ubi_assert(vol->readers >= 0 && vol->writers == 1);
+ ubi_assert(vol->exclusive == 0);
+ vol->writers -= 1;
+ break;
+
+ case UBI_EXCLUSIVE:
+ ubi_assert(vol->readers == 0 && vol->writers == 0);
+ ubi_assert(vol->exclusive == 1);
+ vol->exclusive = 0;
+ }
+ spin_unlock(&vol->vol_lock);
+
+ kfree(desc);
+ module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL_GPL(ubi_close_volume);
+
+/**
+ * ubi_eraseblock_read - read data from a logical eraseblock.
+ *
+ * @desc: volume descriptor
+ * @lnum: the logical eraseblock number to read from
+ * @buf: a buffer where to store the read data
+ * @offset: the offset within the logical eraseblock from where to read
+ * @len: how many bytes to read
+ * @check: whether UBI has to check the read data's CRC or not.
+ *
+ * This function reads data from offset @offset of the logical eraseblock @lnum
+ * and stores the read data at @buf. When reading from static volumes, @check
+ * may be used to specify whether the read data has to be checked or not. If
+ * checking is requested, the whole logical eraseblock will be read and its CRC
+ * checksum will be checked (i.e., the CRC checksum is per-eraseblock). So
+ * checking may substantially slow down the read speed. The @check argument is
+ * ignored for dynamic volumes.
+ *
+ * In case of success, this function returns zero. In case of failure, this
+ * function returns a negative error code.
+ *
+ * A special %-EBADMSG error code is returned:
+ * o for both static and dynamic volumes if the MTD driver has detected a data
+ * integrity problem (unrecoverable ECC checksum mismatch in case of NAND);
+ * o for static volumes if the data CRC mismatches.
+ *
+ * In case of reading a corrupted UBI volume, this function never returns zero,
+ * but %-EUCLEAN is returned instead, even though the data was read without
+ * errors because the corruption did not affect this particular piece of data,
+ * otherwise %-EBADMSG would have been returned. Thus, %-EUCLEAN just means
+ * that the read static volume is corrupted, but the read data is actually OK.
+ *
+ * %-EUCLEAN is not returned for dynamic volumes, because UBI does not protect
+ * their contents.
+ *
+ * If the volume is damaged because of an interrupted update this function just
+ * returns immediately with %-EBADF error code.
+ */
+int ubi_eraseblock_read(struct ubi_vol_desc *desc, int lnum, char *buf,

+ int offset, int len, int check)

+{
+ const struct ubi_vtbl_vtr *vtr;
+ struct ubi_info *ubi = desc->vol->ubi;
+ int err, vol_id = desc->vol->vol_id;
+
+ dbg_uif("read %d bytes from LEB %d:%d:%d",
+ len, vol_id, lnum, offset);
+
+ if (unlikely(vol_id < 0 || vol_id >= ubi->vtbl.vt_slots))
+ return -EINVAL;
+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ if (unlikely(lnum < 0 || lnum >= vtr->used_ebs))
+ return -EINVAL;
+
+ if (unlikely(offset < 0 || len < 0 ||
+ offset + len > vtr->usable_leb_size))
+ return -EINVAL;
+
+ if (unlikely(vtr->vol_type == UBI_STATIC_VOLUME &&
+ lnum == vtr->used_ebs - 1 &&
+ offset + len > vtr->last_eb_bytes))
+ return -EINVAL;
+
+ if (unlikely(vtr->upd_marker))
+ return -EBADF;
+
+ if (unlikely(len == 0))
+ return 0;
+
+reread:
+ err = ubi_eba_read_leb(ubi, vol_id, lnum, buf, offset, len, check);
+ if (unlikely(err)) {
+ if (err == -EBADMSG && vtr->vol_type == UBI_STATIC_VOLUME &&
+ !vtr->corrupted) {
+ int err1;
+
+ /*
+ * We have read a static volume and result in data
+ * integrity error. If we did not do checking, re-read
+ * with checking enabled. Otherwise mark this static
+ * eraseblock as corrupted.
+ */
+ if (!check) {
+ check = 1;
+ goto reread;
+ }
+
+ ubi_warn("mark volume %d as corrupted", vol_id);
+ err1 = ubi_vtbl_set_corrupted(ubi, vol_id);
+ if (err1)
+ err = err1;
+ }

+ return err;
+ }
+

+ if (unlikely(vtr->corrupted)) {
+ ubi_assert(vtr->vol_type == UBI_STATIC_VOLUME);
+ err = -EUCLEAN;
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(ubi_eraseblock_read);
+
+/**
+ * ubi_eraseblock_write - write data to a logical eraseblock.
+ *
+ * @desc: volume descriptor
+ * @lnum: the logical eraseblock number to write to
+ * @buf: the data to write
+ * @offset: offset within the logical eraseblock where to write
+ * @len: how many bytes from @buf to write
+ * @dtype: expected data type
+ *
+ * This function writes @len bytes of data from buffer @buf to offset @offset
+ * of logical eraseblock @lnum. The @dtype argument describes the expected
+ * lifetime of the data that is being written.
+ *
+ * Note, this function takes care of physical eraseblock write failures. If a
+ * write to the physical eraseblock write operation fails, the logical
+ * eraseblock is re-mapped to another physical eraseblock, the data is
+ * recovered, and the write finishes. UBI has a pool of reserved physical
+ * eraseblocks for this.
+ *
+ * If all the data were successfully written, zero is returned. If an error
+ * occurred and UBI has not been able to recover from it, this function returns
+ * a negative error code. Note, in case of an error, it is possible that
+ * something was still written to the flash media, but that may be some
+ * garbage.
+ *
+ * If the volume is damaged because of an interrupted update this function just
+ * returns immediately with %-EBADF error code.
+ */

+int ubi_eraseblock_write(struct ubi_vol_desc *desc, int lnum, const void *buf,
+ int offset, int len, int dtype)

+{
+ const struct ubi_vtbl_vtr *vtr;
+ struct ubi_info *ubi = desc->vol->ubi;
+ int vol_id = desc->vol->vol_id;
+
+ dbg_uif("write %zd bytes to LEB %d:%d:%d",
+ len, vol_id, lnum, offset);
+
+ if (unlikely(vol_id < 0 || vol_id >= ubi->vtbl.vt_slots))
+ return -EINVAL;
+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ if (unlikely(desc->mode == UBI_READONLY ||
+ vtr->vol_type == UBI_STATIC_VOLUME))
+ return -EROFS;
+
+ if (unlikely(lnum < 0 || lnum >= vtr->reserved_pebs))
+ return -EINVAL;
+
+ if (unlikely(offset < 0 || len < 0 ||
+ offset + len > vtr->usable_leb_size))
+ return -EINVAL;
+
+ if (unlikely(offset % ubi->io.min_io_size ||
+ len % ubi->io.min_io_size))
+ return -EINVAL;
+
+ if (unlikely(dtype != UBI_DATA_LONGTERM &&
+ dtype != UBI_DATA_SHORTTERM &&
+ dtype != UBI_DATA_UNKNOWN))
+ return -EINVAL;
+
+ if (unlikely(vtr->upd_marker))
+ return -EBADF;
+
+ if (unlikely(len == 0))
+ return 0;
+
+ return ubi_eba_write_leb(ubi, vol_id, lnum, buf, offset, len, dtype);
+}
+EXPORT_SYMBOL_GPL(ubi_eraseblock_write);
+
+/**
+ * ubi_eraseblock_erase - erase a logical eraseblock.
+ *
+ * @desc: volume descriptor
+ * @lnum: the logical eraseblock number to erase
+ *
+ * This function un-maps logical eraseblock @lnum and synchronously erases the
+ * correspondent physical eraseblock. Returns zero in case of success and a

+ * negative error code in case of failure.

+ *
+ * If the volume is damaged because of an interrupted update this function just
+ * returns immediately with %-EBADF error code.
+ */

+int ubi_eraseblock_erase(struct ubi_vol_desc *desc, int lnum)

+{
+ const struct ubi_vtbl_vtr *vtr;
+ struct ubi_info *ubi = desc->vol->ubi;
+ int err, vol_id = desc->vol->vol_id;
+
+ dbg_uif("erase LEB %d:%d", vol_id, lnum);
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ if (unlikely(desc->mode == UBI_READONLY ||
+ vtr->vol_type == UBI_STATIC_VOLUME))
+ return -EROFS;
+
+ if (unlikely(lnum < 0 || lnum >= vtr->reserved_pebs))
+ return -EINVAL;
+
+ if (unlikely(vtr->upd_marker))
+ return -EBADF;
+
+ err = ubi_eba_unmap_leb(ubi, vol_id, lnum);
+ if (unlikely(err))
+ return err;
+
+ return ubi_wl_flush(ubi);
+}
+EXPORT_SYMBOL_GPL(ubi_eraseblock_erase);
+
+/**
+ * ubi_eraseblock_unmap - un-map a logical eraseblock.
+ *
+ * @desc: volume descriptor
+ * @lnum: the logical eraseblock number to un-map
+ *
+ * This function un-maps logical eraseblock @lnum and schedules the
+ * corresponding physical eraseblock for erasure, so that it will eventually be
+ * physically erased in background. So this operation is much faster then the
+ * erase operation.
+ *
+ * Unlike erase, the un-map operation does not guarantee that the logical
+ * eraseblock will contain all 0xFF bytes when UBI is initialized again. For
+ * example, if several logical eraseblocks are un-mapped, and an unclean reboot
+ * happens after this, the logical eraseblocks will not necessarily be
+ * un-mapped again when this MTD device is attached. They may actually be
+ * mapped to the same physical eraseblocks again. So, this function has to be
+ * used with care.
+ *
+ * In other words, when un-mapping a logical eraseblock, UBI does not store
+ * any information about this on the flash media, it just marks the logical
+ * eraseblock as "un-mapped" in RAM. If UBI is detached before the physical
+ * eraseblock is physically erased, the physical eraseblock will be mapped
+ * again to the same logical eraseblock when the MTD device is attached again.
+ *
+ * The main and obvious use-case of this function is when the contents of a
+ * logical eraseblock has to be re-written. Then it is much more efficient to
+ * first un-map it, then write new data, rather then first erase it, then write
+ * new data. Note, once new data has been written to the logical eraseblock,
+ * UBI guarantees that the old contents has gone forever. In other words, if an
+ * unclean reboot happens after the logical eraseblock has been un-mapped and
+ * then written to, it will contain the last written data.
+ *

+ * This function returns zero in case of success and a negative error code in

+ * case of failure. If the volume is damaged because of an interrupted update
+ * this function just returns immediately with %-EBADF error code.
+ */

+int ubi_eraseblock_unmap(struct ubi_vol_desc *desc, int lnum)

+{
+ const struct ubi_vtbl_vtr *vtr;
+ struct ubi_info *ubi = desc->vol->ubi;
+ int vol_id = desc->vol->vol_id;
+
+ dbg_uif("unmap LEB %d:%d", vol_id, lnum);
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ if (unlikely(desc->mode == UBI_READONLY ||
+ vtr->vol_type == UBI_STATIC_VOLUME))
+ return -EROFS;
+
+ if (unlikely(lnum < 0 || lnum >= vtr->reserved_pebs))
+ return -EINVAL;
+
+ if (unlikely(vtr->upd_marker))
+ return -EBADF;
+
+ return ubi_eba_unmap_leb(ubi, vol_id, lnum);
+}
+EXPORT_SYMBOL_GPL(ubi_eraseblock_unmap);
+
+/**
+ * ubi_eraseblock_is_mapped - check if a logical eraseblock is mapped to a

+ * physical eraseblock
+ *

+ * @desc: volume descriptor
+ * @lnum: the logical eraseblock number to erase
+ *
+ * This function checks if a logical eraseblock is mapped to a physical
+ * eraseblock. Unmapped logical eraseblocks are equivalent to erased logical
+ * eraseblocks and contain only 0xFF bytes. Mapped logical eraseblocks are
+ * those that were explicitly written to. They may also contain only 0xFF
+ * bytes if these were written there.
+ *
+ * Note, if a logical eraseblock is unmapped, this does not necessarily mean
+ * it will still be un-mapped after the corresponding UBI device is detached
+ * and attached again. After the UBI device is re-attached, the logical
+ * eraseblock may become mapped to the physical eraseblock it was last mapped
+ * to. Se the 'ubi_eraseblock_unmap()' function for more explanation.
+ *
+ * This function returns %1 if the LEB is mapped, %0 if not, and a negative
+ * error code in case of failure. If the volume is damaged because of an
+ * interrupted update this function just returns immediately with %-EBADF error
+ * code.
+ */

+int ubi_eraseblock_is_mapped(struct ubi_vol_desc *desc, int lnum)

+{
+ const struct ubi_vtbl_vtr *vtr;
+ struct ubi_info *ubi = desc->vol->ubi;
+ int vol_id = desc->vol->vol_id;
+
+ dbg_uif("check LEB %d:%d", vol_id, lnum);
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ if (unlikely(lnum < 0 || lnum >= vtr->reserved_pebs))
+ return -EINVAL;
+
+ if (unlikely(vtr->upd_marker))
+ return -EBADF;
+
+ return ubi_eba_leb_is_mapped(ubi, vol_id, lnum);
+}
+EXPORT_SYMBOL_GPL(ubi_eraseblock_is_mapped);
+
+/**
+ * kill_volumes - close user interfaces for all volumes.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function is called when the UBI device is detached so no locking is
+ * needed.
+ */
+static void kill_volumes(struct ubi_info *ubi)
+{
+ struct ubi_uif_volume *vol, *vol_tmp;
+
+ list_for_each_entry_safe(vol, vol_tmp, &ubi->uif.volumes, list) {
+ int err;
+
+ list_del(&vol->list);
+ err = ubi_gluebi_vol_close(vol);
+ BUG_ON(err);
+ cdev_del(&vol->cdev);
+ ubi_sysfs_vol_close(vol);
+ }
+}
+
+/**
+ * ubi_uif_init - initialize user interfaces for an UBI device.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.

+ */
+int ubi_uif_init(struct ubi_info *ubi)
+{
+ int i, err;
+
+ dbg_uif("initialize the user interfaces unit");
+
+ spin_lock_init(&ubi->uif.volumes_list_lock);
+ mutex_init(&ubi->uif.vol_check_lock);
+ mutex_init(&ubi->uif.vol_change_lock);
+ INIT_LIST_HEAD(&ubi->uif.volumes);
+
+ sprintf(ubi->uif.ubi_name, UBI_STRING_NAME "%d", ubi->ubi_num);
+
+ err = ubi_cdev_init(ubi);
+ if (err)
+ return 0;
+
+ err = ubi_sysfs_init(ubi);
+ if (err)
+ goto out_cdev;
+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {
+ const struct ubi_vtbl_vtr *vtr;
+
+ cond_resched();
+
+ vtr = ubi_vtbl_get_vtr(ubi, i);
+ if (IS_ERR(vtr))
+ continue;
+
+ err = ubi_vmt_mkvol(ubi, i, NULL);
+ if (err < 0)
+ goto out_volumes;
+ }
+
+ dbg_uif("the user interfaces unit is initialized");
+ return 0;
+
+out_volumes:
+ kill_volumes(ubi);
+ ubi_sysfs_close(ubi);
+out_cdev:
+ ubi_cdev_close(ubi);
+ return err;
+}
+
+/**
+ * ubi_uif_close - close the UBI user interface unit for an UBI device.

+ *
+ * @ubi: the UBI device description object
+ */

+void ubi_uif_close(struct ubi_info *ubi)
+{
+ dbg_uif("close the user interface unit for %s", ubi->uif.ubi_name);
+
+ kill_volumes(ubi);
+ ubi_sysfs_close(ubi);
+ ubi_cdev_close(ubi);
+}
+
+/**
+ * ubi_uif_get_exclusive - get exclusive access to an UBI volume.
+ *
+ * @desc: volume descriptor
+ *
+ * This function changes UBI volume mode to "exclusive". Obviously, the caller
+ * has to be the only holder of the volume to make this function succeed.
+ * Returns previous mode value (positive integer) in case of success and a

+ * negative error code in case of failure.
+ */

+int ubi_uif_get_exclusive(struct ubi_vol_desc *desc)
+{
+ int users, ret;
+ struct ubi_uif_volume *vol = desc->vol;
+
+ spin_lock(&vol->vol_lock);
+ users = vol->readers + vol->writers + vol->exclusive;
+ ubi_assert(users > 0);
+ if (users > 1) {
+ dbg_err("%d users for volume %d", users, vol->vol_id);
+ ret = -EBUSY;
+ } else {
+ dbg_uif("changed volume %d mode %d to exclusive",
+ vol->vol_id, desc->mode);
+ vol->readers = vol->writers = 0;
+ vol->exclusive = 1;
+ }
+ spin_unlock(&vol->vol_lock);
+
+ ret = desc->mode;
+ desc->mode = UBI_EXCLUSIVE;
+ return ret;
+}
+
+/**
+ * ubi_uif_revoke_exclusive - switch to a non-exclusive open mode.
+ *
+ * @desc: volume descriptor
+ * @mode: the new mode to switch to
+ *
+ * This function revokes the "exclusive" mode and switches to the @mode mode.
+ */

+void ubi_uif_revoke_exclusive(struct ubi_vol_desc *desc, int mode)

+{
+ struct ubi_uif_volume *vol = desc->vol;
+
+ spin_lock(&vol->vol_lock);
+ ubi_assert(vol->readers == 0 && vol->writers == 0);
+ ubi_assert(vol->exclusive == 1 && desc->mode == UBI_EXCLUSIVE);
+
+ vol->exclusive = 0;
+ if (mode == UBI_READONLY)
+ vol->readers = 1;
+ else if (mode == UBI_READWRITE)
+ vol->writers = 1;
+ else
+ vol->exclusive = 1;
+
+ spin_unlock(&vol->vol_lock);
+
+ desc->mode = mode;
+ dbg_uif("revoke exclusive mode for volume %d, new mode is %d",
+ vol->vol_id, mode);
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:24:21 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/eba.c tmp-to/drivers/mtd/ubi/eba.c
--- tmp-from/drivers/mtd/ubi/eba.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/eba.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,1735 @@

+ * The UBI Eraseblock Association (EBA) unit.
+ *
+ * This unit maintains the Eraseblock Association Table (EBA table). The EBA
+ * table is a data structure which maps (volume ID, logical eraseblock number)
+ * pairs to physical eraseblock numbers.
+ *
+ * All the UBI input/output goes via the EBA unit. The only reservation is made
+ * for the initialization time when different units may directly do
+ * input/output from physical eraseblocks.
+ *
+ * Although in this implementation the EBA table is fully kept and managed in
+ * RAM, which assumes poor UBI scalability, it might be (partially) maintained
+ * on flash in future implementations.
+ *
+ * The EBA unit implements per-logical eraseblock locking. Before accessing a
+ * logical eraseblock it is locked for reading or writing. The per-logical
+ * eraseblock locking is implemented by means of a lock tree. The lock tree is
+ * an RB-tree which refers all the currently locked logical eraseblocks. The
+ * lock tree elements are &struct ubi_eba_ltree_entry objects. They are indexed
+ * by (@vol_id, @lnum) pairs.
+ *
+ * EBA also maintains the global sequence counter which is incremented each
+ * time a logical eraseblock is mapped to a physical eraseblock and it is
+ * stored in the volume identifier header. This means that each VID header has
+ * a unique sequence number. The sequence number is only increased an we assume
+ * 64 bits is enough to never overflow.
+ */
+
+#include <linux/slab.h>
+#include <linux/crc32.h>
+#include <linux/err.h>
+#include "ubi.h"
+
+/**
+ * struct ubi_eba_tbl_rec - a record in the eraseblock association table.
+ *

+ * @pnum: physical eraseblock number

+ * @leb_ver: LEB version (obsolete)
+ *
+ * This structure represents a record in the eraseblock association table.
+ */
+struct ubi_eba_tbl_rec {
+ int pnum;

+ uint32_t leb_ver; /* FIXME: obsolete, to be removed */

+};
+
+/**
+ * struct ubi_eba_tbl_volume - a volume in the the eraseblock association
+ * table.
+ *
+ * @recs: an array of per-logical eraseblock records (for each logical
+ * eraseblock of this volume)
+ * @leb_count: how many logical eraseblock this volume has
+ */
+struct ubi_eba_tbl_volume {
+ struct ubi_eba_tbl_rec *recs;
+ int leb_count;
+};
+
+/**
+ * struct ubi_eba_ltree_entry - an entry in the lock tree.
+ *
+ * @rb: links RB-tree nodes
+ * @vol_id: volume ID of the locked logical eraseblock
+ * @lnum: locked logical eraseblock number
+ * @users: how many tasks are using this logical eraseblock or wait for it
+ * @mutex: a read/write mutex to implement read/write access serialization to
+ * the (@vol_id, @lnum) logical eraseblock
+ *
+ * When a logical eraseblock is being locked - corresponding &struct
+ * ubi_eba_ltree_entry object is inserted to the lock tree (@eba->ltree).
+ */
+struct ubi_eba_ltree_entry {
+ struct rb_node rb;
+ int vol_id;
+ int lnum;
+ int users;
+ struct rw_semaphore mutex;
+};
+
+/*
+ * The top-most bit in logical-to-physical eraseblock mappings is used to
+ * indicate that the logical eraseblock is not mapped.
+ */
+#define NOT_MAPPED 0x80000000
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_EBA
+static int paranoid_check_leb(const struct ubi_info *ubi, int pnum, int vol_id,
+ int lnum, const struct ubi_vid_hdr *vid_hdr);
+static int paranoid_check_leb_locked(struct ubi_info *ubi, int vol_id,
+ int lnum);
+#else
+#define paranoid_check_leb(ubi, vol_id, pnum, lnum, vid_hdr) 0
+#define paranoid_check_leb_locked(ubi, vol_id, lnum)
+#endif
+
+/* Slab cache for lock-tree entries */
+static struct kmem_cache *eba_ltree_entry_slab;
+
+/**
+ * vol_id2idx - turn volume ID to the EBA table index.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID
+ */
+static inline int vol_id2idx(const struct ubi_info *ubi, int vol_id)
+{
+ if (vol_id >= UBI_INTERNAL_VOL_START)
+ return vol_id - UBI_INTERNAL_VOL_START + ubi->vtbl.vt_slots;
+ else
+ return vol_id;
+}
+
+/**
+ * idx2vol_id - turn an EBA table index to the volume ID.

+ *
+ * @ubi: the UBI device description object

+ * @idx: the EBA table index
+ */
+static inline int idx2vol_id(const struct ubi_info *ubi, int idx)
+{
+ if (idx >= ubi->vtbl.vt_slots)
+ return idx - ubi->vtbl.vt_slots + UBI_INTERNAL_VOL_START;
+ else
+ return idx;
+}
+
+/**
+ * leb_get_ver - get logical eraseblock version.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID

+ * @lnum: the logical eraseblock number

+ *
+ * The logical eraseblock has to be locked. Note, all this leb_ver stuff is
+ * obsolete and will be removed eventually. FIXME: to be removed together with
+ * leb_ver support.
+ */
+static inline int leb_get_ver(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ int idx, leb_ver;
+
+ idx = vol_id2idx(ubi, vol_id);
+
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ ubi_assert(ubi->eba.eba_tbl[idx].recs);
+ leb_ver = ubi->eba.eba_tbl[idx].recs[lnum].leb_ver;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+
+ return leb_ver;
+}
+
+/**
+ * next_sqnum - get next sequence number.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function returns next sequence number to use, which is just the current
+ * global sequence counter value. It also increases the global sequence
+ * counter.
+ */
+static unsigned long long next_sqnum(struct ubi_info *ubi)
+{

+ unsigned long long sqnum;
+

+ spin_lock(&ubi->eba.eba_tbl_lock);
+ sqnum = ubi->eba.global_sq_cnt++;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+
+ return sqnum;
+}
+
+/**
+ * leb_map - map a logical eraseblock to a physical eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID

+ * @lnum: the logical eraseblock number

+ * @pnum: the physical eraseblock
+ *
+ * The logical eraseblock has to be locked.
+ */
+static inline void leb_map(struct ubi_info *ubi, int vol_id, int lnum, int pnum)
+{
+ int idx;
+
+ idx = vol_id2idx(ubi, vol_id);
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ ubi_assert(ubi->eba.eba_tbl[idx].recs);
+ ubi_assert(ubi->eba.eba_tbl[idx].recs[lnum].pnum < 0);
+ ubi->eba.eba_tbl[idx].recs[lnum].pnum = pnum;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+}
+
+/**
+ * leb_unmap - un-map a logical eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID
+ * @lnum: the logical eraseblock number to unmap
+ *
+ * This function un-maps a logical eraseblock and increases its version. The
+ * logical eraseblock has to be locked.
+ */
+static inline void leb_unmap(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ int idx;
+
+ idx = vol_id2idx(ubi, vol_id);
+
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ ubi_assert(ubi->eba.eba_tbl[idx].recs);
+ ubi_assert(ubi->eba.eba_tbl[idx].recs[lnum].pnum >= 0);
+ ubi->eba.eba_tbl[idx].recs[lnum].pnum |= NOT_MAPPED;
+ ubi->eba.eba_tbl[idx].recs[lnum].leb_ver += 1;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+}
+
+/**
+ * leb2peb - get physical eraseblock number by logical eraseblock number.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID

+ * @lnum: the logical eraseblock number

+ *
+ * If the logical eraseblock is mapped, this function returns a positive
+ * physical eraseblock number. If it is not mapped, this function returns
+ * a negative number.
+ */
+static inline int leb2peb(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ int idx, pnum;
+
+ idx = vol_id2idx(ubi, vol_id);
+
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ ubi_assert(ubi->eba.eba_tbl[idx].recs);
+ pnum = ubi->eba.eba_tbl[idx].recs[lnum].pnum;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+
+ return pnum;
+}
+
+/**
+ * ubi_eba_mkvol - create EBA mapping for a new volume.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: ID of the new volume
+ * @leb_count: how many eraseblocks are reserved for this volume

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_eba_mkvol(struct ubi_info *ubi, int vol_id, int reserved_pebs)
+{
+ int i, idx, sz;
+ struct ubi_eba_tbl_rec *new_ebs;
+ struct ubi_eba_tbl_volume *eba_tbl = ubi->eba.eba_tbl;
+
+ dbg_eba("create volume %d, size %d", vol_id, reserved_pebs);
+
+ ubi_assert(vol_id >= 0);
+ ubi_assert(reserved_pebs > 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+ ubi_assert(vol_id < ubi->vtbl.vt_slots);
+
+ if (ubi->io.ro_mode) {
+ dbg_err("read-only mode");

+ return -EROFS;
+ }
+

+ return 0;
+}
+

+/**
+ * ubi_eba_rmvol - remove EBA mapping of a volume.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to be removed
+ *
+ * This function removes a volume from the EBA table. It un-maps all the
+ * logical eraseblocks and the corresponding physical eraseblocks are scheduled
+ * for erasure. Returns zero in case of success and a negative error code in

+ * case of failure.
+ */

+int ubi_eba_rmvol(struct ubi_info *ubi, int vol_id)

+{
+ int err = 0, i, idx, to_put;
+ struct ubi_eba_tbl_rec *rm_ebs;
+ struct ubi_eba_tbl_volume *eba_tbl = ubi->eba.eba_tbl;
+
+ dbg_eba("remove volume %d", vol_id);
+
+ ubi_assert(vol_id >= 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+ ubi_assert(vol_id < ubi->vtbl.vt_slots);
+
+ if (ubi->io.ro_mode) {
+ dbg_err("read-only mode");

+ return -EROFS;
+ }
+

+ idx = vol_id2idx(ubi, vol_id);
+
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ ubi_assert(eba_tbl[idx].recs);
+ to_put = eba_tbl[idx].leb_count;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+
+ for (i = 0; i < to_put; i++) {
+ err = ubi_eba_unmap_leb(ubi, vol_id, i);
+ if (err)
+ break;
+ }
+
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ rm_ebs = eba_tbl[idx].recs;
+ eba_tbl[idx].recs = NULL;
+ eba_tbl[idx].leb_count = 0;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+
+ kfree(rm_ebs);

+ return err;
+}
+
+/**

+ * ubi_eba_rsvol - re-size EBA mapping for a volume.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to be re-sized
+ * @reserved_pebs: new count of physical eraseblocks in this volume
+ *
+ * This function changes the EBA table accordingly to the volume re-size
+ * operation. If the volume is actually shrunken, the dropped logical
+ * eraseblocks are got unmapped an thus, the corresponding physical eraseblocks
+ * are scheduled for erasure. This function returns zero in case of success and
+ * a negative error code in case of failure.
+ */

+int ubi_eba_rsvol(struct ubi_info *ubi, int vol_id, int reserved_pebs)

+{
+ int err = 0, i, idx, min, to_put, sz;
+ struct ubi_eba_tbl_rec *new_ebs, *old_ebs;
+ struct ubi_eba_tbl_volume *eba_tbl = ubi->eba.eba_tbl;
+
+ dbg_eba("re-size volume %d to %d PEBs", vol_id, reserved_pebs);
+
+ ubi_assert(vol_id >= 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+ ubi_assert(vol_id < ubi->vtbl.vt_slots);
+ ubi_assert(reserved_pebs > 0);
+
+ if (ubi->io.ro_mode) {
+ dbg_err("read-only mode");

+ return -EROFS;
+ }
+

+ sz = reserved_pebs * sizeof(struct ubi_eba_tbl_rec);
+ new_ebs = kmalloc(sz, GFP_KERNEL);
+ if (!new_ebs)
+ return -ENOMEM;
+
+ for (i = 0; i < reserved_pebs; i++) {
+ new_ebs[i].pnum = NOT_MAPPED;
+ new_ebs[i].leb_ver = 0;
+ }
+
+ idx = vol_id2idx(ubi, vol_id);
+
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ ubi_assert(eba_tbl[idx].recs);
+
+ if (reserved_pebs < eba_tbl[idx].leb_count) {
+ min = reserved_pebs;
+ to_put = eba_tbl[idx].leb_count - reserved_pebs;
+ } else {
+ min = eba_tbl[idx].leb_count;
+ to_put = 0;
+ }
+
+ for (i = 0; i < min; i++) {
+ new_ebs[i].pnum = eba_tbl[idx].recs[i].pnum;
+ new_ebs[i].leb_ver = eba_tbl[idx].recs[i].leb_ver;
+ }
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+
+ for (i = 0; i < to_put; i++) {
+ err = ubi_eba_unmap_leb(ubi, vol_id, i);
+ if (err)
+ break;
+ }
+
+ spin_lock(&ubi->eba.eba_tbl_lock);
+ old_ebs = eba_tbl[idx].recs;
+ eba_tbl[idx].recs = new_ebs;
+ eba_tbl[idx].leb_count = reserved_pebs;
+ spin_unlock(&ubi->eba.eba_tbl_lock);
+
+ kfree(old_ebs);

+ return err;
+}
+
+/**

+ * ltree_lookup - look up the lock tree.
+ *
+ * @eba: the EBA unit description data structure
+ * @vol_id: volume ID of the logical eraseblock to look up
+ * @lnum: the logical eraseblock to look up
+ *
+ * This function returns a pointer to the corresponding
+ * &struct ubi_eba_ltree_entry object if the logical eraseblock is locked and
+ * %NULL if it is not. The @ubi->eba.ltree_lock has to be locked.
+ */
+static struct ubi_eba_ltree_entry *ltree_lookup(struct ubi_info *ubi,
+ int vol_id, int lnum)
+{
+ struct rb_node *p;
+
+ p = ubi->eba.ltree.rb_node;
+ while (p) {
+ struct ubi_eba_ltree_entry *le;
+
+ le = rb_entry(p, struct ubi_eba_ltree_entry, rb);
+
+ if (vol_id < le->vol_id)
+ p = p->rb_left;
+ else if (vol_id > le->vol_id)
+ p = p->rb_right;
+ else {
+ if (lnum < le->lnum)
+ p = p->rb_left;
+ else if (lnum > le->lnum)
+ p = p->rb_right;
+ else
+ return le;
+ }
+ }
+
+ return NULL;
+}
+
+/**
+ * ltree_add_entry - add new entry to the lock tree.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: volume ID of the logical eraseblock

+ * @lnum: logical eraseblock number

+ *
+ * This function adds new entry for logical eraseblock (@vol_id, @lnum) to the
+ * lock tree. If such entry is already there, its usage counter is increased.
+ * Returns a pointer to the lock tree entry or %-ENOMEM if memory allocation
+ * failed.
+ */
+static struct ubi_eba_ltree_entry *ltree_add_entry(struct ubi_info *ubi,
+ int vol_id, int lnum)
+{
+ struct ubi_eba_ltree_entry *le, *le1, *le_free;
+
+ le = kmem_cache_alloc(eba_ltree_entry_slab, GFP_KERNEL);
+ if (unlikely(!le))
+ return ERR_PTR(-ENOMEM);
+
+ le->vol_id = vol_id;
+ le->lnum = lnum;
+
+ spin_lock(&ubi->eba.ltree_lock);
+ le1 = ltree_lookup(ubi, vol_id, lnum);
+
+ if (le1) {
+ /*
+ * This logical eraseblock is already locked. The newly
+ * allocated lock entry is not needed.
+ */
+ le_free = le;
+ le = le1;
+ } else {
+ struct rb_node **p, *parent = NULL;
+
+ /*
+ * No lock entry, add the newly allocated one to the
+ * @ubi->eba.ltree RB-tree.
+ */
+ le_free = NULL;
+
+ p = &ubi->eba.ltree.rb_node;
+ while (*p) {
+ parent = *p;
+ le1 = rb_entry(parent, struct ubi_eba_ltree_entry, rb);
+
+ if (vol_id < le1->vol_id)
+ p = &(*p)->rb_left;
+ else if (vol_id > le1->vol_id)
+ p = &(*p)->rb_right;
+ else {
+ ubi_assert(lnum != le1->lnum);
+ if (lnum < le1->lnum)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+ }
+
+ rb_link_node(&le->rb, parent, p);
+ rb_insert_color(&le->rb, &ubi->eba.ltree);
+ }
+ le->users += 1;
+ spin_unlock(&ubi->eba.ltree_lock);
+
+ if (le_free)
+ kmem_cache_free(eba_ltree_entry_slab, le_free);
+
+ return le;
+}
+
+/**
+ * leb_read_lock - lock logical eraseblock for reading.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: volume ID
+ * @lnum: the logical eraseblock to lock
+ *
+ * This function locks a logical eraseblock for reading. Returns zero in case
+ * of success and a negative error code in case of failure.
+ */
+static int leb_read_lock(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ struct ubi_eba_ltree_entry *le;
+
+ le = ltree_add_entry(ubi, vol_id, lnum);
+ if (unlikely(IS_ERR(le)))
+ return PTR_ERR(le);
+ down_read(&le->mutex);

+ return 0;
+}
+

+/**
+ * leb_read_unlock - unlock logical eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: volume ID
+ * @lnum: the logical eraseblock to unlock
+ */
+static void leb_read_unlock(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ int free = 0;
+ struct ubi_eba_ltree_entry *le;
+
+ spin_lock(&ubi->eba.ltree_lock);
+ le = ltree_lookup(ubi, vol_id, lnum);
+ le->users -= 1;
+ ubi_assert(le->users >= 0);
+ if (le->users == 0) {
+ rb_erase(&le->rb, &ubi->eba.ltree);
+ free = 1;
+ }
+ spin_unlock(&ubi->eba.ltree_lock);
+
+ up_read(&le->mutex);
+ if (free)
+ kmem_cache_free(eba_ltree_entry_slab, le);
+}
+
+/**
+ * leb_write_lock - lock logical eraseblock for writing.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: volume ID
+ * @lnum: the logical eraseblock to lock
+ *
+ * This function locks a logical eraseblock for writing. Returns zero in case
+ * of success and a negative error code in case of failure.
+ */
+static int leb_write_lock(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ struct ubi_eba_ltree_entry *le;
+
+ le = ltree_add_entry(ubi, vol_id, lnum);
+ if (unlikely(IS_ERR(le)))
+ return PTR_ERR(le);
+ down_write(&le->mutex);

+ return 0;
+}
+

+/**
+ * leb_write_unlock - unlock logical eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: volume ID
+ * @lnum: the logical eraseblock to unlock
+ */
+static void leb_write_unlock(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ int free;
+ struct ubi_eba_ltree_entry *le;
+
+ spin_lock(&ubi->eba.ltree_lock);
+ le = ltree_lookup(ubi, vol_id, lnum);
+ le->users -= 1;
+ ubi_assert(le->users >= 0);
+ if (le->users == 0) {
+ rb_erase(&le->rb, &ubi->eba.ltree);
+ free = 1;
+ } else
+ free = 0;
+ spin_unlock(&ubi->eba.ltree_lock);
+
+ up_write(&le->mutex);
+ if (free)
+ kmem_cache_free(eba_ltree_entry_slab, le);
+}
+
+/**
+ * ubi_eba_unmap_leb - un-map a logical eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: volume ID
+ * @lnum: the logical eraseblock to erase
+ *
+ * This function un-maps the logical eraseblock and schedules corresponding
+ * physical eraseblock for erasure. Returns zero in case of success and a

+ * negative error code in case of failure.

+ */

+int ubi_eba_unmap_leb(struct ubi_info *ubi, int vol_id, int lnum)

+{
+ int err, pnum;
+
+ ubi_assert(vol_id >= 0);
+ ubi_assert(vol_id < ubi->vtbl.vt_slots || ubi_is_ivol(vol_id));
+ ubi_assert(lnum >= 0);
+ ubi_assert(ubi->eba.eba_tbl[vol_id2idx(ubi, vol_id)].recs);
+ ubi_assert(lnum < ubi->eba.eba_tbl[vol_id2idx(ubi, vol_id)].leb_count);
+
+ cond_resched();
+
+ if (unlikely(ubi->io.ro_mode)) {
+ dbg_err("read-only mode");

+ return -EROFS;
+ }
+

+ err = leb_write_lock(ubi, vol_id, lnum);

+ if (unlikely(err))
+ return err;
+

+ pnum = leb2peb(ubi, vol_id, lnum);
+ if (pnum < 0) {
+ /* This logical eraseblock is already unmapped */
+ dbg_eba("erase LEB %d:%d (unmapped)", vol_id, lnum);
+ goto out_unlock;
+ }
+ dbg_eba("erase LEB %d:%d, PEB %d", vol_id, lnum, pnum);
+
+ leb_unmap(ubi, vol_id, lnum);
+
+ err = ubi_wl_put_peb(ubi, pnum, 0);
+
+out_unlock:
+ leb_write_unlock(ubi, vol_id, lnum);

+ return err;
+}
+
+/**

+ * ubi_eba_read_leb - read data from a logical eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: volume ID from where to read
+ * @lnum: the logical eraseblock to read from
+ * @buf: buffer to store the read data
+ * @offset: offset within the logical eraseblock from where to read

+ * @len: how many bytes to read

+ * @check: data CRC check flag
+ *
+ * If the logical eraseblock @lnum is unmapped, @buf is filled with 0xFF
+ * bytes. The @check flag only makes sense for static volumes and forces
+ * eraseblock data CRC checking.
+ *
+ * In case of success this function returns zero. In case of a static volume,
+ * id data CRC mismatches - %-EBADMSG is returned. %-EBADMSG may also be
+ * returned for any volume type if an ECC error was detected by the MTD device
+ * driver. Other negative error cored may be returned in case of other errors.
+ */
+int ubi_eba_read_leb(struct ubi_info *ubi, int vol_id, int lnum, void *buf,
+ int offset, int len, int check)
+{
+ int err, pnum, scrub = 0;

+ const struct ubi_vtbl_vtr *vtr;

+ uint32_t data_crc;
+ struct ubi_vid_hdr *vid_hdr;
+
+ ubi_assert(vol_id >= 0);
+ ubi_assert(vol_id < ubi->vtbl.vt_slots || ubi_is_ivol(vol_id));
+ ubi_assert(lnum >= 0);
+ ubi_assert(offset >= 0);
+ ubi_assert(len > 0);

+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));

+ ubi_assert(offset + len <= ubi->io.leb_size - vtr->data_pad);
+ ubi_assert(lnum < ubi->eba.eba_tbl[vol_id2idx(ubi, vol_id)].leb_count);
+
+ cond_resched();
+
+ err = leb_read_lock(ubi, vol_id, lnum);

+ if (unlikely(err))
+ return err;
+

+ pnum = leb2peb(ubi, vol_id, lnum);
+
+ if (pnum < 0) {
+ /*
+ * The logical eraseblock is not mapped, fill the whole buffer
+ * by 0xFF bytes. The exception is static volumes for which it
+ * is an error to read unmapped logical eraseblocks.
+ */
+ dbg_eba("read %d bytes from offset %d of LEB %d:%d (unmapped)",
+ len, offset, vol_id, lnum);
+ leb_read_unlock(ubi, vol_id, lnum);
+ ubi_assert(vtr->vol_type != UBI_STATIC_VOLUME);
+ memset(buf, 0xFF, len);

+ return 0;
+ }
+

+ dbg_eba("read %d bytes from offset %d of LEB %d:%d, PEB %d",
+ len, offset, vol_id, lnum, pnum);
+
+ if (vtr->vol_type == UBI_DYNAMIC_VOLUME)
+ /* In case of dynamic volumes no checking is needed */
+ check = 0;
+
+ if (check) {
+ vid_hdr = ubi_zalloc_vid_hdr(ubi);
+ if (unlikely(!vid_hdr)) {
+ err = -ENOMEM;

+ goto out_unlock;
+ }
+

+ err = ubi_io_read_vid_hdr(ubi, pnum, vid_hdr, 1);
+ if (unlikely(err) && err != UBI_IO_BITFLIPS) {
+ if (err > 0) {
+ /*
+ * The header is either absent or corrupted.
+ * The former case means there is a bug -
+ * switch to read-only mode just in case.
+ * The latter case means a real corruption - we
+ * may try to recover data. FIXME: but this is
+ * not implemented.
+ */
+ if (err == UBI_IO_BAD_VID_HDR) {
+ ubi_warn("bad VID header at PEB %d, LEB"
+ "%d:%d", pnum, vol_id, lnum);
+ err = -EBADMSG;
+ } else
+ ubi_ro_mode(ubi);
+ }
+ goto out_free;
+ } else if (unlikely(err == UBI_IO_BITFLIPS))
+ scrub = 1;
+
+ err = paranoid_check_leb(ubi, pnum, vol_id, lnum, vid_hdr);
+ if (unlikely(err)) {
+ if (err > 0)
+ err = -EINVAL;

+ goto out_free;
+ }
+

+ ubi_assert(lnum < ubi32_to_cpu(vid_hdr->used_ebs));
+ ubi_assert(len == ubi32_to_cpu(vid_hdr->data_size));
+
+ data_crc = ubi32_to_cpu(vid_hdr->data_crc);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ }
+
+ err = ubi_io_read_data(ubi, buf, pnum, offset, len);
+ if (unlikely(err) && err != UBI_IO_BITFLIPS)
+ goto out_unlock;
+ else if (unlikely(err == UBI_IO_BITFLIPS)) {
+ scrub = 1;

+ err = 0;
+ }
+

+ if (check) {
+ uint32_t crc;
+
+ crc = crc32(UBI_CRC32_INIT, buf, len);
+ if (unlikely(crc != data_crc)) {
+ ubi_warn("CRC error: calculated %#08x, must be %#08x",
+ crc, data_crc);
+ err = -EBADMSG;

+ goto out_unlock;
+ }
+

+ if (err)
+ dbg_eba("error %d while reading, but data CRC is OK, "
+ "ignore the error", err);
+ err = 0;
+ dbg_eba("data is OK, CRC matches");
+ }
+
+ if (unlikely(err))
+ goto out_unlock;
+
+ if (unlikely(scrub))
+ err = ubi_wl_scrub_peb(ubi, pnum);
+
+ leb_read_unlock(ubi, vol_id, lnum);
+ return err;
+
+out_free:
+ ubi_free_vid_hdr(ubi, vid_hdr);
+out_unlock:
+ leb_read_unlock(ubi, vol_id, lnum);

+ return err;
+}
+
+/**

+ * recover_peb - recover from write failure.
+ *

+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock to recover

+ * @vol_id: volume ID

+ * @lnum: logical eraseblock number

+ * @buf: data which was not written because of a write failure
+ * @offset: offset of the failed write
+ * @len: how many bytes should have been written
+ *
+ * This function is called in case of a write failure and moves all good data
+ * foam the potentially bad physical eraseblock to a good physical eraseblock.
+ * This function also writes the data which was not written due to the failure.
+ * Returns new physical eraseblock number in case of success, and a negative

+ * error code in case of failure.

+ */
+int recover_peb(struct ubi_info *ubi, int pnum, int vol_id, int lnum,
+ const void *buf, int offset, int len)
+{
+ int err, new_pnum, data_size, tries = 0;
+ struct ubi_vid_hdr *vid_hdr;
+ unsigned char *new_buf;
+
+ ubi_assert(ubi->io.bad_allowed);
+
+retry:
+ new_pnum = ubi_wl_get_peb(ubi, UBI_DATA_UNKNOWN);
+ if (new_pnum < 0)
+ return new_pnum;
+
+ ubi_msg("recover PEB %d, move data to PEB %d", pnum, new_pnum);
+
+ /* At first recover the VID header */
+
+ vid_hdr = ubi_zalloc_vid_hdr(ubi);
+ if (!vid_hdr) {

+ err = -ENOMEM;
+ goto out_put;
+ }
+

+ err = ubi_io_read_vid_hdr(ubi, pnum, vid_hdr, 1);
+ if (err && err != UBI_IO_BITFLIPS) {
+ if (err > 0)
+ err = -EIO;

+ goto out_free;
+ }
+

+ vid_hdr->leb_ver = cpu_to_ubi32(ubi32_to_cpu(vid_hdr->leb_ver) + 1);
+ err = ubi_io_write_vid_hdr(ubi, new_pnum, vid_hdr);
+ if (err)
+ goto write_error;
+
+ /* Now recover the data */
+
+ data_size = offset + len;
+ new_buf = kmalloc(data_size, GFP_KERNEL);
+ if (unlikely(!new_buf)) {
+ err = -ENOMEM;
+ goto out_free;
+ }
+ memset(new_buf + offset, 0xFF, len);
+
+ /* Read everything before the area where the write failure happened */
+ if (offset > 0) {
+ err = ubi_io_read_data(ubi, new_buf, pnum, 0, offset);
+ if (err && err != UBI_IO_BITFLIPS) {
+ kfree(new_buf);

+ goto out_free;
+ }
+ }

+
+ /*
+ * Now we assume that before the failed write the (offset, offset+len)
+ * area contained all 0xFF bytes. This is true for NAND. This is not
+ * always true for NOR, but NOR don't admit of bad PEBs.
+ */
+ memcpy(new_buf + offset, buf, len);
+
+ err = ubi_io_write_data(ubi, new_buf, new_pnum, 0, data_size);
+ if (err) {
+ kfree(new_buf);
+ goto write_error;
+ }
+
+ kfree(new_buf);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ ubi_eba_leb_remap(ubi, vol_id, lnum, new_pnum);
+ ubi_wl_put_peb(ubi, pnum, 1);
+ ubi_msg("data was successfully recovered");
+ return 0;
+
+out_free:
+ ubi_free_vid_hdr(ubi, vid_hdr);
+out_put:
+ ubi_wl_put_peb(ubi, new_pnum, 1);
+ return err;
+
+write_error:
+ /*
+ * Bad luck? This physical eraseblock is bad too? Crud. Let's try to
+ * get another one.
+ */
+ ubi_warn("failed to write to PEB %d", new_pnum);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ ubi_wl_put_peb(ubi, new_pnum, 1);
+ if (++tries > 5)
+ /* We've tried too many times */
+ return err;
+ ubi_msg("try again");
+ goto retry;
+}
+
+/**
+ * ubi_eba_write_leb - write data to logical eraseblock of a dynamic volume.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID where to write

+ * @lnum: the logical eraseblock number to write

+ * @buf: the data to write

+ * @offset: the offset within the logical eraseblock where to write
+ * @len: how many bytes to write
+ * @dtype: data type
+ *
+ * This function writes data to logical eraseblock @lnum of a dynamic volume
+ * @vol_id. Returns zero in case of success and a negative error code in case
+ * of failure. In case of an error, it is possible that something was still
+ * written to the flash media, but may be some garbage.
+ */

+int ubi_eba_write_leb(struct ubi_info *ubi, int vol_id, int lnum,
+ const void *buf, int offset, int len,
+ enum ubi_data_type dtype)

+{
+ int err, pnum, tries = 0;
+ uint32_t leb_ver;
+ uint64_t sqnum;
+ struct ubi_vid_hdr *vid_hdr;

+ const struct ubi_vtbl_vtr *vtr;
+

+retry:
+ ubi_assert(vol_id >= 0);
+ ubi_assert(vol_id < ubi->vtbl.vt_slots || ubi_is_ivol(vol_id));
+ ubi_assert(lnum >= 0);
+ ubi_assert(offset >= 0);
+ ubi_assert(len >= 0);
+ ubi_assert(dtype == UBI_DATA_LONGTERM || dtype == UBI_DATA_SHORTTERM ||
+ dtype == UBI_DATA_UNKNOWN);

+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));

+ ubi_assert(offset + len <= ubi->io.leb_size - vtr->data_pad);
+ ubi_assert(lnum < ubi->eba.eba_tbl[vol_id2idx(ubi, vol_id)].leb_count);
+ ubi_assert(len % ubi->io.min_io_size == 0);
+ ubi_assert(offset % ubi->io.min_io_size == 0);
+ ubi_assert(vtr->vol_type == UBI_DYNAMIC_VOLUME);
+
+ cond_resched();
+
+ if (unlikely(ubi->io.ro_mode)) {
+ dbg_err("read-only mode");

+ return -EROFS;
+ }
+

+ err = leb_write_lock(ubi, vol_id, lnum);

+ if (unlikely(err))
+ return err;
+

+ pnum = leb2peb(ubi, vol_id, lnum);
+ if (pnum >= 0) {
+ dbg_eba("write %d bytes at offset %d of LEB %d:%d, PEB %d",
+ len, offset, vol_id, lnum, pnum);
+
+ if (len != 0) {
+ err = ubi_io_write_data(ubi, buf, pnum, offset, len);
+ if (unlikely(err))
+ goto data_write_error;
+ }
+ leb_write_unlock(ubi, vol_id, lnum);

+ return err;
+ }
+

+ /*
+ * The logical eraseblock is not mapped. We have to get a free physical
+ * eraseblock and write the volume identifier header there first.
+ */
+ vid_hdr = ubi_zalloc_vid_hdr(ubi);
+ if (unlikely(!vid_hdr)) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+
+ sqnum = next_sqnum(ubi);
+ leb_ver = leb_get_ver(ubi, vol_id, lnum);
+
+ vid_hdr->vol_type = UBI_VID_DYNAMIC;
+ vid_hdr->sqnum = cpu_to_ubi64(sqnum);
+ vid_hdr->leb_ver = cpu_to_ubi32(leb_ver);
+ vid_hdr->vol_id = cpu_to_ubi32(vol_id);
+ vid_hdr->lnum = cpu_to_ubi32(lnum);
+ vid_hdr->compat = ubi_get_compat(ubi, vol_id);
+ vid_hdr->data_pad = cpu_to_ubi32(vtr->data_pad);
+
+ pnum = ubi_wl_get_peb(ubi, dtype);
+ if (unlikely(pnum < 0)) {
+ err = pnum;
+ ubi_free_vid_hdr(ubi, vid_hdr);

+ goto out_unlock;
+ }
+

+ dbg_eba("write VID hdr and %d bytes at offset %d of LEB %d:%d, PEB %d",
+ len, offset, vol_id, lnum, pnum);
+
+ err = ubi_io_write_vid_hdr(ubi, pnum, vid_hdr);
+ if (unlikely(err)) {
+ ubi_warn("failed to write VID header to PEB %d", pnum);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ if (err != -EIO || !ubi->io.bad_allowed)
+ goto no_bad_eraseblocks;
+
+ /*
+ * Fortunately, we did not write any data there yet, so just put this
+ * physical eraseblock and request a new one. We assume that if this
+ * physical eraseblock went bad, the erase code will handle that.
+ */
+ ubi_msg("try to recover form the error");
+ err = ubi_wl_put_peb(ubi, pnum, 1);
+ leb_write_unlock(ubi, vol_id, lnum);
+ if (err || ++tries > 5) {
+ ubi_ro_mode(ubi);
+ return err;
+ }
+ goto retry;
+ }
+
+ leb_map(ubi, vol_id, lnum, pnum);
+
+ if (len != 0) {
+ err = ubi_io_write_data(ubi, buf, pnum, offset, len);
+ if (unlikely(err)) {
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ goto data_write_error;
+ }
+ }
+
+ leb_write_unlock(ubi, vol_id, lnum);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ return 0;
+
+out_unlock:
+ leb_write_unlock(ubi, vol_id, lnum);
+ return err;
+
+ /* Failed to write data */
+data_write_error:
+ ubi_warn("failed to write data to PEB %d", pnum);
+ if (err != -EIO || !ubi->io.bad_allowed)
+ goto no_bad_eraseblocks;
+
+ err = recover_peb(ubi, pnum, vol_id, lnum, buf, offset, len);
+ if (err)
+ ubi_ro_mode(ubi);
+ leb_write_unlock(ubi, vol_id, lnum);
+ return err;
+
+ /*
+ * This flash device does not admit of bad eraseblocks or something
+ * nasty and unexpected happened. Switch to read-only mode just in
+ * case.
+ */
+no_bad_eraseblocks:
+ ubi_ro_mode(ubi);
+ leb_write_unlock(ubi, vol_id, lnum);

+ return err;
+}
+
+/**

+ * ubi_eba_write_leb_st - write data to a logical eraseblock of a static
+ * volume.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID where to write

+ * @lnum: the logical eraseblock number to write

+ * @buf: the data to write

+ * @len: how many bytes to write
+ * @dtype: data type
+ * @used_ebs: how many logical eraseblocks will this volume contain (used only
+ * for static volumes)
+ *
+ * This function writes data to a logical eraseblock of a static volume. The
+ * @used_ebs argument should contain total number of logical eraseblock which
+ * will contain any data in this static volume.
+ *
+ * When writing to the last logical eraseblock of a static volume, the @len
+ * argument doesn't have to be aligned to the minimal I/O unit size. Instead,
+ * it has to be equivalent to the real data size, although the @buf buffer has
+ * to contain the alignment. In all other cases, @len has to be aligned.
+ *
+ * Note, it is prohibited to write more then once to logical eraseblocks of
+ * static volumes.
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_eba_write_leb_st(struct ubi_info *ubi, int vol_id, int lnum,
+ const void *buf, int len, enum ubi_data_type dtype,
+ int used_ebs)

+{
+ int err, pnum, tries = 0, data_size = len;
+ uint32_t leb_ver, crc;
+ uint64_t sqnum;
+ struct ubi_vid_hdr *vid_hdr;

+ const struct ubi_vtbl_vtr *vtr;
+

+retry:
+ ubi_assert(vol_id >= 0);
+ ubi_assert(vol_id < ubi->vtbl.vt_slots || ubi_is_ivol(vol_id));
+ ubi_assert(lnum >= 0);
+ ubi_assert(len > 0);
+ ubi_assert(dtype == UBI_DATA_LONGTERM || dtype == UBI_DATA_SHORTTERM ||
+ dtype == UBI_DATA_UNKNOWN);

+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));

+ ubi_assert(lnum < ubi->eba.eba_tbl[vol_id2idx(ubi, vol_id)].leb_count);
+ ubi_assert(lnum < used_ebs);
+ ubi_assert(used_ebs >= 0);

+ ubi_assert(vtr->vol_type == UBI_STATIC_VOLUME);
+

+ cond_resched();
+
+ if (lnum == used_ebs - 1) {
+ /*
+ * If this is the last logical eraseblock of a static
+ * volume, @len may be unaligned.
+ */
+ ubi_assert(len <= ubi->io.leb_size - vtr->data_pad);
+ len = align_up(data_size, ubi->io.min_io_size);
+ } else {
+ ubi_assert(len == ubi->io.leb_size - vtr->data_pad);
+ ubi_assert(len % ubi->io.min_io_size == 0);
+ }
+
+ if (unlikely(ubi->io.ro_mode)) {
+ dbg_err("read-only mode");

+ return -EROFS;
+ }
+

+ err = leb_write_lock(ubi, vol_id, lnum);

+ if (unlikely(err))
+ return err;
+

+ ubi_assert(leb2peb(ubi, vol_id, lnum) < 0);
+
+ /*
+ * Get a free physical eraseblock and write the volume identifier
+ * header.
+ */
+ vid_hdr = ubi_zalloc_vid_hdr(ubi);
+ if (unlikely(!vid_hdr)) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+
+ sqnum = next_sqnum(ubi);
+ leb_ver = leb_get_ver(ubi, vol_id, lnum);
+
+ vid_hdr->sqnum = cpu_to_ubi64(sqnum);
+ vid_hdr->leb_ver = cpu_to_ubi32(leb_ver);
+ vid_hdr->vol_id = cpu_to_ubi32(vol_id);
+ vid_hdr->lnum = cpu_to_ubi32(lnum);
+ vid_hdr->compat = ubi_get_compat(ubi, vol_id);
+ vid_hdr->data_pad = cpu_to_ubi32(vtr->data_pad);
+
+ crc = crc32(UBI_CRC32_INIT, buf, data_size);
+ vid_hdr->vol_type = UBI_VID_STATIC;
+ vid_hdr->data_size = cpu_to_ubi32(data_size);
+ vid_hdr->used_ebs = cpu_to_ubi32(used_ebs);
+ vid_hdr->data_crc = cpu_to_ubi32(crc);
+
+ pnum = ubi_wl_get_peb(ubi, dtype);
+ if (unlikely(pnum < 0)) {
+ err = pnum;
+ ubi_free_vid_hdr(ubi, vid_hdr);

+ goto out_unlock;
+ }
+

+ dbg_eba("write VID hdr and %d bytes at of LEB %d:%d, PEB %d",
+ len, vol_id, lnum, pnum);
+
+ err = ubi_io_write_vid_hdr(ubi, pnum, vid_hdr);
+ if (unlikely(err)) {
+ ubi_warn("failed to write VID header to PEB %d", pnum);
+ goto write_error;
+ }
+
+ leb_map(ubi, vol_id, lnum, pnum);
+
+ err = ubi_io_write_data(ubi, buf, pnum, 0, len);
+ if (unlikely(err)) {
+ ubi_warn("failed to write data to PEB %d", pnum);
+ goto write_error;
+ }
+
+ leb_write_unlock(ubi, vol_id, lnum);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ return 0;
+
+out_unlock:
+ leb_write_unlock(ubi, vol_id, lnum);
+ return err;
+
+ /* Write failure */
+write_error:
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ if (err != -EIO || !ubi->io.bad_allowed) {
+ /*
+ * This flash device does not admit of bad eraseblocks or
+ * something nasty and unexpected happened. Switch to read-only
+ * mode just in case.
+ */
+ ubi_ro_mode(ubi);
+ leb_write_unlock(ubi, vol_id, lnum);

+ return err;
+ }
+

+ /*
+ * We assume that if this physical eraseblock went bad - the erase code
+ * will handle that.
+ */
+ ubi_msg("try to recover form the error");
+ err = ubi_wl_put_peb(ubi, pnum, 1);
+ leb_write_unlock(ubi, vol_id, lnum);
+ if (err || ++tries > 5) {
+ ubi_ro_mode(ubi);
+ return err;
+ }
+ goto retry;
+}
+
+/**
+ * ubi_eba_leb_is_mapped - check if a logical eraseblock is mapped.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: volume ID
+ * @lnum: the logical eraseblock to check
+ *
+ * This function checks if logical eraseblock @lnum is mapped to a physical
+ * eraseblock. Returns %1 if it is mapped, and %0 if not.
+ */

+int ubi_eba_leb_is_mapped(struct ubi_info *ubi, int vol_id, int lnum)

+{
+ dbg_eba("check LEB %d:%d PEBs", vol_id, lnum);
+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);
+ ubi_assert(lnum >= 0);
+ ubi_assert(lnum < ubi->eba.eba_tbl[vol_id2idx(ubi, vol_id)].leb_count);
+
+ return leb2peb(ubi, vol_id, lnum) >= 0;
+}
+
+/**
+ * ubi_eba_leb_remap - re-map a logical eraseblock to another physical
+ * eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID

+ * @lnum: the logical eraseblock number

+ * @pnum: new physical eraseblock to map to
+ *
+ * The logical eraseblock must be locked before re-mapping.
+ */

+void ubi_eba_leb_remap(struct ubi_info *ubi, int vol_id, int lnum, int pnum)

+{
+ /* The logical eraseblock is supposed to be locked */
+ paranoid_check_leb_locked(ubi, vol_id, lnum);
+ leb_unmap(ubi, vol_id, lnum);
+ leb_map(ubi, vol_id, lnum, pnum);
+}
+
+/**
+ * build_eba_tbl - build the eraseblock association table.

+ *
+ * @ubi: the UBI device description object

+ * @si: scanning info

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int build_eba_tbl(const struct ubi_info *ubi,
+ const struct ubi_scan_info *si)
+{
+ int i, err, idx;
+ struct ubi_eba_tbl_volume *eba_tbl = ubi->eba.eba_tbl;
+
+ for (idx = 0; idx < ubi->eba.num_volumes; idx++) {
+ struct rb_node *rb;
+ struct ubi_scan_leb *seb;
+ struct ubi_scan_volume *sv;

+ const struct ubi_vtbl_vtr *vtr;

+ int sz;
+
+ cond_resched();
+
+ vtr = ubi_vtbl_get_vtr(ubi, idx2vol_id(ubi, idx));

+ if (IS_ERR(vtr))
+ continue;
+

+ dbg_eba("found volume %d (idx %d)", idx2vol_id(ubi, idx), idx);
+
+ eba_tbl[idx].leb_count = vtr->reserved_pebs;
+
+ sz = vtr->reserved_pebs * sizeof(struct ubi_eba_tbl_rec);
+ eba_tbl[idx].recs = kmalloc(sz, GFP_KERNEL);
+ if (unlikely(!eba_tbl[idx].recs)) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ for (i = 0; i < vtr->reserved_pebs; i++) {
+ ubi->eba.eba_tbl[idx].recs[i].pnum = NOT_MAPPED;
+ ubi->eba.eba_tbl[idx].recs[i].leb_ver = 0;
+ }
+
+ sv = ubi_scan_find_sv(si, idx2vol_id(ubi, idx));
+ if (!sv)
+ continue;
+
+ rb_for_each_entry(rb, seb, &sv->root, u.rb) {
+ ubi->eba.eba_tbl[idx].recs[seb->lnum].pnum = seb->pnum;
+ ubi->eba.eba_tbl[idx].recs[seb->lnum].leb_ver
+ = seb->leb_ver + 100;
+ }

+ }
+
+ return 0;
+

+out:
+ for (i = 0; i < ubi->eba.num_volumes; i++)
+ kfree(ubi->eba.eba_tbl[i].recs);

+
+ return err;
+}
+

+/**
+ * ltree_entry_ctor - lock tree entries slab cache constructor.
+ *
+ * @obj: the lock-tree entry to construct
+ * @cache: the lock tree entry slab cache
+ * @flag: constructor flags
+ */
+static void ltree_entry_ctor(void *obj, struct kmem_cache *cache,
+ unsigned long flags)
+{
+ struct ubi_eba_ltree_entry *le = obj;
+
+ if ((flags & (SLAB_CTOR_VERIFY | SLAB_CTOR_CONSTRUCTOR)) !=
+ SLAB_CTOR_CONSTRUCTOR)
+ return;
+
+ le->users = 0;
+ init_rwsem(&le->mutex);
+}
+
+/**
+ * ubi_eba_init_scan - initialize the EBA unit using scanning information.

+ *
+ * @ubi: the UBI device description object

+ * @si: pointer to the scanning information

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_eba_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si)

+{
+ int err, sz;
+
+ dbg_eba("initialize the EBA unit");
+
+ spin_lock_init(&ubi->eba.eba_tbl_lock);
+ spin_lock_init(&ubi->eba.ltree_lock);
+ ubi->eba.ltree = RB_ROOT;
+
+ if (ubis_num == 0) {
+ eba_ltree_entry_slab =
+ kmem_cache_create("ubi_eba_ltree_entry_slab",
+ sizeof(struct ubi_eba_ltree_entry), 0,
+ 0, &ltree_entry_ctor, NULL);
+ if (!eba_ltree_entry_slab)

+ return -ENOMEM;
+ }
+

+ ubi->eba.global_sq_cnt = si->max_sqnum;
+
+ ubi->eba.num_volumes = ubi->vtbl.vt_slots + UBI_INT_VOL_COUNT;
+ sz = ubi->eba.num_volumes * sizeof(struct ubi_eba_tbl_volume);
+ ubi->eba.eba_tbl = kzalloc(sz, GFP_KERNEL);
+ if (!ubi->eba.eba_tbl)
+ goto out;
+
+ err = build_eba_tbl(ubi, si);
+ if (err)
+ goto out;
+
+ dbg_eba("the EBA unit is initialized");
+ return 0;
+
+out:
+ kfree(ubi->eba.eba_tbl);
+ if (ubis_num == 0)
+ kmem_cache_destroy(eba_ltree_entry_slab);

+ return err;
+}
+
+/**

+ * ubi_eba_close - close the EBA unit.
+ *

+ * @ubi: the UBI device description object
+ */

+void ubi_eba_close(const struct ubi_info *ubi)

+{
+ unsigned int i;
+
+ dbg_eba("close EBA management unit");
+
+ for (i = 0; i < ubi->eba.num_volumes; i++)
+ kfree(ubi->eba.eba_tbl[i].recs);
+ if (ubis_num == 1)
+ kmem_cache_destroy(eba_ltree_entry_slab);
+}
+
+/**
+ * ubi_eba_copy_leb - copy logical eraseblock.
+ *

+ * @ubi: the UBI device description object

+ * @from: physical eraseblock number from where to move
+ * @to: physical eraseblock number where to move
+ * @vid_hdr: VID header of the @from physical eraseblock
+ *
+ * This function copies logical eraseblock from physical eraseblock @from to
+ * physical eraseblock @to. The @vid_hdr buffer may be changed by this
+ * function. Returns zero in case of success, %UBI_IO_BITFLIPS if the operation
+ * was canceled because bit-flips were detected at the target PEB, and a

+ * negative error code in case of failure.
+ */

+int ubi_eba_copy_leb(struct ubi_info *ubi, int from, int to,
+ struct ubi_vid_hdr *vid_hdr)

+{
+ int err, vol_id, lnum, data_size, aldata_size, pnum;
+ uint32_t crc;

+ unsigned long long sqnum;

+ void *buf;
+
+ vol_id = ubi32_to_cpu(vid_hdr->vol_id);
+ lnum = ubi32_to_cpu(vid_hdr->lnum);
+
+ dbg_eba("copy LEB %d:%d, PEB %d to PEB %d",
+ vol_id, lnum, from, to);
+
+ if (vid_hdr->vol_type == UBI_VID_STATIC) {
+ data_size = ubi32_to_cpu(vid_hdr->data_size);
+ aldata_size = align_up(data_size, ubi->io.min_io_size);
+ } else
+ data_size = aldata_size =
+ ubi->io.leb_size - ubi32_to_cpu(vid_hdr->data_pad);
+
+ ubi_assert(aldata_size % ubi->io.min_io_size == 0);
+
+ buf = kmalloc(aldata_size, GFP_KERNEL);
+ if (unlikely(!buf))
+ return -ENOMEM;
+
+ /*
+ * We do not want anybody to write to this logical eraseblock while we
+ * are moving it, so we lock it.
+ */
+ err = leb_write_lock(ubi, vol_id, lnum);
+ if (unlikely(err))
+ goto out_free;
+
+ /*
+ * But the logical eraseblock might have been put by this time.
+ * Cancel if it is true.
+ */
+ pnum = leb2peb(ubi, vol_id, lnum);
+ if (pnum != from) {
+ dbg_eba("LEB %d:%d is no longer mapped to PEB %d, mapped to "
+ "PEB %d, cancel", vol_id, lnum, from, pnum);

+ goto out_unlock;
+ }
+

+ /*
+ * OK, now the LEB is locked and we can safely start moving it.
+ */
+
+ dbg_eba("read %d bytes of data", aldata_size);
+
+ err = ubi_io_read_data(ubi, buf, from, 0, aldata_size);
+ if (unlikely(err) && err != UBI_IO_BITFLIPS) {
+ ubi_warn("error %d while reading data from PEB %d",
+ err, from);

+ goto out_unlock;
+ }
+

+ /*
+ * Now we have got to calculate how much data we have to to copy. In
+ * case of a static volume it is fairly easy - the VID header contains
+ * the data size. In case of a dynamic volume it is more difficult - we
+ * have to read the contents, cut 0xFF bytes from the end and copy only
+ * the first part. We must do this to avoid writing 0xFF bytes as it
+ * may have some side-effects. And not only this. It is important not
+ * to include those 0xFFs to CRC because later the they may be filled
+ * by data.
+ */
+ if (vid_hdr->vol_type == UBI_VID_DYNAMIC)
+ aldata_size = data_size =
+ ubi_calc_data_len(ubi, buf, data_size);
+
+ cond_resched();
+ crc = crc32(UBI_CRC32_INIT, buf, data_size);
+ cond_resched();
+
+ /*
+ * It may turn out that the whole @from physical eraseblock contains
+ * only 0xFF bytes. Then we have to only write the VID header and do
+ * not write any data. This also means we should not set
+ * @vid_hdr->copy_flag, @vid_hdr->data_size, and @vid_hdr->data_crc.
+ */
+ if (likely(data_size > 0)) {
+ vid_hdr->copy_flag = 1;
+ vid_hdr->data_size = cpu_to_ubi32(data_size);
+ vid_hdr->data_crc = cpu_to_ubi32(crc);
+ }
+ vid_hdr->leb_ver = cpu_to_ubi32(ubi32_to_cpu(vid_hdr->leb_ver) + 1);
+ sqnum = next_sqnum(ubi);
+ vid_hdr->sqnum = cpu_to_ubi64(sqnum);
+
+ err = ubi_io_write_vid_hdr(ubi, to, vid_hdr);
+ if (unlikely(err))
+ goto out_unlock;
+
+ cond_resched();
+
+ /* Read the VID header back and check if it was written correctly */
+ err = ubi_io_read_vid_hdr(ubi, to, vid_hdr, 1);
+ if (unlikely(err)) {
+ if (err != UBI_IO_BITFLIPS)
+ ubi_warn("cannot read VID header back from PEB %d", to);

+ goto out_unlock;
+ }
+

+ if (likely(data_size > 0)) {
+ void *buf1;
+
+ err = ubi_io_write_data(ubi, buf, to, 0, aldata_size);
+ if (unlikely(err))
+ goto out_unlock;
+
+ /*
+ * We've written the data and are going to read it back to make
+ * sure it was written correctly.
+ */
+ buf1 = kmalloc(aldata_size, GFP_KERNEL);
+ if (unlikely(!buf1)) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+
+ cond_resched();
+
+ err = ubi_io_read_data(ubi, buf1, to, 0, aldata_size);
+ if (unlikely(err)) {
+ kfree(buf1);
+ if (err != UBI_IO_BITFLIPS)
+ ubi_warn("cannot read data back from PEB %d",
+ to);

+ goto out_unlock;
+ }
+

+ cond_resched();
+
+ if (unlikely(memcmp(buf, buf1, aldata_size))) {
+ ubi_warn("read data back from PEB %d - it is different",
+ to);
+ kfree(buf1);
+ goto out_unlock;
+ }
+ kfree(buf1);
+ }
+
+ ubi_eba_leb_remap(ubi, vol_id, lnum, to);
+ leb_write_unlock(ubi, vol_id, lnum);
+ kfree(buf);

+
+ return 0;
+

+out_unlock:
+ leb_write_unlock(ubi, vol_id, lnum);
+out_free:
+ kfree(buf);

+ return err;
+}
+

+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_EBA
+
+/**
+ * paranoid_check_leb - check that a logical eraseblock has correct erase
+ * counter and volume identifier headers.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number
+ * @vol_id: the volume ID to check
+ * @lnum: the logical eraseblock number to check
+ * @vid_hdr: volume identifier header to check
+ *
+ * This function returns zero if the headers are all right, %1 if not, and a
+ * negative error code in case of error.
+ */
+static int paranoid_check_leb(const struct ubi_info *ubi, int pnum, int vol_id,
+ int lnum, const struct ubi_vid_hdr *vid_hdr)
+{
+ int err, hdr_vol_id, hdr_lnum;
+ struct ubi_ec_hdr *ec_hdr;
+
+ /* Check the EC header */
+ ec_hdr = kzalloc(ubi->io.ec_hdr_alsize, GFP_KERNEL);
+ if (unlikely(!ec_hdr))
+ return -ENOMEM;
+
+ err = ubi_io_read_ec_hdr(ubi, pnum, ec_hdr, 1);
+ kfree(ec_hdr);
+ if (unlikely(err) && err != UBI_IO_BITFLIPS) {
+ if (err < 0)
+ return err;
+ goto fail;
+ }
+
+ hdr_vol_id = ubi32_to_cpu(vid_hdr->vol_id);
+ hdr_lnum = ubi32_to_cpu(vid_hdr->lnum);
+
+ if (unlikely(vol_id != hdr_vol_id)) {
+ ubi_err("bad vol_id %d, should be %d", hdr_vol_id, vol_id);
+ goto fail;
+ }
+
+ if (unlikely(lnum != hdr_lnum)) {
+ ubi_err("bad lnum %d, should be %d", hdr_lnum, lnum);
+ goto fail;

+ }
+
+ return 0;
+

+fail:
+ ubi_err("paranoid check failed");
+ ubi_dbg_dump_stack();
+ return 1;
+}
+
+/**
+ * paranoid_check_leb_locked - ensure that a logical eraseblock is locked.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: the volume ID to check
+ * @lnum: the logical eraseblock number to check
+ *
+ * This function returns zero if the logical eraseblock is locked and %1 if
+ * not.
+ */
+static int paranoid_check_leb_locked(struct ubi_info *ubi, int vol_id, int lnum)
+{
+ struct ubi_eba_ltree_entry *le;
+
+ spin_lock(&ubi->eba.ltree_lock);
+ le = ltree_lookup(ubi, vol_id, lnum);
+ spin_unlock(&ubi->eba.ltree_lock);
+ if (likely(le))
+ return 0;
+
+ ubi_err("paranoid check failed");
+ ubi_dbg_dump_stack();
+ return 1;
+}
+
+#endif /* CONFIG_MTD_UBI_DEBUG_PARANOID_EBA */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:24:33 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/debug.c tmp-to/drivers/mtd/ubi/debug.c
--- tmp-from/drivers/mtd/ubi/debug.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/debug.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,546 @@

+/**
+ * Here we keep all the UBI debugging stuff which should normally be disabled
+ * and compiled-out, but it is extremly helpful when hunting bugs or doing big
+ * changes.
+ */
+
+#include <stdarg.h>
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/stat.h>
+#include <linux/random.h>
+#include "ubi.h"
+
+/* Level of UBI debugging prints */
+#define UBI_DBG_LEVEL KERN_DEBUG
+
+/* Prefixes of debugging messages */
+#define UBI_DBG_UIF_PREF "[UBI DBG uif]"
+#define UBI_DBG_VTBL_PREF "[UBI DBG vtbl]"
+#define UBI_DBG_EBA_PREF "[UBI DBG eba]"
+#define UBI_DBG_WL_PREF "[UBI DBG wl]"
+#define UBI_DBG_IO_PREF "[UBI DBG io]"
+#define UBI_DBG_BLD_PREF "[UBI DBG bld]"
+#define UBI_DBG_SCAN_PREF "[UBI DBG scan]"
+
+#ifdef CONFIG_MTD_UBI_DEBUG_MSG_UIF
+static int uif_prints = 1;
+#else
+static int uif_prints;
+#endif
+#ifdef CONFIG_MTD_UBI_DEBUG_MSG_VTBL
+static int vtbl_prints = 1;
+#else
+static int vtbl_prints;
+#endif
+#ifdef CONFIG_MTD_UBI_DEBUG_MSG_EBA
+static int eba_prints = 1;
+#else
+static int eba_prints;
+#endif
+#ifdef CONFIG_MTD_UBI_DEBUG_MSG_WL
+static int wl_prints = 1;
+#else
+static int wl_prints;
+#endif
+#ifdef CONFIG_MTD_UBI_DEBUG_MSG_IO
+static int io_prints = 1;
+#else
+static int io_prints;
+#endif
+#ifdef CONFIG_MTD_UBI_DEBUG_MSG_BLD
+static int bld_prints = 1;
+static int scan_prints = 1;
+#else
+static int bld_prints;
+static int scan_prints;
+#endif
+
+/* If bit-flips should be emulated */
+#ifdef CONFIG_MTD_UBI_DEBUG_EMULATE_BITFLIPS
+static int emulate_bitflips = 1;
+#else
+static int emulate_bitflips;
+#endif
+
+/* If write failures should be emulated */
+#ifdef CONFIG_MTD_UBI_DEBUG_EMULATE_WRITE_FAILURES
+static int emulate_write_failures = 1;
+#else
+static int emulate_write_failures;
+#endif
+
+/* If erase failures should be emulated */
+#ifdef CONFIG_MTD_UBI_DEBUG_EMULATE_ERASE_FAILURES
+static int emulate_erase_failures = 1;
+#else
+static int emulate_erase_failures;
+#endif
+
+/* Direntries of the UBI debugfs files */
+
+/* <debugfs>/ubi */
+static struct dentry *debugfs_root;
+/* <debugfs>/ubi/uif_prints */
+static struct dentry *debugfs_uif_prints;
+/* <debugfs>/ubi/vtbl_prints */
+static struct dentry *debugfs_vtbl_prints;
+/* <debugfs>/ubi/eba_prints */
+static struct dentry *debugfs_eba_prints;
+/* <debugfs>/ubi/wl_prints */
+static struct dentry *debugfs_wl_prints;
+/* <debugfs>/ubi/io_prints */
+static struct dentry *debugfs_io_prints;
+/* <debugfs>/ubi/bld_prints */
+static struct dentry *debugfs_bld_prints;
+
+/* Serializes prints */
+static spinlock_t dbg_prints_lock = SPIN_LOCK_UNLOCKED;
+
+/**
+ * ubi_dbg_init - initialize the debugging unit.
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int __init ubi_dbg_init(void)
+{
+ int err = -ENODEV;
+
+ /* Create debugging files and directories */
+
+ debugfs_root = debugfs_create_dir("ubi", NULL);
+ if (!debugfs_root || IS_ERR(debugfs_root))
+ goto out;
+
+ debugfs_uif_prints = debugfs_create_bool("uif_prints",
+ S_IFREG | S_IRUGO | S_IWUGO, debugfs_root, &uif_prints);
+ if (!debugfs_uif_prints || IS_ERR(debugfs_uif_prints))
+ goto out_root;
+
+ debugfs_vtbl_prints = debugfs_create_bool("vtbl_prints",
+ S_IFREG | S_IRUGO | S_IWUGO, debugfs_root, &vtbl_prints);
+ if (!debugfs_vtbl_prints || IS_ERR(debugfs_vtbl_prints))
+ goto out_uif;
+
+ debugfs_eba_prints = debugfs_create_bool("eba_prints",
+ S_IFREG | S_IRUGO | S_IWUGO, debugfs_root, &eba_prints);
+ if (!debugfs_eba_prints || IS_ERR(debugfs_eba_prints))
+ goto out_vtbl;
+
+ debugfs_wl_prints = debugfs_create_bool("wl_prints",
+ S_IFREG | S_IRUGO | S_IWUGO, debugfs_root, &wl_prints);
+ if (!debugfs_wl_prints || IS_ERR(debugfs_wl_prints))
+ goto out_eba;
+
+ debugfs_io_prints = debugfs_create_bool("io_prints",
+ S_IFREG | S_IRUGO | S_IWUGO, debugfs_root, &io_prints);
+ if (!debugfs_io_prints || IS_ERR(debugfs_io_prints))
+ goto out_wl;
+
+ debugfs_bld_prints = debugfs_create_bool("bld_prints",
+ S_IFREG | S_IRUGO | S_IWUGO, debugfs_root, &bld_prints);
+ if (!debugfs_bld_prints || IS_ERR(debugfs_bld_prints))
+ goto out_io;

+
+ return 0;
+

+out_io:
+ debugfs_remove(debugfs_io_prints);
+out_wl:
+ debugfs_remove(debugfs_wl_prints);
+out_eba:
+ debugfs_remove(debugfs_eba_prints);
+out_vtbl:
+ debugfs_remove(debugfs_vtbl_prints);
+out_uif:
+ debugfs_remove(debugfs_uif_prints);
+out_root:
+ debugfs_remove(debugfs_root);
+out:

+ return err;
+}
+
+/**

+ * ubi_dbg_close - close the debugging unit.
+ */
+void __exit ubi_dbg_close(void)
+{
+ debugfs_remove(debugfs_bld_prints);
+ debugfs_remove(debugfs_io_prints);
+ debugfs_remove(debugfs_wl_prints);
+ debugfs_remove(debugfs_eba_prints);
+ debugfs_remove(debugfs_vtbl_prints);
+ debugfs_remove(debugfs_uif_prints);
+ debugfs_remove(debugfs_root);
+}
+
+static void ubi_dbg_vprint_nolock(int type, const char *func, const char *fmt,
+ va_list args);
+
+/**
+ * ubi_dbg_print - print a message.
+ *
+ * @type: type of the message
+ * @func: printing function name
+ * @fmt: format string
+ *
+ * This function prints a message to the console, the debugging log, or both.
+ * Normal, warning, and error messages always go to both console and debugging
+ * log. Debugging messages always go to the debugging log, and if the
+ * corresponding option is enabled, they also go to the console.
+ */
+void ubi_dbg_print(int type, const char *func, const char *fmt, ...)
+{
+ va_list args;
+
+ va_start(args, fmt);
+ spin_lock(&dbg_prints_lock);
+ ubi_dbg_vprint_nolock(type, func, fmt, args);
+ spin_unlock(&dbg_prints_lock);
+ va_end(args);
+}
+
+static void ubi_dbg_vprint_nolock(int type, const char *func, const char *fmt,
+ va_list args)
+{
+ const char *prefix;
+
+ switch (type) {
+ case UBI_DBG_UIF:
+ if (!uif_prints)
+ return;
+ prefix = UBI_DBG_UIF_PREF;
+ break;
+ case UBI_DBG_VTBL:
+ if (!vtbl_prints)
+ return;
+ prefix = UBI_DBG_VTBL_PREF;
+ break;
+ case UBI_DBG_EBA:
+ if (!eba_prints)
+ return;
+ prefix = UBI_DBG_EBA_PREF;
+ break;
+ case UBI_DBG_WL:
+ if (!wl_prints)
+ return;
+ prefix = UBI_DBG_WL_PREF;
+ break;
+ case UBI_DBG_IO:
+ if (!io_prints)
+ return;
+ prefix = UBI_DBG_IO_PREF;
+ break;
+ case UBI_DBG_BLD:
+ if (!bld_prints)
+ return;
+ prefix = UBI_DBG_BLD_PREF;
+ break;
+ case UBI_DBG_SCAN:
+ if (!scan_prints)
+ return;
+ prefix = UBI_DBG_SCAN_PREF;
+ break;
+ default:
+ BUG();
+ return;
+ }
+
+ printk(UBI_DBG_LEVEL "%s (pid:%d) ", prefix, current->pid);
+ if (func)
+ printk("%s: ", func);
+ vprintk(fmt, args);
+ printk("\n");
+}
+
+/**
+ * ubi_dbg_dump_ec_hdr - dump an erase counter header.
+ *
+ * @ec_hdr: the erase counter header to dump
+ */
+void ubi_dbg_dump_ec_hdr(const struct ubi_ec_hdr *ec_hdr)
+{
+ spin_lock(&dbg_prints_lock);
+ ubi_msg("erase counter header dump:");
+ ubi_msg("magic %#08x", ubi32_to_cpu(ec_hdr->magic));
+ ubi_msg("version %d", (int)ec_hdr->version);
+ ubi_msg("ec %llu", (long long)ubi64_to_cpu(ec_hdr->ec));
+ ubi_msg("vid_hdr_offset %d", ubi32_to_cpu(ec_hdr->vid_hdr_offset));
+ ubi_msg("data_offset %d", ubi32_to_cpu(ec_hdr->data_offset));
+ ubi_msg("hdr_crc %#08x", ubi32_to_cpu(ec_hdr->hdr_crc));
+ ubi_msg("erase counter header hexdump:");
+ spin_unlock(&dbg_prints_lock);
+ ubi_dbg_hexdump(ec_hdr, UBI_EC_HDR_SIZE);
+}
+
+/**
+ * ubi_dbg_dump_vid_hdr - dump a volume identifier header.
+ *
+ * @vid_hdr: the volume identifier header to dump
+ */
+void ubi_dbg_dump_vid_hdr(const struct ubi_vid_hdr *vid_hdr)
+{
+ spin_lock(&dbg_prints_lock);
+ ubi_msg("volume identifier header dump:");
+ ubi_msg("magic %08x", ubi32_to_cpu(vid_hdr->magic));
+ ubi_msg("version %d", (int)vid_hdr->version);
+ ubi_msg("vol_type %d", (int)vid_hdr->vol_type);
+ ubi_msg("copy_flag %d", (int)vid_hdr->copy_flag);
+ ubi_msg("compat %d", (int)vid_hdr->compat);
+ ubi_msg("vol_id %d", ubi32_to_cpu(vid_hdr->vol_id));
+ ubi_msg("lnum %d", ubi32_to_cpu(vid_hdr->lnum));
+ ubi_msg("leb_ver %u", ubi32_to_cpu(vid_hdr->leb_ver));
+ ubi_msg("data_size %d", ubi32_to_cpu(vid_hdr->data_size));
+ ubi_msg("used_ebs %d", ubi32_to_cpu(vid_hdr->used_ebs));
+ ubi_msg("data_pad %d", ubi32_to_cpu(vid_hdr->data_pad));
+ ubi_msg("sqnum %llu",
+ (unsigned long long)ubi64_to_cpu(vid_hdr->sqnum));
+ ubi_msg("hdr_crc %08x", ubi32_to_cpu(vid_hdr->hdr_crc));
+ ubi_msg("volume identifier header hexdump:");
+ spin_unlock(&dbg_prints_lock);
+ ubi_dbg_hexdump(vid_hdr, UBI_VID_HDR_SIZE_CRC);
+}
+
+/**
+ * ubi_dbg_dump_vtr - dump a &struct ubi_vtbl_vtr object.
+ *
+ * @vtr: the object to dump
+ */
+void ubi_dbg_dump_vtr(const struct ubi_vtbl_vtr *vtr)
+{
+ spin_lock(&dbg_prints_lock);
+ ubi_msg("volume table record dump:");
+ ubi_msg("reserved_pebs %d", vtr->reserved_pebs);
+ ubi_msg("alignment %d", vtr->alignment);
+ ubi_msg("data_pad %d", vtr->data_pad);
+ ubi_msg("vol_type %d", vtr->vol_type);
+ ubi_msg("name_len %d", vtr->name_len);
+ ubi_msg("usable_leb_size %d", vtr->usable_leb_size);
+ ubi_msg("used_ebs %d", vtr->used_ebs);
+ ubi_msg("used_bytes %lld", vtr->used_bytes);
+ ubi_msg("last_eb_bytes %d", vtr->last_eb_bytes);
+ ubi_msg("corrupted %d", vtr->corrupted);
+ ubi_msg("upd_marker %d", vtr->upd_marker);
+
+ if (vtr->name == NULL) {
+ ubi_msg("name NULL");
+ spin_unlock(&dbg_prints_lock);
+ return;
+ }
+
+ if (vtr->name_len <= UBI_VOL_NAME_MAX &&
+ strnlen(vtr->name, vtr->name_len + 1) == vtr->name_len) {
+ ubi_msg("name %s", vtr->name);
+ } else {
+ ubi_msg("the 1st 5 characters of the name: %c%c%c%c%c",
+ vtr->name[0], vtr->name[1], vtr->name[2],
+ vtr->name[3], vtr->name[4]);
+ }
+
+ spin_unlock(&dbg_prints_lock);
+}
+
+/**
+ * ubi_dbg_dump_vol_tbl_record - dump a &struct ubi_vol_tbl_record object.
+ *
+ * @r: the object to dump
+ */
+void ubi_dbg_dump_vol_tbl_record(const struct ubi_vol_tbl_record *r)
+{
+ int name_len = ubi16_to_cpu(r->name_len);
+
+ spin_lock(&dbg_prints_lock);
+ ubi_msg("raw volume table record dump:");
+ ubi_msg("reserved_pebs %d", ubi32_to_cpu(r->reserved_pebs));
+ ubi_msg("alignment %d", ubi32_to_cpu(r->alignment));
+ ubi_msg("data_pad %d", ubi32_to_cpu(r->data_pad));
+ ubi_msg("vol_type %d", (int)r->vol_type);
+ ubi_msg("upd_marker %d", (int)r->upd_marker);
+ ubi_msg("name_len %d", name_len);
+
+ if (r->name[0] == '\0') {
+ ubi_msg("name NULL");
+ spin_unlock(&dbg_prints_lock);
+ return;
+ }
+
+ if (name_len <= UBI_VOL_NAME_MAX &&
+ strnlen(&r->name[0], name_len + 1) == name_len) {
+ ubi_msg("name %s", &r->name[0]);
+ } else {
+ ubi_msg("the 1st 5 characters of the name: %c%c%c%c%c",
+ r->name[0], r->name[1], r->name[2], r->name[3],
+ r->name[4]);
+ }
+ spin_unlock(&dbg_prints_lock);
+}
+
+/**
+ * ubi_dbg_dump_sv - dump a &struct ubi_scan_volume object.
+ *
+ * @sv: the object to dump
+ */
+void ubi_dbg_dump_sv(const struct ubi_scan_volume *sv)
+{
+ spin_lock(&dbg_prints_lock);
+ ubi_msg("volume scanning information dump:");
+ ubi_msg("vol_id %d", sv->vol_id);
+ ubi_msg("highest_lnum %d", sv->highest_lnum);
+ ubi_msg("leb_count %d", sv->leb_count);
+ ubi_msg("compat %d", sv->compat);
+ ubi_msg("vol_type %d", sv->vol_type);
+ ubi_msg("used_ebs %d", sv->used_ebs);
+ ubi_msg("last_data_size %d", sv->last_data_size);
+ ubi_msg("data_pad %d", sv->data_pad);
+ spin_unlock(&dbg_prints_lock);
+}
+
+/**
+ * ubi_dbg_dump_seb - dump a &struct ubi_scan_leb object.
+ *
+ * @seb: the object to dump
+ * @type: object type: 0 - not corrupted, 1 - corrupted
+ */
+void ubi_dbg_dump_seb(const struct ubi_scan_leb *seb, int type)
+{
+ spin_lock(&dbg_prints_lock);
+ ubi_msg("eraseblock scanning information dump:");
+ ubi_msg("ec %d", seb->ec);
+ ubi_msg("pnum %d", seb->pnum);
+ switch (type) {
+ case 0:
+ ubi_msg("lnum %d", seb->lnum);
+ ubi_msg("scrub %d", seb->scrub);
+ ubi_msg("sqnum %llu", seb->sqnum);
+ ubi_msg("leb_ver %u", seb->leb_ver);
+ break;
+ case 1:
+ break;
+ }
+ spin_unlock(&dbg_prints_lock);
+}
+
+/**
+ * ubi_dbg_dump_mkvol_req - dump a &struct ubi_mkvol_req object.
+ *
+ * @req: the object to dump
+ */
+void ubi_dbg_dump_mkvol_req(const struct ubi_mkvol_req *req)
+{
+ char nm[17];
+
+ spin_lock(&dbg_prints_lock);
+ ubi_msg("volume creation request dump:");
+ ubi_msg("vol_id %d", req->vol_id);
+ ubi_msg("alignment %d", req->alignment);
+ ubi_msg("bytes %lld", (long long)req->bytes);
+ ubi_msg("vol_type %d", req->vol_type);
+ ubi_msg("name_len %d", req->name_len);
+
+ memcpy(nm, req->name, 16);
+ nm[16] = 0;
+ ubi_msg("the 1st 16 characters of the name: %s", nm);
+ spin_unlock(&dbg_prints_lock);
+}
+
+#define BYTES_PER_LINE 32
+
+/**
+ * ubi_dbg_hexdump - dump a buffer.
+ *
+ * @buf: the buffer to dump
+ * @size: buffer size which must be multiple of 4 bytes
+ */
+void ubi_dbg_hexdump(const void *ptr, int size)
+{
+ int i, k = 0, rows, columns;
+ const uint8_t *p = ptr;
+
+ size = ALIGN(size, 4);
+ rows = size/BYTES_PER_LINE + size % BYTES_PER_LINE;
+ for (i = 0; i < rows; i++) {
+ int j;
+
+ cond_resched();
+
+ columns = min(size - k, BYTES_PER_LINE) / 4;
+ if (columns == 0)
+ break;
+
+ spin_lock(&dbg_prints_lock);
+ printk(UBI_DBG_LEVEL "%5d: ", i * BYTES_PER_LINE);
+
+ for (j = 0; j < columns; j++) {
+ int n, N;
+
+ N = size - k > 4 ? 4 : size - k;
+ for (n = 0; n < N; n++)
+ printk("%02x", p[k++]);
+ printk(" ");
+ }
+ printk("\n");
+ spin_unlock(&dbg_prints_lock);
+ }
+}
+
+/**
+ * ubi_dbg_is_bitflip - if its time to emulate a bit-flip.
+ *
+ * Returns non-zero if a bit-flip should be emulated, otherwise returns zero.
+ */
+int ubi_dbg_is_bitflip(void)
+{
+ if (emulate_bitflips)
+ return !(random32() % 50);
+ else

+ return 0;
+}
+
+/**

+ * ubi_dbg_is_write_failure - if its time to emulate a write failure.
+ *
+ * Returns non-zero if a write failure should be emulated, otherwise returns
+ * zero.
+ */
+int ubi_dbg_is_write_failure(void)
+{
+ if (emulate_write_failures)
+ return !(random32() % 100);
+ else

+ return 0;
+}
+
+/**

+ * ubi_dbg_is_erase_failure - if its time to emulate an erase failure.
+ *
+ * Returns non-zero if an erase failure should be emulated, otherwise returns
+ * zero.
+ */
+int ubi_dbg_is_erase_failure(void)
+{
+ if (emulate_erase_failures)
+ return !(random32() % 100);
+ else
+ return 0;
+}
diff -auNrp tmp-from/drivers/mtd/ubi/debug.h tmp-to/drivers/mtd/ubi/debug.h
--- tmp-from/drivers/mtd/ubi/debug.h 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/debug.h 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,146 @@

+#ifndef __UBI_DEBUG_H__
+#define __UBI_DEBUG_H__
+
+#ifdef CONFIG_MTD_UBI_DEBUG
+
+#ifdef CONFIG_MTD_UBI_DEBUG_DISABLE_BGT
+#define DBG_DISABLE_BGT 1
+#else
+#define DBG_DISABLE_BGT 0
+#endif
+#define UBI_DEBUG 1
+
+#define ubi_assert(expr) BUG_ON(!(expr))
+
+#define ubi_dbg_dump_stack() dump_stack()
+
+/* Debugging messages from different UBI units */
+
+/* A verbose error messages */
+#define dbg_err(fmt, ...) ubi_err(fmt, ##__VA_ARGS__)
+/* User interface unit */
+#define dbg_uif(fmt, ...) \
+ ubi_dbg_print(UBI_DBG_UIF, __FUNCTION__, fmt, ##__VA_ARGS__)
+/* Volume table unit */
+#define dbg_vtbl(fmt, ...) \
+ ubi_dbg_print(UBI_DBG_VTBL, __FUNCTION__, fmt, ##__VA_ARGS__)
+/* Eraseblock association unit */
+#define dbg_eba(fmt, ...) \
+ ubi_dbg_print(UBI_DBG_EBA, __FUNCTION__, fmt, ##__VA_ARGS__)
+/* Wear-leveling unit */
+#define dbg_wl(fmt, ...) \
+ ubi_dbg_print(UBI_DBG_WL, __FUNCTION__, fmt, ##__VA_ARGS__)
+/* Input/output unit */
+#define dbg_io(fmt, ...) \
+ ubi_dbg_print(UBI_DBG_IO, __FUNCTION__, fmt, ##__VA_ARGS__)
+/* Build unit */
+#define dbg_bld(fmt, ...) \
+ ubi_dbg_print(UBI_DBG_BLD, __FUNCTION__, fmt, ##__VA_ARGS__)
+/* Scanning unit */
+#define dbg_scan(fmt, ...) \
+ ubi_dbg_print(UBI_DBG_SCAN, __FUNCTION__, fmt, ##__VA_ARGS__)
+
+/**
+ * UBI message types.
+ *
+ * @UBI_DBG_UIF: a debugging message from the user interfaces unit
+ * @UBI_DBG_VTBL: a debugging message from the volume table unit
+ * @UBI_DBG_EBA: a debugging message from the eraseblock association unit
+ * @UBI_DBG_WL: a debugging message from the wear-leveling unit
+ * @UBI_DBG_IO: a debugging message from the input/output unit
+ * @UBI_DBG_BLD: a UBI build debugging message from the build unit
+ * @UBI_DBG_SCAN: a debugging message from the scanning unit
+ */
+enum {
+ UBI_DBG_UIF,
+ UBI_DBG_VTBL,
+ UBI_DBG_EBA,
+ UBI_DBG_WL,
+ UBI_DBG_IO,
+ UBI_DBG_BLD,
+ UBI_DBG_SCAN
+};
+
+struct ubi_ec_hdr;
+struct ubi_vid_hdr;
+struct ubi_vtbl_vtr;
+struct ubi_vol_tbl_record;
+struct ubi_scan_volume;
+struct ubi_scan_leb;
+struct ubi_mkvol_req;
+
+void ubi_dbg_print(int type, const char *func, const char *fmt, ...);
+void ubi_dbg_dump_ec_hdr(const struct ubi_ec_hdr *ec_hdr);
+void ubi_dbg_dump_vid_hdr(const struct ubi_vid_hdr *vid_hdr);
+void ubi_dbg_dump_vtr(const struct ubi_vtbl_vtr *vtr);
+void ubi_dbg_dump_vol_tbl_record(const struct ubi_vol_tbl_record *r);
+void ubi_dbg_dump_sv(const struct ubi_scan_volume *sv);
+void ubi_dbg_dump_seb(const struct ubi_scan_leb *seb, int type);
+void ubi_dbg_dump_mkvol_req(const struct ubi_mkvol_req *req);
+void ubi_dbg_hexdump(const void *buf, int size);
+
+int ubi_dbg_is_bitflip(void);
+int ubi_dbg_is_write_failure(void);
+int ubi_dbg_is_erase_failure(void);
+
+int __init ubi_dbg_init(void);
+void __exit ubi_dbg_close(void);
+
+#else
+
+#define UBI_DEBUG 0
+
+#define DBG_DISABLE_BGT 0
+
+#define ubi_assert(expr) ({})
+
+#define dbg_err(fmt, ...) ({})
+#define dbg_uif(fmt, ...) ({})
+#define dbg_vtbl(fmt, ...) ({})
+#define dbg_eba(fmt, ...) ({})
+#define dbg_wl(fmt, ...) ({})
+#define dbg_io(fmt, ...) ({})
+#define dbg_bld(fmt, ...) ({})
+#define dbg_scan(fmt, ...) ({})
+
+#define ubi_dbg_print(func, fmt, ...) ({})
+#define ubi_dbg_dump_stack() ({})
+#define ubi_dbg_dump_ec_hdr(ec_hdr) ({})
+#define ubi_dbg_dump_vid_hdr(vid_hdr) ({})
+#define ubi_dbg_dump_vtr(vtr) ({})
+#define ubi_dbg_dump_vol_tbl_record(r) ({})
+#define ubi_dbg_dump_sv(sv) ({})
+#define ubi_dbg_dump_seb(seb, type) ({})
+#define ubi_dbg_dump_mkvol_req(req) ({})
+#define ubi_dbg_hexdump(buf, size) ({})
+
+#define ubi_dbg_is_bitflip() 0
+#define ubi_dbg_is_write_failure() 0
+#define ubi_dbg_is_erase_failure() 0
+
+#define ubi_dbg_init() 0
+#define ubi_dbg_close()
+
+#endif /* !CONFIG_MTD_UBI_DEBUG */
+
+#endif /* !__UBI_DEBUG_H__ */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:24:46 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/include/mtd/ubi-user.h tmp-to/include/mtd/ubi-user.h
--- tmp-from/include/mtd/ubi-user.h 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/include/mtd/ubi-user.h 2007-03-14 17:15:49.000000000 +0200
@@ -0,0 +1,163 @@

+#ifndef __UBI_USER_H__
+#define __UBI_USER_H__
+
+/*
+ * UBI volume creation
+ * ~~~~~~~~~~~~~~~~~~~
+ *
+ * UBI volumes are created via the %UBI_IOCMKVOL IOCTL command of UBI character
+ * device. A &struct ubi_mkvol_req object has to be properly filled and a
+ * pointer to it has to be passed to the IOCTL.
+ *
+ * UBI volume deletion
+ * ~~~~~~~~~~~~~~~~~~~
+ *
+ * To delete a volume, the %UBI_IOCRMVOL IOCTL command of the UBI character
+ * device should be used. A pointer to the 32-bit volume ID hast to be passed
+ * to the IOCTL.
+ *
+ * UBI volume re-size
+ * ~~~~~~~~~~~~~~~~~~
+ *
+ * To re-size a volume, the %UBI_IOCRSVOL IOCTL command of the UBI character
+ * device should be used. A &struct ubi_rsvol_req object has to be properly
+ * filled and a pointer to it has to be passed to the IOCTL.
+ *
+ * UBI volume update
+ * ~~~~~~~~~~~~~~~~~
+ *
+ * Volume update should be done via the %UBI_IOCVOLUP IOCTL command of the
+ * corresponding UBI volume character device. A pointer to a 64-bit update
+ * size should be passed to the IOCTL. After then, UBI expects user to write
+ * this number of bytes to the volume character device. The update is finished
+ * when the claimed number of bytes is passed. So, the volume update sequence
+ * is something like:
+ *
+ * fd = open("/dev/my_volume");
+ * ioctl(fd, UBI_IOCVOLUP, &image_size);
+ * write(fd, buf, image_size);
+ * close(fd);
+ */
+
+/*
+ * When a new volume is created, users may either specify the volume number they
+ * want to create or to let UBI automatically assign a volume number using this
+ * constant.
+ */
+#define UBI_VOL_NUM_AUTO (-1)
+
+/* Maximum volume name length */
+#define UBI_MAX_VOLUME_NAME 127
+
+/* IOCTL commands of UBI character devices */
+
+#define UBI_IOC_MAGIC 'o'
+
+/* Create an UBI volume */
+#define UBI_IOCMKVOL _IOW(UBI_IOC_MAGIC, 0, struct ubi_mkvol_req)
+/* Remove an UBI volume */
+#define UBI_IOCRMVOL _IOW(UBI_IOC_MAGIC, 1, int32_t)
+/* Re-size an UBI volume */
+#define UBI_IOCRSVOL _IOW(UBI_IOC_MAGIC, 2, struct ubi_rsvol_req)
+
+/* IOCTL commands of UBI volume character devices */
+
+#define UBI_VOL_IOC_MAGIC 'O'
+
+/* Start UBI volume update */
+#define UBI_IOCVOLUP _IOW(UBI_VOL_IOC_MAGIC, 0, int64_t)
+/* An eraseblock erasure command, used for debugging, disabled by default */
+#define UBI_IOCEBER _IOW(UBI_VOL_IOC_MAGIC, 1, int32_t)
+
+/*
+ * UBI volume type constants.
+ *
+ * @UBI_DYNAMIC_VOLUME: dynamic volume
+ * @UBI_STATIC_VOLUME: static volume
+ */
+enum {
+ UBI_DYNAMIC_VOLUME = 3,
+ UBI_STATIC_VOLUME = 4
+};
+
+/**
+ * struct ubi_mkvol_req - volume description data structure used in
+ * volume creation requests.
+ *
+ * @vol_id: volume number

+ * @alignment: volume alignment

+ * @bytes: volume size in bytes

+ * @vol_type: volume type (%UBI_DYNAMIC_VOLUME or %UBI_STATIC_VOLUME)

+ * @padding1: reserved for future, not used

+ * @name_len: volume name length

+ * @padding2: reserved for future, not used

+ * @name: volume name

+ *
+ * This structure is used by userspace programs when creating new volumes. The
+ * @used_bytes field is only necessary when creating static volumes.
+ *
+ * The @alignment field specifies the required alignment of the volume logical
+ * eraseblock. This means, that the size of logical eraseblocks will be aligned
+ * to this number, i.e.,
+ * (UBI device logical eraseblock size) mod (@alignment) = 0.
+ *
+ * To put it differently, the logical eraseblock of this volume may be slightly
+ * shortened in order to make it properly aligned. The alignment has to be
+ * multiple of the flash minimal input/output unit, or %1 to utilize the entire
+ * available space of logical eraseblocks.
+ *
+ * The @alignment field may be useful, for example, when one wants to maintain
+ * a block device on top of an UBI volume. In this case, it is desirable to fit
+ * an integer number of blocks in logical eraseblocks of this UBI volume. With
+ * alignment it is possible to update this volume using plane UBI volume image
+ * BLOBs, without caring about how to properly align them.
+ */
+struct ubi_mkvol_req {
+ int32_t vol_id;
+ int32_t alignment;
+ int64_t bytes;
+ int8_t vol_type;
+ int8_t padding1;
+ int16_t name_len;
+ int8_t padding2[4];
+ char name[UBI_MAX_VOLUME_NAME + 1];
+} __attribute__ ((packed));
+
+/**
+ * struct ubi_rsvol_req - a data structure used in volume re-size requests.
+ *
+ * @vol_id: ID of the volume to re-size
+ * @bytes: new size of the volume in bytes
+ *
+ * Re-sizing is possible for both dynamic and static volumes. But while dynamic
+ * volumes may be re-sized arbitrarily, static volumes cannot be made to be
+ * smaller then the number of bytes they bear. To arbitrarily shrink a static
+ * volume, it must be wiped out first (by means of volume update operation with
+ * zero number of bytes).
+ */
+struct ubi_rsvol_req {
+ int64_t bytes;
+ int32_t vol_id;
+} __attribute__ ((packed));
+
+#endif /* __UBI_USER_H__ */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:24:58 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/Kconfig tmp-to/drivers/mtd/Kconfig
--- tmp-from/drivers/mtd/Kconfig 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/drivers/mtd/Kconfig 2007-03-14 17:15:49.000000000 +0200
@@ -292,5 +292,7 @@ source "drivers/mtd/nand/Kconfig"

source "drivers/mtd/onenand/Kconfig"

+source "drivers/mtd/ubi/Kconfig"
+
endmenu

diff -auNrp tmp-from/drivers/mtd/Makefile tmp-to/drivers/mtd/Makefile
--- tmp-from/drivers/mtd/Makefile 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/drivers/mtd/Makefile 2007-03-14 17:15:49.000000000 +0200
@@ -28,3 +28,5 @@ nftl-objs := nftlcore.o nftlmount.o
inftl-objs := inftlcore.o inftlmount.o

obj-y += chips/ maps/ devices/ nand/ onenand/
+
+obj-$(CONFIG_MTD_UBI) += ubi/
diff -auNrp tmp-from/drivers/mtd/ubi/Kconfig tmp-to/drivers/mtd/ubi/Kconfig
--- tmp-from/drivers/mtd/ubi/Kconfig 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/Kconfig 2007-03-14 17:15:49.000000000 +0200
@@ -0,0 +1,60 @@
+# drivers/mtd/ubi/Kconfig
+
+menu "UBI - Unsorted block images"
+ depends on MTD
+
+config MTD_UBI
+ tristate "Enable UBI"
+ depends on MTD
+ select CRC32
+ help
+ UBI is a software layer above MTD layer which admits of LVM-like
+ logical volumes on top of MTD devices, hides some complexities of
+ flash chips like wear and bad blocks and provides some other useful
+ capabilities. Please, consult the MTD web site for more details
+ (www.linux-mtd.infradead.org).
+
+config MTD_UBI_WL_THRESHOLD
+ int "UBI wear-leveling threshold"
+ default 4096
+ range 2 65536
+ depends on MTD_UBI
+ help
+ This parameter defines the maximum difference between the highest
+ erase counter value and the lowest erase counter value of eraseblocks
+ of UBI devices. When this threshold is exceeded, UBI starts performing
+ wear leveling by means of moving data from eraseblock with low erase
+ counter to eraseblocks with high erase counter. Leave the default
+ value if unsure.
+
+config MTD_UBI_BEB_RESERVE
+ int "Percentage of reserved eraseblocks for bad eraseblocks handling"
+ default 1
+ range 0 25
+ depends on MTD_UBI
+ help
+ If the MTD device admits of bad eraseblocks (e.g. NAND flash), UBI
+ reserves some amount of physical eraseblocks to handle new bad
+ eraseblocks. For example, if a flash physical eraseblock becomes bad,
+ UBI uses these reserved physical eraseblocks to relocate the bad one.
+ This option specifies how many physical eraseblocks will be reserved
+ for bad eraseblock handling (percents of total number of good flash
+ eraseblocks). If the underlying flash does not admit of bad
+ eraseblocks (e.g. NOR flash), this value is ignored and nothing is
+ reserved. Leave the default value if unsure.
+
+config MTD_UBI_GLUEBI
+ bool "Emulate MTD devices"
+ default n
+ depends on MTD_UBI
+ help
+ This option enables MTD devices emulation on top of UBI volumes: for
+ each UBI volumes an MTD device is created, and all I/O to this MTD
+ device is redirected to the UBI volume. This is handy to make
+ MTD-oriented software (like JFFS2) work on top of UBI. Do not enable
+ this if no legacy software will be used.
+
+# There are a lot of debugging options, so they are moved to a distinct file
+source "drivers/mtd/ubi/Kconfig.debug"
+
+endmenu
diff -auNrp tmp-from/drivers/mtd/ubi/Kconfig.debug tmp-to/drivers/mtd/ubi/Kconfig.debug
--- tmp-from/drivers/mtd/ubi/Kconfig.debug 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/Kconfig.debug 2007-03-14 17:15:49.000000000 +0200
@@ -0,0 +1,153 @@
+# UBI debugging configuration options, part of drivers/mtd/ubi/Kconfig
+
+comment "UBI debugging options"
+ depends on MTD_UBI
+
+config MTD_UBI_DEBUG
+ bool "UBI debugging"
+ depends on SYSFS
+ depends on MTD_UBI
+ select DEBUG_FS
+ select KALLSYMS_ALL
+ help
+ This enables UBI debugging support: enables various assertions in the
+ code, verbose debugging messages and allows to switch many other
+ debugging features on. UBI exposes its debugging stuff via the Linux
+ debugfs virtual file-system under the "ubi" directory. Mount debugfs
+ to access this stuff: mount -t debugfs none <mount_point>
+
+config MTD_UBI_DEBUG_DISABLE_BGT
+ bool "Do not enable the UBI background thread"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option switches the background thread off by default. The thread
+ may be also be enabled/disabled via UBI sysfs.
+
+config MTD_UBI_DEBUG_USERSPACE_IO
+ bool "Direct user-space write/erase support"
+ default n
+ depends on MTD_UBI
+ help
+ By default, users cannot directly write and erase individual
+ eraseblocks of dynamic volumes, and have to use update operation
+ instead. This option enables this capability - it is very useful for
+ debugging and testing.
+
+config MTD_UBI_DEBUG_EMULATE_BITFLIPS
+ bool "Emulate flash bit-flips"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option emulates bit-flips with probability 1/50, which in turn
+ causes scrubbing. Useful for debugging and stressing UBI.
+
+config MTD_UBI_DEBUG_EMULATE_WRITE_FAILURES
+ bool "Emulate flash write failures"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option emulates write failures with probability 1/100. Useful for
+ debugging and testing how UBI handlines errors.
+
+config MTD_UBI_DEBUG_EMULATE_ERASE_FAILURES
+ bool "Emulate flash erase failures"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option emulates erase failures with probability 1/100. Useful for
+ debugging and testing how UBI handlines errors.
+
+menu "UBI debugging messages"
+ depends on MTD_UBI_DEBUG
+
+config MTD_UBI_DEBUG_MSG_UIF
+ bool "User interface unit messages"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option enables debugging messages from the UBI user interfaces
+ unit.
+
+config MTD_UBI_DEBUG_MSG_VTBL
+ bool "Volume table unit messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables debugging messages from the UBI volume table
+ unit.
+
+config MTD_UBI_DEBUG_MSG_EBA
+ bool "Eraseblock association unit messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables debugging messages from the UBI eraseblock
+ association unit.
+
+config MTD_UBI_DEBUG_MSG_WL
+ bool "Wear-leveling unit messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables debugging messages from the UBI wear-leveling
+ unit.
+
+config MTD_UBI_DEBUG_MSG_IO
+ bool "Input/output unit messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables debugging messages from the UBI input/output unit.
+
+config MTD_UBI_DEBUG_MSG_BLD
+ bool "UBI build messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables UBI build debugging messages
+
+endmenu # UBI debugging messages
+
+menu "UBI paranoid checks"
+ depends on MTD_UBI_DEBUG
+
+config MTD_UBI_DEBUG_PARANOID_VTBL
+ bool "Paranoid checks in the volume table unit"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option enables extra self-checks in the volume table unit.
+
+config MTD_UBI_DEBUG_PARANOID_EBA
+ bool "Paranoid checks in the eraseblock association unit"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option enables extra self-checks in the eraseblock association
+ unit.
+
+config MTD_UBI_DEBUG_PARANOID_WL
+ bool "Paranoid checks in the wear-leveling unit"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option enables extra self-checks in the wear-leveling unit.
+
+config MTD_UBI_DEBUG_PARANOID_IO
+ bool "Paranoid checks in the input/output unit"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option enables extra self-checks in the input/output unit.
+ Warning, this is rather heavy-weight and will slow UBI down.
+
+config MTD_UBI_DEBUG_PARANOID_SCAN
+ bool "Paranoid checks in the scanning unit"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option enables extra self-checks in the scanning unit. Warning,
+ this is rather heavy-weight and will slow UBI down.
+
+endmenu # UBI paranoid checks
diff -auNrp tmp-from/drivers/mtd/ubi/Makefile tmp-to/drivers/mtd/ubi/Makefile
--- tmp-from/drivers/mtd/ubi/Makefile 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/Makefile 2007-03-14 17:15:49.000000000 +0200
@@ -0,0 +1,7 @@
+obj-$(CONFIG_MTD_UBI) += ubi.o
+
+ubi-y += upd.o sysfs.o cdev.o uif.o vtbl.o vmt.o eba.o io.o wl.o scan.o build.o
+ubi-y += account.o misc.o
+
+ubi-$(CONFIG_MTD_UBI_DEBUG) += debug.o
+ubi-$(CONFIG_MTD_UBI_GLUEBI) += gluebi.o
diff -auNrp tmp-from/include/mtd/Kbuild tmp-to/include/mtd/Kbuild
--- tmp-from/include/mtd/Kbuild 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/include/mtd/Kbuild 2007-03-14 17:15:49.000000000 +0200
@@ -3,3 +3,5 @@ header-y += jffs2-user.h
header-y += mtd-abi.h
header-y += mtd-user.h
header-y += nftl-user.h
+header-y += ubi-header.h
+header-y += ubi-user.h

Artem Bityutskiy

unread,

Mar 14, 2007, 11:25:07 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/upd.c tmp-to/drivers/mtd/ubi/upd.c
--- tmp-from/drivers/mtd/ubi/upd.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/upd.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,359 @@

+/*
+ * Copyright (c) International Business Machines Corp., 2006

+ * Copyright (C) Nokia Corporation, 2006

+ * Jan 2007: Alexander Schmidt, hacked per-volume update.
+ */
+
+/*
+ * This is a part of the user interfaces unit which implements volume update
+ * functionality.
+ *
+ * The update operation is based on the per-volume update marker which is
+ * stored in the volume table. The update marker is set before the update
+ * starts, and removed after the update has been finished. So if the update was
+ * interrupted by an unclean re-boot or due to some other reasons, the update
+ * marker stays on the flash media and UBI finds it when it attaches the MTD
+ * device next time. If there is a set update marker exist for a volume, the
+ * volume is treated as damaged and most I/O operations are prohibited. Only a
+ * new update operation is allowed.
+ *
+ * We do not do any serialization here because the volume update operation is
+ * serialized by opening the volume in the exclusive mode.
+ *
+ * Note, in general it is possible to implement the update operation as a
+ * transaction with a possibility to roll-back. But this is far more complex.
+ * May be in future.

+ */
+
+#include <linux/err.h>

+#include <asm/uaccess.h>
+#include <asm/div64.h>
+#include "ubi.h"
+
+/**
+ * ubi_wipe_out_volume - wipe out an UBI volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to free
+ *
+ * This function erases all eraseblocks of a volume. Returns zero in case of
+ * success, and a negative error code in case of failure.
+ */
+static int ubi_wipe_out_volume(struct ubi_info *ubi, int vol_id)
+{
+ int i, err;

+ const struct ubi_vtbl_vtr *vtr;
+

+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);
+

+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);

+ for (i = 0; i < vtr->reserved_pebs; i++) {

+ err = ubi_eba_unmap_leb(ubi, vol_id, i);

+ if (unlikely(err))

+ return err;
+ }
+

+ return ubi_wl_flush(ubi);
+}
+
+
+/**
+ * ubi_upd_start - start volume update operation.
+ *
+ * @vol: volume description object
+ * @bytes: how many bytes will be written to the volume
+ *
+ * This function starts a volume update operation. If @bytes is zero, the
+ * volume is just wiped out. This function returns zero in case of success and
+ * a negative error code in case of failure.
+ */

+int ubi_upd_start(struct ubi_uif_volume *vol, long long bytes)

+{
+ int err, rem, vol_id = vol->vol_id;
+ uint64_t tmp;

+ const struct ubi_vtbl_vtr *vtr;

+ struct ubi_info *ubi = vol->ubi;
+
+ dbg_uif("start update of volume %d, %llu bytes", vol_id, bytes);

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));

+ ubi_assert(bytes >= 0);
+ ubi_assert(bytes <= vtr->usable_leb_size * vtr->reserved_pebs);
+
+ vol->updating = 1;
+
+ /* Set the update marker first */
+ err = ubi_vtbl_set_upd_marker(ubi, vol_id);
+ if (err)
+ return err;
+
+ /* Before updating - wipe out the volume */
+ err = ubi_wipe_out_volume(ubi, vol_id);
+ if (err)
+ return err;
+
+ if (bytes == 0) {
+ /* Zero bytes means the volume just has to be erased */
+ err = ubi_vtbl_clear_upd_marker(ubi, vol_id, 0);
+ if (!err)
+ vol->updating = 0;

+ return err;
+ }
+

+ vol->upd_buf = kmalloc(ubi->io.leb_size, GFP_KERNEL);
+ if (!vol->upd_buf)
+ return -ENOMEM;
+
+ tmp = bytes;
+ rem = do_div(tmp, vtr->usable_leb_size);
+ vol->upd_ebs = tmp + !!rem;
+ vol->upd_bytes = bytes;
+ vol->upd_received = 0;
+
+ return 0;
+}
+
+/**
+ * write_leb - write update data to a logical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to write to
+ * @lnum: logical eraseblock number to write
+ * @buf: data to write

+ * @len: how many bytes to write

+ * @used_ebs: how many logical eraseblocks will this volume contain (static
+ * volumes only)
+ *
+ * This function writes update data to corresponding logical eraseblock. In
+ * case of a dynamic volume, this function checks if the data contains 0xFF
+ * bytes at the end. If yes, the 0xFF bytes are cut and not written. So if the
+ * whole buffer contains only 0xFF bytes, the LEB is left unmapped.
+ *
+ * The reason why we skip the ending 0xFF bytes in case of dynamic volumes is
+ * that we want to make sure that more data may be appended to these logical
+ * eraseblocks in future. Indeed, writing 0xFF bytes may have side effects and
+ * this PEB won't be writable anymore. So if, say, one writes the file-system
+ * image to the UBI volume where 0xFFs mean free space - UBI makes sure this
+ * free space is writable after the update.
+ *
+ * We do not do this for static volumes because they are read-only. But this
+ * also cannot be done because we have to store per-LEB CRC and the correct
+ * data length.
+ *
+ * The data buffer must always contain @vtr->usable_leb_size bytes, unless this
+ * is the last logical eraseblock.
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int write_leb(struct ubi_info *ubi, int vol_id, int lnum, void *buf,
+ int len, int used_ebs)
+{
+ int err;

+ const struct ubi_vtbl_vtr *vtr;
+

+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);

+ ubi_assert(len >= 0 && len <= vtr->usable_leb_size);
+ ubi_assert(used_ebs >= 0 && used_ebs <= vtr->reserved_pebs);

+
+ if (vtr->vol_type == UBI_DYNAMIC_VOLUME) {

+ int l;
+
+ l = align_up(len, ubi->io.min_io_size);
+ memset(buf + len, 0xFF, l - len);
+
+ l = ubi_calc_data_len(ubi, buf, l);
+ if (l == 0) {
+ dbg_uif("all %d bytes contain 0xFF - skip", len);
+ return 0;
+ }
+ if (len != l)
+ dbg_uif("skip last %d bytes (0xFF)", len - l);
+
+ err = ubi_eba_write_leb(ubi, vol_id, lnum, buf, 0, l,
+ UBI_DATA_UNKNOWN);
+ } else {
+ /*
+ * When writing to static volumes, and this is the last logical
+ * eraseblock, the length (@len) does not have to be aligned to
+ * the minimal flash I/O unit. The 'ubi_eba_write_leb_st()'
+ * function needs the exact (unaligned) length to store in the
+ * VID header. And it will take care of proper alignment by
+ * padding the buffer. Here we just make sure the padding will
+ * contain zeros, not random trash.
+ */
+ memset(buf + len, 0, vtr->usable_leb_size - len);
+ err = ubi_eba_write_leb_st(ubi, vol_id, lnum, buf, len,
+ UBI_DATA_UNKNOWN, used_ebs);

+ }
+
+ return err;

+}
+
+/**
+ * ubi_upd_write_data - write more data.
+ *
+ * @vol: volume description object
+ * @buf: the data to write (user-space memory buffer)
+ * @count: how much bytes to write
+ *
+ * This function writes more data to the volume which is being updated. It may
+ * be called arbitrary number of times until all of the update data arrive.
+ * This function returns %0 in case of success, a positive number of bytes
+ * written at during this last call if the whole volume update was successfully
+ * finished, and a negative error code in case of failure.
+ */

+int ubi_upd_write_data(struct ubi_uif_volume *vol, const void __user *buf,
+ int count)

+{
+ uint64_t tmp;

+ const struct ubi_vtbl_vtr *vtr;

+ struct ubi_info *ubi = vol->ubi;
+ int lnum, offs, err = 0, len, to_write = count, vol_id = vol->vol_id;
+
+ dbg_uif("write %d bytes requested", count);

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ ubi_assert(count >= 0);

+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+ ubi_assert(!IS_ERR(vtr));
+

+ if (unlikely(count == 0))
+ return 0;
+
+ ubi_assert(vol->updating == 1);
+ ubi_assert(vol->upd_received >= 0);
+ ubi_assert(vol->upd_received < vol->upd_bytes);
+
+ tmp = vol->upd_received;
+ offs = do_div(tmp, vtr->usable_leb_size);
+ lnum = tmp;
+
+ if (vol->upd_received + count > vol->upd_bytes)
+ to_write = count = vol->upd_bytes - vol->upd_received;
+
+ /*
+ * When updating volumes, we accumulate whole logical eraseblock data
+ * and write it at once.
+ */
+
+ if (offs != 0) {
+ /*
+ * This is a write to the middle of the logical eraseblock. We
+ * copy the data to our update buffer and wait for more data or
+ * flush it if the whole eraseblock is written or the update
+ * is finished.
+ */
+
+ len = vtr->usable_leb_size - offs;
+ if (len > count)
+ len = count;
+
+ dbg_uif("copy more %d bytes of data", len);
+
+ err = copy_from_user(vol->upd_buf + offs, buf, len);
+ if (unlikely(err)) {
+ dbg_err("memory access error");
+ return -EFAULT;
+ }
+
+ if (offs + len == vtr->usable_leb_size ||
+ vol->upd_received + len == vol->upd_bytes) {
+ int flush_len = offs + len;
+
+ /*
+ * OK, we gathered either the whole eraseblock or this
+ * is the last chunk, it's time to flush our buffer.
+ */
+
+ ubi_assert(flush_len <= vtr->usable_leb_size);
+
+ err = write_leb(ubi, vol_id, lnum, vol->upd_buf,
+ flush_len, vol->upd_ebs);
+ if (err)

+ return err;
+ }
+

+ vol->upd_received += len;
+ count -= len;
+ buf += len;
+ lnum += 1;
+ }
+
+ /*
+ * If we've got more to write, let's continue. At this point we know we
+ * are starting from the beginning of an eraseblock.
+ */
+
+ while (count) {
+ if (count > vtr->usable_leb_size)
+ len = vtr->usable_leb_size;
+ else
+ len = count;
+
+ dbg_uif("copy %d bytes of user data", len);
+ err = copy_from_user(vol->upd_buf, buf, len);
+ if (unlikely(err)) {
+ dbg_err("memory access error");
+ return -EFAULT;
+ }
+
+ if (len == vtr->usable_leb_size ||
+ vol->upd_received + len == vol->upd_bytes) {
+ err = write_leb(ubi, vol_id, lnum, vol->upd_buf, len,
+ vol->upd_ebs);
+ if (unlikely(err))
+ break;
+ }
+
+ vol->upd_received += len;
+ count -= len;
+ lnum += 1;
+ buf += len;
+ }
+
+ ubi_assert(vol->upd_received <= vol->upd_bytes);
+ if (vol->upd_received == vol->upd_bytes) {
+ /* The update is finished, clear the update marker */
+ err = ubi_vtbl_clear_upd_marker(ubi, vol_id, vol->upd_bytes);
+ if (err == 0) {
+ err = to_write;
+ kfree(vol->upd_buf);
+ vol->updating = 0;
+ }
+ }
+
+ return err;
+}
+
+/**
+ * ubi_upd_abort - abort volume update.
+ *
+ * @vol: volume description object
+ */
+void ubi_upd_abort(struct ubi_uif_volume *vol)
+{
+ ubi_warn("update of volume %d was not finished, volume is damaged",
+ vol->vol_id);
+ vol->updating = 0;
+ kfree(vol->upd_buf);
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:25:20 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/account.c tmp-to/drivers/mtd/ubi/account.c
--- tmp-from/drivers/mtd/ubi/account.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/account.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,233 @@

+/*
+ * Copyright (c) International Business Machines Corp., 2006
+ *

+ */
+
+/*
+ * UBI accounting unit.
+ *
+ * This unit is responsible for maintaining physical eraseblock accounting to
+ * prevent overcommitment.

+ */
+
+#include <linux/err.h>

+#include "ubi.h"
+
+/* The lowest number PEBs reserved for bad PEB handling */
+#define MIN_RESEVED_PEBS 1
+
+/**
+ * ubi_acc_reserve - reserve a number of physical eraseblocks.

+ *
+ * @ubi: the UBI device description object

+ * @pebs: how many physical eraseblocks to reserve
+ *
+ * This function returns zero in case of success and %-ENOSPC if there is no
+ * enough physical eraseblocks.
+ */

+int ubi_acc_reserve(struct ubi_info *ubi, int pebs)

+{
+ ubi_assert(pebs > 0);
+
+ spin_lock(&ubi->acc.lock);
+ ubi_assert(ubi->acc.avail_pebs >= 0);
+ if (pebs > ubi->acc.avail_pebs) {
+ dbg_err("not enough PEBs: requested %d, available %d",
+ pebs, ubi->acc.avail_pebs);
+ spin_unlock(&ubi->acc.lock);
+ return -ENOSPC;
+ }
+ ubi->acc.avail_pebs -= pebs;
+ ubi->acc.rsvd_pebs += pebs;
+ spin_unlock(&ubi->acc.lock);

+ return 0;
+}
+
+/**

+ * ubi_acc_free - free a number of reserved physical eraseblocks.

+ *
+ * @ubi: the UBI device description object

+ * @pebs: how many physical eraseblocks to free
+ */

+void ubi_acc_free(struct ubi_info *ubi, int pebs)

+{
+ ubi_assert(pebs > 0);
+
+ spin_lock(&ubi->acc.lock);
+ ubi_assert(pebs <= ubi->acc.rsvd_pebs);
+ ubi->acc.rsvd_pebs -= pebs;
+ ubi->acc.avail_pebs += pebs;
+ ubi_assert(ubi->acc.rsvd_pebs >= 0);
+ spin_unlock(&ubi->acc.lock);
+}
+
+/**
+ * calculate_beb_rsvd_max - calculate how many PEBs must be reserved for bad
+ * eraseblock handling.

+ *
+ * @ubi: the UBI device description object

+ */
+static void calculate_beb_rsvd_max(struct ubi_info *ubi)
+{
+ ubi->acc.beb_rsvd_max = ubi->io.good_peb_count/100;
+ ubi->acc.beb_rsvd_max *= CONFIG_MTD_UBI_BEB_RESERVE;
+ if (ubi->acc.beb_rsvd_max < MIN_RESEVED_PEBS)
+ ubi->acc.beb_rsvd_max = MIN_RESEVED_PEBS;
+}
+
+/**
+ * ubi_acc_peb_marked_bad - a physical eraseblock was marked as bad,
+ * re-calculate accounting.

+ *
+ * @ubi: the UBI device description object

+ */
+void ubi_acc_peb_marked_bad(struct ubi_info *ubi)
+{
+ int need;
+
+ ubi_assert(ubi->acc.beb_rsvd_pebs >= 0);
+
+ if (!ubi->io.bad_allowed)
+ return;
+
+ spin_lock(&ubi->acc.lock);
+ if (ubi->acc.beb_rsvd_pebs == 0) {
+ ubi_warn("no reserved physical eraseblocks");
+ ubi_ro_mode(ubi);
+ }
+ calculate_beb_rsvd_max(ubi);
+ ubi->acc.beb_rsvd_pebs -= 1;
+ need = ubi->acc.beb_rsvd_max = ubi->acc.beb_rsvd_pebs;
+ if (need > 0) {
+ int alloc;
+
+ alloc = ubi->acc.avail_pebs >= need ? need
+ : ubi->acc.avail_pebs;
+ ubi->acc.avail_pebs -= alloc;
+ ubi->acc.rsvd_pebs += alloc;
+ ubi->acc.beb_rsvd_pebs += alloc;
+ }
+ if (ubi->acc.beb_rsvd_pebs == 0)
+ ubi_warn("last PEB from the reserved pool was used");
+ spin_unlock(&ubi->acc.lock);
+}
+
+/**
+ * acc_info_check - check sanity and consistency of accounting information.

+ *
+ * @ubi: the UBI device description object
+ *

+ * We do not trust data read from the flash media and we check that the
+ * accounting information is consistent, because it is formed using on-flash
+ * data. This function returns zero if everything is fine and %-EINVAL if some
+ * inconsistency was found.
+ */
+static int acc_info_check(const struct ubi_info *ubi)
+{
+ if (ubi->acc.avail_pebs < 0 || ubi->acc.rsvd_pebs < 0)
+ goto bad;
+
+ if (ubi->acc.avail_pebs > ubi->io.good_peb_count) {
+ dbg_err("bad avail_pebs");
+ goto bad;
+ }
+
+ if (ubi->acc.rsvd_pebs > ubi->io.good_peb_count ||
+ ubi->acc.rsvd_pebs < ubi->vtbl.vol_count) {
+ dbg_err("bad rsvd_pebs");
+ goto bad;
+ }
+
+ if (ubi->acc.avail_pebs + ubi->acc.rsvd_pebs !=
+ ubi->io.good_peb_count) {
+ dbg_err("accounting error");
+ goto bad;
+ }

+
+ return 0;
+

+bad:
+ ubi_err("accounting check failed");
+ dbg_err("avail_pebs %d, rsvd_pebs %d, ubi->io.good_peb_count %d",
+ ubi->acc.avail_pebs, ubi->acc.rsvd_pebs,
+ ubi->io.good_peb_count);
+ return -EINVAL;
+}
+
+/**
+ * ubi_acc_init_scan - initialize the accounting unit using scanning
+ * information.

+ *
+ * @ubi: the UBI device description object

+ * @si: a pointer to the scanning information

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_acc_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si)
+{
+ int err, i;

+ const struct ubi_vtbl_vtr *vtr;
+

+ dbg_uif("initialize the accounting unit");
+
+ spin_lock_init(&ubi->acc.lock);
+
+ for (i = 0; i < UBI_INT_VOL_COUNT; i++) {
+ vtr = ubi_vtbl_get_vtr(ubi, UBI_INTERNAL_VOL_START + i);
+ ubi_assert(!IS_ERR(vtr));
+ ubi->acc.rsvd_pebs += vtr->reserved_pebs;
+ }
+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {

+ vtr = ubi_vtbl_get_vtr(ubi, i);

+ if (IS_ERR(vtr))
+ continue;

+ ubi->acc.rsvd_pebs += vtr->reserved_pebs;
+ }
+
+ ubi->acc.rsvd_pebs += si->alien_peb_count;
+ ubi->acc.avail_pebs = ubi->io.good_peb_count - ubi->acc.rsvd_pebs;
+
+ if (ubi->io.bad_allowed) {
+ calculate_beb_rsvd_max(ubi);
+
+ if (ubi->acc.avail_pebs < ubi->acc.beb_rsvd_max) {
+ /* No enough free physical eraseblocks */
+ ubi->acc.beb_rsvd_pebs = ubi->acc.avail_pebs;
+ ubi_warn("cannot reserve enough PEBs for bad PEB "
+ "handling, reserved %d, need %d",
+ ubi->acc.beb_rsvd_pebs, ubi->acc.beb_rsvd_max);
+ } else
+ ubi->acc.beb_rsvd_pebs = ubi->acc.beb_rsvd_max;
+
+ ubi->acc.avail_pebs -= ubi->acc.beb_rsvd_pebs;
+ ubi->acc.rsvd_pebs += ubi->acc.beb_rsvd_pebs;
+ }
+
+ /* Check accounting information sanity and consistency */
+ err = acc_info_check(ubi);

+ if (err)
+ return err;
+

+ dbg_uif("accounting unit is initialized");
+ dbg_uif("avail_pebs %d rsvd_pebs %d, max_volumes %d",
+ ubi->acc.avail_pebs, ubi->acc.rsvd_pebs, ubi->vtbl.vt_slots);

+
+ return 0;
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:25:34 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/vtbl.c tmp-to/drivers/mtd/ubi/vtbl.c
--- tmp-from/drivers/mtd/ubi/vtbl.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/vtbl.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,1387 @@

+/*
+ * Copyright (c) International Business Machines Corp., 2006

+ * Copyright (C) Nokia Corporation, 2006

+ * The volume table unit.
+ *
+ * This unit is responsible for maintaining the volume table. The volume table
+ * is an on-flash table containing volume meta-data like volume name, number of
+ * reserved physical eraseblocks for this volume, volume type, etc. The volume
+ * table is stored in the so-called "layout volume".
+ *
+ * The layout volume is an internal volume which is organized as follows. It
+ * consists of two logical eraseblocks - LEB 0 and LEB 1. Each logical
+ * eraseblock stores a copy the volume table, i.e. LEB 0 and LEB 1 duplicate
+ * each other. This redundancy guarantees robustness and tolerance to unclean
+ * reboots. The volume table is a mere array of so-called "volume table
+ * records". Each record contains full information about the volume and is
+ * protected by a CRC checksum.
+ *
+ * The volume table is changed as follows. It is first changed in RAM. Then LEB
+ * 0 is erased, and the updated volume table is written back to LEB 0. The same
+ * is done with LEB 1. This scheme guarantees recoverability from unclean
+ * reboots.
+ *
+ * In this UBI implementation the on-flash volume table does not contain any
+ * information about how many data static volumes contain. This information may
+ * be found out while scanning (from the EB headers) so we do not store it in
+ * the on-flash volume table. So, as long as we have an unscalable UBI
+ * implementation which uses scanning, we may live without that. In case of a
+ * scalable implementation, this would be required.
+ *
+ * But it would still be beneficial to store this information in the volume
+ * table. For example, suppose we have a static volume X, and all its physical
+ * eraseblocks became bad for some reasons. Suppose we are attaching the
+ * corresponding MTD device, the scanning unit finds no logical eraseblocks
+ * corresponding to the volume X. According to the volume table volume X does
+ * exist. So we don't know whether it is just empty or all its physical
+ * eraseblocks went bad. So we cannot alarm the user about this corruption.
+ *
+ * Note, although we don't store this information in the on-flash volume table,
+ * we keep it in the in-RAM copy of this table just because it is quite
+ * convenient.
+ *
+ * The volume table also stores so-called "update marker" which is used to
+ * implement the update operation. Before updating the volume, the update
+ * marker is set, after the update operation is finished, the update marker is
+ * cleared. So if the update operation was interrupted (e.g. by an unclean
+ * reboot) - the update marker is still there and we know that the volume's
+ * contents is damaged.
+ */
+
+#include <linux/crc32.h>
+#include <linux/err.h>

+#include <asm/div64.h>
+#include "ubi.h"
+

+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_VTBL
+static int paranoid_check_vtr(const struct ubi_info *ubi,
+ const struct ubi_vtbl_vtr *vtr);
+static int paranoid_vol_tbl_check(const struct ubi_info *ubi,
+ const struct ubi_vol_tbl_record *vol_tbl);
+#else
+#define paranoid_check_vtr(ubi, vtr) 0
+#define paranoid_vol_tbl_check(ubi, vol_tbl) 0
+#endif
+
+/* Empty volume table record */
+struct ubi_vol_tbl_record empty_rec;
+
+/**
+ * change_volume - change a volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to change
+ * @vtr: new volume table record
+ *
+ * This function accepts a new volume table record @vtr for volume @vol_id and
+ * changes the volume table correspondingly (both in RAM and on flash). If the
+ * @vtr->reserved_pebs field contains zero, the volume is be deleted. If
+ * @vtr->name points to the the same address as the current one, the @name and
+ * @name_len fields are not changed.

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int change_volume(struct ubi_info *ubi, int vol_id,
+ const struct ubi_vtbl_vtr *vtr)

+{
+ int i, err;

+ struct ubi_vol_tbl_record *vol_tbl;
+
+ vol_tbl = kzalloc(ubi->vtbl.vt_size, GFP_KERNEL);
+ if (!vol_tbl)
+ return -ENOMEM;
+
+ mutex_lock(&ubi->vtbl.vtbl_lock);
+
+ /* Generate the on-flash volume table contents */

+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {

+ uint32_t crc;
+ const struct ubi_vtbl_vtr *tmp_vtr;
+
+ tmp_vtr = &ubi->vtbl.vt[i];
+
+ err = paranoid_check_vtr(ubi, tmp_vtr);
+ if (err)
+ goto out_err;
+
+ if (i == vol_id)
+ tmp_vtr = vtr;
+
+ if (tmp_vtr->reserved_pebs == 0) {
+ /* Volume is empty */
+ memcpy(&vol_tbl[i], &empty_rec, UBI_VTBL_RECORD_SIZE);
+ continue;
+ }
+
+ vol_tbl[i].reserved_pebs = cpu_to_ubi32(tmp_vtr->reserved_pebs);
+ vol_tbl[i].alignment = cpu_to_ubi32(tmp_vtr->alignment);
+ vol_tbl[i].data_pad = cpu_to_ubi32(tmp_vtr->data_pad);
+ vol_tbl[i].upd_marker = tmp_vtr->upd_marker;
+ vol_tbl[i].vol_type = tmp_vtr->vol_type == UBI_DYNAMIC_VOLUME ?
+ UBI_VID_DYNAMIC : UBI_VID_STATIC;
+ vol_tbl[i].name_len = cpu_to_ubi16(tmp_vtr->name_len);
+ memcpy(&vol_tbl[i].name, tmp_vtr->name,
+ tmp_vtr->name_len);
+ vol_tbl[i].name[tmp_vtr->name_len] = '\0';
+
+ crc = crc32(UBI_CRC32_INIT, &vol_tbl[i],
+ UBI_VTBL_RECORD_SIZE_CRC);
+ vol_tbl[i].crc = cpu_to_ubi32(crc);
+ }
+
+ err = paranoid_vol_tbl_check(ubi, vol_tbl);
+ if (err)
+ goto out_err;
+
+ /* Update both volume table copies */
+ for (i = 0; i < UBI_LAYOUT_VOLUME_EBS; i++) {
+ err = ubi_eba_unmap_leb(ubi, UBI_LAYOUT_VOL_ID, i);
+ if (err)
+ goto out_err;
+
+ err = ubi_eba_write_leb(ubi, UBI_LAYOUT_VOL_ID, i, vol_tbl, 0,
+ ubi->vtbl.vt_size, UBI_DATA_LONGTERM);
+ if (err)
+ goto out_err;
+ }
+
+ err = ubi_wl_flush(ubi);
+ if (err)
+ goto out_err;
+
+ /* Change the in-RAM volume table correspondingly */
+ if (vtr->reserved_pebs != 0) {
+ if (vtr->name != ubi->vtbl.vt[vol_id].name) {
+ /* The name was changed */
+ kfree(ubi->vtbl.vt[vol_id].name);
+ memcpy(&ubi->vtbl.vt[vol_id], vtr, sizeof(struct ubi_vtbl_vtr));
+ ubi->vtbl.vt[vol_id].name = kmemdup(vtr->name,
+ vtr->name_len + 1,
+ GFP_KERNEL);
+ if (!ubi->vtbl.vt[vol_id].name) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+ ubi->vtbl.vt[vol_id].name_len = vtr->name_len;
+ } else {
+ const void *tmp_nm = ubi->vtbl.vt[vol_id].name;
+ int tmp_nm_len = ubi->vtbl.vt[vol_id].name_len;
+
+ memcpy(&ubi->vtbl.vt[vol_id], vtr, sizeof(struct ubi_vtbl_vtr));
+ ubi->vtbl.vt[vol_id].name = tmp_nm;
+ ubi->vtbl.vt[vol_id].name_len = tmp_nm_len;
+ }
+
+ if (paranoid_check_vtr(ubi, &ubi->vtbl.vt[vol_id]))
+ goto out_err;
+ } else {
+ kfree(ubi->vtbl.vt[vol_id].name);
+ memset(&ubi->vtbl.vt[vol_id], 0, sizeof(struct ubi_vtbl_vtr));
+ }
+
+ mutex_unlock(&ubi->vtbl.vtbl_lock);
+ kfree(vol_tbl);
+ return 0;
+
+out_err:
+ /*
+ * The volume table is probably in an inconsistent state, so switch
+ * to read-only mode.
+ */
+ ubi_ro_mode(ubi);
+ mutex_unlock(&ubi->vtbl.vtbl_lock);
+ kfree(vol_tbl);

+ return err;
+}
+

+/**
+ * fill_data_size_fields - fills data size-related fields in a volume table
+ * record.

+ *
+ * @ubi: the UBI device description object

+ * @vtr: a pointer to the volume table record to fill
+ *
+ * This function initializes the @vtr->usable_leb_size, @vtr->used_ebs and
+ * @vtr->last_eb_bytes fields of the volume table record using the
+ * @vtr->vol_type, @vtr->data_pad, and @vtr->used_bytes fields.
+ */
+static void fill_data_size_fields(const struct ubi_info *ubi,
+ struct ubi_vtbl_vtr *vtr)
+{
+ vtr->usable_leb_size = ubi->io.leb_size - vtr->data_pad;

+
+ if (vtr->vol_type == UBI_DYNAMIC_VOLUME) {

+ vtr->used_ebs = vtr->reserved_pebs;
+ vtr->last_eb_bytes = vtr->usable_leb_size;
+ vtr->used_bytes = vtr->used_ebs * vtr->usable_leb_size;
+ } else {
+ uint64_t tmp = vtr->used_bytes;
+
+ vtr->last_eb_bytes = do_div(tmp, vtr->usable_leb_size);
+ vtr->used_ebs = tmp;
+ if (vtr->last_eb_bytes)
+ vtr->used_ebs += 1;
+ else
+ vtr->last_eb_bytes = vtr->usable_leb_size;
+ }
+}
+
+/**
+ * ubi_vtbl_mkvol - create new volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the new volume

+ * @vtr: volume table record of the new volume
+ *
+ * This function adds a volume described by @vtr to the volume table. This
+ * function uses only @vtr->reserved_pebs, @vtr->alignment, @vtr->data_pad,
+ * @vtr->vol_type, and @vtr->name_len and @vtr->name fields of the @vtr object.
+ * The other fields in @vtr are ignored.

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_vtbl_mkvol(struct ubi_info *ubi, int vol_id, struct ubi_vtbl_vtr *vtr)

+{
+ int err;
+

+ dbg_vtbl("create volume: vol_id %d, reserved_pebs %d, alignment %d, "
+ "data_pad %d, vol_type %d, name_len %d, name %s", vol_id,
+ vtr->reserved_pebs, vtr->alignment, vtr->data_pad,
+ vtr->vol_type, vtr->name_len, vtr->name);

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ ubi_assert(vtr->reserved_pebs > 0);
+ ubi_assert(ubi->vtbl.vt[vol_id].reserved_pebs == 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+
+ vtr->corrupted = 0;
+ vtr->upd_marker = 0;
+ vtr->used_bytes = 0;
+ fill_data_size_fields(ubi, vtr);
+
+ err = paranoid_check_vtr(ubi, vtr);

+ if (err)
+ return err;
+

+ err = change_volume(ubi, vol_id, vtr);
+ if (!err)
+ ubi->vtbl.vol_count += 1;
+ ubi_assert(ubi->vtbl.vol_count <= ubi->vtbl.vt_slots);
+
+ return err;
+}
+
+/**
+ * ubi_vtbl_rmvol - remove a volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to remove
+ *
+ * This function clears volume table record of volume @vol_id. Returns zero in
+ * case of success and a negative error code in case of failure.
+ */
+int ubi_vtbl_rmvol(struct ubi_info *ubi, int vol_id)
+{
+ int err;
+ struct ubi_vtbl_vtr empty_vtr;
+
+ dbg_vtbl("remove volume: vol_id %d, reserved_pebs %d, alignment %d, "
+ "data_pad %d, vol_type %d, name_len %d, name %s", vol_id,
+ ubi->vtbl.vt[vol_id].reserved_pebs,
+ ubi->vtbl.vt[vol_id].alignment, ubi->vtbl.vt[vol_id].data_pad,
+ ubi->vtbl.vt[vol_id].vol_type, ubi->vtbl.vt[vol_id].name_len,
+ ubi->vtbl.vt[vol_id].name);

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ ubi_assert(ubi->vtbl.vt[vol_id].reserved_pebs != 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+
+ empty_vtr.reserved_pebs = 0;
+ err = change_volume(ubi, vol_id, &empty_vtr);
+ if (!err)
+ ubi->vtbl.vol_count -= 1;
+ ubi_assert(ubi->vtbl.vol_count >= 0);
+

+ return err;
+}
+

+/**
+ * ubi_vtbl_rsvol - re-size a volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to re-size
+ * @reserved_pebs: new size, i.e. new number of reserved physical eraseblocks.
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_vtbl_rsvol(struct ubi_info *ubi, int vol_id, int reserved_pebs)
+{
+ struct ubi_vtbl_vtr vtr;
+
+ dbg_vtbl("re-size volume %d to %d LEBs, old size %d LEBs", vol_id,
+ reserved_pebs, ubi->vtbl.vt[vol_id].reserved_pebs);

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ ubi_assert(reserved_pebs > 0);
+ ubi_assert(ubi->vtbl.vt[vol_id].reserved_pebs != 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+
+ memcpy(&vtr, &ubi->vtbl.vt[vol_id], sizeof(struct ubi_vtbl_vtr));
+ vtr.reserved_pebs = reserved_pebs;
+ fill_data_size_fields(ubi, &vtr);
+
+ return change_volume(ubi, vol_id, &vtr);
+}
+
+/**
+ * ubi_vtbl_set_upd_marker - set the update marker flag.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume

+ *
+ * This function sets the update marker flag for volume @vol_id. Returns zero
+ * in case of success and a negative error code in case of failure.
+ */
+int ubi_vtbl_set_upd_marker(struct ubi_info *ubi, int vol_id)
+{
+ struct ubi_vtbl_vtr vtr;
+
+ dbg_vtbl("set update marker for volume %d", vol_id);

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ ubi_assert(ubi->vtbl.vt[vol_id].reserved_pebs != 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+
+ if (ubi->vtbl.vt[vol_id].upd_marker) {
+ dbg_vtbl("update marker is already set, do nothing");

+ return 0;
+ }
+

+ memcpy(&vtr, &ubi->vtbl.vt[vol_id], sizeof(struct ubi_vtbl_vtr));
+ vtr.upd_marker = 1;
+
+ return change_volume(ubi, vol_id, &vtr);
+}
+
+/**
+ * ubi_vtbl_clear_upd_marker - clear the update marker flag.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume

+ * @bytes: new data size in bytes
+ *
+ * This function clears the update marker for volume @vol_id, sets new volume
+ * data size and clears the "corrupted" flag (static volume s only). This
+ * function returns zero in case of success and a negative error code in case
+ * of failure.
+ */
+int ubi_vtbl_clear_upd_marker(struct ubi_info *ubi, int vol_id, long long bytes)
+{
+ struct ubi_vtbl_vtr vtr;
+
+ dbg_vtbl("clear update marker for volume %d", vol_id);

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ ubi_assert(ubi->vtbl.vt[vol_id].reserved_pebs != 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+ ubi_assert(bytes >= 0 && bytes <= ubi->vtbl.vt[vol_id].usable_leb_size *
+ ubi->vtbl.vt[vol_id].reserved_pebs);
+
+ if (!ubi->vtbl.vt[vol_id].upd_marker) {
+ dbg_vtbl("update marker is already cleared, do nothing");

+ return 0;
+ }
+

+ memcpy(&vtr, &ubi->vtbl.vt[vol_id], sizeof(struct ubi_vtbl_vtr));
+ vtr.upd_marker = 0;
+
+ if (ubi->vtbl.vt[vol_id].vol_type == UBI_STATIC_VOLUME) {
+ dbg_vtbl("set data length of static volume %d to %lld",
+ vol_id, bytes);
+ vtr.used_bytes = bytes;
+ vtr.corrupted = 0;
+ fill_data_size_fields(ubi, &vtr);
+ } else
+ ubi_assert(vtr.corrupted == 0);
+
+ return change_volume(ubi, vol_id, &vtr);
+}
+
+/**
+ * ubi_vtbl_set_corrupted - mark a volume as 'corrupted'.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to mark
+ *
+ * This function marks volume @vol_id as corrupted. If the volume is dynamic,
+ * it does nothing, because dynamic volumes cannot have the corrupted flag.

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_vtbl_set_corrupted(struct ubi_info *ubi, int vol_id)
+{
+ struct ubi_vtbl_vtr *vtr = &ubi->vtbl.vt[vol_id];

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);

+ ubi_assert(ubi->vtbl.vt[vol_id].reserved_pebs != 0);
+ ubi_assert(!ubi_is_ivol(vol_id));
+ ubi_assert(ubi->vtbl.vt[vol_id].upd_marker == 0);
+
+ if (vtr->vol_type == UBI_STATIC_VOLUME) {
+ dbg_vtbl("mark static volume %d as corrupted", vol_id);
+ vtr->corrupted = 1;

+ }
+
+ return 0;
+}
+

+/**
+ * ubi_vtbl_get_vtr - get a volume table record.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: the requested volume ID
+ *
+ * This function returns a pointer to the volume record or %-ENODEV if the
+ * volume does not exist.It does not access the flash media as retrieves the
+ * information from the in-RAM volume table copy. So the function does not
+ * sleep.
+ */

+const struct ubi_vtbl_vtr *ubi_vtbl_get_vtr(const struct ubi_info *ubi,

+ int vol_id)
+{
+ int err;
+
+ if (ubi_is_ivol(vol_id))
+ return &ubi->vtbl.ivol_vtrs[vol_id - UBI_INTERNAL_VOL_START];

+
+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);
+

+ if (ubi->vtbl.vt[vol_id].reserved_pebs == 0)
+ return ERR_PTR(-ENODEV);
+
+ err = paranoid_check_vtr(ubi, &ubi->vtbl.vt[vol_id]);
+ if (unlikely(err))
+ return ERR_PTR(err);
+
+ return &ubi->vtbl.vt[vol_id];
+}
+
+/**
+ * vol_tbl_check - check if the volume table is not corrupted and contains
+ * sensible data.

+ *
+ * @ubi: the UBI device description object

+ * @vol_tbl: the volume table
+ *
+ * This function returns zero if the volume table is all right and %-EINVAL if

+ * not.
+ */

+static int vol_tbl_check(const struct ubi_info *ubi,
+ const struct ubi_vol_tbl_record *vol_tbl)
+{
+ int i, reserved_pebs, alignment, data_pad, vol_type, name_len;
+ int upd_marker;
+ const char *name;

+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {

+ int n;
+ uint32_t crc;
+
+ cond_resched();
+
+ reserved_pebs = ubi32_to_cpu(vol_tbl[i].reserved_pebs);
+ alignment = ubi32_to_cpu(vol_tbl[i].alignment);
+ data_pad = ubi32_to_cpu(vol_tbl[i].data_pad);
+ upd_marker = vol_tbl[i].upd_marker;
+ vol_type = vol_tbl[i].vol_type;
+ name_len = ubi16_to_cpu(vol_tbl[i].name_len);
+ name = &vol_tbl[i].name[0];
+
+ crc = crc32(UBI_CRC32_INIT, &vol_tbl[i],
+ UBI_VTBL_RECORD_SIZE_CRC);
+
+ if (ubi32_to_cpu(vol_tbl[i].crc) != crc) {
+ ubi_err("wrong CRC at record %u: %#08x, not %#08x",
+ i, crc, ubi32_to_cpu(vol_tbl[i].crc));

+ return -EINVAL;
+ }
+

+ if (reserved_pebs == 0) {
+ int is_zero;
+
+ is_zero = ubi_buf_all_zeroes(&vol_tbl[i],
+ UBI_VTBL_RECORD_SIZE_CRC);
+ if (!is_zero) {
+ dbg_err("bad empty record");

+ goto bad;
+ }
+

+ continue;
+ }
+
+ if (reserved_pebs < 0 || alignment < 0 || data_pad < 0 ||
+ name_len < 0) {
+ dbg_err("negative values");

+ goto bad;
+ }
+

+ if (alignment > ubi->io.leb_size || alignment == 0) {
+ dbg_err("bad alignment");

+ goto bad;
+ }
+

+ n = alignment % ubi->io.min_io_size;
+ if (alignment != 1 && n) {
+ dbg_err("alignment is not multiple of min I/O unit"
+ "size");

+ goto bad;
+ }
+

+ n = ubi->io.leb_size % alignment;
+ if (data_pad != n) {
+ dbg_err("bad data_pad, has to be %d", n);

+ goto bad;
+ }
+

+ if (vol_type != UBI_VID_DYNAMIC && vol_type != UBI_VID_STATIC) {
+ dbg_err("bad vol_type");

+ goto bad;
+ }
+

+ if (upd_marker != 0 && upd_marker != 1) {
+ dbg_err("bad upd_marker");

+ goto bad;
+ }
+

+ if (reserved_pebs > ubi->io.good_peb_count) {
+ dbg_err("too large reserved_pebs");

+ goto bad;
+ }
+

+ if (name_len > UBI_VOL_NAME_MAX) {
+ dbg_err("too long volume name, max is %d",
+ UBI_VOL_NAME_MAX);

+ goto bad;
+ }
+

+ if (name[0] == '\0') {
+ dbg_err("NULL volume name");

+ goto bad;
+ }
+

+ n = strnlen(name, name_len + 1);
+ if (name_len != n) {
+ dbg_err("bad name_len");

+ goto bad;
+ }
+ }
+

+ return 0;
+
+bad:

+ ubi_err("volume table check failed, record %d", i);
+ dbg_err("volume record dump:");
+ ubi_dbg_dump_vol_tbl_record(&vol_tbl[i]);

+ return -EINVAL;
+}
+
+/**

+ * create_vtbl - create a copy of the volume table.

+ *
+ * @ubi: the UBI device description object
+ * @si: a pointer to the scanning information

+ * @copy: the number of the volume table copy
+ * @vol_tbl: the contents of the volume table

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int create_vtbl(const struct ubi_info *ubi, struct ubi_scan_info *si,
+ int copy, void *vol_tbl)
+{
+ int err, tries = 0, pnum, ec;
+ unsigned int leb_ver;
+ static struct ubi_vid_hdr *vid_hdr;
+ struct ubi_scan_volume *sv;
+ struct ubi_scan_leb *new_seb, *old_seb = NULL;
+
+ ubi_msg("create volume table (copy #%d)", copy + 1);

+
+ vid_hdr = ubi_zalloc_vid_hdr(ubi);
+ if (!vid_hdr)

+ return -ENOMEM;
+
+ /*
+ * First we look if there is a logical eraseblock which would have to
+ * contain this volume table copy was found during scanning. We have
+ * to wipe it.
+ */
+ sv = ubi_scan_find_sv(si, UBI_LAYOUT_VOL_ID);
+ if (sv)
+ old_seb = ubi_scan_find_seb(sv, copy);
+
+retry:
+ new_seb = ubi_scan_get_free_peb(ubi, si);
+ if (IS_ERR(new_seb)) {
+ err = PTR_ERR(new_seb);
+ goto out_free;
+ }
+ pnum = new_seb->pnum;
+ ec = new_seb->ec;
+ kfree(new_seb);

+
+ vid_hdr->vol_type = UBI_VID_DYNAMIC;

+ vid_hdr->vol_id = cpu_to_ubi32(UBI_LAYOUT_VOL_ID);
+ vid_hdr->compat = UBI_LAYOUT_VOLUME_COMPAT;
+ vid_hdr->data_size = vid_hdr->used_ebs =
+ vid_hdr->data_pad = cpu_to_ubi32(0);
+ vid_hdr->lnum = cpu_to_ubi32(copy);
+ vid_hdr->sqnum = cpu_to_ubi64(ubi_scan_next_sqnum(si));
+ leb_ver = old_seb ? old_seb->leb_ver + 1: 0;

+ vid_hdr->leb_ver = cpu_to_ubi32(leb_ver);
+

+ /* The EC header is already there, write the VID header */

+ err = ubi_io_write_vid_hdr(ubi, pnum, vid_hdr);

+ if (err)
+ goto write_error;
+

+ /* Write the layout volume contents */
+ err = ubi_io_write_data(ubi, vol_tbl, pnum, 0, ubi->vtbl.vt_size);

+ if (err)
+ goto write_error;

+
+ /*
+ * And add it to the scanning information. Don't delete the old
+ * @old_seb as it will be deleted and freed in 'ubi_scan_add_peb()'.
+ */
+ err = ubi_scan_add_peb(ubi, si, pnum, ec, vid_hdr, 0);
+ ubi_free_vid_hdr(ubi, vid_hdr);

+ return err;
+
+write_error:

+ /* May be this physical eraseblock went bad, try to pick another one */
+ if (++tries <= 5) {
+ err = ubi_scan_add_corr_peb(si, pnum, ec);
+ if (!err)
+ goto retry;

+ }
+out_free:
+ ubi_free_vid_hdr(ubi, vid_hdr);

+ return err;
+
+}

+
+/**
+ * process_lvol - process the layout volume.

+ *
+ * @ubi: the UBI device description object
+ * @si: a pointer to the scanning information

+ * @sv: scanning information about the layout volume
+ *
+ * This function is responsible for reading the layout volume, ensuring it is
+ * not corrupted, and recovering from corruptions if needed. Returns the volume
+ * table in case of success and a negative error code in case of failure.
+ */
+static struct ubi_vol_tbl_record *process_lvol(const struct ubi_info *ubi,
+ struct ubi_scan_info *si,
+ struct ubi_scan_volume *sv)
+{
+ int err;

+ struct rb_node *rb;
+ struct ubi_scan_leb *seb;

+ struct ubi_vol_tbl_record *leb[UBI_LAYOUT_VOLUME_EBS] = { NULL, NULL };
+ int leb_corrupted[UBI_LAYOUT_VOLUME_EBS] = {1, 1};
+
+ /*
+ * UBI goes through the following steps when it changes the layout
+ * volume:
+ *
+ * a. erase LEB 0;
+ * b. write new data to LEB 0;
+ * c. erase LEB 1;
+ * d. write new data to LEB 1.
+ *
+ * Before the change, both LEBS contain the same data.
+ *
+ * Owing to unclean reboots, we may lose the contents of LEB 0 but there
+ * is always LEB 1 present. Thus, it is normal if LEB 0 is corrupted
+ * while LEB 1 is OK. Or LEB 1 may be lost, but there has to be LEB 0.
+ * And finally, unclean reboots may result in a situation when neither
+ * LEB 0 nor LEB 1 are corrupted, but they are different. In this case,
+ * LEB 0 contains more recent information.
+ *
+ * So the plan is to first check LEB 0.
+ * a. if LEB 0 is OK, it must be containing the most resent data; then
+ * we compare its contents with LEB 1, and if they are different, we
+ * copy LEB 0 to LEB 1;
+ * b. if LEB 0 is corrupted, but LEB 1 has to be OK, and we copy LEB 1
+ * to LEB 0.
+ */
+
+ dbg_vtbl("check the layout volume");
+
+ /* Read both LEB 0 and LEB 1 into RAM */

+ rb_for_each_entry(rb, seb, &sv->root, u.rb) {

+ leb[seb->lnum] = kzalloc(ubi->vtbl.vt_size, GFP_KERNEL);
+ if (!leb[seb->lnum]) {

+ err = -ENOMEM;
+ goto out_free;
+ }
+

+ err = ubi_io_read_data(ubi, leb[seb->lnum], seb->pnum, 0,
+ ubi->vtbl.vt_size);
+ if (err == UBI_IO_BITFLIPS || err == -EBADMSG)
+ /* Scrub the PEB later */
+ seb->scrub = 1;
+ else if (err)

+ goto out_free;
+ }
+

+ if (leb[0])
+ leb_corrupted[0] = vol_tbl_check(ubi, leb[0]);
+
+ if (!leb_corrupted[0]) {
+ /* LEB 0 is OK */
+
+ if (leb[1])
+ leb_corrupted[1] = memcmp(leb[0], leb[1],
+ ubi->vtbl.vt_size);
+ if (leb_corrupted[1]) {
+ ubi_warn("the volume table copy #2 is corrupted");
+ err = create_vtbl(ubi, si, 1, leb[0]);
+ if (err)
+ goto out_free;
+ ubi_msg("volume table was restored");
+ }
+
+ /* Both LEB 1 and LEB 2 are OK and consistent */
+ kfree(leb[1]);
+ return leb[0];
+ } else {
+ /* LEB 0 is corrupted or does not exist */
+ if (leb[1])
+ leb_corrupted[1] = vol_tbl_check(ubi, leb[1]);
+ if (leb_corrupted[1]) {
+ /*
+ * Both LEB 0 and LEB 1 are corrupted. We don't try to
+ * restore them and let user-space tools do this.
+ */
+ ubi_err("the layout volume is corrupted");

+ err = -EINVAL;
+ goto out_free;
+ }
+

+ ubi_warn("the volume table copy #1 is corrupted");
+ err = create_vtbl(ubi, si, 0, leb[1]);
+ if (err)
+ goto out_free;
+ ubi_msg("volume table was restored");
+
+ kfree(leb[0]);
+ return leb[1];
+ }
+
+out_free:
+ kfree(leb[0]);
+ kfree(leb[1]);
+ return ERR_PTR(err);
+}
+
+/**
+ * create_empty_lvol - create an empty layout volume.

+ *
+ * @ubi: the UBI device description object
+ * @si: a pointer to the scanning information
+ *

+ * If during scanning it was found out that the flash device is empty, this
+ * function is called to create an empty layout volume. Returns the volume
+ * table contents in case of success and a negative error code in case of

+ * failure.
+ */

+static struct ubi_vol_tbl_record *create_empty_lvol(const struct ubi_info *ubi,
+ struct ubi_scan_info *si)
+{
+ int i;
+ struct ubi_vol_tbl_record *vol_tbl;
+
+ vol_tbl = kmalloc(ubi->vtbl.vt_size, GFP_KERNEL);
+ if (!vol_tbl)
+ return ERR_PTR(-ENOMEM);

+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++)

+ memcpy(&vol_tbl[i], &empty_rec, UBI_VTBL_RECORD_SIZE);
+
+ for (i = 0; i < UBI_LAYOUT_VOLUME_EBS; i++) {
+ int err;
+
+ err = create_vtbl(ubi, si, i, vol_tbl);
+ if (err) {
+ kfree(vol_tbl);
+ return ERR_PTR(err);
+ }
+ }
+
+ return vol_tbl;
+}
+
+/**
+ * free_volume_info - free the in-RAM copy of the volume table.

+ *
+ * @ubi: the UBI device description object
+ */

+static void free_volume_info(const struct ubi_info *ubi)
+{
+ int i;

+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++)

+ kfree(ubi->vtbl.vt[i].name);
+
+ kfree(ubi->vtbl.vt);
+}
+
+/**
+ * ubi_vtbl_close - close the volume table unit.
+ *

+ * @ubi: the UBI device description object
+ */

+void ubi_vtbl_close(const struct ubi_info *ubi)
+{
+ dbg_vtbl("close the volume table unit");
+ free_volume_info(ubi);
+}
+
+/**
+ * init_ram_vt - initialize the in-RAM copy of the volume table.

+ *
+ * @ubi: the UBI device description object
+ * @si: a pointer to the scanning information

+ * @vol_tbl: the volume table
+ *
+ * This function builds the in-RAM volume table representation. Returns zero in
+ * case of success and a negative error code in case of failure.
+ */
+static int init_ram_vt(struct ubi_info *ubi, const struct ubi_scan_info *si,
+ const struct ubi_vol_tbl_record *vol_tbl)
+{
+ int i;
+
+ ubi->vtbl.vt = kzalloc(ubi->vtbl.vt_slots * sizeof(struct ubi_vtbl_vtr),
+ GFP_KERNEL);
+ if (!ubi->vtbl.vt)
+ return -ENOMEM;

+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {

+ struct ubi_vtbl_vtr *vtr = &ubi->vtbl.vt[i];
+ struct ubi_scan_volume *sv;
+ char *name;
+
+ cond_resched();
+
+ vtr->reserved_pebs = ubi32_to_cpu(vol_tbl[i].reserved_pebs);
+
+ /* Skip empty records */
+ if (vtr->reserved_pebs == 0)
+ continue;
+
+ vtr->alignment = ubi32_to_cpu(vol_tbl[i].alignment);
+ vtr->data_pad = ubi32_to_cpu(vol_tbl[i].data_pad);
+ vtr->vol_type = vol_tbl[i].vol_type == UBI_VID_DYNAMIC ?
+ UBI_DYNAMIC_VOLUME : UBI_STATIC_VOLUME;
+ vtr->name_len = ubi16_to_cpu(vol_tbl[i].name_len);
+ vtr->usable_leb_size = ubi->io.leb_size - vtr->data_pad;
+
+ vtr->name = kmalloc(vtr->name_len + 1, GFP_KERNEL);
+ if (!vtr->name) {
+ free_volume_info(ubi);

+ return -ENOMEM;
+ }
+

+ name = (char *)vtr->name;
+ memcpy(name, vol_tbl[i].name, vtr->name_len + 1);
+ name[vtr->name_len] = '\0';
+
+ ubi->vtbl.vol_count += 1;
+
+ /* Initialize the RAM-only fields */
+
+ /*
+ * In case of dynamic volume UBI knows nothing about how many
+ * data is stored there. So assume the whole volume is used.
+ */

+ if (vtr->vol_type == UBI_DYNAMIC_VOLUME) {

+ vtr->used_ebs = vtr->reserved_pebs;
+ vtr->last_eb_bytes = vtr->usable_leb_size;
+ vtr->used_bytes = vtr->used_ebs * vtr->usable_leb_size;
+ continue;
+ }
+
+ /* Static volumes only */
+ sv = ubi_scan_find_sv(si, i);
+ if (!sv)
+ /*
+ * No eraseblocks belonging to this volume found. We
+ * don't actually know whether this static volume is
+ * completely corrupted or just contains no data. And
+ * we cannot know this as long as data size is not
+ * stored on flash. So we just assume the volume is
+ * empty.
+ */
+ continue;
+
+ if (unlikely(sv->leb_count != sv->used_ebs)) {
+ /*
+ * We found a static volume which misses several
+ * eraseblocks. Treat it as corrupted.
+ */
+ ubi_warn("static volume %d misses %d LEBs - corrupted",
+ sv->vol_id, sv->used_ebs - sv->leb_count);
+ vtr->corrupted = 1;
+ continue;
+ }
+
+ vtr->used_ebs = sv->used_ebs;
+ vtr->used_bytes = (vtr->used_ebs - 1) * vtr->usable_leb_size;
+ vtr->used_bytes += sv->last_data_size;
+ vtr->last_eb_bytes = sv->last_data_size;

+ }
+
+ return 0;
+}
+

+/**
+ * check_sv - check sanity of volume scanning information.

+ *
+ * @ubi: the UBI device description object

+ * @sv: volume scanning information
+ * @vtr: corresponding volume table record
+ *
+ * This function returns zero if the volume scanning information is consistent
+ * to the data in the corresponding volume table record, and %-EINVAL if not.
+ */
+static int check_sv(const struct ubi_info *ubi,
+ const struct ubi_scan_volume *sv,
+ const struct ubi_vtbl_vtr *vtr)
+{
+ if (sv->highest_lnum >= vtr->reserved_pebs) {
+ dbg_err("bad highest_lnum");

+ goto bad;
+ }
+

+ if (sv->leb_count > vtr->reserved_pebs) {
+ dbg_err("bad leb_count");

+ goto bad;
+ }
+

+ if (sv->vol_type != vtr->vol_type) {
+ dbg_err("bad vol_type");

+ goto bad;
+ }
+

+ if (sv->used_ebs > vtr->reserved_pebs) {
+ dbg_err("bad used_ebs");

+ goto bad;
+ }
+

+ if (sv->data_pad != vtr->data_pad) {
+ dbg_err("bad data_pad");

+ goto bad;
+ }
+
+ return 0;
+
+bad:

+ ubi_err("scanning information is not consistent to volume table");
+ ubi_dbg_dump_sv(sv);
+ ubi_dbg_dump_vtr(vtr);

+ return -EINVAL;
+}
+
+/**

+ * check_scanning_info - check that scanning information is consistent to the
+ * information from the volume table.

+ *
+ * @ubi: the UBI device description object
+ * @si: a pointer to the scanning information
+ *

+ * Even though we protect on-flash data by CRC checksums, we still don't trust
+ * the media. Who knows what users are trying to feed us. This function returns
+ * zero if the scanning information is sane and %-EINVAL if it is not.
+ */
+static int check_scanning_info(const struct ubi_info *ubi,
+ struct ubi_scan_info *si)

+{
+ int err, i;
+ const struct ubi_vtbl_vtr *vtr;

+ struct ubi_scan_volume *sv;
+
+ if (si->vols_found > UBI_INT_VOL_COUNT + ubi->vtbl.vt_slots) {
+ ubi_err("scanning found %d volumes, maximum is %d + %d",
+ si->vols_found, UBI_INT_VOL_COUNT, ubi->vtbl.vt_slots);

+ return -EINVAL;
+ }
+

+ if (si->highest_vol_id >= ubi->vtbl.vt_slots &&
+ si->highest_vol_id < UBI_INTERNAL_VOL_START) {
+ ubi_err("too large volume ID %d found by scanning",
+ si->highest_vol_id);

+ return -EINVAL;
+ }
+

+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {

+ cond_resched();
+
+ vtr = &ubi->vtbl.vt[i];
+ sv = ubi_scan_find_sv(si, i);
+
+ if (vtr->reserved_pebs == 0) {

+ if (!sv)
+ continue;

+
+ /*
+ * The scanning unit has found a volume which does not
+ * exist according to the information in the volume
+ * table. This must have happened due to an unclean
+ * reboot while the volume was being removed. Discard
+ * these eraseblocks.
+ */
+ dbg_vtbl("volume %d removal was interrupted, finish it",
+ sv->vol_id);
+ ubi_scan_rm_volume(si, sv);
+ } else if (sv) {
+ err = check_sv(ubi, sv, vtr);

+ if (err)
+ return err;
+ }
+ }

+
+ /* Check that scanning information about internal UBI volumes is sane */

+ for (i = 0; i < UBI_INT_VOL_COUNT; i++) {

+ cond_resched();
+
+ vtr = &ubi->vtbl.ivol_vtrs[i];
+ ubi_assert(!IS_ERR(vtr));
+
+ sv = ubi_scan_find_sv(si, i + UBI_INTERNAL_VOL_START);
+
+ /*
+ * If an internal volume was not found, the corresponding
+ * UBI unit will handle this.
+ */
+ if (sv) {
+ err = check_sv(ubi, sv, vtr);

+ if (err)
+ return err;
+ }
+ }

+
+ return 0;
+}
+

+/**
+ * init_ivols - initialize internal volumes information.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function initializes information about internal UBI volumes. This
+ * information is not stored on flash but instead, is kept only in RAM.
+ */
+static void init_ivols(struct ubi_info *ubi)
+{
+ struct ubi_vtbl_vtr *vtr;
+
+ /* The layout volume */
+ vtr = &ubi->vtbl.ivol_vtrs[0];
+ vtr->reserved_pebs = UBI_LAYOUT_VOLUME_EBS;
+ vtr->alignment = 1;
+ vtr->vol_type = UBI_DYNAMIC_VOLUME;
+ vtr->name_len = sizeof(UBI_LAYOUT_VOLUME_NAME) - 1;
+ vtr->name = UBI_LAYOUT_VOLUME_NAME;
+ vtr->usable_leb_size = ubi->io.leb_size;
+ vtr->used_ebs = vtr->reserved_pebs;
+ vtr->last_eb_bytes = vtr->reserved_pebs;
+ vtr->used_bytes = vtr->used_ebs * (ubi->io.leb_size - vtr->data_pad);
+}
+
+/**
+ * ubi_vtbl_init_scan - initialize the volume table unit using scanning

+ * information.
+ *
+ * @ubi: the UBI device description object
+ * @si: a pointer to the scanning information
+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_vtbl_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si)
+{
+ int err;
+ struct ubi_vol_tbl_record *vol_tbl;

+ struct ubi_scan_volume *sv;
+

+ dbg_vtbl("initialize the volume table unit");
+
+ empty_rec.crc = cpu_to_ubi32(0xf116c36b);
+ mutex_init(&ubi->vtbl.vtbl_lock);
+
+ /*
+ * The number of supported volumes is limited by the eraseblock size
+ * and by the UBI_MAX_VOLUMES constant.
+ */
+ ubi->vtbl.vt_slots = ubi->io.leb_size / UBI_VTBL_RECORD_SIZE;
+ if (ubi->vtbl.vt_slots > UBI_MAX_VOLUMES)
+ ubi->vtbl.vt_slots = UBI_MAX_VOLUMES;
+
+ /*
+ * We are going to calculate size of the volume table. It must be less
+ * then the logical eraseblock size or equivalent to it. Here we also
+ * ensure that @ubi->vtbl.vt_size has correct alignment (i.e., it is
+ * multiple of the minimal flash I/O unit size).
+ */
+ ubi->vtbl.vt_size = ubi->vtbl.vt_slots * UBI_VTBL_RECORD_SIZE;
+ ubi->vtbl.vt_size = align_up(ubi->vtbl.vt_size, ubi->io.min_io_size);
+
+ sv = ubi_scan_find_sv(si, UBI_LAYOUT_VOL_ID);
+ if (!sv) {
+ /*
+ * No logical eraseblocks belonging to the layout volume were
+ * found. This could mean that the flash is just empty. In
+ * this case we "UBI-nize" this flash by means of creating a
+ * layout volume with an empty volume table.
+ *
+ * But if flash is not empty this must be a serious corruption
+ * or we were just fed by a bad/random/etc data. We could try
+ * to do some recovery, but it is not implemented. And it seems
+ * its better to do this using some user-space tools.
+ */
+ if (si->is_empty) {
+ vol_tbl = create_empty_lvol(ubi, si);
+ if (IS_ERR(vol_tbl))
+ return PTR_ERR(vol_tbl);
+ } else {
+ ubi_err("the layout volume was not found");
+ return -EINVAL;
+ }
+ } else {
+ /*
+ * The layout volume was found during scanning, lets look at
+ * it, check it, etc.
+ */
+
+ if (sv->leb_count > UBI_LAYOUT_VOLUME_EBS) {
+ /* This must not happen with proper UBI images */
+ dbg_err("too many logical LEBs (%d) belonging to the "
+ "layout volume found", sv->leb_count);

+ return -EINVAL;
+ }
+

+ vol_tbl = process_lvol(ubi, si, sv);
+ if (IS_ERR(vol_tbl))
+ return PTR_ERR(vol_tbl);
+ }
+
+ /*
+ * The layout volume is OK, initialize the corresponding in-RAM data
+ * structures.
+ */
+ err = init_ram_vt(ubi, si, vol_tbl);

+ if (err)
+ return err;
+

+ kfree(vol_tbl);
+ init_ivols(ubi);
+
+ /*
+ * Get sure that the scanning information is consistent to the
+ * information stored in the volume table.
+ */
+ err = check_scanning_info(ubi, si);
+ if (err) {
+ free_volume_info(ubi);

+ return err;
+ }
+

+ dbg_vtbl("the volume table unit is initialized");
+ return 0;
+}
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_VTBL
+
+/**
+ * paranoid_check_vtr - check a &struct ubi_vtbl_vtr object.

+ *
+ * @ubi: the UBI device description object

+ * @vtr: the object pointer to check
+ *
+ * This function returns zero if the volume table record is sane, and %-EINVAL
+ * if not.
+ */
+static int paranoid_check_vtr(const struct ubi_info *ubi,
+ const struct ubi_vtbl_vtr *vtr)
+{
+ long long n;
+
+ if (vtr->reserved_pebs == 0)
+ return 0;
+
+ if (unlikely(vtr->reserved_pebs < 0 || vtr->alignment < 0 ||
+ vtr->data_pad < 0 || vtr->name_len < 0)) {
+ ubi_err("negative values");

+ goto fail;
+ }
+

+ if (unlikely(vtr->alignment > ubi->io.leb_size ||
+ vtr->alignment == 0)) {
+ ubi_err("bad alignment %d", vtr->alignment);

+ goto fail;
+ }
+

+ n = vtr->alignment % ubi->io.min_io_size;
+ if (vtr->alignment != 1 && unlikely(n)) {
+ ubi_err("alignment %d is not multiple of min I/O unit size",
+ vtr->alignment);

+ goto fail;
+ }
+

+ n = ubi->io.leb_size % vtr->alignment;
+ if (unlikely(vtr->data_pad != n)) {
+ ubi_err("bad data_pad %d, has to be %lld", vtr->data_pad, n);

+ goto fail;
+ }
+

+ if (unlikely(vtr->vol_type != UBI_DYNAMIC_VOLUME &&
+ vtr->vol_type != UBI_STATIC_VOLUME)) {
+ ubi_err("bad vol_type");

+ goto fail;
+ }
+

+ if (unlikely(vtr->upd_marker != 0 && vtr->upd_marker != 1)) {
+ ubi_err("bad upd_marker");

+ goto fail;
+ }
+

+ if (unlikely(vtr->upd_marker && vtr->corrupted)) {
+ dbg_err("update marker and corrupted simultaneously");
+ }
+
+ if (unlikely(vtr->reserved_pebs > ubi->io.good_peb_count)) {
+ ubi_err("too large reserved_pebs %d", vtr->reserved_pebs);

+ goto fail;
+ }
+

+ if (unlikely(vtr->usable_leb_size !=
+ ubi->io.leb_size - vtr->data_pad)) {
+ ubi_err("bad usable_leb_size %d, has to be %d",
+ vtr->usable_leb_size, ubi->io.leb_size - vtr->data_pad);

+ goto fail;
+ }
+

+ if (unlikely(vtr->name_len > UBI_VOL_NAME_MAX)) {
+ ubi_err("too long volume name %d, max is %d",
+ vtr->name_len, UBI_VOL_NAME_MAX);

+ goto fail;
+ }
+

+ if (unlikely(!vtr->name)) {
+ ubi_err("NULL volume name");

+ goto fail;
+ }
+

+ n = strnlen(vtr->name, vtr->name_len + 1);
+ if (unlikely(n != vtr->name_len)) {
+ ubi_err("bad name_len");
+ goto fail;
+ }
+
+ /* Check RAM-only fields */
+ n = vtr->used_ebs * vtr->usable_leb_size;

+ if (vtr->vol_type == UBI_DYNAMIC_VOLUME) {

+ if (unlikely(vtr->corrupted != 0)) {
+ ubi_err("bad corrupted");

+ goto fail;
+ }
+

+ if (unlikely(vtr->used_ebs != vtr->reserved_pebs)) {
+ ubi_err("bad used_ebs");

+ goto fail;
+ }
+

+ if (unlikely(vtr->last_eb_bytes != vtr->usable_leb_size)) {
+ ubi_err("bad last_eb_bytes");

+ goto fail;
+ }
+

+ if (unlikely(vtr->used_bytes != n)) {
+ ubi_err("bad used_bytes");
+ goto fail;
+ }
+ } else {
+ if (unlikely(vtr->corrupted != 0 && vtr->corrupted != 1)) {
+ ubi_err("bad corrupted");

+ goto fail;
+ }
+

+ if (unlikely(vtr->used_ebs < 0 ||
+ vtr->used_ebs > vtr->reserved_pebs)) {
+ ubi_err("bad used_ebs");

+ goto fail;
+ }
+

+ if (unlikely(vtr->last_eb_bytes < 0 ||
+ vtr->last_eb_bytes > vtr->usable_leb_size)) {
+ ubi_err("bad last_eb_bytes");

+ goto fail;
+ }
+

+ if (unlikely(vtr->used_bytes < 0 || vtr->used_bytes > n ||
+ vtr->used_bytes < n - vtr->usable_leb_size)) {
+ ubi_err("bad used_bytes");
+ goto fail;
+ }

+ }
+
+ return 0;
+

+fail:
+ ubi_err("paranoid check failed");

+ ubi_dbg_dump_vtr(vtr);
+ dump_stack();

+ return -EINVAL;
+}
+

+static int paranoid_vol_tbl_check(const struct ubi_info *ubi,
+ const struct ubi_vol_tbl_record *vol_tbl)

+{
+ int err;
+

+ err = vol_tbl_check(ubi, vol_tbl);
+ if (err) {

+ ubi_err("paranoid check failed");

+ dump_stack();

+ }
+
+ return err;
+}

+#endif /* CONFIG_MTD_UBI_DEBUG_PARANOID_VTBL */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:27:16 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/MAINTAINERS tmp-to/MAINTAINERS
--- tmp-from/MAINTAINERS 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/MAINTAINERS 2007-03-14 17:15:50.000000000 +0200
@@ -2176,6 +2176,14 @@ L: linu...@lists.infradead.org
T: git git://git.infradead.org/mtd-2.6.git
S: Maintained

+UNSORTED BLOCK IMAGES (UBI)
+P: Artem Bityutskiy
+M: dede...@infradead.org
+W: http://www.linux-mtd.infradead.org/
+L: linu...@lists.infradead.org
+T: git git://git.infradead.org/ubi-2.6.git
+S: Maintained
+
MICROTEK X6 SCANNER
P: Oliver Neukum
M: oli...@neukum.name

Artem Bityutskiy

unread,

Mar 14, 2007, 11:27:29 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/io.c tmp-to/drivers/mtd/ubi/io.c
--- tmp-from/drivers/mtd/ubi/io.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/io.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,1445 @@

+/*
+ * Copyright (c) International Business Machines Corp., 2006

+ * Copyright (C) Nokia Corporation, 2006, 2007

+ * UBI input/output unit.
+ *
+ * This unit provides a uniform way to work with all kinds of the underlying
+ * MTD devices. It also implements handy functions for reading and writing UBI
+ * headers.
+ *
+ * We are trying to have a paranoid mindset and not to trust to what we read
+ * from the flash media in order to be more secure and robust. So this unit
+ * validates every single header it reads from the flash media.
+ *
+ * Some words about how the eraseblock headers are stored.
+ *
+ * The erase counter header is always stored at offset zero. By default, the
+ * VID header is stored after the EC header at the closest aligned offset
+ * (i.e. aligned to the minimum I/O unit size). Data starts next to the VID
+ * header at the closest aligned offset. But this default layout may be
+ * changed. For example, for different reasons (e.g., optimization) UBI may be
+ * asked to put the VID header at further offset, and even at an unaligned
+ * offset. Of course, if the offset of the VID header is unaligned, UBI adds
+ * proper padding in front of it. Data offset may also be changed but it has to
+ * be aligned.
+ *
+ * About minimal I/O units. In general, UBI assumes flash device model where
+ * there is only one minimal I/O unit size. E.g., in case of NOR flash it is 1,
+ * in case of NAND flash it is a NAND page, etc. This is reported by MTD in the
+ * @io.mtd->writesize field. But as an exception, UBI admits of using another
+ * (smaller) minimal I/O unit size for EC and VID headers to make it possible
+ * to do different optimizations.
+ *
+ * This is extremely useful in case of NAND flashes which admit of several
+ * write operations to one NAND page. In this case UBI can fit EC and VID
+ * headers at one NAND page. Thus, UBI may use "sub-page" size as the minimal
+ * I/O unit for the headers (the @io.hdrs_min_io_size field). But it still
+ * reports NAND page size (@io.min_io_size) as a minimal I/O unit for the UBI
+ * users.
+ *
+ * Example: some Samsung NANDs with 2KiB pages allow 4x 512-byte writes, so
+ * although the minimal I/O unit is 2K, UBI uses 512 bytes for EC and VID
+ * headers.
+ *
+ * Q: why not just to treat sub-page as a minimal I/O unit of this flash
+ * device, e.g., make @io.min_io_size = 512 in the example above?
+ *
+ * A: because when writing a sub-page, MTD still writes a full 2K page but the
+ * bytes which are no relevant to the sub-page are 0xFF. So, basically, writing
+ * 4x512 sub-pages is 4 times slower then writing one 2KiB NAND page. Thus, we
+ * prefer to use sub-pages only for EV and VID headers.
+ *
+ * As it was noted above, the VID header may start at a non-aligned offset.
+ * For example, in case of a 2KiB page NAND flash with a 512 bytes sub-page,
+ * the VID header may reside at offset 1984 which is the last 64 bytes of the
+ * last sub-page (EC header is always at offset zero). This causes some
+ * difficulties when reading and writing VID headers.
+ *
+ * Suppose we have a 64-byte buffer and we read a VID header at it. We change
+ * the data and want to write this VID header out. As we can only write in
+ * 512-byte chunks, we have to allocate one more buffer and copy our VID header
+ * to offset 448 of this buffer.
+ *
+ * The I/O unit does the following trick in order to avoid this extra copy.
+ * It always allocates a @io.vid_hdr_alsize bytes buffer for the VID header and
+ * returns a pointer to offset @io.vid_hdr_shift of this buffer. When the VID
+ * header is being written out, it shifts the VID header pointer back and
+ * writes the whole sub-page.

+ */
+
+#include <linux/crc32.h>
+#include <linux/err.h>

+#include "ubi.h"
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_IO
+static int paranoid_check_not_bad(const struct ubi_info *ubi, int pnum);
+static int paranoid_check_peb_ec_hdr(const struct ubi_info *ubi, int pnum);
+static int paranoid_check_ec_hdr(const struct ubi_info *ubi, int pnum,
+ const struct ubi_ec_hdr *ec_hdr);
+static int paranoid_check_peb_vid_hdr(const struct ubi_info *ubi, int pnum);
+static int paranoid_check_vid_hdr(const struct ubi_info *ubi, int pnum,
+ const struct ubi_vid_hdr *vid_hdr);
+static int paranoid_check_all_ff(const struct ubi_info *ubi, int pnum,

+ int offset, int len);

+#else
+#define paranoid_check_not_bad(ubi, pnum) 0
+#define paranoid_check_peb_ec_hdr(ubi, pnum) 0
+#define paranoid_check_ec_hdr(ubi, pnum, ec_hdr) 0
+#define paranoid_check_peb_vid_hdr(ubi, pnum) 0
+#define paranoid_check_vid_hdr(ubi, pnum, vid_hdr) 0
+#define paranoid_check_all_ff(ubi, pnum, offset, len) 0
+#endif
+
+/*
+ * In case of an input/output error, UBI tries to repeat the operation several
+ * times before returning error. The below constant defines how many times
+ * UBI re-tries.
+ */
+#define IO_RETRIES 3
+
+/**
+ * ubi_io_read - read data from a physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @buf: buffer where to store the read data
+ * @pnum: physical eraseblock number to read from
+ * @offset: offset within the physical eraseblock from where to read
+ * @len: how many bytes to read
+ *
+ * This function reads data from offset @offset of physical eraseblock @pnum
+ * and stores the read data in the @buf buffer. The following return codes are
+ * possible:
+ *
+ * o %0 if all the requested data were successfully read;
+ * o %UBI_IO_BITFLIPS if all the requested data were successfully read, but
+ * correctable bit-flips were detected; this is harmless but may indicate
+ * that this eraseblock may become bad soon (but do not have to);
+ * o %-EBADMSG if the MTD subsystem reported about data data integrity
+ * problems, for example it can me an ECC error in case of NAND; this most
+ * probably means that the data is corrupted;
+ * o %-EIO if some I/O error occurred;
+ * o other negative error codes in case of other errors.
+ */

+int ubi_io_read(const struct ubi_info *ubi, void *buf, int pnum, int offset,

+ int len)
+{
+ int err, retries = 0;
+ size_t read;
+ loff_t addr;
+
+ dbg_io("read %d bytes from PEB %d:%d", len, pnum, offset);
+
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);
+ ubi_assert(offset >= 0 && offset + len <= ubi->io.peb_size);
+ ubi_assert(len > 0);
+
+ err = paranoid_check_not_bad(ubi, pnum);
+ if (unlikely(err))
+ return err > 0 ? -EINVAL : err;
+
+ addr = (loff_t)pnum * ubi->io.peb_size + offset;
+retry:
+ err = ubi->io.mtd->read(ubi->io.mtd, addr, len, &read, buf);
+ if (unlikely(err)) {
+ if (err == -EUCLEAN) {
+ /*
+ * -EUCLEAN is reported if there was a bit-flip which
+ * was corrected, so this is harmless.
+ */
+ dbg_io("bit-flip occurred");
+ ubi_assert(len == read);
+ return UBI_IO_BITFLIPS;
+ }
+
+ if (read != len && retries++ < IO_RETRIES) {
+ dbg_io("error %d while reading %d bytes from PEB %d:%d, "
+ "read only %zd bytes, retry",
+ err, len, pnum, offset, read);
+ yield();

+ goto retry;
+ }
+

+ ubi_err("error %d while reading %d bytes from PEB %d:%d, "
+ "read %zd bytes", err, len, pnum, offset, read);
+ ubi_dbg_dump_stack();
+ } else {
+ ubi_assert(len == read);
+
+ if (ubi_dbg_is_bitflip()) {
+ dbg_io("emulate bit-flip");
+ err = UBI_IO_BITFLIPS;
+ }
+ }
+

+ return err;
+}
+
+/**

+ * ubi_io_write - write data to a physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @buf: buffer with the data to write
+ * @pnum: physical eraseblock number to write to
+ * @offset: offset within the physical eraseblock where to write

+ * @len: how many bytes to write

+ *

+ * This function writes @len bytes of data from buffer @buf to offset @offset

+ * of physical eraseblock @pnum. If all the data were successfully written,
+ * zero is returned. If an error occurred, this function returns a negative
+ * error code. If %-EIO is returned, the physical eraseblock most probably went
+ * bad.
+ *
+ * Note, in case of an error, it is possible that something was still written
+ * to the flash media, but may be some garbage.
+ */

+int ubi_io_write(const struct ubi_info *ubi, const void *buf, int pnum,
+ int offset, int len)

+{
+ int err;
+ size_t written;
+ loff_t addr;
+
+ dbg_io("write %d bytes to PEB %d:%d", len, pnum, offset);
+
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);
+ ubi_assert(offset >= 0 && offset + len <= ubi->io.peb_size);
+ ubi_assert(offset % ubi->io.hdrs_min_io_size == 0);
+ ubi_assert(len > 0 && len % ubi->io.hdrs_min_io_size == 0);
+
+ if (unlikely(ubi->io.ro_mode)) {
+ ubi_err("read-only mode");

+ return -EROFS;
+ }
+

+ /* The below has to be compiled out if paranoid checks are disabled */
+
+ err = paranoid_check_not_bad(ubi, pnum);
+ if (unlikely(err))
+ return err > 0 ? -EINVAL : err;
+
+ /* The area we are writing to has to contain all 0xFF bytes */
+ err = paranoid_check_all_ff(ubi, pnum, offset, len);
+ if (unlikely(err))
+ return err > 0 ? -EINVAL : err;
+
+ if (offset >= ubi->io.leb_start) {
+ /*
+ * We write to the data area of the physical eraseblock. Make
+ * sure it has valid EC and VID headers.
+ */
+ err = paranoid_check_peb_ec_hdr(ubi, pnum);
+ if (unlikely(err))
+ return err > 0 ? -EINVAL : err;
+ err = paranoid_check_peb_vid_hdr(ubi, pnum);
+ if (unlikely(err))
+ return err > 0 ? -EINVAL : err;
+ }
+
+ if (ubi_dbg_is_write_failure()) {
+ ubi_err("cannot write %d bytes to PEB %d:%d "
+ "(emulated)", len, pnum, offset);
+ ubi_dbg_dump_stack();
+ return -EIO;
+ }
+
+ addr = (loff_t)pnum * ubi->io.peb_size + offset;
+ err = ubi->io.mtd->write(ubi->io.mtd, addr, len, &written, buf);
+ if (unlikely(err)) {
+ ubi_err("error %d while writing %d bytes to PEB %d:%d, written"
+ " %zd bytes", err, len, pnum, offset, written);
+ ubi_dbg_dump_stack();
+ } else
+ ubi_assert(written == len);

+
+ return err;
+}
+
+/**

+ * erase_callback - MTD erasure call-back.
+ *
+ * @ei: MTD erase information object.
+ *
+ * Note, even though MTD erase interface is asynchronous, all the current
+ * implementations are synchronous anyway.
+ */
+static void erase_callback(struct erase_info *ei)
+{
+ wake_up_interruptible((wait_queue_head_t *)ei->priv);
+}
+
+/**
+ * do_sync_erase - synchronously erase a physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to erase
+ *
+ * This function synchronously erases physical eraseblock @pnum and returns
+ * zero in case of success and a negative error code in case of failure. If
+ * %-EIO is returned, the physical eraseblock most probably went bad.
+ */
+static int do_sync_erase(const struct ubi_info *ubi, int pnum)
+{
+ int err, retries = 0;
+ struct erase_info ei;
+ wait_queue_head_t wq;
+
+ dbg_io("erase PEB %d", pnum);
+
+retry:
+ init_waitqueue_head(&wq);
+ memset(&ei, 0, sizeof(struct erase_info));
+
+ ei.mtd = ubi->io.mtd;
+ ei.addr = pnum * ubi->io.peb_size;
+ ei.len = ubi->io.peb_size;
+ ei.callback = erase_callback;
+ ei.priv = (unsigned long)&wq;
+
+ err = ubi->io.mtd->erase(ubi->io.mtd, &ei);
+ if (unlikely(err)) {
+ if (retries++ < IO_RETRIES) {
+ dbg_io("error %d while erasing PEB %d, retry",
+ err, pnum);
+ yield();
+ goto retry;
+ }
+ ubi_err("cannot erase PEB %d, error %d", pnum, err);
+ ubi_dbg_dump_stack();

+ return err;
+ }
+

+ err = wait_event_interruptible(wq, ei.state == MTD_ERASE_DONE ||
+ ei.state == MTD_ERASE_FAILED);
+ if (unlikely(err)) {
+ ubi_err("interrupted PEB %d erasure", pnum);
+ return -EINTR;
+ }
+
+ if (unlikely(ei.state == MTD_ERASE_FAILED)) {
+ if (retries++ < IO_RETRIES) {
+ dbg_io("error while erasing PEB %d, retry", pnum);
+ yield();
+ goto retry;
+ }
+ ubi_err("cannot erase PEB %d", pnum);
+ ubi_dbg_dump_stack();
+ return -EIO;
+ }
+
+ err = paranoid_check_all_ff(ubi, pnum, 0, ubi->io.peb_size);
+ if (unlikely(err))
+ return err > 0 ? -EINVAL : err;
+
+ if (ubi_dbg_is_erase_failure() && !err) {
+ ubi_err("cannot erase PEB %d (emulated)", pnum);
+ return -EIO;

+ }
+
+ return 0;
+}
+

+/* Patterns to write to a physical eraseblock when torturing it */
+static uint8_t patterns[] = {0xa5, 0x5a, 0x0};
+
+/**
+ * torture_peb - test a supposedly bad physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to test
+ *
+ * This function returns %-EIO if the physical eraseblock did not pass the
+ * test, a positive number of erase operations done if the test was
+ * successfully passed, and other negative error codes in case of other errors.
+ */
+static int torture_peb(const struct ubi_info *ubi, int pnum)
+{
+ void *buf;
+ int err, i, patt_count;
+
+ buf = kmalloc(ubi->io.peb_size, GFP_KERNEL);
+ if (unlikely(!buf))
+ return -ENOMEM;
+
+ patt_count = ARRAY_SIZE(patterns);
+ ubi_assert(patt_count > 0);
+
+ for (i = 0; i < patt_count; i++) {
+ err = do_sync_erase(ubi, pnum);
+ if (unlikely(err))
+ goto out;
+
+ /* Make sure the PEB contains only 0xFF bytes */
+ err = ubi_io_read(ubi, buf, pnum, 0, ubi->io.peb_size);
+ if (unlikely(err))
+ goto out;
+
+ err = ubi_buf_all_ff(buf, ubi->io.peb_size);
+ if (unlikely(err == 0)) {
+ ubi_err("erased PEB %d, but a non-0xFF byte found",
+ pnum);
+ err = -EIO;

+ goto out;
+ }
+

+ /* Write a pattern and check it */
+ memset(buf, patterns[i], ubi->io.peb_size);
+ err = ubi_io_write(ubi, buf, pnum, 0, ubi->io.peb_size);
+ if (unlikely(err))
+ goto out;
+
+ memset(buf, ~patterns[i], ubi->io.peb_size);
+ err = ubi_io_read(ubi, buf, pnum, 0, ubi->io.peb_size);
+ if (unlikely(err))
+ goto out;
+
+ err = ubi_check_pattern(buf, patterns[i], ubi->io.peb_size);
+ if (unlikely(err == 0)) {
+ ubi_err("pattern %x checking failed for PEB %d",
+ patterns[i], pnum);
+ err = -EIO;
+ goto out;
+ }
+ }
+
+ err = patt_count;
+
+out:
+ if (unlikely(err == UBI_IO_BITFLIPS || err == -EBADMSG))
+ /*
+ * If a bit-flip or data integrity error was detected, the test
+ * has not passed because it happened on a freshly erased
+ * physical eraseblock which means something is wrong with it.
+ */
+ err = -EIO;
+ kfree(buf);

+ return err;
+}
+
+/**

+ * ubi_io_sync_erase - synchronously erase a physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock number to erase
+ * @torture: if this physical eraseblock has to be tortured
+ *
+ * This function synchronously erases physical eraseblock @pnum. If @torture
+ * flag is not zero, the physical eraseblock is checked by means of writing
+ * different patterns to it and reading them back. If the torturing is enabled,
+ * the physical eraseblock is erased more then once.
+ *
+ * This function returns the number of erasures made in case of success, %-EIO
+ * if the erasure failed or the torturing test failed, and other negative error
+ * codes in case of other errors. Note, %-EIO means that the physical
+ * eraseblock is bad.
+ */
+int ubi_io_sync_erase(const struct ubi_info *ubi, int pnum, int torture)
+{
+ int err, ret = 0;
+
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);
+
+ err = paranoid_check_not_bad(ubi, pnum);
+ if (unlikely(err != 0))
+ return err > 0 ? -EINVAL : err;

+
+ if (unlikely(ubi->io.ro_mode)) {

+ ubi_err("read-only mode");

+ return -EROFS;
+ }
+

+ if (torture) {
+ ret = torture_peb(ubi, pnum);
+ if (unlikely(ret < 0))
+ return ret;
+ }
+
+ err = do_sync_erase(ubi, pnum);
+ if (unlikely(err))
+ return err;
+
+ return ret + 1;
+}
+
+/**
+ * ubi_io_is_bad - check if a physical eraseblock is bad.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to check
+ *
+ * This function returns a positive number if the physical eraseblock is bad,
+ * zero if not, and a negative error code if an error occurred.
+ */

+int ubi_io_is_bad(const struct ubi_info *ubi, int pnum)

+{
+ struct mtd_info *mtd = ubi->io.mtd;
+
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);

+
+ if (ubi->io.bad_allowed) {

+ int ret;
+
+ ret = mtd->block_isbad(mtd, (loff_t)pnum * ubi->io.peb_size);
+ if (unlikely(ret < 0))
+ ubi_err("error %d while checking if PEB %d is bad",
+ ret, pnum);
+ else if (ret)
+ dbg_io("PEB %d is bad", pnum);
+ return ret;

+ }
+
+ return 0;
+}
+
+/**

+ * ubi_io_mark_bad - mark a physical eraseblock as bad.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to mark
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_io_mark_bad(const struct ubi_info *ubi, int pnum)

+{
+ int err;
+ struct mtd_info *mtd = ubi->io.mtd;
+
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);

+
+ if (unlikely(ubi->io.ro_mode)) {

+ ubi_err("read-only mode");
+ return -EROFS;
+ }

+
+ if (!ubi->io.bad_allowed)

+ return 0;
+
+ err = mtd->block_markbad(mtd, (loff_t)pnum * ubi->io.peb_size);
+ if (unlikely(err))
+ ubi_err("cannot mark PEB %d bad, error %d", pnum, err);

+ return err;
+}
+
+/**

+ * validate_ec_hdr - validate an erase counter header.

+ *
+ * @ubi: the UBI device description object

+ * @ec_hdr: the erase counter header to check
+ *
+ * This function returns zero if the erase counter header is OK, and %1 if

+ * not.
+ */

+static int validate_ec_hdr(const struct ubi_info *ubi,
+ const struct ubi_ec_hdr *ec_hdr)
+{
+ long long ec;
+ int vid_hdr_offset, leb_start;
+
+ ec = ubi64_to_cpu(ec_hdr->ec);
+ vid_hdr_offset = ubi32_to_cpu(ec_hdr->vid_hdr_offset);
+ leb_start = ubi32_to_cpu(ec_hdr->data_offset);
+
+ if (unlikely(ec_hdr->version != UBI_VERSION)) {
+ ubi_err("node with incompatible UBI version found: "
+ "this UBI version is %d, image version is %d",
+ UBI_VERSION, (int)ec_hdr->version);

+ goto bad;
+ }
+

+ if (unlikely(vid_hdr_offset != ubi->io.vid_hdr_offset)) {
+ ubi_err("bad VID header offset %d, expected %d",
+ vid_hdr_offset, ubi->io.vid_hdr_offset);

+ goto bad;
+ }
+

+ if (unlikely(leb_start != ubi->io.leb_start)) {
+ ubi_err("bad data offset %d, expected %d",
+ leb_start, ubi->io.leb_start);

+ goto bad;
+ }
+

+ if (unlikely(ec < 0 || ec > UBI_MAX_ERASECOUNTER)) {
+ ubi_err("bad erase counter %lld", ec);

+ goto bad;
+ }
+
+ return 0;
+
+bad:

+ ubi_err("bad EC header");
+ ubi_dbg_dump_ec_hdr(ec_hdr);

+ ubi_dbg_dump_stack();
+ return 1;

+}
+
+/**
+ * ubi_io_read_ec_hdr - read and check an erase counter header.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock to read from
+ * @ec_hdr: a &struct ubi_ec_hdr object where to store the read erase counter
+ * header
+ * @verbose: be verbose if the header is corrupted or was not found
+ *
+ * This function reads erase counter header from physical eraseblock @pnum and
+ * stores it in @ec_hdr. This function also checks CRC checksum of the read
+ * erase counter header. The following codes may be returned:
+ *
+ * o %0 if the CRC checksum is correct and the header was successfully read;
+ * o %UBI_IO_BITFLIPS if the CRC is correct, but bit-flips were detected
+ * and corrected by the flash driver; this is harmless but may indicate that
+ * this eraseblock may become bad soon (but may be not);
+ * o %UBI_IO_BAD_EC_HDR if the erase counter header is corrupted (a CRC error);
+ * o %UBI_IO_PEB_EMPTY if the physical eraseblock is empty;
+ * o a negative error code in case of failure.
+ */

+int ubi_io_read_ec_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_ec_hdr *ec_hdr, int verbose)

+{
+ int err, read_err = 0;
+ uint32_t crc, magic, hdr_crc;
+
+ dbg_io("read EC header from PEB %d", pnum);
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);
+
+ err = ubi_io_read(ubi, ec_hdr, pnum, 0, UBI_EC_HDR_SIZE);
+ if (unlikely(err)) {
+ if (err != UBI_IO_BITFLIPS && err != -EBADMSG)
+ return err;
+
+ /*
+ * We read all the data, but either a correctable bit-flip
+ * occurred, or MTD reported about some data integrity error,
+ * like an ECC error in case of NAND. The former is harmless,
+ * the later may mean that the read data is corrupted. But we
+ * have a CRC check-sum and we will detect this. If the EC
+ * header is still OK, we just report this as there was a
+ * bit-flip.
+ */
+ read_err = err;
+ }
+
+ magic = ubi32_to_cpu(ec_hdr->magic);
+ if (unlikely(magic != UBI_EC_HDR_MAGIC)) {
+ /*
+ * The magic field is wrong. Let's check if we have read all
+ * 0xFF. If yes, this physical eraseblock is assumed to be
+ * empty.
+ *
+ * But if there was a read error, we do not test it for all
+ * 0xFFs. Even if it does contain all 0xFFs, this error
+ * indicates that something is still wrong with this physical
+ * eraseblock and we anyway cannot treat it as empty.
+ */
+ if (read_err != -EBADMSG &&
+ ubi_buf_all_ff(ec_hdr, UBI_EC_HDR_SIZE)) {
+ /* The physical eraseblock is supposedly empty */
+
+ /*
+ * The below is just a paranoid check, it has to be
+ * compiled out if paranoid checks are disabled.
+ */
+ err = paranoid_check_all_ff(ubi, pnum, 0,
+ ubi->io.peb_size);
+ if (unlikely(err))
+ return err > 0 ? UBI_IO_BAD_EC_HDR : err;
+
+ if (verbose)
+ ubi_warn("no EC header found at PEB %d, "
+ "only 0xFF bytes", pnum);
+ return UBI_IO_PEB_EMPTY;
+ }
+
+ /*
+ * This is not a valid erase counter header, and these are not
+ * 0xFF bytes. Report that the header is corrupted.
+ */
+ if (verbose) {
+ ubi_warn("bad magic number at PEB %d: %08x instead of "
+ "%08x", pnum, magic, UBI_EC_HDR_MAGIC);
+ ubi_dbg_dump_ec_hdr(ec_hdr);
+ }
+ return UBI_IO_BAD_EC_HDR;
+ }
+
+ crc = crc32(UBI_CRC32_INIT, ec_hdr, UBI_EC_HDR_SIZE_CRC);
+ hdr_crc = ubi32_to_cpu(ec_hdr->hdr_crc);
+
+ if (unlikely(hdr_crc != crc)) {
+ if (verbose) {
+ ubi_warn("bad EC header CRC at PEB %d, calculated %#08x,"
+ " read %#08x", pnum, crc, hdr_crc);
+ ubi_dbg_dump_ec_hdr(ec_hdr);
+ }
+ return UBI_IO_BAD_EC_HDR;
+ }
+
+ /* And of course validate what has just been read from the media */
+ err = validate_ec_hdr(ubi, ec_hdr);
+ if (unlikely(err)) {
+ ubi_err("validation failed for PEB %d", pnum);

+ return -EINVAL;
+ }
+

+ return read_err ? UBI_IO_BITFLIPS : 0;
+}
+
+/**
+ * ubi_io_write_ec_hdr - write an erase counter header.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock to write to
+ * @ec_hdr: the erase counter header to write
+ *
+ * This function writes erase counter header described by @ec_hdr to physical
+ * eraseblock @pnum. It also fills most fields of @ec_hdr before writing, so
+ * the caller do not have to fill them. Callers must only fill the @ec_hdr->ec
+ * field.

+ *
+ * This function returns zero in case of success and a negative error code in

+ * case of failure. If %-EIO is returned, the physical eraseblock most probably
+ * went bad.
+ */
+int ubi_io_write_ec_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_ec_hdr *ec_hdr)
+{
+ int err;
+ uint32_t crc;
+
+ dbg_io("write EC header to PEB %d", pnum);
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);
+
+ ec_hdr->magic = cpu_to_ubi32(UBI_EC_HDR_MAGIC);
+ ec_hdr->version = UBI_VERSION;
+ ec_hdr->vid_hdr_offset = cpu_to_ubi32(ubi->io.vid_hdr_offset);
+ ec_hdr->data_offset = cpu_to_ubi32(ubi->io.leb_start);
+ crc = crc32(UBI_CRC32_INIT, ec_hdr, UBI_EC_HDR_SIZE_CRC);
+ ec_hdr->hdr_crc = cpu_to_ubi32(crc);
+
+ err = paranoid_check_ec_hdr(ubi, pnum, ec_hdr);
+ if (unlikely(err))
+ return -EINVAL;
+
+ err = ubi_io_write(ubi, ec_hdr, pnum, 0, ubi->io.ec_hdr_alsize);

+ return err;
+}
+
+/**

+ * validate_vid_hdr - validate a volume identifier header.

+ *
+ * @ubi: the UBI device description object

+ * @vid_hdr: the volume identifier header to check
+ *
+ * This function checks that data stored in the volume identifier header
+ * @vid_hdr. Returns zero if the VID header is OK and %1 if not.
+ */
+static int validate_vid_hdr(const struct ubi_info *ubi,
+ const struct ubi_vid_hdr *vid_hdr)
+{
+ int vol_type = vid_hdr->vol_type;
+ int copy_flag = vid_hdr->copy_flag;
+ int vol_id = ubi32_to_cpu(vid_hdr->vol_id);
+ int lnum = ubi32_to_cpu(vid_hdr->lnum);
+ int compat = vid_hdr->compat;
+ int data_size = ubi32_to_cpu(vid_hdr->data_size);
+ int used_ebs = ubi32_to_cpu(vid_hdr->used_ebs);
+ int data_pad = ubi32_to_cpu(vid_hdr->data_pad);
+ int data_crc = ubi32_to_cpu(vid_hdr->data_crc);
+ int usable_leb_size = ubi->io.leb_size - data_pad;
+
+ if (unlikely(copy_flag != 0 && copy_flag != 1)) {
+ dbg_err("bad copy_flag");

+ goto bad;
+ }
+

+ if (unlikely(vol_id < 0 || lnum < 0 || data_size < 0 || used_ebs < 0 ||
+ data_pad < 0)) {

+ dbg_err("negative values");
+ goto bad;
+ }
+

+ if (unlikely(vol_id >= UBI_MAX_VOLUMES &&
+ vol_id < UBI_INTERNAL_VOL_START)) {
+ dbg_err("bad vol_id");

+ goto bad;
+ }
+

+ if (unlikely(vol_id < UBI_INTERNAL_VOL_START && compat != 0)) {
+ dbg_err("bad compat");

+ goto bad;
+ }
+

+ if (unlikely(vol_id >= UBI_INTERNAL_VOL_START &&
+ compat != UBI_COMPAT_DELETE &&
+ compat != UBI_COMPAT_RO &&
+ compat != UBI_COMPAT_PRESERVE &&
+ compat != UBI_COMPAT_REJECT)) {
+ dbg_err("bad compat");

+ goto bad;
+ }
+

+ if (unlikely(vol_type != UBI_VID_DYNAMIC &&
+ vol_type != UBI_VID_STATIC)) {

+ dbg_err("bad vol_type");
+ goto bad;
+ }
+

+ if (unlikely(data_pad >= ubi->io.leb_size / 2)) {

+ dbg_err("bad data_pad");
+ goto bad;
+ }
+

+ if (vol_type == UBI_VID_STATIC) {
+ /*
+ * Although from high-level point of view static volumes may
+ * contain zero bytes of data, but no VID headers can contain
+ * zero at these fields, because they empty volumes do not have
+ * mapped logical eraseblocks.
+ */
+ if (unlikely(used_ebs == 0)) {
+ dbg_err("zero used_ebs");
+ goto bad;
+ }
+ if (unlikely(data_size == 0)) {
+ dbg_err("zero data_size");
+ goto bad;
+ }
+ if (lnum < used_ebs - 1) {
+ if (unlikely(data_size != usable_leb_size)) {
+ dbg_err("bad data_size");
+ goto bad;
+ }
+ } else if (lnum == used_ebs - 1) {
+ if (unlikely(data_size == 0)) {
+ dbg_err("bad data_size at last LEB");
+ goto bad;
+ }
+ } else {
+ dbg_err("too high lnum");
+ goto bad;
+ }
+ } else {
+ if (copy_flag == 0) {
+ if (unlikely(data_crc != 0)) {
+ dbg_err("non-zero data CRC");
+ goto bad;
+ }
+ if (unlikely(data_size != 0)) {
+ dbg_err("non-zero data_size");
+ goto bad;
+ }
+ } else {
+ if (unlikely(data_size == 0)) {
+ dbg_err("zero data_size of copy");

+ goto bad;
+ }
+ }

+ if (unlikely(used_ebs != 0)) {

+ dbg_err("bad used_ebs");
+ goto bad;
+ }
+ }
+

+ return 0;
+
+bad:

+ ubi_err("bad VID header");
+ ubi_dbg_dump_vid_hdr(vid_hdr);

+ ubi_dbg_dump_stack();
+ return 1;

+}
+
+/**
+ * ubi_io_read_vid_hdr - read and check a volume identifier header.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock number to read from
+ * @vid_hdr: &struct ubi_vid_hdr object where to store the read volume
+ * identifier header
+ * @verbose: be verbose if the header is corrupted or wasn't found
+ *
+ * This function reads the volume identifier header from physical eraseblock
+ * @pnum and stores it in @vid_hdr. It also checks CRC checksum of the read
+ * volume identifier header. The following codes may be returned:
+ *
+ * o %0 if the CRC checksum is correct and the header was successfully read;
+ * o %UBI_IO_BITFLIPS if the CRC is correct, but bit-flips were detected
+ * and corrected by the flash driver; this is harmless but may indicate that
+ * this eraseblock may become bad soon;
+ * o %UBI_IO_BAD_VID_HRD if the volume identifier header is corrupted (a CRC
+ * error detected);
+ * o %UBI_IO_PEB_FREE if the physical eraseblock is free (i.e., there is no VID
+ * header there);
+ * o a negative error code in case of failure.
+ */

+int ubi_io_read_vid_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_vid_hdr *vid_hdr, int verbose)

+{
+ int err, read_err = 0;
+ uint32_t crc, magic, hdr_crc;
+ void *p;
+
+ dbg_io("read VID header from PEB %d", pnum);
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);
+
+ p = (char *)vid_hdr - ubi->io.vid_hdr_shift;
+ err = ubi_io_read(ubi, p, pnum, ubi->io.vid_hdr_aloffset,
+ ubi->io.vid_hdr_alsize);
+ if (unlikely(err)) {
+ if (err != UBI_IO_BITFLIPS && err != -EBADMSG)
+ return err;
+
+ /*
+ * We read all the data, but either a correctable bit-flip
+ * occurred, or MTD reported about some data integrity error,
+ * like an ECC error in case of NAND. The former is harmless,
+ * the later may mean the read data is corrupted. But we have a
+ * CRC check-sum and we will identify this. If the VID header is
+ * still OK, we just report this as there was a bit-flip.
+ */
+ read_err = err;
+ }
+
+ magic = ubi32_to_cpu(vid_hdr->magic);
+ if (unlikely(magic != UBI_VID_HDR_MAGIC)) {
+ /*
+ * If we have read all 0xFF bytes, the VID header probably does
+ * not exist and the physical eraseblock is assumed to be free.
+ *
+ * But if there was a read error, we do not test the data for
+ * 0xFFs. Even if it does contain all 0xFFs, this error
+ * indicates that something is still wrong with this physical
+ * eraseblock and it cannot be regarded as free.
+ */
+ if (likely(read_err != -EBADMSG) &&
+ ubi_buf_all_ff(vid_hdr, UBI_VID_HDR_SIZE)) {
+ /* The physical eraseblock is supposedly free */
+
+ /*
+ * The below is just a paranoid check, it has to be
+ * compiled out if paranoid checks are disabled.
+ */
+ err = paranoid_check_all_ff(ubi, pnum, ubi->io.leb_start,
+ ubi->io.leb_size);
+ if (unlikely(err))
+ return err > 0 ? UBI_IO_BAD_VID_HDR : err;
+
+ if (verbose)
+ ubi_warn("no VID header found at PEB %d, "
+ "only 0xFF bytes", pnum);
+ return UBI_IO_PEB_FREE;
+ }
+
+ /*
+ * This is not a valid VID header, and these are not 0xFF
+ * bytes. Report that the header is corrupted.
+ */
+ if (verbose) {
+ ubi_warn("bad magic number at PEB %d: %08x instead of "
+ "%08x", pnum, magic, UBI_VID_HDR_MAGIC);
+ ubi_dbg_dump_vid_hdr(vid_hdr);
+ }
+ return UBI_IO_BAD_VID_HDR;
+ }
+
+ crc = crc32(UBI_CRC32_INIT, vid_hdr, UBI_VID_HDR_SIZE_CRC);
+ hdr_crc = ubi32_to_cpu(vid_hdr->hdr_crc);
+
+ if (unlikely(hdr_crc != crc)) {
+ if (verbose) {
+ ubi_warn("bad CRC at PEB %d, calculated %#08x, "
+ "read %#08x", pnum, crc, hdr_crc);
+ ubi_dbg_dump_vid_hdr(vid_hdr);
+ }
+ return UBI_IO_BAD_VID_HDR;
+ }
+
+ /* Validate the VID header that we have just read */
+ err = validate_vid_hdr(ubi, vid_hdr);
+ if (unlikely(err)) {
+ ubi_err("validation failed for PEB %d", pnum);

+ return -EINVAL;
+ }
+

+ return read_err ? UBI_IO_BITFLIPS : 0;
+}
+
+/**
+ * ubi_io_write_vid_hdr - write a volume identifier header.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to write to
+ * @vid_hdr: the volume identifier header to write
+ *
+ * This function writes the volume identifier header described by @vid_hdr to
+ * physical eraseblock @pnum. This function automatically fills the
+ * @vid_hdr->magic and the @vid_hdr->version fields, as well as calculates
+ * header CRC checksum and stores it at vid_hdr->hdr_crc.

+ *
+ * This function returns zero in case of success and a negative error code in

+ * case of failure. If %-EIO is returned, the physical eraseblock probably went
+ * bad.
+ */
+int ubi_io_write_vid_hdr(const struct ubi_info *ubi, int pnum,
+ struct ubi_vid_hdr *vid_hdr)
+{
+ int err;
+ uint32_t crc;
+ void *p;
+
+ dbg_io("write VID header to PEB %d", pnum);
+ ubi_assert(pnum >= 0 && pnum < ubi->io.peb_count);
+
+ err = paranoid_check_peb_ec_hdr(ubi, pnum);
+ if (unlikely(err))
+ return err > 0 ? -EINVAL: err;
+
+ vid_hdr->magic = cpu_to_ubi32(UBI_VID_HDR_MAGIC);
+ vid_hdr->version = UBI_VERSION;
+ crc = crc32(UBI_CRC32_INIT, vid_hdr, UBI_VID_HDR_SIZE_CRC);
+ vid_hdr->hdr_crc = cpu_to_ubi32(crc);
+
+ err = paranoid_check_vid_hdr(ubi, pnum, vid_hdr);
+ if (unlikely(err))
+ return -EINVAL;
+
+ p = (char *)vid_hdr - ubi->io.vid_hdr_shift;
+ err = ubi_io_write(ubi, p, pnum, ubi->io.vid_hdr_aloffset,
+ ubi->io.vid_hdr_alsize);

+ return err;
+}
+
+/**

+ * ubi_io_init - initialize the UBI I/O unit for a given UBI device.

+ *
+ * @ubi: the UBI device description object

+ * @mtd_num: the underlying MTD device number

+ * @vid_hdr_offset: volume identifier header offset

+ * @data_offset: logical eraseblock data offset
+ *
+ * If the @vid_hdr_offset and @data_offset parameters are zero, the default
+ * offsets are assumed:
+ * o the EC header is always at offset zero - this cannot be changed;
+ * o the VID header starts just after the EC header at the closest address
+ * aligned to @io->@hdrs_min_io_size;
+ * o data starts just after the VID header at the closest address aligned to
+ * @io->@min_io_size

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_io_init(struct ubi_info *ubi, int mtd_num, int vid_hdr_offset,
+ int data_offset)

+{
+ int err;
+ struct mtd_info *mtd;
+
+ dbg_io("initialize the UBI I/O unit for MTD device %d, VID hdr offset "
+ "%d data offset %d", mtd_num, vid_hdr_offset, data_offset);
+
+ mtd = ubi->io.mtd = get_mtd_device(NULL, mtd_num);
+ if (IS_ERR(mtd)) {
+ ubi_err("cannot open MTD device %d", mtd_num);
+ return PTR_ERR(mtd);
+ }
+ ubi->io.mtd_num = mtd_num;
+
+ err = -EINVAL;
+ if (mtd->numeraseregions != 0) {
+ /*
+ * Some flashes have several erase regions. Different regions
+ * may have different eraseblock size and other
+ * characteristics. It looks like mostly multi-region flashes
+ * have one "main" region and one or more small regions to
+ * store boot loader code or boot parameters or whatever. I
+ * guess we should just pick the largest region. But this is

+ * not implemented.
+ */

+ ubi_err("multiple regions, not implemented");
+ goto out_mtd;
+ }
+
+ /*
+ * Note, in this implementation we support MTD devices with 0x7FFFFFFF
+ * physical eraseblocks maximum.
+ */
+
+ ubi->io.mtd_name = mtd->name;
+ ubi->io.peb_size = mtd->erasesize;
+ ubi->io.peb_count = mtd->size / mtd->erasesize;
+ ubi->io.flash_size = mtd->size;
+
+ if (mtd->block_isbad && mtd->block_markbad)
+ ubi->io.bad_allowed = 1;
+
+ ubi->io.min_io_size = mtd->writesize;
+ ubi->io.hdrs_min_io_size = mtd->writesize >> mtd->subpage_sft;
+
+ ubi_assert(ubi->io.hdrs_min_io_size > 0);
+ ubi_assert(ubi->io.hdrs_min_io_size <= ubi->io.min_io_size);
+ ubi_assert(ubi->io.min_io_size % ubi->io.hdrs_min_io_size == 0);
+
+ /* Calculate default aligned sizes of EC and VID headers */
+ ubi->io.ec_hdr_alsize = align_up(UBI_EC_HDR_SIZE,
+ ubi->io.hdrs_min_io_size);
+ ubi->io.vid_hdr_alsize = align_up(UBI_VID_HDR_SIZE,
+ ubi->io.hdrs_min_io_size);
+
+ dbg_io("min_io_size %d", ubi->io.min_io_size);
+ dbg_io("hdrs_min_io_size %d", ubi->io.hdrs_min_io_size);
+ dbg_io("ec_hdr_alsize %d", ubi->io.ec_hdr_alsize);
+ dbg_io("vid_hdr_alsize %d", ubi->io.vid_hdr_alsize);
+
+ if (vid_hdr_offset == 0)
+ /* Default offset */
+ ubi->io.vid_hdr_offset = ubi->io.vid_hdr_aloffset =
+ ubi->io.ec_hdr_alsize;
+ else {
+ ubi->io.vid_hdr_offset = vid_hdr_offset;
+ ubi->io.vid_hdr_aloffset = align_down(vid_hdr_offset,
+ ubi->io.hdrs_min_io_size);
+ ubi->io.vid_hdr_shift = vid_hdr_offset -
+ ubi->io.vid_hdr_aloffset;
+ }
+
+ /* Similar for the data offset */
+ if (data_offset == 0) {
+ ubi->io.leb_start = ubi->io.vid_hdr_offset
+ + ubi->io.vid_hdr_alsize;
+ ubi->io.leb_start = align_up(ubi->io.leb_start,
+ ubi->io.min_io_size);
+ } else
+ ubi->io.leb_start = data_offset;
+
+ dbg_io("vid_hdr_offset %d", ubi->io.vid_hdr_offset);
+ dbg_io("vid_hdr_aloffset %d", ubi->io.vid_hdr_aloffset);
+ dbg_io("vid_hdr_shift %d", ubi->io.vid_hdr_shift);
+ dbg_io("leb_start %d", ubi->io.leb_start);
+
+ /* The shift must be aligned to 32-bit boundary */
+ if (ubi->io.vid_hdr_shift % 4) {
+ ubi_err("unaligned VID header shift %d",
+ ubi->io.vid_hdr_shift);
+ goto out_mtd;
+ }
+
+ /* Check sanity */
+ if (ubi->io.vid_hdr_offset < UBI_EC_HDR_SIZE ||
+ ubi->io.leb_start < ubi->io.vid_hdr_offset + UBI_VID_HDR_SIZE ||
+ ubi->io.leb_start > ubi->io.peb_size - UBI_VID_HDR_SIZE ||
+ ubi->io.leb_start % ubi->io.min_io_size) {
+ ubi_err("bad VID header (%d) or data offsets (%d)",
+ ubi->io.vid_hdr_offset, ubi->io.leb_start);
+ goto out_mtd;
+ }
+
+ /*
+ * It may happen that EC and VID headers are situated in one minimal
+ * I/O unit. In this case we can only accept this UBI image in
+ * read-only mode.
+ */
+ if (ubi->io.vid_hdr_offset + UBI_VID_HDR_SIZE <=
+ ubi->io.hdrs_min_io_size) {
+ ubi_warn("EC and VID headers are in the same minimal I/O unit, "
+ "switch to read-only mode");
+ ubi->io.ro_mode = 1;
+ }
+
+ ubi->io.leb_size = ubi->io.peb_size - ubi->io.leb_start;
+
+ if (!(mtd->flags & MTD_WRITEABLE)) {
+ ubi_msg("MTD device %d is write-protected, attach in "
+ "read-only mode", mtd_num);
+ ubi->io.ro_mode = 1;
+ }
+
+ dbg_io("leb_size %d", ubi->io.leb_size);
+ dbg_io("ro_mode %d", ubi->io.ro_mode);
+
+ /*
+ * FIXME: ideally, we have to initialize ubi->io.bad_peb_count here. But
+ * unfortunately, MTD does not provide this information. We should loop
+ * over all physical eraseblocks and invoke mtd->block_is_bad() which
+ * is not optimal. So, we skip ubi->io.bad_peb_count uninitialized and
+ * let the scanning unit to initialize it. This is not nice.
+ */
+
+ dbg_io("the UBI I/O unit is initialized");
+ return 0;
+
+out_mtd:
+ put_mtd_device(mtd);

+ return err;
+}
+
+/**

+ * ubi_io_close - close the UBI I/O unit for a given UBI device.

+ *
+ * @ubi: the UBI device description object
+ */

+void ubi_io_close(const struct ubi_info *ubi)
+{
+ dbg_io("close the UBI I/O unit for mtd device %d", ubi->io.mtd_num);
+ put_mtd_device(ubi->io.mtd);
+}
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_IO
+
+/**
+ * paranoid_check_not_bad - ensure that a physical eraseblock is not bad.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock number to check
+ *
+ * This function returns zero if the physical eraseblock is good, a positive
+ * number if it is bad and a negative error code if an error occurred.
+ */
+static int paranoid_check_not_bad(const struct ubi_info *ubi, int pnum)

+{
+ int err;
+

+ err = ubi_io_is_bad(ubi, pnum);
+ if (likely(!err))
+ return err;
+
+ ubi_err("paranoid check failed for PEB %d", pnum);
+ ubi_dbg_dump_stack();

+ return err;
+}
+
+/**

+ * paranoid_check_ec_hdr - check if an erase counter header is all right.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock number the erase counter header belongs to
+ * @ec_hdr: the erase counter header to check
+ *
+ * This function returns zero if the erase counter header contains valid
+ * values, and %1 if not.
+ */
+static int paranoid_check_ec_hdr(const struct ubi_info *ubi, int pnum,
+ const struct ubi_ec_hdr *ec_hdr)
+{
+ int err;
+ uint32_t magic;
+
+ magic = ubi32_to_cpu(ec_hdr->magic);
+ if (unlikely(magic != UBI_EC_HDR_MAGIC)) {
+ ubi_err("bad magic %#08x, must be %#08x",
+ magic, UBI_EC_HDR_MAGIC);

+ goto fail;
+ }
+

+ err = validate_ec_hdr(ubi, ec_hdr);
+ if (unlikely(err)) {
+ ubi_err("paranoid check failed for PEB %d", pnum);

+ goto fail;
+ }
+

+ return 0;
+
+fail:

+ ubi_dbg_dump_ec_hdr(ec_hdr);

+ ubi_dbg_dump_stack();
+ return 1;

+}
+
+/**
+ * paranoid_check_peb_ec_hdr - check that the erase counter header of a
+ * physical eraseblock is in-place and is all right.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to check
+ *
+ * This function returns zero if the erase counter header is all right, %1 if
+ * not, and a negative error code if an error occurred.
+ */
+static int paranoid_check_peb_ec_hdr(const struct ubi_info *ubi, int pnum)
+{
+ int err;
+ uint32_t crc, hdr_crc;

+ struct ubi_ec_hdr *ec_hdr;
+

+ ec_hdr = kzalloc(ubi->io.ec_hdr_alsize, GFP_KERNEL);
+ if (unlikely(!ec_hdr))

+ return -ENOMEM;
+
+ err = ubi_io_read(ubi, ec_hdr, pnum, 0, UBI_EC_HDR_SIZE);
+ if (unlikely(err) && err != UBI_IO_BITFLIPS && err != -EBADMSG)
+ goto exit;
+
+ crc = crc32(UBI_CRC32_INIT, ec_hdr, UBI_EC_HDR_SIZE_CRC);
+ hdr_crc = ubi32_to_cpu(ec_hdr->hdr_crc);
+ if (unlikely(hdr_crc != crc)) {
+ ubi_err("bad CRC, calculated %#08x, read %#08x", crc, hdr_crc);
+ ubi_err("paranoid check failed for PEB %d", pnum);
+ ubi_dbg_dump_ec_hdr(ec_hdr);
+ ubi_dbg_dump_stack();
+ err = 1;
+ goto exit;
+ }
+
+ err = paranoid_check_ec_hdr(ubi, pnum, ec_hdr);
+
+exit:
+ kfree(ec_hdr);

+ return err;
+}
+
+/**

+ * paranoid_check_vid_hdr - check that a volume identifier header is all right.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock number the volume identifier header belongs to
+ * @vid_hdr: the volume identifier header to check
+ *
+ * This function returns zero if the volume identifier header is all right, and
+ * %1 if not.
+ */
+static int paranoid_check_vid_hdr(const struct ubi_info *ubi, int pnum,
+ const struct ubi_vid_hdr *vid_hdr)
+{
+ int err;
+ uint32_t magic;
+
+ magic = ubi32_to_cpu(vid_hdr->magic);
+ if (unlikely(magic != UBI_VID_HDR_MAGIC)) {
+ ubi_err("bad VID header magic %#08x at PEB %d, must be %#08x",
+ magic, pnum, UBI_VID_HDR_MAGIC);

+ goto fail;
+ }
+

+ err = validate_vid_hdr(ubi, vid_hdr);
+ if (unlikely(err)) {
+ ubi_err("paranoid check failed for PEB %d", pnum);

+ goto fail;
+ }
+

+ return err;
+
+fail:
+ ubi_err("paranoid check failed for PEB %d", pnum);
+ ubi_dbg_dump_vid_hdr(vid_hdr);

+ ubi_dbg_dump_stack();
+ return 1;

+
+}
+
+/**
+ * paranoid_check_peb_vid_hdr - check that the volume identifier header of a
+ * physical eraseblock is in-place and is all right.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to check
+ *
+ * This function returns zero if the volume identifier header is all right,
+ * %1 if not, and a negative error code if an error occurred.
+ */
+static int paranoid_check_peb_vid_hdr(const struct ubi_info *ubi, int pnum)
+{
+ int err;
+ uint32_t crc, hdr_crc;
+ struct ubi_vid_hdr *vid_hdr;
+ void *p;

+
+ vid_hdr = ubi_zalloc_vid_hdr(ubi);

+ if (unlikely(!vid_hdr))
+ return -ENOMEM;
+
+ p = (char *)vid_hdr - ubi->io.vid_hdr_shift;
+ err = ubi_io_read(ubi, p, pnum, ubi->io.vid_hdr_aloffset,
+ ubi->io.vid_hdr_alsize);
+ if (unlikely(err) && err != UBI_IO_BITFLIPS && err != -EBADMSG)
+ goto exit;
+
+ crc = crc32(UBI_CRC32_INIT, vid_hdr, UBI_EC_HDR_SIZE_CRC);
+ hdr_crc = ubi32_to_cpu(vid_hdr->hdr_crc);
+ if (unlikely(hdr_crc != crc)) {
+ ubi_err("bad VID header CRC at PEB %d, calculated %#08x, "
+ "read %#08x", pnum, crc, hdr_crc);
+ ubi_err("paranoid check failed for PEB %d", pnum);
+ ubi_dbg_dump_vid_hdr(vid_hdr);
+ ubi_dbg_dump_stack();
+ err = 1;
+ goto exit;
+ }
+
+ err = paranoid_check_vid_hdr(ubi, pnum, vid_hdr);
+
+exit:

+ ubi_free_vid_hdr(ubi, vid_hdr);
+ return err;
+}
+

+/**
+ * paranoid_check_all_ff - check that a region of flash is empty.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to check
+ * @offset: the starting offset within the physical eraseblock to check
+ * @len: the length of the region to check
+ *
+ * This function returns zero if only 0xFF bytes are present at offset
+ * @offset of the physical eraseblock @pnum, %1 if not, and a negative error
+ * code if an error occurred.
+ */
+static int paranoid_check_all_ff(const struct ubi_info *ubi, int pnum,
+ int offset, int len)
+{
+ size_t read;
+ int err;
+ void *buf;
+ loff_t addr = (loff_t)pnum * ubi->io.peb_size + offset;
+
+ buf = kzalloc(len, GFP_KERNEL);
+ if (unlikely(!buf))
+ return -ENOMEM;
+
+ err = ubi->io.mtd->read(ubi->io.mtd, addr, len, &read, buf);
+ if (unlikely(err && err != -EUCLEAN)) {
+ ubi_err("error %d while reading %d bytes from PEB %d:%d, "
+ "read %zd bytes", err, len, pnum, offset, read);
+ goto error;
+ }
+
+ err = ubi_buf_all_ff(buf, len);
+ if (unlikely(err == 0)) {
+ ubi_err("flash region at PEB %d:%d, length %d does not "
+ "contain all 0xFF bytes", pnum, offset, len);

+ goto fail;
+ }
+

+ kfree(buf);

+ return 0;
+
+fail:

+ ubi_err("paranoid check failed for PEB %d", pnum);
+ dbg_err("hex dump of the %d-%d region", offset, offset + len);
+ ubi_dbg_hexdump(buf, len);
+ err = 1;
+error:
+ ubi_dbg_dump_stack();
+ kfree(buf);

+ return err;
+}
+

+#endif /* CONFIG_MTD_UBI_DEBUG_PARANOID_IO */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:27:35 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/sysfs.c tmp-to/drivers/mtd/ubi/sysfs.c
--- tmp-from/drivers/mtd/ubi/sysfs.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/sysfs.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,408 @@

+/*
+ * Copyright (c) International Business Machines Corp., 2006
+ *

+#include <linux/kobject.h>

+#include <linux/stat.h>
+#include "ubi.h"
+

+static struct class *ubi_class;
+
+/* "Show" and "store" methods for files in '/<sysfs>/class/ubi/' */
+static ssize_t ubi_version_show(struct class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", UBI_VERSION);
+}
+
+/* Class attributes corresponding to files in '/<sysfs>/class/ubi/' */
+static struct class_attribute ubi_version =
+ __ATTR(version, S_IRUGO, ubi_version_show, NULL);
+
+/**
+ * ubi_sysfs_infr_init - initialize UBI sysfs infrastructure support.

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_sysfs_infr_init(void)

+{
+ int err;
+

+ ubi_class = class_create(THIS_MODULE, UBI_STRING_NAME);
+ if (IS_ERR(ubi_class))
+ return PTR_ERR(ubi_class);
+
+ err = class_create_file(ubi_class, &ubi_version);
+ if (err)
+ goto out_class;

+
+ return 0;
+

+out_class:
+ class_destroy(ubi_class);

+ return err;
+}
+
+/**

+ * ubi_sysfs_infr_close - close UBI sysfs infrastructure.
+ */
+void ubi_sysfs_infr_close(void)
+{
+ class_remove_file(ubi_class, &ubi_version);
+ class_destroy(ubi_class);
+}
+
+/**
+ * dev2ubi - find UBI device description object by the pointer to the device
+ * object.
+ *
+ * @dev: device object pointer
+ *
+ * This function returns a pointer to the UBI device description object.
+ */
+static inline struct ubi_info *dev2ubi(struct device *dev)
+{
+ return container_of(dev, struct ubi_info, uif.dev);
+}
+
+static ssize_t dev_attribute_show(struct device *dev,
+ struct device_attribute *attr, char *buf);
+
+/*
+ * Device attributes corresponding to files in '/<sysfs>/class/ubi/ubiX'.
+ */
+static struct device_attribute dev_eraseblock_size =
+ __ATTR(eraseblock_size, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_avail_eraseblocks =
+ __ATTR(avail_eraseblocks, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_total_eraseblocks =
+ __ATTR(total_eraseblocks, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_volumes_count =
+ __ATTR(volumes_count, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_max_ec =
+ __ATTR(max_ec, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_reserved_for_bad =
+ __ATTR(reserved_for_bad, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_bad_peb_count =
+ __ATTR(bad_peb_count, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_max_vol_count =
+ __ATTR(max_vol_count, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_min_io_size =
+ __ATTR(min_io_size, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_bgt_enabled =
+ __ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
+
+/* "Show" method for files in '/<sysfs>/class/ubi/ubiX/' */
+static ssize_t dev_attribute_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ const struct ubi_info *ubi = dev2ubi(dev);
+
+ if (attr == &dev_eraseblock_size)
+ return sprintf(buf, "%d\n", ubi->io.leb_size);
+ else if (attr == &dev_avail_eraseblocks)
+ return sprintf(buf, "%d\n", ubi->acc.avail_pebs);
+ else if (attr == &dev_total_eraseblocks)
+ return sprintf(buf, "%d\n", ubi->io.good_peb_count);
+ else if (attr == &dev_volumes_count)
+ return sprintf(buf, "%d\n", ubi->vtbl.vol_count);
+ else if (attr == &dev_max_ec)
+ return sprintf(buf, "%d\n", ubi->wl.max_ec);
+ else if (attr == &dev_reserved_for_bad)
+ return sprintf(buf, "%d\n", ubi->acc.beb_rsvd_pebs);
+ else if (attr == &dev_bad_peb_count)
+ return sprintf(buf, "%d\n", ubi->io.bad_peb_count);
+ else if (attr == &dev_max_vol_count)
+ return sprintf(buf, "%d\n", ubi->vtbl.vt_slots);
+ else if (attr == &dev_min_io_size)
+ return sprintf(buf, "%d\n", ubi->io.min_io_size);
+ else if (attr == &dev_bgt_enabled) {
+ return sprintf(buf, "%d\n", ubi->wl.thread_enabled);

+ }
+
+ return 0;
+}
+

+/* "Release" method for UBI devices */
+static void dev_release(struct device *dev)
+{
+ return;
+}
+
+/**
+ * ubi_sysfs_init - initialize sysfs for an UBI device.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_sysfs_init(struct ubi_info *ubi)

+{
+ int err;
+

+ ubi->uif.dev.release = dev_release;
+ ubi->uif.dev.devt = MKDEV(ubi->uif.major, 0);
+ ubi->uif.dev.class = ubi_class;
+ sprintf(&ubi->uif.dev.bus_id[0], UBI_STRING_NAME"%d", ubi->ubi_num);
+ err = device_register(&ubi->uif.dev);
+ if (err)
+ goto out;
+
+ err = device_create_file(&ubi->uif.dev, &dev_eraseblock_size);
+ if (err)
+ goto out_unregister;
+ err = device_create_file(&ubi->uif.dev, &dev_avail_eraseblocks);
+ if (err)
+ goto out_eraseblock_size;
+ err = device_create_file(&ubi->uif.dev, &dev_total_eraseblocks);
+ if (err)
+ goto out_avail_eraseblocks;
+ err = device_create_file(&ubi->uif.dev, &dev_volumes_count);
+ if (err)
+ goto out_total_eraseblocks;
+ err = device_create_file(&ubi->uif.dev, &dev_max_ec);
+ if (err)
+ goto out_volumes_count;
+ err = device_create_file(&ubi->uif.dev, &dev_reserved_for_bad);
+ if (err)
+ goto out_volumes_max_ec;
+ err = device_create_file(&ubi->uif.dev, &dev_bad_peb_count);
+ if (err)
+ goto out_reserved_for_bad;
+ err = device_create_file(&ubi->uif.dev, &dev_max_vol_count);
+ if (err)
+ goto out_bad_peb_count;
+ err = device_create_file(&ubi->uif.dev, &dev_min_io_size);
+ if (err)
+ goto out_max_vol_count;
+ err = device_create_file(&ubi->uif.dev, &dev_bgt_enabled);
+ if (err)
+ goto out_min_io_size;

+
+ return 0;
+

+out_min_io_size:
+ device_remove_file(&ubi->uif.dev, &dev_min_io_size);
+out_max_vol_count:
+ device_remove_file(&ubi->uif.dev, &dev_max_vol_count);
+out_bad_peb_count:
+ device_remove_file(&ubi->uif.dev, &dev_bad_peb_count);
+out_reserved_for_bad:
+ device_remove_file(&ubi->uif.dev, &dev_reserved_for_bad);
+out_volumes_max_ec:
+ device_remove_file(&ubi->uif.dev, &dev_max_ec);
+out_volumes_count:
+ device_remove_file(&ubi->uif.dev, &dev_volumes_count);
+out_total_eraseblocks:
+ device_remove_file(&ubi->uif.dev, &dev_total_eraseblocks);
+out_avail_eraseblocks:
+ device_remove_file(&ubi->uif.dev, &dev_avail_eraseblocks);
+out_eraseblock_size:
+ device_remove_file(&ubi->uif.dev, &dev_eraseblock_size);
+out_unregister:
+ device_unregister(&ubi->uif.dev);
+out:
+ ubi_err("failed to initialize sysfs for UBI device %d", ubi->ubi_num);

+ return err;
+}
+
+/**

+ * ubi_sysfs_close - close sysfs for an UBI device.

+ *
+ * @ubi: the UBI device description object
+ */

+void ubi_sysfs_close(struct ubi_info *ubi)
+{
+ device_remove_file(&ubi->uif.dev, &dev_bgt_enabled);
+ device_remove_file(&ubi->uif.dev, &dev_min_io_size);
+ device_remove_file(&ubi->uif.dev, &dev_max_vol_count);
+ device_remove_file(&ubi->uif.dev, &dev_bad_peb_count);
+ device_remove_file(&ubi->uif.dev, &dev_reserved_for_bad);
+ device_remove_file(&ubi->uif.dev, &dev_max_ec);
+ device_remove_file(&ubi->uif.dev, &dev_volumes_count);
+ device_remove_file(&ubi->uif.dev, &dev_total_eraseblocks);
+ device_remove_file(&ubi->uif.dev, &dev_avail_eraseblocks);
+ device_remove_file(&ubi->uif.dev, &dev_eraseblock_size);
+ device_unregister(&ubi->uif.dev);
+}
+
+/**
+ * dev2ubi - find volume description object by the pointer to the device
+ * object.
+ *
+ * @dev: device object pointer
+ *
+ * This function returns a pointer to the UBI volume description object.
+ */
+static inline struct ubi_uif_volume *dev2vol(struct device *dev)
+{
+ return container_of(dev, struct ubi_uif_volume, dev);
+}
+
+static ssize_t vol_attribute_show(struct device *dev,
+ struct device_attribute *attr, char *buf);
+
+/* Device attributes corresponding to files in '/<sysfs>/class/ubi/ubiX_Y' */
+static struct device_attribute vol_reserved_ebs =
+ __ATTR(reserved_ebs, S_IRUGO, vol_attribute_show, NULL);
+static struct device_attribute vol_type =
+ __ATTR(type, S_IRUGO, vol_attribute_show, NULL);
+static struct device_attribute vol_name =
+ __ATTR(name, S_IRUGO, vol_attribute_show, NULL);
+static struct device_attribute vol_corrupted =
+ __ATTR(corrupted, S_IRUGO, vol_attribute_show, NULL);
+static struct device_attribute vol_alignment =
+ __ATTR(alignment, S_IRUGO, vol_attribute_show, NULL);
+static struct device_attribute vol_usable_eb_size =
+ __ATTR(usable_eb_size, S_IRUGO, vol_attribute_show, NULL);
+static struct device_attribute vol_data_bytes =
+ __ATTR(data_bytes, S_IRUGO, vol_attribute_show, NULL);
+static struct device_attribute vol_upd_marker =
+ __ATTR(upd_marker, S_IRUGO, vol_attribute_show, NULL);
+
+/*
+ * "Show" method for files in '/<sysfs>/class/ubi/ubiX_Y/'.
+ *
+ * Consider a situation:
+ * A. process 1 opens a sysfs file related to volume Y, say
+ * /<sysfs>/class/ubi/ubiX_Y/reserved_ebs;
+ * B. process 2 removes volume Y;
+ * C. process 1 starts reading the /<sysfs>/class/ubi/ubiX_Y/reserved_ebs file;
+ *
+ * What we want to do in a situation like that is to return error when the file
+ * is read. This is done by means of the 'removed' flag and the 'vol_lock' of
+ * the UBI UIF volume information structure.
+ */
+static ssize_t vol_attribute_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int usable_eb_size, ret = 0;

+ const struct ubi_vtbl_vtr *vtr;

+ const char *tp;
+ struct ubi_uif_volume *vol = dev2vol(dev);
+
+ spin_lock(&vol->vol_lock);
+ if (vol->removed) {
+ spin_unlock(&vol->vol_lock);
+ dbg_err("volume %d was removed", vol->vol_id);
+ return -EIO;
+ }

+ vtr = ubi_vtbl_get_vtr(vol->ubi, vol->vol_id);

+ if (attr == &vol_reserved_ebs)
+ ret = sprintf(buf, "%d\n", vtr->reserved_pebs);
+ else if (attr == &vol_type) {
+ tp = vtr->vol_type == UBI_DYNAMIC_VOLUME ? "dynamic" : "static";
+ ret = sprintf(buf, "%s\n", tp);
+ } else if (attr == &vol_name)
+ ret = sprintf(buf, "%s\n", vtr->name);
+ else if (attr == &vol_corrupted)
+ ret = sprintf(buf, "%d\n", vtr->corrupted);
+ else if (attr == &vol_alignment)
+ ret = sprintf(buf, "%d\n", vtr->alignment);
+ else if (attr == &vol_usable_eb_size) {
+ usable_eb_size = vol->ubi->io.leb_size - vtr->data_pad;
+ ret = sprintf(buf, "%d\n", usable_eb_size);
+ } else if (attr == &vol_data_bytes)
+ ret = sprintf(buf, "%lld\n", vtr->used_bytes);
+ else if (attr == &vol_upd_marker)
+ ret = sprintf(buf, "%d\n", vtr->upd_marker);
+ spin_unlock(&vol->vol_lock);

+ return ret;
+}
+

+/* Release method for volume devices */
+static void vol_release(struct device *dev)
+{
+ const struct ubi_uif_volume *vol = dev2vol(dev);
+
+ dbg_uif("release volume %d", vol->vol_id);
+ kfree(vol);
+}
+
+/**
+ * ubi_sysfs_vol_init - initialize sysfs for an UBI volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol: volume description object

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.

+ *
+ * Note, this function does not free allocated resources in case of failure -
+ * the caller does it. This is because this would cause release() here and the
+ * caller would oops.
+ */

+int ubi_sysfs_vol_init(struct ubi_info *ubi, struct ubi_uif_volume *vol)

+{
+ int err;
+

+ vol->dev.release = vol_release;
+ vol->dev.parent = &ubi->uif.dev;
+ vol->dev.devt = MKDEV(ubi->uif.major, vol->vol_id + 1);
+ vol->dev.class = ubi_class;
+ sprintf(&vol->dev.bus_id[0], "%s_%d", ubi->uif.ubi_name, vol->vol_id);
+ err = device_register(&vol->dev);
+ if (err)
+ return err;
+
+ err = device_create_file(&vol->dev, &vol_reserved_ebs);

+ if (err)
+ return err;

+ err = device_create_file(&vol->dev, &vol_type);

+ if (err)
+ return err;

+ err = device_create_file(&vol->dev, &vol_name);

+ if (err)
+ return err;

+ err = device_create_file(&vol->dev, &vol_corrupted);

+ if (err)
+ return err;

+ err = device_create_file(&vol->dev, &vol_alignment);

+ if (err)
+ return err;

+ err = device_create_file(&vol->dev, &vol_usable_eb_size);

+ if (err)
+ return err;

+ err = device_create_file(&vol->dev, &vol_data_bytes);

+ if (err)
+ return err;

+ err = device_create_file(&vol->dev, &vol_upd_marker);

+ if (err)
+ return err;

+ return 0;
+}
+
+/**

+ * ubi_sysfs_vol_close - close sysfs for an UBI volume.
+ *
+ * @vol: volume description object
+ */
+void ubi_sysfs_vol_close(struct ubi_uif_volume *vol)
+{
+ device_remove_file(&vol->dev, &vol_upd_marker);
+ device_remove_file(&vol->dev, &vol_data_bytes);
+ device_remove_file(&vol->dev, &vol_usable_eb_size);
+ device_remove_file(&vol->dev, &vol_alignment);
+ device_remove_file(&vol->dev, &vol_corrupted);
+ device_remove_file(&vol->dev, &vol_name);
+ device_remove_file(&vol->dev, &vol_type);
+ device_remove_file(&vol->dev, &vol_reserved_ebs);
+ device_unregister(&vol->dev);
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:27:41 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/gluebi.c tmp-to/drivers/mtd/ubi/gluebi.c
--- tmp-from/drivers/mtd/ubi/gluebi.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/gluebi.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,361 @@

+ * Author: Artem B. Bityutskiy, Joern Engel
+ */
+
+/*
+ * This is a part of the user interfaces unit. Here we implement fake MTD
+ * devices for each UBI volume. This sounds strange, but it is in fact quite
+ * useful to make MTD-oriented software (including all the legacy software) to
+ * work on top of UBI.
+ *
+ * Gluebi emulates MTD devices of "MTD_UBIVOLUME" type. Their minimal I/O unit
+ * size (mtd->writesize) is equivalent to the minimal I/O unit of the
+ * underlying (real) MTD device. The eraseblock size is equivalent to the
+ * logical eraseblock size of the volume.
+ */
+

+#include <asm/div64.h>
+#include "ubi.h"
+

+/**
+ * mtd2vol - find the user interface volume description object by an MTD

+ * object.
+ *

+ * @mtd: the MTD object
+ *
+ * This function returns the user interface volume description object
+ * corresponding to the @mtd object.
+ */
+static inline struct ubi_uif_volume *mtd2vol(struct mtd_info *mtd)
+{
+ return container_of(mtd, struct ubi_uif_volume, gluebi_mtd);
+}
+
+/**
+ * gluebi_get_device - get MTD device reference.
+ *
+ * @mtd: the MTD device description object
+ *
+ * This function is called every time the MTD device is being opened and
+ * implements the MTD get_device() operation. Returns zero in case of success
+ * and a negative error code in case of failure.
+ */
+static int gluebi_get_device(struct mtd_info *mtd)
+{
+ struct ubi_uif_volume *vol = mtd2vol(mtd);
+
+ /*
+ * We do not introduce locks for gluebi reference count because the
+ * get_device()/put_device() calls are already serialized at MTD.
+ */
+ if (vol->gluebi_refcount > 0) {
+ /*
+ * The MTD device is already referenced and this is just one
+ * more reference. MTD allows many users to open the same
+ * volume simultaneously and do not distinguish between
+ * readers/writers/exclusive openers as UBI does. So we do not
+ * open the UBI volume again - just increase the reference
+ * counter and return.
+ */
+ vol->gluebi_refcount += 1;

+ return 0;
+ }
+

+ /*
+ * This is the first reference to this UBI volume via the MTD device
+ * interface. Open the corresponding volume in read-write mode.
+ */
+ vol->gluebi_desc = ubi_open_volume(vol->ubi->ubi_num, vol->vol_id,
+ UBI_READWRITE);
+ if (IS_ERR(vol->gluebi_desc))
+ return PTR_ERR(vol->gluebi_desc);
+ vol->gluebi_refcount += 1;

+ return 0;
+}
+
+/**

+ * gluebi_put_device - put MTD device reference.
+ *
+ * @mtd: the MTD device description object
+ *
+ * This function is called every time the MTD device is being put. Returns
+ * zero in case of success and a negative error code in case of failure.
+ */
+static void gluebi_put_device(struct mtd_info *mtd)
+{
+ struct ubi_uif_volume *vol = mtd2vol(mtd);
+
+ vol->gluebi_refcount -= 1;
+ ubi_assert(vol->gluebi_refcount >= 0);
+ if (vol->gluebi_refcount == 0)
+ ubi_close_volume(vol->gluebi_desc);
+}
+
+/**
+ * gluebi_read - read operation of emulated MTD devices.
+ *
+ * @mtd: the MTD device description object
+ * @from: absolute offset from where to read

+ * @len: how many bytes to read

+ * @retlen: count of read bytes is returned here
+ * @buf: the buffer to store the data to

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int gluebi_read(struct mtd_info *mtd, loff_t from, size_t len,
+ size_t *retlen, unsigned char *buf)
+{
+ int err = 0, lnum, offs, total_read;
+ struct ubi_uif_volume *vol = mtd2vol(mtd);

+ struct ubi_info *ubi = vol->ubi;
+

+ dbg_uif("read %zd bytes from offset %lld", len, from);
+
+ if (unlikely(len < 0 || from < 0 || from + len > mtd->size))
+ return -EINVAL;
+
+ offs = do_div(from, mtd->erasesize);
+ lnum = from;
+
+ total_read = len;
+ while (total_read) {
+ size_t to_read = mtd->erasesize - offs;
+
+ if (to_read > total_read)
+ to_read = total_read;
+
+ dbg_uif("read %zd bytes from LEB %d:%d, offset %d",
+ to_read, vol->vol_id, lnum, offs);
+
+ err = ubi_eba_read_leb(ubi, vol->vol_id, lnum, buf, offs,
+ to_read, 0);

+ if (unlikely(err))
+ break;
+

+ lnum += 1;
+ offs = 0;
+ total_read -= to_read;
+ buf += to_read;
+ }
+
+ *retlen = len - total_read;

+ return err;
+}
+
+/**

+ * gluebi_write - write operation of emulated MTD devices.
+ *
+ * @mtd: the MTD device description object
+ * @to: absolute offset where to write

+ * @len: how many bytes to write

+ * @retlen: count of written bytes is returned here
+ * @buf: the buffer with data to write

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int gluebi_write(struct mtd_info *mtd, loff_t to, size_t len,
+ size_t *retlen, const u_char *buf)
+{
+ int err = 0, lnum, offs, total_written;
+ struct ubi_uif_volume *vol = mtd2vol(mtd);

+ struct ubi_info *ubi = vol->ubi;
+

+ dbg_uif("write %zd bytes to offset %lld", len, to);
+
+ if (unlikely(len < 0 || to < 0 || len + to > mtd->size))
+ return -EINVAL;
+
+ if (unlikely(ubi->io.ro_mode))
+ return -EROFS;
+
+ offs = do_div(to, mtd->erasesize);
+ lnum = to;
+
+ if (unlikely(len % mtd->writesize || offs % mtd->writesize))
+ return -EINVAL;
+
+ total_written = len;
+ while (total_written) {
+ size_t to_write = mtd->erasesize - offs;
+
+ if (to_write > total_written)
+ to_write = total_written;
+
+ dbg_uif("write %zd bytes to LEB %d:%d, offset %d",
+ to_write, vol->vol_id, lnum, offs);
+
+ err = ubi_eba_write_leb(ubi, vol->vol_id, lnum, buf, offs,
+ to_write, UBI_DATA_UNKNOWN);

+ if (unlikely(err))
+ break;
+

+ lnum += 1;
+ offs = 0;
+ total_written -= to_write;
+ buf += to_write;
+ }
+
+ *retlen = len - total_written;

+ return err;
+}
+
+/**

+ * gluebi_erase - erase operation of emulated MTD devices.
+ *
+ * @mtd: the MTD device description object
+ * @instr: the erase operation description
+ *
+ * This function calls the erase callback when finishes. Returns zero in case
+ * of success and a negative error code in case of failure.
+ */
+static int gluebi_erase(struct mtd_info *mtd, struct erase_info *instr)
+{
+ int err, i, lnum, count;
+ struct ubi_uif_volume *vol = mtd2vol(mtd);

+ struct ubi_info *ubi = vol->ubi;
+

+ dbg_uif("erase %u bytes at offset %u", instr->len, instr->addr);
+
+ if (unlikely(instr->addr < 0 ||
+ instr->addr > mtd->size - mtd->erasesize))
+ return -EINVAL;
+
+ if (unlikely(instr->len < 0 ||
+ instr->addr + instr->len > mtd->size))
+ return -EINVAL;
+
+ if (unlikely(instr->addr % mtd->writesize ||
+ instr->len % mtd->writesize))
+ return -EINVAL;
+
+ lnum = instr->addr / mtd->erasesize;
+ count = instr->len / mtd->erasesize;

+
+ if (unlikely(ubi->io.ro_mode))

+ return -EROFS;
+
+ for (i = 0; i < count; i++) {
+ dbg_uif("erase LEB %d", lnum);
+
+ err = ubi_eba_unmap_leb(ubi, vol->vol_id, lnum + i);
+ if (unlikely(err))
+ goto out_err;
+ }
+
+ /*
+ * MTD erase operations are synchronous, so we have to make sure the
+ * physical eraseblock is wiped out.
+ */
+ err = ubi_wl_flush(ubi);
+ if (unlikely(err))
+ goto out_err;
+
+ instr->state = MTD_ERASE_DONE;
+ mtd_erase_callback(instr);

+
+ return 0;
+

+out_err:
+ instr->state = MTD_ERASE_FAILED;
+ instr->fail_addr = lnum * mtd->erasesize;

+ return err;
+}
+
+/**

+ * ubi_gluebi_vol_init - initialize gluebi for an UBI volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol: user interfaces unit volume description object
+ *
+ * This function is called when an UBI volume is created in order to create
+ * corresponding fake MTD device. Returns zero in case of success and a
+ * negative error code in case of failure.
+ */
+int ubi_gluebi_vol_init(const struct ubi_info *ubi, struct ubi_uif_volume *vol)
+{
+ int err;
+ struct mtd_info *mtd = &vol->gluebi_mtd;

+ const struct ubi_vtbl_vtr *vtr;
+

+ vtr = ubi_vtbl_get_vtr(ubi, vol->vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ mtd->name = kmemdup(vtr->name, vtr->name_len + 1, GFP_KERNEL);
+ if (!mtd->name)
+ return -ENOMEM;
+
+ mtd->type = MTD_UBIVOLUME;
+ if (!ubi->io.ro_mode)
+ mtd->flags = MTD_WRITEABLE;
+ mtd->writesize = ubi->io.min_io_size;
+ mtd->owner = THIS_MODULE;
+ mtd->size = vtr->usable_leb_size * vtr->reserved_pebs;
+ mtd->erasesize = vtr->usable_leb_size;
+ mtd->read = gluebi_read;
+ mtd->write = gluebi_write;
+ mtd->erase = gluebi_erase;
+ mtd->get_device = gluebi_get_device;
+ mtd->put_device = gluebi_put_device;
+
+ if (add_mtd_device(mtd)) {
+ ubi_err("cannot not add MTD device\n");
+
+ /*
+ * Unfortunately, add_mtd_device() does not return sane error
+ * code. So, let's name it -ENOMEM;
+ */

+ err = -ENOMEM;
+ goto out_free;
+ }
+

+ dbg_uif("added mtd%d (\"%s\"), size %u, EB size %u",
+ mtd->index, mtd->name, mtd->size, mtd->erasesize);

+
+ return 0;
+

+out_free:
+ kfree(mtd->name);

+ return err;
+}
+
+/**

+ * ubi_gluebi_vol_close - close gluebi for an UBI volume.
+ *
+ * @vol: user interfaces unit volume description object
+ *
+ * This function is called when an UBI volume is removed in order to remove
+ * corresponding fake MTD device. Returns zero in case of success and a
+ * negative error code in case of failure.
+ */
+int ubi_gluebi_vol_close(struct ubi_uif_volume *vol)
+{
+ int err;
+ struct mtd_info *mtd = &vol->gluebi_mtd;
+
+ dbg_uif("remove mtd%d", mtd->index);
+
+ err = del_mtd_device(mtd);

+ if (err)
+ return err;
+

+ kfree(mtd->name);
+ return 0;
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:27:53 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/misc.c tmp-to/drivers/mtd/ubi/misc.c
--- tmp-from/drivers/mtd/ubi/misc.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/misc.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,167 @@

+ */
+
+/*
+ * Here we keep miscellaneous functions which are used all over the UBI code and
+ * do not really belong to any particular unit.
+ */
+
+#include "ubi.h"
+
+/**
+ * ubi_buf_all_ff - check if buffer contains only 0xFF bytes.
+ *
+ * @buf: buffer to check
+ * @size: buffer size in bytes
+ *
+ * This function returns %1 if there are only 0xFF bytes in @buf, and %2 if
+ * something else was also found.
+ */
+int ubi_buf_all_ff(const void *buf, int size)
+{
+ int i;
+
+ for (i = 0; i < size / sizeof(unsigned int); i++)
+ if (((const unsigned int *)buf)[i] != ~0)
+ return 0;
+
+ for (i = i; i < size; i++)
+ if (((const uint8_t *)buf)[i] != 0xFF)
+ return 0;
+
+ return 1;
+}
+
+/**
+ * ubi_buf_all_zeroes - check if buffer contains only zeroes.
+ *
+ * @buf: buffer to check
+ * @size: buffer size in bytes
+ *
+ * This function returns %1 in there are only zero bytes in @buf, and %0 if
+ * something else was also found.
+ */
+int ubi_buf_all_zeroes(const void *buf, int size)
+{
+ int i;
+
+ for (i = 0; i < size / sizeof(unsigned int); i++)
+ if (((const unsigned int *)buf)[i] != 0)
+ return 0;
+
+ for (i = i; i < size; i++)
+ if (((const uint8_t *)buf)[i] != 0)
+ return 0;
+
+ return 1;
+}
+
+/**
+ * ubi_check_pattern - check if buffer contains only a certain byte pattern.
+ *
+ * @buf: buffer to check
+ * @patt: the pattern to check
+ * @size: buffer size in bytes
+ *
+ * This function returns %1 in there are only @patt bytes in @buf, and %0 if
+ * something else was also found.
+ */
+int ubi_check_pattern(const void *buf, uint8_t patt, int size)
+{
+ int i;
+
+ for (i = 0; i < size; i++)
+ if (((const uint8_t *)buf)[i] != patt)
+ return 0;
+ return 1;
+}
+
+/**
+ * calc_data_len - calculate how much real data is stored in a buffer.

+ *
+ * @ubi: the UBI device description object

+ * @buf: a buffer with the contents of the physical eraseblock
+ * @length: the buffer length
+ *
+ * This function calculates how much "real data" is stored in @buf and returnes
+ * the length. Continuous 0xFF bytes at the end of the buffer are not
+ * considered as "real data".
+ */
+int ubi_calc_data_len(const struct ubi_info *ubi, const void *buf,
+ int length)

+{
+ int i;
+

+ ubi_assert(length % ubi->io.min_io_size == 0);
+
+ for (i = length - 1; i >= 0; i--)
+ if (((const uint8_t *)buf)[i] != 0xFF)
+ break;
+
+ /* The resulting length must be aligned to the minimum flash I/O size */
+ length = align_up(i + 1, ubi->io.min_io_size);
+ return length;
+}
+
+/**
+ * ubi_check_volume - check the contents of a static volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to check
+ *
+ * This function checks if static volume @vol_id is corrupted by fully reading
+ * it and checking data CRC. This function returns %0 if the volume is not
+ * corrupted, %1 if it is corrupted and a negative error code in case of
+ * failure. Dynamic volumes are not checked and zero is returned immediately.
+ */
+int ubi_check_volume(struct ubi_info *ubi, int vol_id)
+{
+ void *buf;
+ int err = 0, i;

+ const struct ubi_vtbl_vtr *vtr;
+

+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+
+ if (vtr->vol_type != UBI_STATIC_VOLUME)
+ return 0;
+
+ buf = kmalloc(vtr->usable_leb_size, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ for (i = 0; i < vtr->used_ebs; i++) {
+ int size;
+
+ if (i == vtr->used_ebs - 1)
+ size = vtr->last_eb_bytes;
+ else
+ size = vtr->usable_leb_size;
+
+ err = ubi_eba_read_leb(ubi, vol_id, i, buf, 0, size, 1);
+ if (err) {
+ if (err == -EBADMSG)
+ err = 1;
+ break;
+ }
+ }
+
+ kfree(buf);
+ return err;
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:29:51 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/cdev.c tmp-to/drivers/mtd/ubi/cdev.c
--- tmp-from/drivers/mtd/ubi/cdev.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/cdev.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,926 @@

+ * This is a part of the UBI users interfaces unit and here we implement UBI
+ * character device operations.

+ */
+
+#include <linux/module.h>

+#include <linux/stat.h>
+#include <linux/ioctl.h>
+#include <linux/capability.h>
+#include <mtd/ubi-user.h>
+#include <asm/uaccess.h>

+#include <asm/div64.h>
+#include "ubi.h"
+

+/* Maximum sequence numbers of UBI and volume character device IOCTLs */
+#define UBI_CDEV_IOC_MAX_SEQ 2
+
+/* Direct logical eraseblock erase is a debugging-only feature */
+#ifndef CONFIG_MTD_UBI_DEBUG_USERSPACE_IO
+#define VOL_CDEV_IOC_MAX_SEQ 1
+#else
+#define VOL_CDEV_IOC_MAX_SEQ 2
+#endif
+
+static int ubi_cdev_ioctl(struct inode *inode, struct file *file,
+ unsigned int cmd, unsigned long arg);
+
+/* UBI character device operations */
+static struct file_operations ubi_cdev_operations = {
+ .owner = THIS_MODULE,
+ .ioctl = ubi_cdev_ioctl,
+ .llseek = no_llseek
+};
+
+static int vol_cdev_open(struct inode *inode, struct file *file);
+static int vol_cdev_release(struct inode *inode, struct file *file);
+static loff_t vol_cdev_llseek(struct file *file, loff_t offset, int origin);
+static ssize_t vol_cdev_read(struct file *file, __user char *buf, size_t count,
+ loff_t * offp);
+static ssize_t vol_cdev_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offp);
+static int vol_cdev_ioctl(struct inode *inode, struct file *file,
+ unsigned int cmd, unsigned long arg);
+
+/* UBI volume character device operations */
+static struct file_operations vol_cdev_operations = {
+ .owner = THIS_MODULE,
+ .open = vol_cdev_open,
+ .release = vol_cdev_release,
+ .llseek = vol_cdev_llseek,
+ .read = vol_cdev_read,
+ .write = vol_cdev_write,
+ .ioctl = vol_cdev_ioctl
+};
+
+/**
+ * major_to_info - find the UBI device description object by major number of
+ * the corresponding character device.
+ *
+ * @major: major number
+ *
+ * This function returns a pointer to the UBI description object.
+ */
+static struct ubi_info *major_to_info(int major)

+{
+ int i;
+

+ for (i = 0; i < ubis_num; i++)

+ if (ubis[i] && ubis[i]->uif.major == major)
+ return ubis[i];
+
+ ubi_assert(0);

+ return NULL;
+}
+
+/*

+ * ubi_vol_cdev_open - 'open()' implementation of volume character devices.
+ */
+static int vol_cdev_open(struct inode *inode, struct file *file)
+{
+ struct ubi_vol_desc *desc;
+ const struct ubi_info *ubi = major_to_info(imajor(inode));
+ int vol_id = iminor(inode) - 1;
+ int mode;
+
+ if (file->f_mode & FMODE_WRITE)
+ mode = UBI_READWRITE;
+ else
+ mode = UBI_READONLY;
+
+ dbg_uif("open volume %d, mode %d", vol_id, mode);
+
+ desc = ubi_open_volume(ubi->ubi_num, vol_id, mode);
+ if (IS_ERR(desc))
+ return PTR_ERR(desc);
+
+ file->private_data = desc;

+ return 0;
+}
+

+/*
+ * ubi_vol_cdev_release - 'release()' implementation of volume character
+ * devices.
+ */
+static int vol_cdev_release(struct inode *inode, struct file *file)
+{
+ struct ubi_vol_desc *desc = file->private_data;
+ struct ubi_uif_volume *vol = desc->vol;
+
+ dbg_uif("release volume %d, mode %d", vol->vol_id, desc->mode);
+
+ if (vol->updating)
+ ubi_upd_abort(vol);
+
+ ubi_close_volume(desc);

+ return 0;
+}
+

+/*
+ * ubi_vol_cdev_llseek - 'llseek()' implementation of volume character devices.
+ *
+ * If an update is in progress, seeking is prohibited.
+ */
+static loff_t vol_cdev_llseek(struct file *file, loff_t offset, int origin)
+{

+ const struct ubi_vtbl_vtr *vtr;

+ struct ubi_vol_desc *desc = file->private_data;
+ const struct ubi_info *ubi = desc->vol->ubi;
+ struct ubi_uif_volume *vol = desc->vol;
+ loff_t new_offset;
+
+ dbg_uif("seek volume %d, offset %lld, origin %d",
+ vol->vol_id, offset, origin);
+
+ if (vol->updating) {
+ dbg_err("updating");
+ return -EBUSY;
+ }
+

+ vtr = ubi_vtbl_get_vtr(ubi, vol->vol_id);
+ ubi_assert(!IS_ERR(vtr));
+

+ switch (origin) {
+ case 0: /* SEEK_SET */
+ new_offset = offset;
+ break;
+ case 1: /* SEEK_CUR */
+ new_offset = file->f_pos + offset;
+ break;
+ case 2: /* SEEK_END */
+ new_offset = vtr->used_bytes + offset;
+ break;
+ default:

+ return -EINVAL;
+ }
+

+ if (new_offset < 0 || new_offset > vtr->used_bytes) {
+ dbg_err("bad seek (%lld)", new_offset);

+ return -EINVAL;
+ }
+

+ dbg_uif("set volume %d offset at %lld", vol->vol_id, new_offset);
+ file->f_pos = new_offset;
+ return new_offset;
+}
+
+/*
+ * ubi_vol_cdev_read - 'read()' implementation of volume character devices.
+ */
+static ssize_t vol_cdev_read(struct file *file, __user char *buf, size_t count,
+ loff_t *offp)
+{

+ const struct ubi_vtbl_vtr *vtr;

+ struct ubi_vol_desc *desc = file->private_data;
+ struct ubi_uif_volume *vol = desc->vol;

+ struct ubi_info *ubi = vol->ubi;

+ int err, lnum, off, len, tbuf_size, vol_id = desc->vol->vol_id;
+ size_t count_save = count;
+ void *tbuf;
+ uint64_t tmp;
+
+ dbg_uif("request: read %zd bytes from offset %lld of volume %u",
+ count, *offp, vol_id);
+
+ if (count == 0)
+ return 0;
+
+ if (vol->updating) {
+ dbg_err("updating");
+ return -EBUSY;
+ }

+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);

+ ubi_assert(!IS_ERR(vtr));
+ ubi_assert(*offp >= 0 && *offp <= vtr->used_bytes);
+
+ if (vtr->upd_marker) {
+ dbg_err("damaged volume, update marker is set");

+ return -EBADF;
+ }
+

+ if (*offp == vtr->used_bytes)
+ return 0;
+
+ if (vtr->corrupted)
+ dbg_err("read from corrupted volume %d", vol_id);
+
+ if (*offp + count > vtr->used_bytes)
+ count_save = count = vtr->used_bytes - *offp;
+
+ /*
+ * To optimize reading, we read in fractions of the minimum
+ * input/output units of the flash.
+ */
+ tbuf_size = (PAGE_SIZE / ubi->io.min_io_size) * ubi->io.min_io_size;
+ if (tbuf_size == 0)
+ tbuf_size = ubi->io.min_io_size;
+ if (tbuf_size > ubi->io.leb_size)
+ tbuf_size = ubi->io.leb_size;
+
+ tbuf = kmalloc(tbuf_size, GFP_KERNEL);
+ if (!tbuf)
+ return -ENOMEM;
+
+ /*
+ * We read in portions of the minimal flash input/output unit. If we are
+ * requested to read form a non-aligned offset, we first read up to the
+ * nearest boundary, and later only read in units of 'tbuf_size'.
+ */
+ if (count > tbuf_size) {
+ int rem;
+
+ tmp = *offp;
+ rem = do_div(tmp, ubi->io.min_io_size);
+ if (rem == 0)
+ len = tbuf_size;
+ else
+ len = ubi->io.min_io_size - rem;
+ } else
+ len = count;
+
+ tmp = *offp;
+ off = do_div(tmp, vtr->usable_leb_size);

+ lnum = tmp;
+

+ do {
+ if (off + len >= vtr->usable_leb_size)
+ len = vtr->usable_leb_size - off;
+
+ err = ubi_eba_read_leb(ubi, vol_id, lnum, tbuf, off, len, 0);

+ if (unlikely(err))
+ break;
+

+ off += len;
+ if (off == vtr->usable_leb_size) {
+ lnum += 1;
+ off -= vtr->usable_leb_size;
+ }
+
+ count -= len;
+ *offp += len;
+
+ err = copy_to_user(buf, tbuf, len);
+ if (unlikely(err)) {

+ dbg_err("memory access error");

+ break;
+ }
+
+ buf += len;
+ len = count > tbuf_size ? tbuf_size : count;
+
+ cond_resched();
+ } while (count);
+
+ kfree(tbuf);
+ return err ? err : count_save - count;
+}
+
+#ifdef CONFIG_MTD_UBI_DEBUG_USERSPACE_IO
+
+/*
+ * vol_cdev_direct_write - 'write()' implementation implementation if the update
+ * command was not initiated.
+ *
+ * This function allows to directly write to dynamic UBI volumes, without
+ * issuing the volume update operation. Available only as a debugging feature.
+ */
+static ssize_t vol_cdev_direct_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offp)
+{

+ const struct ubi_vtbl_vtr *vtr;

+ struct ubi_vol_desc *desc = file->private_data;
+ struct ubi_uif_volume *vol = desc->vol;

+ struct ubi_info *ubi = vol->ubi;

+ int lnum, off, len, tbuf_size, vol_id = vol->vol_id, err = 0;
+ size_t count_save = count;
+ char *tbuf;
+ uint64_t tmp;
+
+ dbg_uif("requested: write %zd bytes to offset %lld of volume %u",
+ count, *offp, desc->vol->vol_id);

+
+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);
+

+ ubi_assert(!IS_ERR(vtr));
+ ubi_assert(!vol->updating);
+ ubi_assert(*offp >= 0 && *offp <= vtr->used_bytes);
+
+ if (vtr->vol_type == UBI_STATIC_VOLUME) {
+ dbg_err("static volume");

+ return -EROFS;
+ }
+

+ tmp = *offp;
+ off = do_div(tmp, vtr->usable_leb_size);

+ lnum = tmp;
+

+ if (off % ubi->io.min_io_size) {
+ dbg_err("unaligned position");

+ return -EIO;
+ }
+

+ if (*offp + count > vtr->used_bytes)
+ count_save = count = vtr->used_bytes - *offp;
+
+ /*
+ * We can only write in fractions of the minimum input/output unit of
+ * the flash.
+ */
+ if (count % ubi->io.min_io_size) {
+ dbg_err("unaligned write length");

+ return -EINVAL;
+ }
+

+ tbuf_size = (PAGE_SIZE / ubi->io.min_io_size) * ubi->io.min_io_size;
+ if (tbuf_size == 0)
+ tbuf_size = ubi->io.min_io_size;
+ if (tbuf_size > ubi->io.leb_size)
+ tbuf_size = ubi->io.leb_size;
+
+ tbuf = kmalloc(tbuf_size, GFP_KERNEL);
+ if (!tbuf)
+ return -ENOMEM;
+
+ len = count > tbuf_size ? tbuf_size : count;
+
+ while (count) {
+ cond_resched();
+
+ if (off + len >= vtr->usable_leb_size)
+ len = vtr->usable_leb_size - off;

+
+ dbg_uif("copy %d bytes of user data", len);

+ err = copy_from_user(tbuf, buf, len);
+ if (err) {

+ dbg_err("memory access error");

+ err = -EFAULT;
+ break;
+ }
+
+ dbg_uif("write %d bytes to LEB %d:%d, offset %d",
+ len, vol_id, lnum, off);
+
+ err = ubi_eba_write_leb(ubi, vol_id, lnum, tbuf, off, len,
+ UBI_DATA_UNKNOWN);

+ if (unlikely(err))
+ break;
+

+ off += len;
+ if (off == vtr->usable_leb_size) {
+ lnum += 1;
+ off -= vtr->usable_leb_size;
+ }
+
+ count -= len;
+ *offp += len;
+ buf += len;
+ len = count > tbuf_size ? tbuf_size : count;
+ }
+
+ kfree(tbuf);
+ return err ? err : count_save - count;
+}
+#else
+
+#define vol_cdev_direct_write(file, buf, count, offp) -EROFS
+
+#endif /* CONFIG_MTD_UBI_DEBUG_USERSPACE_IO */
+
+/*
+ * ubi_vol_cdev_write - 'write()' implementation of volume character devices.
+ */
+static ssize_t vol_cdev_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offp)
+{
+ int err = 0;
+ struct ubi_vol_desc *desc = file->private_data;
+ struct ubi_uif_volume *vol = desc->vol;

+ struct ubi_info *ubi = vol->ubi;
+

+ dbg_uif("requested: write %zd bytes to offset %lld of volume %u",
+ count, *offp, vol->vol_id);
+
+ if (!vol->updating)
+ return vol_cdev_direct_write(file, buf, count, offp);
+
+ err = ubi_upd_write_data(vol, buf, count);
+ if (err < 0) {
+ dbg_err("cannot write %zd bytes of update data", count);

+ return err;
+ }
+

+ if (err) {
+ /*
+ * Update is finished, @err contains number of actually written
+ * bytes now.
+ */
+ ubi_assert(err > 0 && err <= count);
+ count = err;
+
+ err = ubi_check_volume(ubi, vol->vol_id);
+ if (err < 0)
+ return err;
+

+ if (err == 1) {
+ ubi_warn("volume %d on UBI device %d is corrupted",

+ vol->vol_id, ubi->ubi_num);
+ err = ubi_vtbl_set_corrupted(ubi, vol->vol_id);

+ if (err)
+ return err;
+ }

+ vol->checked = 1;

+ ubi_uif_revoke_exclusive(desc, UBI_READWRITE);
+ }
+
+ *offp += count;
+ return count;
+}
+
+/*
+ * ubi_vol_cdev_ioctl - 'ioctl()' implementation of volume character devices.
+ */
+static int vol_cdev_ioctl(struct inode *inode, struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ int err = 0;
+ struct ubi_vol_desc *desc = file->private_data;
+ struct ubi_info *ubi = desc->vol->ubi;
+ void __user *argp = (void __user *)arg;
+
+ if (_IOC_NR(cmd) > VOL_CDEV_IOC_MAX_SEQ ||
+ _IOC_TYPE(cmd) != UBI_VOL_IOC_MAGIC) {
+ dbg_err("bad ioctl command");
+ return -ENOTTY;
+ }
+
+ if (_IOC_DIR(cmd) && _IOC_READ)
+ err = !access_ok(VERIFY_WRITE, argp, _IOC_SIZE(cmd));
+ else if (_IOC_DIR(cmd) && _IOC_WRITE)
+ err = !access_ok(VERIFY_READ, argp, _IOC_SIZE(cmd));
+ if (err) {

+ dbg_err("memory access error");
+ return -EFAULT;
+ }
+

+ switch (cmd) {
+
+ /* Volume update command */
+ case UBI_IOCVOLUP:
+ {
+ int64_t bytes, rsvd_bytes;

+ const struct ubi_vtbl_vtr *vtr;
+

+ if (!capable(CAP_SYS_RESOURCE)) {
+ dbg_err("no rights");
+ err = -EPERM;
+ break;
+ }
+
+ err = copy_from_user(&bytes, argp, sizeof(int64_t));
+ if (err) {

+ dbg_err("memory access error");

+ err = -EFAULT;
+ break;
+ }
+
+ dbg_uif("update volume %u, %lld bytes",
+ desc->vol->vol_id, (long long)bytes);
+
+ if (desc->mode == UBI_READONLY) {

+ dbg_err("read-only mode");

+ err = -EROFS;
+ break;
+ }
+
+ vtr = ubi_vtbl_get_vtr(ubi, desc->vol->vol_id);
+ rsvd_bytes = vtr->reserved_pebs *
+ (ubi->io.leb_size - vtr->data_pad);
+ if (bytes < 0 || bytes > rsvd_bytes) {
+ dbg_err("bad data size %lld", (long long)bytes);
+ err = -EINVAL;
+ break;
+ }
+
+ err = ubi_uif_get_exclusive(desc);
+ if (err < 0)
+ break;
+
+ err = ubi_upd_start(desc->vol, bytes);

+ if (bytes == 0)

+ ubi_uif_revoke_exclusive(desc, UBI_READWRITE);
+
+ file->f_pos = 0;
+ break;
+ }
+
+#ifdef CONFIG_MTD_UBI_DEBUG_USERSPACE_IO
+ /* An eraseblock erasure command */
+ case UBI_IOCEBER:
+ {
+ int32_t lnum;

+ const struct ubi_vtbl_vtr *vtr;
+

+ err = __get_user(lnum, (__user int32_t *)argp);
+ if (err) {

+ dbg_err("memory access error");

+ err = -EFAULT;
+ break;
+ }
+
+ if (desc->mode == UBI_READONLY) {

+ dbg_err("read-only mode");

+ err = -EROFS;
+ break;
+ }
+
+ vtr = ubi_vtbl_get_vtr(ubi, desc->vol->vol_id);
+ ubi_assert(!IS_ERR(vtr));
+ if (lnum < 0 || lnum >= vtr->reserved_pebs) {
+ dbg_err("bad lnum %d", lnum);
+ err = -EINVAL;
+ break;
+ }
+
+ if (vtr->vol_type != UBI_DYNAMIC_VOLUME) {
+ dbg_err("static volume");
+ err = -EROFS;
+ break;
+ }
+
+ dbg_uif("erase LEB %d:%d", desc->vol->vol_id, lnum);
+ err = ubi_eba_unmap_leb(ubi, desc->vol->vol_id, lnum);

+ if (err)
+ break;
+

+ err = ubi_wl_flush(ubi);
+ break;
+ }
+#endif
+
+ default:
+ err = -ENOTTY;
+ break;
+ }
+
+ return err;
+}
+
+/**
+ * check_mkvol_req - check sanity of a volume creation request.

+ *
+ * @ubi: the UBI device description object

+ * @req: the request to check
+ *
+ * This function returns a positive volume size in eraseblocks if the request
+ * is sane, and %-EINVAL if not.
+ */
+static int check_mkvol_req(const struct ubi_info *ubi,
+ const struct ubi_mkvol_req *req)
+{
+ int n, rem, ebs, usable_leb_size;
+ uint64_t bytes;
+
+ if (req->bytes < 0 || req->alignment < 0 || req->vol_type < 0 ||
+ req->name_len < 0) {

+ dbg_err("negative values");
+ goto bad;
+ }
+

+ if ((req->vol_id < 0 || req->vol_id >= ubi->vtbl.vt_slots) &&
+ req->vol_id != UBI_VOL_NUM_AUTO) {
+ dbg_err("bad vol_id %d", req->vol_id);

+ goto bad;
+ }
+

+ if (req->alignment == 0) {
+ dbg_err("zero alignment");

+ goto bad;
+ }
+

+ if (req->bytes == 0) {
+ dbg_err("zero bytes");

+ goto bad;
+ }
+

+ if (req->vol_type != UBI_DYNAMIC_VOLUME &&
+ req->vol_type != UBI_STATIC_VOLUME) {

+ dbg_err("bad vol_type");
+ goto bad;
+ }
+

+ if (req->alignment > ubi->io.leb_size) {
+ dbg_err("too large alignment");

+ goto bad;
+ }
+

+ n = req->alignment % ubi->io.min_io_size;
+ if (req->alignment != 1 && n) {
+ dbg_err("alignment is not multiple of min I/O unit size");

+ goto bad;
+ }
+

+ if (req->name_len > UBI_VOL_NAME_MAX) {
+ dbg_err("bad name_len, max is %d", UBI_VOL_NAME_MAX);

+ goto bad;
+ }
+

+ /* Calculate how many eraseblocks are requested */
+ usable_leb_size = ubi->io.leb_size - ubi->io.leb_size % req->alignment;
+ bytes = req->bytes;
+ rem = do_div(bytes, usable_leb_size);
+ ebs = bytes;
+ if (rem)
+ ebs += 1;
+
+ return ebs;
+
+bad:
+ dbg_err("bad volume creation request");
+ ubi_dbg_dump_mkvol_req(req);
+ return -EINVAL;
+}
+
+/**
+ * check_rsvol_req - check sanity of a volume re-size request.

+ *
+ * @ubi: the UBI device description object

+ * @req: the re-size request to check
+ *
+ * This function returns zero if the request is sane, and %-EINVAL if not.
+ */
+static int check_rsvol_req(const struct ubi_info *ubi,
+ const struct ubi_rsvol_req *req)
+{
+ if (req->bytes <= 0) {
+ dbg_err("bad bytes %lld", (long long)req->bytes);

+ return -EINVAL;
+ }
+

+ if (req->vol_id < 0 || req->vol_id >= ubi->vtbl.vt_slots) {
+ dbg_err("bad vol_id %d", req->vol_id);

+ return -EINVAL;
+ }
+

+ return 0;
+}
+

+/*
+ * ubi_cdev_ioctl - 'ioctl()' implementation of UBI character devices.
+ */
+static int ubi_cdev_ioctl(struct inode *inode, struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ int err = 0;
+ struct ubi_info *ubi;
+ void __user *argp = (void __user *)arg;
+
+ if (_IOC_NR(cmd) > UBI_CDEV_IOC_MAX_SEQ ||
+ _IOC_TYPE(cmd) != UBI_IOC_MAGIC) {
+ dbg_err("bad ioctl command");
+ return -ENOTTY;
+ }
+
+ if (_IOC_DIR(cmd) && _IOC_READ)
+ err = !access_ok(VERIFY_WRITE, argp, _IOC_SIZE(cmd));
+ else if (_IOC_DIR(cmd) && _IOC_WRITE)
+ err = !access_ok(VERIFY_READ, argp, _IOC_SIZE(cmd));
+ if (err)
+ return -EFAULT;
+
+ if (!capable(CAP_SYS_RESOURCE)) {
+ dbg_err("no rights");
+ return -EPERM;
+ }
+
+ ubi = major_to_info(imajor(inode));
+ if (IS_ERR(ubi))
+ return PTR_ERR(ubi);
+
+ switch (cmd) {
+
+ /* Create volume command */
+ case UBI_IOCMKVOL:
+ {
+ struct ubi_mkvol_req req;
+ int pebs;

+ struct ubi_vtbl_vtr vtr;
+

+ err = __copy_from_user(&req, argp,
+ sizeof(struct ubi_mkvol_req));
+ if (err) {
+ err = -EFAULT;
+ break;
+ }
+
+ /* Make sure that user passed us sane data */
+ pebs = check_mkvol_req(ubi, &req);
+ if (pebs < 0) {
+ err = pebs;

+ break;
+ }
+
+

+ vtr.reserved_pebs = pebs;
+ vtr.alignment = req.alignment;
+ vtr.vol_type = req.vol_type;
+ vtr.name_len = req.name_len;
+ req.name[req.name_len] = '\0';
+ vtr.name = req.name;
+ vtr.data_pad = ubi->io.leb_size % vtr.alignment;
+
+ err = ubi_vmt_mkvol(ubi, req.vol_id, &vtr);
+ if (err < 0)
+ break;
+ req.vol_id = err;
+
+ err = __copy_to_user(argp, &req, sizeof(struct ubi_mkvol_req));
+ if (err)
+ err = -EFAULT;
+
+ break;
+ }
+
+ /* Remove volume command */
+ case UBI_IOCRMVOL:
+ {
+ int32_t vol_id;

+ struct ubi_vol_desc *desc;
+

+ err = __get_user(vol_id, (__user int32_t *)argp);
+ if (err) {
+ err = -EFAULT;
+ break;
+ }
+
+ dbg_uif("remove volume %u", vol_id);
+
+ desc = ubi_open_volume(ubi->ubi_num, vol_id, UBI_EXCLUSIVE);
+ if (IS_ERR(desc)) {
+ err = PTR_ERR(desc);
+ break;
+ }
+
+ err = ubi_vmt_rmvol(desc);
+ if (err) {
+ ubi_close_volume(desc);
+ break;
+ }
+
+ break;
+ }
+
+ /* Re-size volume command */
+ case UBI_IOCRSVOL:
+ {
+ int rem, pebs;
+ uint64_t tmp;

+ const struct ubi_vtbl_vtr *vtr;

+ struct ubi_rsvol_req req;

+ struct ubi_vol_desc *desc;
+

+ err = __copy_from_user(&req, argp,
+ sizeof(struct ubi_rsvol_req));
+ if (err) {
+ err = -EFAULT;
+ break;
+ }
+
+ /* Make sure that user passed us sane data */
+ err = check_rsvol_req(ubi, &req);

+ if (err)
+ break;
+

+ dbg_uif("re-size volume %d to size %lld bytes",
+ req.vol_id, (long long)req.bytes);
+
+ desc = ubi_open_volume(ubi->ubi_num, req.vol_id, UBI_EXCLUSIVE);
+ if (IS_ERR(desc)) {
+ err = PTR_ERR(desc);
+ break;
+ }
+
+ vtr = ubi_vtbl_get_vtr(ubi, req.vol_id);
+ ubi_assert(!IS_ERR(vtr));
+
+ tmp = req.bytes;
+ rem = do_div(tmp, vtr->usable_leb_size);
+ pebs = tmp;
+ if (rem)
+ pebs += 1;
+
+ err = ubi_vmt_rsvol(ubi, req.vol_id, pebs);
+ ubi_close_volume(desc);
+ break;
+ }
+
+ default:
+ err = -ENOTTY;
+ break;
+ }
+
+ return err;
+}
+
+/**
+ * ubi_cdev_init - initialize all the character device-related stuff for
+ * an UBI device.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_cdev_init(struct ubi_info *ubi)
+{
+ int err;
+ dev_t dev;
+
+ /*
+ * Major numbers for the UBI character devices are allocated
+ * dynamically. Major numbers of volume character devices are
+ * equivalent to ones of the corresponding UBI character device. Minor
+ * numbers of UBI character devices are 0, while minor numbers of
+ * volume character devices start from 1. Thus, we allocate one major
+ * number and ubi->vtbl.vt_slots + 1 minor numbers.
+ */
+ err = alloc_chrdev_region(&dev, 0, ubi->vtbl.vt_slots + 1,
+ ubi->uif.ubi_name);
+ if (err) {
+ ubi_err("cannot register UBI character devices");

+ return err;
+ }
+

+ cdev_init(&ubi->uif.cdev, &ubi_cdev_operations);
+ ubi->uif.major = MAJOR(dev);
+ ubi->uif.cdev.owner = THIS_MODULE;
+
+ dev = MKDEV(ubi->uif.major, 0);
+ err = cdev_add(&ubi->uif.cdev, dev, 1);
+ if (err) {
+ ubi_err("cannot add character device %s", ubi->uif.ubi_name);
+ unregister_chrdev_region(MKDEV(ubi->uif.major, 0),
+ ubi->vtbl.vt_slots + 1);

+ return err;
+ }
+

+ dbg_uif("%s major:minor is %u:0", ubi->uif.ubi_name, ubi->uif.major);

+ return 0;
+}
+

+/**
+ * ubi_cdev_close - close all the character device-related stuff for an UBI
+ * device.

+ *
+ * @ubi: the UBI device description object

+ */
+void ubi_cdev_close(struct ubi_info *ubi)
+{
+ cdev_del(&ubi->uif.cdev);
+ unregister_chrdev_region(MKDEV(ubi->uif.major, 0),
+ ubi->vtbl.vt_slots + 1);
+}
+
+/**
+ * ubi_cdev_vol_init - initialize character device-related stuff for an UBI
+ * volume.

+ *
+ * @ubi: the UBI device description object

+ * @vol: volume description object
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_cdev_vol_init(struct ubi_info *ubi, struct ubi_uif_volume *vol)

+{
+ int err;
+

+ /* Register the character device for the volume */
+ cdev_init(&vol->cdev, &vol_cdev_operations);
+ vol->cdev.owner = THIS_MODULE;
+ err = cdev_add(&vol->cdev, MKDEV(ubi->uif.major, vol->vol_id + 1), 1);
+ if (err)
+ ubi_err("cannot add character device for volume %d",
+ vol->vol_id);
+ return err;
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:30:06 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/scan.c tmp-to/drivers/mtd/ubi/scan.c
--- tmp-from/drivers/mtd/ubi/scan.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/scan.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,1478 @@

+ * UBI scanning unit.
+ *
+ * This unit is responsible for scanning the flash media, checking UBI
+ * headers and providing complete information about the UBI flash image.
+ *
+ * The scanning information is reoresented by a &struct ubi_scan_info' object.
+ * Information about found volumes is represented by &struct ubi_scan_volume
+ * objects which are kept in volume RB-tree with root at the @volumes field.
+ * The RB-tree is indexed by the volume ID.
+ *
+ * Found logical eraseblocks are represented by &struct ubi_scan_leb objects.
+ * These objects are kept in per-volume RB-trees with the root at the
+ * corresponding &struct ubi_scan_volume object. To put it differently, we keep
+ * an RB-tree of per-volume objects and each of these objects is the root of
+ * RB-tree of per-eraseblock objects.
+ *
+ * Corrupted physical eraseblocks are put to the @corr list, free physical
+ * eraseblocks are put to the @free list and the physical eraseblock to be
+ * erased are put to the @erase list.

+ */
+
+#include <linux/err.h>

+#include <linux/crc32.h>
+#include "ubi.h"
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_SCAN
+static int paranoid_check_si(const struct ubi_info *ubi,
+ struct ubi_scan_info *si);
+#else
+#define paranoid_check_si(ubi, si) 0
+#endif
+
+/* Temporary variables used during scanning */
+static struct ubi_ec_hdr *ech;
+static struct ubi_vid_hdr *vidh;
+
+/**
+ * add_to_erase - add physical eraseblock to the list of physical eraseblocks
+ * which have to be erased.
+ *
+ * @si: pointer to the scanning information

+ * @pnum: physical eraseblock number

+ * @ec: erase counter of this physical eraseblock

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int add_to_erase(struct ubi_scan_info *si, int pnum, int ec)
+{

+ struct ubi_scan_leb *seb;
+

+ dbg_scan("PEB %d, EC %d", pnum, ec);
+
+ seb = kmalloc(sizeof(struct ubi_scan_leb), GFP_KERNEL);
+ if (unlikely(!seb))
+ return -ENOMEM;
+
+ seb->pnum = pnum;
+ seb->ec = ec;
+ list_add_tail(&seb->u.list, &si->erase);

+ return 0;
+}
+
+/**

+ * add_to_free - add physical eraseblock to the list of free physical
+ * eraseblocks.
+ *
+ * @si: pointer to the scanning information

+ * @pnum: physical eraseblock number

+ * @ec: erase counter of this physical eraseblock

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+static int add_to_free(struct ubi_scan_info *si, int pnum, int ec)
+{

+ struct ubi_scan_leb *seb;
+

+ dbg_scan("PEB %d, EC %d", pnum, ec);
+ ubi_assert(ec >= 0);
+
+ seb = kmalloc(sizeof(struct ubi_scan_leb), GFP_KERNEL);
+ if (unlikely(!seb))
+ return -ENOMEM;
+
+ seb->pnum = pnum;
+ seb->ec = ec;
+ list_add_tail(&seb->u.list, &si->free);

+ return 0;
+}
+
+/**

+ * add_to_alien - add physical eraseblock to the @si->alien list.
+ *
+ * @si: pointer to the scanning information
+ * @pnum: physical eraseblock number
+ *

+ * This function returns zero in case of success and a negative error

+ * code in case of failure.
+ */
+static int add_to_alien(struct ubi_scan_info *si, int pnum)
+{
+ struct ubi_scan_leb *seb;
+
+ dbg_scan("PEB %d is alien", pnum);
+
+ seb = kmalloc(sizeof(struct ubi_scan_leb), GFP_KERNEL);
+ if (unlikely(!seb))
+ return -ENOMEM;
+
+ seb->pnum = pnum;
+ list_add_tail(&seb->u.list, &si->alien);

+ return 0;
+}
+
+/**

+ * ubi_scan_add_corr_peb - add physical eraseblock to the list of corrupted
+ * physical eraseblocks.
+ *
+ * @si: pointer to the scanning information

+ * @pnum: physical eraseblock number

+ * @ec: erase counter of this physical eraseblock
+ *
+ * If @ec is not known, %NAND_SCAN_UNKNOWN_EC has to be passed and mean erase
+ * counter will be used. This function returns zero in case of success and a
+ * negative error code in case of failure.
+ */

+int ubi_scan_add_corr_peb(struct ubi_scan_info *si, int pnum, int ec)

+ struct ubi_scan_leb *seb;
+

+ dbg_scan("PEB %d (EC %d) is corrupted", pnum, ec);
+
+ seb = kmalloc(sizeof(struct ubi_scan_leb), GFP_KERNEL);
+ if (unlikely(!seb))
+ return -ENOMEM;
+
+ seb->pnum = pnum;
+ seb->ec = ec;
+ list_add_tail(&seb->u.list, &si->corr);

+ return 0;
+}
+
+/**

+ * commit_to_mean_value - commit intermediate results to the final mean erase
+ * counter value.
+ *
+ * @si: pointer to the scanning information
+ *
+ * This is a helper function which calculates partial mean erase counter mean
+ * value and adds it to the resulting mean value. As we can work only in
+ * integer arithmetic and we want to calculate the mean value of erase counter
+ * accurately, we first sum erase counter values in @si->ec_sum variable and
+ * count these components in @si->ec_count. If this temporary @si->ec_sum is
+ * going to overflow, we calculate the partial mean value
+ * (@si->ec_sum/@si->ec_count) and add it to @si->mean_ec.
+ */
+static void commit_to_mean_value(struct ubi_scan_info *si)
+{
+ int rem;
+
+ rem = si->ec_sum % si->ec_count;
+ si->ec_sum /= si->ec_count;
+ if (rem >= si->ec_count / 2)
+ si->mean_ec += 1;
+ si->mean_ec += si->ec_sum;
+}
+
+/**
+ * validate_vid_hdr - check that volume identifier header is correct and
+ * consistent.
+ *

+ * @vid_hdr: the volume identifier header to check

+ * @sv: information about the volume this logical eraseblock belongs to
+ * @pnum: physical eraseblock number the VID header came from
+ *
+ * This function checks that data stored in @vid_hdr is consistent. Returns
+ * non-zero if an inconsistency was found and zero if not.
+ *
+ * Note, UBI does sanity check of everything it reads from the flash media.
+ * Most of the checks are done in the I/O unit. Here we check that the
+ * information in the VID header is consistent to the information in other VID
+ * headers of the same volume.
+ */
+static int validate_vid_hdr(const struct ubi_vid_hdr *vid_hdr,
+ const struct ubi_scan_volume *sv, int pnum)
+{

+ int vol_type = vid_hdr->vol_type;

+ int vol_id = ubi32_to_cpu(vid_hdr->vol_id);

+ int used_ebs = ubi32_to_cpu(vid_hdr->used_ebs);
+ int data_pad = ubi32_to_cpu(vid_hdr->data_pad);
+

+ if (sv->leb_count != 0) {
+ int sv_vol_type;
+
+ /*
+ * This is not the first logical eraseblock belonging to this
+ * volume. Ensure that the data in its VID header is consistent
+ * to the data in previous logical eraseblock headers.
+ */
+
+ if (unlikely(vol_id != sv->vol_id)) {
+ dbg_err("inconsistent vol_id");

+ goto bad;
+ }
+

+ if (sv->vol_type == UBI_STATIC_VOLUME)
+ sv_vol_type = UBI_VID_STATIC;
+ else
+ sv_vol_type = UBI_VID_DYNAMIC;
+
+ if (unlikely(vol_type != sv_vol_type)) {
+ dbg_err("inconsistent vol_type");

+ goto bad;
+ }
+

+ if (unlikely(used_ebs != sv->used_ebs)) {
+ dbg_err("inconsistent used_ebs");

+ goto bad;
+ }
+

+ if (unlikely(data_pad != sv->data_pad)) {
+ dbg_err("inconsistent data_pad");

+ goto bad;
+ }
+ }
+

+ return 0;
+
+bad:

+ ubi_err("inconsistent VID header at PEB %d", pnum);
+ ubi_dbg_dump_vid_hdr(vid_hdr);
+ ubi_dbg_dump_sv(sv);

+ return -EINVAL;
+}
+
+/**

+ * add_volume - add volume to the scanning information.
+ *
+ * @si: pointer to the scanning information
+ * @vol_id: ID of the volume to add

+ * @pnum: physical eraseblock number

+ * @vid_hdr: volume identifier header

+ *
+ * If the volume corresponding to the @vid_hdr logical eraseblock is already
+ * present in the scanning information, this function does nothing. Otherwise
+ * it adds corresponding volume to the scanning information. Returns a pointer
+ * to the scanning volume object in case of success and a negative error code
+ * in case of failure.
+ */
+static struct ubi_scan_volume *add_volume(struct ubi_scan_info *si, int vol_id,
+ int pnum,

+ const struct ubi_vid_hdr *vid_hdr)
+{

+ struct ubi_scan_volume *sv;
+ struct rb_node **p = &si->volumes.rb_node, *parent = NULL;
+
+ ubi_assert(vol_id == ubi32_to_cpu(vid_hdr->vol_id));
+
+ /* Walk the volume RB-tree to look if this volume is already present */

+ while (*p) {
+ parent = *p;

+ sv = rb_entry(parent, struct ubi_scan_volume, rb);
+
+ if (vol_id == sv->vol_id)
+ return sv;
+
+ if (vol_id > sv->vol_id)

+ p = &(*p)->rb_left;
+ else

+ p = &(*p)->rb_right;

+ }
+
+ /* The volume is absent - add it */
+ sv = kmalloc(sizeof(struct ubi_scan_volume), GFP_KERNEL);
+ if (unlikely(!sv))
+ return ERR_PTR(-ENOMEM);
+
+ sv->highest_lnum = sv->leb_count = 0;
+ si->max_sqnum = 0;
+ sv->vol_id = vol_id;
+ sv->root = RB_ROOT;
+ sv->used_ebs = ubi32_to_cpu(vid_hdr->used_ebs);
+ sv->data_pad = ubi32_to_cpu(vid_hdr->data_pad);
+ sv->compat = vid_hdr->compat;
+ sv->vol_type = vid_hdr->vol_type == UBI_VID_DYNAMIC ? UBI_DYNAMIC_VOLUME
+ : UBI_STATIC_VOLUME;
+ if (vol_id > si->highest_vol_id)
+ si->highest_vol_id = vol_id;
+
+ rb_link_node(&sv->rb, parent, p);
+ rb_insert_color(&sv->rb, &si->volumes);
+ si->vols_found += 1;
+ dbg_scan("added volume %d", vol_id);
+ return sv;
+}
+
+/**
+ * compare_lebs - find out which logical eraseblock is newer.

+ *
+ * @ubi: the UBI device description object

+ * @seb: first logical eraseblock to compare
+ * @pnum: physical eraseblock number of the second logical eraseblock to
+ * compare
+ * @vid_hdr: volume identifier header of the second logical eraseblock
+ *
+ * This function compares 2 copies of a LEB and informs which one is newer. In
+ * case of success this function returns a positive value, in case of failure, a
+ * negative error code is returned. The success return codes use the following
+ * bits:
+ * o bit 0 is cleared: the first PEB (described by @seb) is newer then the
+ * second PEB (described by @pnum and @vid_hdr);
+ * o bit 0 is set: the second PEB is newer;
+ * o bit 1 is cleared: no bit-flips were detected in the newer LEB;
+ * o bit 1 is set: bit-flips were detected in the newer LEB;
+ * o bit 2 is cleared: the older LEB is not corrupted;
+ * o bit 2 is set: the older LEB is corrupted.
+ */
+static int compare_lebs(const struct ubi_info *ubi,
+ const struct ubi_scan_leb *seb, int pnum,

+ const struct ubi_vid_hdr *vid_hdr)

+{
+ void *buf;
+ int len, err, second_is_newer, bitflips = 0, corrupted = 0;
+ uint32_t data_crc, crc;
+ struct ubi_vid_hdr *vidh = NULL;
+ unsigned long long sqnum2 = ubi64_to_cpu(vid_hdr->sqnum);
+
+ if (seb->sqnum != 0 && sqnum2 != 0) {
+ long long abs, v1 = seb->leb_ver, v2 = ubi32_to_cpu(vid_hdr->leb_ver);
+
+ /*
+ * UBI constantly increases the logical eraseblock version
+ * number and it can overflow. Thus, we have to bear in mind
+ * that versions that are close to %0xFFFFFFFF are less then
+ * versions that are close to %0.
+ *
+ * The UBI WL unit guarantees that the number of pending tasks
+ * is not greater then %0x7FFFFFFF. So, if the difference
+ * between any two versions is greater or equivalent to
+ * %0x7FFFFFFF, there was an overflow and the logical
+ * eraseblock with lower version is actually newer then the one
+ * with higher version.
+ *
+ * FIXME: but this is anyway obsolete and will be removed at
+ * some point.
+ */
+
+ dbg_scan("using old crappy leb_ver stuff");
+
+ abs = v1 - v2;
+ if (abs < 0)
+ abs = -abs;
+
+ if (likely(abs < 0x7FFFFFFF))
+ /* Non-overflow situation */
+ second_is_newer = (v2 > v1);
+ else
+ second_is_newer = (v2 < v1);
+ } else
+ /* Obviously the LEB with lower sequence counter is older */
+ second_is_newer = sqnum2 > seb->sqnum;
+
+ /*
+ * Now we know which copy is newer. If the copy flag of the PEB with
+ * newer version is not set, then we just return, otherwise we have to
+ * check data CRC. For the second PEB we already have the VID header,
+ * for the first one - we'll need to re-read it from flash.
+ *
+ * FIXME: this may be optimized so that we wouldn't read twice.
+ */
+
+ if (second_is_newer) {
+ if (!vid_hdr->copy_flag) {
+ /* It is not a copy, so it is newer */
+ dbg_scan("second PEB %d is newer, copy_flag is unset",
+ pnum);
+ return 1;
+ }
+ } else {
+ pnum = seb->pnum;
+
+ vidh = ubi_zalloc_vid_hdr(ubi);
+ if (!vidh)
+ return -ENOMEM;
+
+ err = ubi_io_read_vid_hdr(ubi, pnum, vidh, 0);
+ if (unlikely(err)) {
+ if (err == UBI_IO_BITFLIPS)
+ bitflips = 1;
+ else {
+ dbg_err("VID of PEB %d header is bad, but it "
+ "was OK earlier", pnum);
+ if (err > 0)
+ err = -EIO;
+
+ goto out_free_vidh;
+ }
+ }
+
+ if (!vidh->copy_flag) {
+ /* It is not a copy, so it is newer */
+ dbg_scan("first PEB %d is newer, copy_flag is unset",
+ pnum);
+ err = bitflips << 1;
+ goto out_free_vidh;
+ }
+
+ vid_hdr = vidh;
+ }
+
+ /* Read the data of the copy and check the CRC */
+
+ len = ubi32_to_cpu(vid_hdr->data_size);
+ buf = kmalloc(len, GFP_KERNEL);
+ if (unlikely(!buf)) {
+ err = -ENOMEM;
+ goto out_free_vidh;
+ }
+
+ err = ubi_io_read_data(ubi, buf, pnum, 0, len);
+ if (unlikely(err && err != UBI_IO_BITFLIPS))
+ goto out_free_buf;
+
+ data_crc = ubi32_to_cpu(vid_hdr->data_crc);

+ crc = crc32(UBI_CRC32_INIT, buf, len);
+ if (unlikely(crc != data_crc)) {

+ dbg_scan("PEB %d CRC error: calculated %#08x, must be %#08x",
+ pnum, crc, data_crc);
+ corrupted = 1;
+ bitflips = 0;
+ second_is_newer = !second_is_newer;
+ } else {
+ dbg_scan("PEB %d CRC is OK", pnum);
+ bitflips = !!err;
+ }
+
+ kfree(buf);
+ ubi_free_vid_hdr(ubi, vidh);
+
+ if (second_is_newer)
+ dbg_scan("second PEB %d is newer, copy_flag is set", pnum);
+ else
+ dbg_scan("first PEB %d is newer, copy_flag is set", pnum);
+
+ return second_is_newer | (bitflips << 1) | (corrupted << 2);
+
+out_free_buf:
+ kfree(buf);
+out_free_vidh:
+ ubi_free_vid_hdr(ubi, vidh);
+ ubi_assert(err < 0);

+ return err;
+}
+

+/**
+ * ubi_scan_add_peb - add information about a physical eraseblock to the
+ * scanning information.

+ *
+ * @ubi: the UBI device description object

+ * @si: a pointer to the scanning information

+ * @pnum: the physical eraseblock number

+ * @ec: erase counter

+ * @vid_hdr: the volume identifier header

+ * @bitflips: if bit-flips were detected when this physical eraseblock was read

+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */

+int ubi_scan_add_peb(const struct ubi_info *ubi, struct ubi_scan_info *si,
+ int pnum, int ec, const struct ubi_vid_hdr *vid_hdr,
+ int bitflips)

+{
+ int err, vol_id, lnum;
+ uint32_t leb_ver;

+ unsigned long long sqnum;

+ struct ubi_scan_volume *sv;
+ struct ubi_scan_leb *seb;

+ struct rb_node **p, *parent = NULL;
+

+ vol_id = ubi32_to_cpu(vid_hdr->vol_id);
+ lnum = ubi32_to_cpu(vid_hdr->lnum);
+ sqnum = ubi64_to_cpu(vid_hdr->sqnum);
+ leb_ver = ubi32_to_cpu(vid_hdr->leb_ver);
+
+ dbg_scan("PEB %d, LEB %d:%d, EC %d, sqnum %llu, ver %u, bitflips %d",
+ pnum, vol_id, lnum, ec, sqnum, leb_ver, bitflips);
+
+ sv = add_volume(si, vol_id, pnum, vid_hdr);
+ if (unlikely(IS_ERR(sv)) < 0)
+ return PTR_ERR(sv);
+
+ /*
+ * Walk the RB-tree of logical eraseblocks of volume @vol_id to look
+ * if this is the first instance of this logical eraseblock or not.
+ */
+ p = &sv->root.rb_node;
+ while (*p) {
+ int cmp_res;
+
+ parent = *p;
+ seb = rb_entry(parent, struct ubi_scan_leb, u.rb);
+ if (lnum != seb->lnum) {
+ if (lnum < seb->lnum)

+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;

+ continue;
+ }
+
+ /*
+ * There is already a physical eraseblock describing the same
+ * logical eraseblock present.
+ */
+
+ dbg_scan("this LEB already exists: PEB %d, sqnum %llu, "
+ "LEB ver %u, EC %d", seb->pnum, seb->sqnum,
+ seb->leb_ver, seb->ec);
+
+ /*
+ * Make sure that the logical eraseblocks have different
+ * versions. Otherwise the image is bad.
+ */
+ if (unlikely(seb->leb_ver == leb_ver && leb_ver != 0)) {
+ ubi_err("two LEBs with same version %u", leb_ver);
+ ubi_dbg_dump_seb(seb, 0);
+ ubi_dbg_dump_vid_hdr(vid_hdr);

+ return -EINVAL;
+ }
+

+ /*
+ * Make sure that the logical eraseblocks have different
+ * sequence numbers. Otherwise the image is bad.
+ *
+ * FIXME: remove 'sqnum != 0' check when leb_ver is removed.
+ */
+ if (unlikely(seb->sqnum == sqnum && sqnum != 0)) {
+ ubi_err("two LEBs with same sequence number %llu",
+ sqnum);
+ ubi_dbg_dump_seb(seb, 0);
+ ubi_dbg_dump_vid_hdr(vid_hdr);

+ return -EINVAL;
+ }
+

+ /*
+ * Now we have to drop the older one and preserve the newer
+ * one.
+ */
+ cmp_res = compare_lebs(ubi, seb, pnum, vid_hdr);
+ if (unlikely(cmp_res < 0))
+ return cmp_res;
+
+ if (cmp_res & 1) {
+ /*
+ * This logical eraseblock is newer then the one
+ * found earlier.
+ */
+ err = validate_vid_hdr(vid_hdr, sv, pnum);
+ if (unlikely(err))
+ return err;
+
+ if (cmp_res & 4)
+ err = ubi_scan_add_corr_peb(si, seb->pnum, seb->ec);
+ else
+ err = add_to_erase(si, seb->pnum, seb->ec);
+ if (unlikely(err))
+ return err;
+
+ seb->ec = ec;
+ seb->pnum = pnum;
+ seb->scrub = ((cmp_res & 2) || bitflips);
+ seb->sqnum = sqnum;
+ seb->leb_ver = leb_ver;
+
+ if (sv->highest_lnum == lnum)
+ sv->last_data_size =
+ ubi32_to_cpu(vid_hdr->data_size);
+
+ return 0;
+ } else {
+ /*
+ * This logical eraseblock is older then the one found
+ * previously.
+ */
+ if (cmp_res & 4)
+ return ubi_scan_add_corr_peb(si, pnum, ec);
+ else
+ return add_to_erase(si, pnum, ec);
+ }
+ }
+
+ /*
+ * We've met this logical eraseblock for the first time, add it to the
+ * scanning information.
+ */
+
+ err = validate_vid_hdr(vid_hdr, sv, pnum);
+ if (unlikely(err))
+ return err;
+
+ seb = kmalloc(sizeof(struct ubi_scan_leb), GFP_KERNEL);
+ if (unlikely(!seb))
+ return -ENOMEM;
+
+ seb->ec = ec;
+ seb->pnum = pnum;
+ seb->lnum = lnum;
+ seb->sqnum = sqnum;
+ seb->scrub = bitflips;
+ seb->leb_ver = leb_ver;
+
+ if (sv->highest_lnum <= lnum) {
+ sv->highest_lnum = lnum;
+ sv->last_data_size = ubi32_to_cpu(vid_hdr->data_size);
+ }
+
+ if (si->max_sqnum < sqnum)
+ si->max_sqnum = sqnum;
+
+ sv->leb_count += 1;
+ rb_link_node(&seb->u.rb, parent, p);
+ rb_insert_color(&seb->u.rb, &sv->root);

+ return 0;
+}
+
+/**

+ * ubi_scan_find_sv - find information about a particular volume in the
+ * scanning information.
+ *

+ * @si: a pointer to the scanning information

+ * @vol_id: the requested volume ID

+ *
+ * This function returns a pointer to the volume description or %NULL if there
+ * are no data about this volume in the scanning information.
+ */
+struct ubi_scan_volume *ubi_scan_find_sv(const struct ubi_scan_info *si,
+ int vol_id)
+{
+ struct ubi_scan_volume *sv;
+ struct rb_node *p = si->volumes.rb_node;
+
+ while (p) {
+ sv = rb_entry(p, struct ubi_scan_volume, rb);
+
+ if (vol_id == sv->vol_id)
+ return sv;
+
+ if (vol_id > sv->vol_id)

+ p = p->rb_left;
+ else

+ p = p->rb_right;
+ }
+

+ return NULL;
+}
+

+/**
+ * ubi_scan_find_seb - find information about a particular logical
+ * eraseblock in the volume scanning information.
+ *
+ * @sv: a pointer to the volume scanning information
+ * @lnum: the requested logical eraseblock
+ *
+ * This function returns a pointer to the scanning logical eraseblock or %NULL
+ * if there are no data about it in the scanning volume information.
+ */

+struct ubi_scan_leb *ubi_scan_find_seb(const struct ubi_scan_volume *sv,
+ int lnum)

+{
+ struct ubi_scan_leb *seb;
+ struct rb_node *p = sv->root.rb_node;
+
+ while (p) {
+ seb = rb_entry(p, struct ubi_scan_leb, u.rb);
+
+ if (lnum == seb->lnum)
+ return seb;
+
+ if (lnum > seb->lnum)

+ p = p->rb_left;
+ else

+ p = p->rb_right;
+ }
+

+ return NULL;
+}
+

+/**
+ * ubi_scan_rm_volume - delete scanning information about a volume.
+ *

+ * @si: a pointer to the scanning information

+ * @sv: the volume scanning information to delete
+ */
+void ubi_scan_rm_volume(struct ubi_scan_info *si, struct ubi_scan_volume *sv)
+{

+ struct rb_node *rb;
+ struct ubi_scan_leb *seb;
+

+ dbg_scan("remove scanning information about volume %d", sv->vol_id);
+
+ while ((rb = rb_first(&sv->root))) {
+ seb = rb_entry(rb, struct ubi_scan_leb, u.rb);
+ rb_erase(&seb->u.rb, &sv->root);
+ list_add_tail(&seb->u.list, &si->erase);
+ }
+
+ rb_erase(&sv->rb, &si->volumes);
+ kfree(sv);
+ si->vols_found -= 1;
+}
+
+/**
+ * ubi_scan_erase_peb - erase a physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @si: a pointer to the scanning information

+ * @pnum: physical eraseblock number to erase;

+ * @ec: erase counter value to write (%NAND_SCAN_UNKNOWN_EC if it is unknown)
+ *
+ * This function erases physical eraseblock 'pnum', and writes the erase
+ * counter header to it. This function should only be used on UBI device
+ * initialization stages, when the EBA unit had not been yet initialized. This
+ * function returns zero in case of success and a negative error code in case
+ * of failure.
+ */
+int ubi_scan_erase_peb(const struct ubi_info *ubi,
+ const struct ubi_scan_info *si, int pnum, int ec)
+{
+ int err;

+ struct ubi_ec_hdr *ec_hdr;
+
+ ec_hdr = kzalloc(ubi->io.ec_hdr_alsize, GFP_KERNEL);

+ if (!ec_hdr)
+ return -ENOMEM;
+
+ if (unlikely((long long)ec >= UBI_MAX_ERASECOUNTER)) {
+ /*
+ * Erase counter overflow. Upgrade UBI and use 64-bit
+ * erase counters internally.
+ */
+ ubi_err("erase counter overflow at PEB %d, EC %d", pnum, ec);

+ return -EINVAL;
+ }
+

+ ec_hdr->ec = cpu_to_ubi64(ec);
+
+ err = ubi_io_sync_erase(ubi, pnum, 0);
+ if (unlikely(err < 0))
+ goto out_free;
+
+ err = ubi_io_write_ec_hdr(ubi, pnum, ec_hdr);
+
+out_free:
+ kfree(ec_hdr);

+ return err;
+}
+
+/**

+ * ubi_scan_get_free_peb - get a free physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @si: a pointer to the scanning information

+ *
+ * This function returns a free physical eraseblock. It is supposed to be
+ * called on the UBI initialization stages when the wear-leveling unit is not
+ * initialized yet. This function picks a physical eraseblocks from one of the
+ * lists, writes the EC header if it is needed, and removes it from the list.
+ *
+ * This function returns scanning physical eraseblock information in case of
+ * success and an error code in case of failure.
+ */
+struct ubi_scan_leb *ubi_scan_get_free_peb(const struct ubi_info *ubi,
+ struct ubi_scan_info *si)
+{
+ int err = 0, i;

+ struct ubi_scan_leb *seb;
+

+ if (!list_empty(&si->free)) {
+ seb = list_entry(si->free.next, struct ubi_scan_leb,
+ u.list);
+ list_del(&seb->u.list);
+ return seb;
+ }
+
+ if (unlikely(list_empty(&si->erase) && list_empty(&si->corr))) {
+ ubi_err("no vacant eraseblocks found");
+ return ERR_PTR(-ENOSPC);
+ }
+
+ for (i = 0; i < 2; i++) {
+ struct list_head *head;
+ struct ubi_scan_leb *tmp_seb;
+
+ if (i == 0)
+ head = &si->erase;
+ else
+ head = &si->corr;
+
+ /*
+ * We try to erase the first physical eraseblock from the @head
+ * list and pick it if we succeed, or try to erase the
+ * next one if not. And so forth. We don't want to take care
+ * about bad eraseblocks here - they'll be handled later.
+ */
+ list_for_each_entry_safe(seb, tmp_seb, head, u.list) {
+ if (seb->ec == NAND_SCAN_UNKNOWN_EC)
+ seb->ec = si->mean_ec;
+
+ err = ubi_scan_erase_peb(ubi, si, seb->pnum,
+ seb->ec + 1);
+ if (unlikely(err))
+ continue;
+
+ seb->ec += 1;
+ list_del(&seb->u.list);
+ dbg_scan("return PEB %d, EC %d", seb->pnum, seb->ec);
+ return seb;
+ }
+ }
+
+ return ERR_PTR(err ? err : -ENOSPC);
+}
+
+/**
+ * process_eb - read UBI headers, check them and add corresponding data
+ * to the scanning information.

+ *
+ * @ubi: the UBI device description object

+ * @si: a pointer to the scanning information

+ * @pnum: the physical eraseblock number
+ *
+ * This function returns a zero if the physical eraseblock was succesfully
+ * handled and a negative error code in case of failure.
+ */
+static int process_eb(struct ubi_info *ubi, struct ubi_scan_info *si, int pnum)

+{
+ long long ec;

+ int err, bitflips = 0, vol_id, ec_corr = 0;
+
+ dbg_scan("scan PEB %d", pnum);
+
+ /* Skip bad physical eraseblocks */

+ err = ubi_io_is_bad(ubi, pnum);

+ if (unlikely(err < 0))
+ return err;
+ else if (err) {
+ /*
+ * FIXME: this is actually duty of the I/O unit to initialize
+ * this, but MTD does not provide enough information.
+ */
+ ubi->io.bad_peb_count += 1;

+ return 0;
+ }
+

+ err = ubi_io_read_ec_hdr(ubi, pnum, ech, 0);
+ if (unlikely(err < 0))
+ return err;

+ else if (unlikely(err == UBI_IO_BITFLIPS))

+ bitflips = 1;
+ else if (err == UBI_IO_PEB_EMPTY)
+ return add_to_erase(si, pnum, NAND_SCAN_UNKNOWN_EC);
+ else if (err == UBI_IO_BAD_EC_HDR) {
+ /*
+ * We have to also look at the VID header, possibly it is not
+ * corrupted. Set %bitflips flag in order to make this PEB be
+ * moved and EC be re-created.
+ */
+ ec_corr = 1;
+ ec = NAND_SCAN_UNKNOWN_EC;
+ bitflips = 1;
+ }
+
+ si->is_empty = 0;
+
+ if (!ec_corr) {
+ /* Make sure UBI version is OK */
+ if (unlikely(ech->version != UBI_VERSION)) {
+ ubi_err("this UBI version is %d, image version is %d",
+ UBI_VERSION, (int)ech->version);

+ return -EINVAL;
+ }
+

+ ec = ubi64_to_cpu(ech->ec);
+ if (unlikely(ec > UBI_MAX_ERASECOUNTER)) {
+ /*
+ * Erase counter overflow. The EC headers have 64 bits
+ * reserved, but we anyway make use of only 31 bit
+ * values, as this seems to be enough for any existing
+ * flash. Upgrade UBI and use 64-bit erase counters
+ * internally.
+ */
+ ubi_err("erase counter overflow, max is %d",
+ UBI_MAX_ERASECOUNTER);
+ ubi_dbg_dump_ec_hdr(ech);

+ return -EINVAL;
+ }
+ }
+

+ /* OK, we've done with the EC header, let's look at the VID header */
+
+ err = ubi_io_read_vid_hdr(ubi, pnum, vidh, 0);
+ if (unlikely(err < 0))
+ return err;

+ else if (unlikely(err == UBI_IO_BITFLIPS))

+ bitflips = 1;
+ else if (unlikely(err == UBI_IO_BAD_VID_HDR ||
+ (err == UBI_IO_PEB_FREE && ec_corr))) {
+ /* VID header is corrupted */

+ err = ubi_scan_add_corr_peb(si, pnum, ec);

+ if (err)
+ return err;

+ goto adjust_mean_ec;
+ } else if (err == UBI_IO_PEB_FREE) {
+ /* No VID header - the physical eraseblock is free */
+ err = add_to_free(si, pnum, ec);

+ if (unlikely(err))
+ return err;

+ goto adjust_mean_ec;
+ }
+
+ vol_id = ubi32_to_cpu(vidh->vol_id);
+ if (unlikely(!ubi_ivol_is_known(vol_id))) {
+ int lnum = ubi32_to_cpu(vidh->lnum);
+
+ /* Unsupported internal volume */
+ switch (vidh->compat) {
+ case UBI_COMPAT_DELETE:
+ ubi_msg("\"delete\" compatible internal volume %d:%d"
+ " found, remove it", vol_id, lnum);
+ err = ubi_scan_add_corr_peb(si, pnum, ec);

+ if (unlikely(err))
+ return err;

+ break;
+
+ case UBI_COMPAT_RO:
+ ubi_msg("read-only compatible internal volume %d:%d"
+ " found, switch to read-only mode",
+ vol_id, lnum);
+ ubi->io.ro_mode = 1;
+ break;
+
+ case UBI_COMPAT_PRESERVE:
+ ubi_msg("\"preserve\" compatible internal volume %d:%d"
+ " found", vol_id, lnum);
+ err = add_to_alien(si, pnum);

+ if (unlikely(err))
+ return err;

+ si->alien_peb_count += 1;
+ return 0;
+
+ case UBI_COMPAT_REJECT:
+ ubi_err("incompatible internal volume %d:%d found",
+ vol_id, lnum);

+ return -EINVAL;
+ }
+ }
+

+ /* Both UBI headers seem to be fine */
+ err = ubi_scan_add_peb(ubi, si, pnum, ec, vidh, bitflips);

+ if (unlikely(err))
+ return err;
+

+adjust_mean_ec:
+ if (!ec_corr) {
+ if (si->ec_sum + ec < ec) {
+ commit_to_mean_value(si);
+ si->ec_sum = 0;
+ si->ec_count = 0;
+ } else {
+ si->ec_sum += ec;
+ si->ec_count += 1;
+ }
+
+ if (ec > si->max_ec)
+ si->max_ec = ec;
+ if (ec < si->min_ec)
+ si->min_ec = ec;

+ }
+
+ return 0;
+}
+

+/**
+ * ubi_scan - scan an MTD device.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function does full scanning of an MTD device and returns complete
+ * information about it. In case of failure, an error code is returned.
+ */
+struct ubi_scan_info *ubi_scan(struct ubi_info *ubi)
+{
+ int err, pnum;
+ struct rb_node *rb1, *rb2;
+ struct ubi_scan_volume *sv;
+ struct ubi_scan_leb *seb;

+ struct ubi_scan_info *si;
+

+ si = kzalloc(sizeof(struct ubi_scan_info), GFP_KERNEL);
+ if (!si)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&si->corr);
+ INIT_LIST_HEAD(&si->free);
+ INIT_LIST_HEAD(&si->erase);
+ INIT_LIST_HEAD(&si->alien);
+ si->volumes = RB_ROOT;
+ si->is_empty = 1;
+
+ err = -ENOMEM;
+ ech = kzalloc(ubi->io.ec_hdr_alsize, GFP_KERNEL);
+ if (!ech)
+ goto out_si;
+
+ vidh = ubi_zalloc_vid_hdr(ubi);
+ if (!vidh)
+ goto out_ech;
+
+ for (pnum = 0; pnum < ubi->io.peb_count; pnum++) {
+ cond_resched();
+
+ err = process_eb(ubi, si, pnum);
+ if (unlikely(err < 0))
+ goto out_vidh;
+ }
+
+ dbg_scan("scanning is finished");
+
+ /* Finish mean erase counter calculations */
+ if (si->ec_count)
+ commit_to_mean_value(si);
+
+ /*
+ * FIXME: this is actually duty of the I/O unit to initialize this, but
+ * MTD does not provide enough information.
+ */
+ ubi->io.good_peb_count = ubi->io.peb_count - ubi->io.bad_peb_count;
+
+ if (si->is_empty)
+ ubi_msg("empty MTD device detected");
+
+ /*
+ * In case of unknown erase counter we use the mean erase counter
+ * value.
+ */
+ rb_for_each_entry(rb1, sv, &si->volumes, rb) {
+ rb_for_each_entry(rb2, seb, &sv->root, u.rb)
+ if (seb->ec == NAND_SCAN_UNKNOWN_EC)
+ seb->ec = si->mean_ec;
+ }
+
+ list_for_each_entry(seb, &si->free, u.list) {
+ if (seb->ec == NAND_SCAN_UNKNOWN_EC)
+ seb->ec = si->mean_ec;
+ }
+
+ list_for_each_entry(seb, &si->corr, u.list)
+ if (seb->ec == NAND_SCAN_UNKNOWN_EC)
+ seb->ec = si->mean_ec;
+
+ list_for_each_entry(seb, &si->erase, u.list)
+ if (seb->ec == NAND_SCAN_UNKNOWN_EC)
+ seb->ec = si->mean_ec;
+
+ err = paranoid_check_si(ubi, si);
+ if (err) {
+ if (err > 0)
+ err = -EINVAL;
+ goto out_vidh;
+ }
+
+ ubi_free_vid_hdr(ubi, vidh);
+ kfree(ech);
+
+ return si;
+
+out_vidh:
+ ubi_free_vid_hdr(ubi, vidh);
+out_ech:
+ kfree(ech);
+out_si:
+ ubi_scan_destroy_si(si);
+ return ERR_PTR(err);
+}
+
+/**
+ * destroy_sv - free the scanning volume information
+ *
+ * @sv: scanning volume information
+ *
+ * This function destroys the volume RB-tree (@sv->root) and the scanning
+ * volume information.
+ */
+static void destroy_sv(struct ubi_scan_volume *sv)
+{
+ struct ubi_scan_leb *seb;
+ struct rb_node *this = sv->root.rb_node;
+
+ while (this) {
+ if (this->rb_left)
+ this = this->rb_left;
+ else if (this->rb_right)
+ this = this->rb_right;
+ else {
+ seb = rb_entry(this, struct ubi_scan_leb, u.rb);
+ this = rb_parent(this);
+ if (this) {
+ if (this->rb_left == &seb->u.rb)
+ this->rb_left = NULL;
+ else
+ this->rb_right = NULL;
+ }
+
+ kfree(seb);
+ }
+ }
+ kfree(sv);
+}
+
+/**
+ * ubi_scan_destroy_si - destroy scanning information.
+ *

+ * @si: a pointer to the scanning information

+ */
+void ubi_scan_destroy_si(struct ubi_scan_info *si)
+{
+ struct ubi_scan_leb *seb, *seb_tmp;
+ struct ubi_scan_volume *sv;
+ struct rb_node *rb;
+
+ list_for_each_entry_safe(seb, seb_tmp, &si->alien, u.list) {
+ list_del(&seb->u.list);
+ kfree(seb);
+ }
+ list_for_each_entry_safe(seb, seb_tmp, &si->erase, u.list) {
+ list_del(&seb->u.list);
+ kfree(seb);
+ }
+ list_for_each_entry_safe(seb, seb_tmp, &si->corr, u.list) {
+ list_del(&seb->u.list);
+ kfree(seb);
+ }
+ list_for_each_entry_safe(seb, seb_tmp, &si->free, u.list) {
+ list_del(&seb->u.list);
+ kfree(seb);
+ }
+
+ /* Destroy the volume RB-tree */
+ rb = si->volumes.rb_node;
+ while (rb) {
+ if (rb->rb_left)
+ rb = rb->rb_left;
+ else if (rb->rb_right)
+ rb = rb->rb_right;
+ else {
+ sv = rb_entry(rb, struct ubi_scan_volume, rb);
+
+ rb = rb_parent(rb);
+ if (rb) {
+ if (rb->rb_left == &sv->rb)
+ rb->rb_left = NULL;
+ else
+ rb->rb_right = NULL;
+ }
+
+ destroy_sv(sv);
+ }
+ }
+
+ kfree(si);
+}
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_SCAN
+
+/**
+ * paranoid_check_si - check if the scanning information is correct and
+ * consistent.

+ *
+ * @ubi: the UBI device description object

+ * @si: a pointer to the scanning information

+ *
+ * This function returns zero if the scanning information is all right, %1 if
+ * not and a negative error code if an error occurred.
+ */
+static int paranoid_check_si(const struct ubi_info *ubi,

+ struct ubi_scan_info *si)
+{

+ int pnum, err, vols_found = 0;
+ struct rb_node *rb1, *rb2;
+ struct ubi_scan_volume *sv;
+ struct ubi_scan_leb *seb, *last_seb;
+ uint8_t *buf;
+
+ /*
+ * At first, check that scanning information is ok.
+ */
+ rb_for_each_entry(rb1, sv, &si->volumes, rb) {
+ int leb_count = 0;
+
+ cond_resched();
+
+ vols_found += 1;
+
+ if (unlikely(si->is_empty)) {
+ ubi_err("bad is_empty flag");
+ goto bad_sv;
+ }
+
+ if (unlikely(sv->vol_id < 0 || sv->highest_lnum < 0 ||
+ sv->leb_count < 0 || sv->vol_type < 0 ||
+ sv->used_ebs < 0 || sv->data_pad < 0 ||
+ sv->last_data_size < 0)) {
+ ubi_err("negative values");
+ goto bad_sv;
+ }
+
+ if (unlikely(sv->vol_id >= UBI_MAX_VOLUMES &&
+ sv->vol_id < UBI_INTERNAL_VOL_START)) {
+ ubi_err("bad vol_id");
+ goto bad_sv;
+ }
+
+ if (unlikely(sv->vol_id > si->highest_vol_id)) {
+ ubi_err("highest_vol_id is %d, but vol_id %d is there",
+ si->highest_vol_id, sv->vol_id);

+ goto out;
+ }
+

+ if (unlikely(sv->vol_type != UBI_DYNAMIC_VOLUME &&
+ sv->vol_type != UBI_STATIC_VOLUME)) {
+ ubi_err("bad vol_type");
+ goto bad_sv;
+ }
+
+ if (unlikely(sv->data_pad > ubi->io.leb_size / 2)) {
+ ubi_err("bad data_pad");
+ goto bad_sv;
+ }
+
+ last_seb = NULL;
+ rb_for_each_entry(rb2, seb, &sv->root, u.rb) {
+ cond_resched();
+
+ last_seb = seb;
+ leb_count += 1;
+
+ if (unlikely(seb->pnum < 0 || seb->ec < 0)) {
+ ubi_err("negative values");
+ goto bad_seb;
+ }
+
+ if (unlikely(seb->ec < si->min_ec)) {
+ ubi_err("bad si->min_ec (%d), %d found",
+ si->min_ec, seb->ec);
+ goto bad_seb;
+ }
+
+ if (unlikely(seb->ec > si->max_ec)) {
+ ubi_err("bad si->max_ec (%d), %d found",
+ si->max_ec, seb->ec);
+ goto bad_seb;
+ }
+
+ if (unlikely(seb->pnum >= ubi->io.peb_count)) {
+ ubi_err("too high PEB number %d, total PEBs %d",
+ seb->pnum, ubi->io.peb_count);
+ goto bad_seb;
+ }
+
+ if (sv->vol_type == UBI_STATIC_VOLUME) {
+ if (unlikely(seb->lnum >= sv->used_ebs)) {
+ ubi_err("bad lnum or used_ebs");
+ goto bad_seb;
+ }
+ } else {
+ if (unlikely(sv->used_ebs != 0)) {
+ ubi_err("non-zero used_ebs");
+ goto bad_seb;
+ }
+ }
+
+ if (unlikely(seb->lnum > sv->highest_lnum)) {
+ ubi_err("incorrect highest_lnum or lnum");
+ goto bad_seb;
+ }
+ }
+
+ if (unlikely(sv->leb_count != leb_count)) {
+ ubi_err("bad leb_count, %d objects in the tree",
+ leb_count);
+ goto bad_sv;
+ }
+
+ if (!last_seb)
+ continue;
+
+ seb = last_seb;
+
+ if (unlikely(seb->lnum != sv->highest_lnum)) {
+ ubi_err("bad highest_lnum");
+ goto bad_seb;
+ }
+ }
+
+ if (vols_found != si->vols_found) {
+ ubi_err("bad si->vols_found %d, should be %d",
+ si->vols_found, vols_found);

+ goto out;
+ }
+

+ /* Check that scanning information is correct */
+ rb_for_each_entry(rb1, sv, &si->volumes, rb) {
+ last_seb = NULL;
+ rb_for_each_entry(rb2, seb, &sv->root, u.rb) {
+ int vol_type;
+
+ cond_resched();
+
+ last_seb = seb;
+
+ err = ubi_io_read_vid_hdr(ubi, seb->pnum, vidh, 1);
+ if (unlikely(err) && err != UBI_IO_BITFLIPS) {
+ ubi_err("VID header is not OK (%d)", err);
+ if (err > 0)
+ err = -EIO;

+ return err;
+ }
+

+ vol_type = vidh->vol_type == UBI_VID_DYNAMIC ?
+ UBI_DYNAMIC_VOLUME : UBI_STATIC_VOLUME;
+ if (unlikely(sv->vol_type != vol_type)) {
+ ubi_err("bad vol_type");
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(seb->sqnum != ubi64_to_cpu(vidh->sqnum))) {
+ ubi_err("bad sqnum %llu", seb->sqnum);
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(sv->vol_id !=
+ ubi32_to_cpu(vidh->vol_id))) {
+ ubi_err("bad vol_id %d", sv->vol_id);
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(sv->compat != vidh->compat)) {
+ ubi_err("bad compat %d", vidh->compat);
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(seb->lnum != ubi32_to_cpu(vidh->lnum))) {
+ ubi_err("bad lnum %d", seb->lnum);
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(sv->used_ebs !=
+ ubi32_to_cpu(vidh->used_ebs))) {
+ ubi_err("bad used_ebs %d", sv->used_ebs);
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(sv->data_pad !=
+ ubi32_to_cpu(vidh->data_pad))) {
+ ubi_err("bad data_pad %d", sv->data_pad);
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(seb->leb_ver !=
+ ubi32_to_cpu(vidh->leb_ver))) {
+ ubi_err("bad leb_ver %u", seb->leb_ver);
+ goto bad_vid_hdr;
+ }
+ }
+
+ if (!last_seb)
+ continue;
+
+ if (unlikely(sv->highest_lnum != ubi32_to_cpu(vidh->lnum))) {
+ ubi_err("bad highest_lnum %d", sv->highest_lnum);
+ goto bad_vid_hdr;
+ }
+
+ if (unlikely(sv->last_data_size !=
+ ubi32_to_cpu(vidh->data_size))) {
+ ubi_err("bad last_data_size %d", sv->last_data_size);
+ goto bad_vid_hdr;
+ }
+ }
+
+ /*
+ * Make sure that all the physical eraseblocks are in one of the lists
+ * or trees.
+ */
+ buf = kmalloc(ubi->io.peb_count, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ memset(buf, 1, ubi->io.peb_count);
+ for (pnum = 0; pnum < ubi->io.peb_count; pnum++) {

+ err = ubi_io_is_bad(ubi, pnum);

+ if (unlikely(err < 0))
+ return err;
+ else if (err)
+ buf[pnum] = 0;
+ }
+
+ rb_for_each_entry(rb1, sv, &si->volumes, rb)
+ rb_for_each_entry(rb2, seb, &sv->root, u.rb)
+ buf[seb->pnum] = 0;
+
+ list_for_each_entry(seb, &si->free, u.list)
+ buf[seb->pnum] = 0;
+
+ list_for_each_entry(seb, &si->corr, u.list)
+ buf[seb->pnum] = 0;
+
+ list_for_each_entry(seb, &si->erase, u.list)
+ buf[seb->pnum] = 0;
+
+ list_for_each_entry(seb, &si->alien, u.list)
+ buf[seb->pnum] = 0;
+
+ err = 0;
+ for (pnum = 0; pnum < ubi->io.peb_count; pnum++)
+ if (unlikely(buf[pnum])) {
+ ubi_err("PEB %d is not referred", pnum);

+ err = 1;
+ }

+
+ kfree(buf);

+ if (err)
+ goto out;

+ return 0;
+
+bad_seb:
+ ubi_err("bad scanning information about LEB %d", seb->lnum);
+ ubi_dbg_dump_seb(seb, 0);
+ ubi_dbg_dump_sv(sv);
+ goto out;
+
+bad_sv:
+ ubi_err("bad scanning information about volume %d", sv->vol_id);
+ ubi_dbg_dump_sv(sv);
+ goto out;
+
+bad_vid_hdr:
+ ubi_err("bad scanning information about volume %d", sv->vol_id);
+ ubi_dbg_dump_sv(sv);
+ ubi_dbg_dump_vid_hdr(vidh);
+
+out:
+ ubi_dbg_dump_stack();

+ return 1;
+}
+

+#endif /* CONFIG_MTD_UBI_DEBUG_PARANOID_SCAN */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:30:21 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/vmt.c tmp-to/drivers/mtd/ubi/vmt.c
--- tmp-from/drivers/mtd/ubi/vmt.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/vmt.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,360 @@

+ * This is a part of the user interfaces unit which implements volume creation,
+ * deletion, updating and resizing.

+ */
+
+#include <linux/err.h>

+#include "ubi.h"
+
+/**
+ * find_vacant_vol_id - find an unused volume ID.
+ *

+ * @ubi: the UBI device description object
+ *

+ * This function returns a positive vacant volume ID or %-ENOSPC if there are
+ * no vacant volume slots.
+ */
+static int find_vacant_vol_id(const struct ubi_info *ubi)
+{
+ int i;
+
+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {

+ const struct ubi_vtbl_vtr *vtr;
+

+ vtr = ubi_vtbl_get_vtr(ubi, i);
+ if (IS_ERR(vtr)) {

+ dbg_uif("found volume ID %d", i);
+ return i;
+ }
+ }
+
+ dbg_err("vacant volume ID not found");
+ return -ENOSPC;
+}
+
+/**
+ * mkvol_flash - create a volume on flash media.

+ *
+ * @ubi: the UBI device description object

+ * @vol_id: ID to assign to the new volume
+ * @vtr: volume table record describing the new volume
+ *
+ * If @vol_id is %UBI_VOL_NUM_AUTO, then new volume is automatically given an
+ * unused volume identifier. This function returns the ID of the newly created
+ * volume in case of success, and a negative error code in case of failure.
+ */
+static int mkvol_flash(struct ubi_info *ubi, int vol_id,
+ struct ubi_vtbl_vtr *vtr)

+{
+ int i, err;

+ const struct ubi_vtbl_vtr *vtr_ck;
+
+ mutex_lock(&ubi->uif.vol_change_lock);
+
+ if (vol_id == UBI_VOL_NUM_AUTO) {
+ vol_id = find_vacant_vol_id(ubi);
+ if (vol_id < 0) {
+ err = vol_id;
+ goto out_unlock;
+ }
+ } else

+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);
+

+ /* Get sure that this volume does not exist */
+ err = -EEXIST;
+ vtr_ck = ubi_vtbl_get_vtr(ubi, vol_id);
+ if (!IS_ERR(vtr_ck)) {
+ dbg_err("volume %d already exists", vol_id);
+ goto out_unlock;
+ }
+
+ /* Ensure that this volume has a unique name */
+ for (i = 0; i < ubi->vtbl.vt_slots; i++) {
+ vtr_ck = ubi_vtbl_get_vtr(ubi, i);
+ if (IS_ERR(vtr_ck))
+ continue;
+
+ if (vtr->name_len == vtr_ck->name_len &&
+ strcmp(vtr->name, vtr_ck->name) == 0) {
+ dbg_err("not unique name \"%s\", used by volume %d",
+ vtr->name, i);
+ goto out_unlock;
+ }
+ }
+
+ if (ubi->vtbl.vol_count + 1 > ubi->vtbl.vt_slots) {
+ dbg_err("no room for the volume");
+ err = -ENOSPC;
+ goto out_unlock;
+ }
+
+ err = ubi_acc_reserve(ubi, vtr->reserved_pebs);
+ if (err)
+ goto out_unlock;
+
+ /*
+ * Finish all the pending erases because there may be some LEBs
+ * belonging to the same volume ID. We don't want to be messed-up.
+ */
+ err = ubi_wl_flush(ubi);
+ if (err)
+ goto out_acc;
+
+ err = ubi_eba_mkvol(ubi, vol_id, vtr->reserved_pebs);
+ if (err)
+ goto out_acc;
+
+ err = ubi_vtbl_mkvol(ubi, vol_id, vtr);
+ if (err)
+ goto out_eba;
+
+ mutex_unlock(&ubi->uif.vol_change_lock);
+ return vol_id;
+
+out_eba:
+ ubi_eba_rmvol(ubi, vol_id);
+out_acc:
+ ubi_acc_free(ubi, vtr->reserved_pebs);
+out_unlock:
+ mutex_unlock(&ubi->uif.vol_change_lock);

+ return err;
+}
+
+/**

+ * rmvol_flash - remove a volume from the flash media.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to remove
+ *
+ * This function returns zero in case of success, and a negative error code in

+ * case of failure.
+ */

+static int rmvol_flash(struct ubi_info *ubi, int vol_id)
+{
+ int err, reserved_pebs;

+ const struct ubi_vtbl_vtr *vtr;
+

+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);
+

+ mutex_lock(&ubi->uif.vol_change_lock);
+
+ /* Ensure that this volume exists */

+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);

+ if (IS_ERR(vtr)) {
+ err = PTR_ERR(vtr);

+ goto out_unlock;
+ }
+

+ reserved_pebs = vtr->reserved_pebs;
+
+ err = ubi_vtbl_rmvol(ubi, vol_id);
+ if (err)
+ goto out_unlock;
+
+ err = ubi_eba_rmvol(ubi, vol_id);
+ if (err)
+ goto out_unlock;
+
+ ubi_acc_free(ubi, reserved_pebs);
+
+out_unlock:
+ mutex_unlock(&ubi->uif.vol_change_lock);

+ return err;
+}
+
+/**

+ * ubi_vmt_mkvol - create a volume.
+ *

+ * @ubi: the UBI device description object

+ * @vtr: volume table record of the newly created volume
+ * @vol_id: ID of the new volume
+ *
+ * This function creates an UBI volume. If @vtr is %NULL, this function creates
+ * only the user interface-related data structures for this volume. This is
+ * used when the MTD device is being attached and the volume already exists
+ * on the media.
+ *
+ * If @vtr is not %NULL, the caller has to correctly fill @vtr except of
+ * @vtr->usable_leb_size field. If @vol_id is %UBI_VOL_NUM_AUTO then new volume
+ * is automatically given an unused volume identifier. In case of success the
+ * @vtr object will be filled with new volume information.
+ *
+ * This function returns ID of the newly created volume in case of success, and
+ * a negative error code in case of failure.
+ */
+int ubi_vmt_mkvol(struct ubi_info *ubi, int vol_id, struct ubi_vtbl_vtr *vtr)
+{
+ int err;
+ struct ubi_uif_volume *vol;
+
+ if (vtr) {
+ dbg_uif("create volume ID %d, size %d, type %d, name %s",
+ vol_id, vtr->reserved_pebs, vtr->vol_type, vtr->name);
+
+ err = mkvol_flash(ubi, vol_id, vtr);
+ if (err < 0)
+ return err;
+ vol_id = err;
+ } else
+ ubi_assert(vol_id != UBI_VOL_NUM_AUTO);
+
+ vol = kzalloc(sizeof(struct ubi_uif_volume), GFP_KERNEL);
+ if (!vol) {
+ err = -ENOMEM;
+ goto out_rmvol;
+ }
+
+ vol->ubi = ubi;
+ vol->vol_id = vol_id;
+ spin_lock_init(&vol->vol_lock);
+
+ err = ubi_sysfs_vol_init(ubi, vol);
+ if (err)
+ goto out_sysfs;
+
+ err = ubi_cdev_vol_init(ubi, vol);
+ if (err)
+ goto out_sysfs;
+
+ err = ubi_gluebi_vol_init(ubi, vol);
+ if (err)
+ goto out_cdev;
+
+ spin_lock(&ubi->uif.volumes_list_lock);
+ list_add(&vol->list, &ubi->uif.volumes);
+ spin_unlock(&ubi->uif.volumes_list_lock);
+
+ return vol_id;
+
+out_cdev:
+ ubi_cdev_vol_close(vol);
+out_sysfs:
+ ubi_sysfs_vol_close(vol);
+out_rmvol:
+ if (vtr)
+ rmvol_flash(ubi, vol_id);

+ return err;
+}
+
+/**

+ * ubi_vmt_rmvol - remove a volume.
+ *

+ * @desc: volume descriptor

+ * @uif_only: do not remove volume from the media if non zero
+ *
+ * The volume has to be opened in "exclusive" mode. This function returns zero
+ * in case of success and a negative error code in case of failure.
+ */
+int ubi_vmt_rmvol(struct ubi_vol_desc *desc)
+{

+ struct ubi_uif_volume *vol = desc->vol;
+ struct ubi_info *ubi = vol->ubi;

+ int err, vol_id = vol->vol_id;
+
+ dbg_uif("remove UBI volume %d", vol_id);
+ ubi_assert(desc->mode == UBI_EXCLUSIVE);
+
+ err = ubi_gluebi_vol_close(vol);

+ if (err)
+ return err;
+

+ spin_lock(&vol->vol_lock);
+ vol->removed = 1;
+ spin_unlock(&vol->vol_lock);
+
+ spin_lock(&ubi->uif.volumes_list_lock);
+ list_del(&vol->list);
+ spin_unlock(&ubi->uif.volumes_list_lock);
+
+ ubi_cdev_vol_close(vol);
+ ubi_sysfs_vol_close(vol);

+ kfree(desc);
+ module_put(THIS_MODULE);
+

+ return rmvol_flash(ubi, vol_id);
+}
+
+/**
+ * ubi_vmt_rsvol - re-size a volume.
+ *

+ * @ubi: the UBI device description object

+ * @vol_id: ID of the volume to re-size

+ * @reserved_pebs: new volume size
+ *
+ * This function returns zero in case of success, and a negative error code in

+ * case of failure.
+ */

+int ubi_vmt_rsvol(struct ubi_info *ubi, int vol_id, int reserved_pebs)
+{
+ int err, pebs, old_reserved_pebs;

+ const struct ubi_vtbl_vtr *vtr;
+

+ dbg_uif("re-size volume %d to %d PEBs", vol_id, reserved_pebs);

+ ubi_assert(vol_id >= 0 && vol_id < ubi->vtbl.vt_slots);
+ ubi_assert(reserved_pebs > 0);
+

+ mutex_lock(&ubi->uif.vol_change_lock);
+
+ /* Ensure that this volume exists */

+ vtr = ubi_vtbl_get_vtr(ubi, vol_id);

+ if (IS_ERR(vtr)) {
+ err = PTR_ERR(vtr);
+ goto out_unlock;
+ }
+
+ if (vtr->vol_type == UBI_STATIC_VOLUME &&
+ reserved_pebs < vtr->used_ebs) {
+ dbg_err("too small size %d, static volume %d has %d used LEBs",
+ reserved_pebs, vol_id, vtr->used_ebs);
+ err = -EINVAL;

+ goto out_unlock;
+ }
+

+ /* If the size is the same, we have nothing to do */
+ if (reserved_pebs == vtr->reserved_pebs) {
+ err = 0;

+ goto out_unlock;
+ }
+

+ old_reserved_pebs = vtr->reserved_pebs;
+
+ err = ubi_vtbl_rsvol(ubi, vol_id, reserved_pebs);
+ if (err)
+ goto out_unlock;
+
+ pebs = reserved_pebs - old_reserved_pebs;
+ if (pebs > 0) {
+ err = ubi_acc_reserve(ubi, pebs);
+ if (err)
+ goto out_unlock;
+ } else
+ ubi_acc_free(ubi, -pebs);
+
+ err = ubi_eba_rsvol(ubi, vol_id, reserved_pebs);
+ if (err)
+ goto out_unlock;
+
+out_unlock:
+ mutex_unlock(&ubi->uif.vol_change_lock);
+ return err;
+}

Artem Bityutskiy

unread,

Mar 14, 2007, 11:30:32 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/fs/jffs2/fs.c tmp-to/fs/jffs2/fs.c
--- tmp-from/fs/jffs2/fs.c 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/fs/jffs2/fs.c 2007-03-14 17:15:50.000000000 +0200
@@ -672,6 +672,13 @@ static int jffs2_flash_setup(struct jffs
return ret;
}

+ /* and an UBI volume */
+ if (jffs2_ubivol(c)) {
+ ret = jffs2_ubivol_setup(c);
+ if (ret)

+ return ret;
+ }
+

return ret;
}

@@ -690,4 +697,9 @@ void jffs2_flash_cleanup(struct jffs2_sb
if (jffs2_nor_wbuf_flash(c)) {
jffs2_nor_wbuf_flash_cleanup(c);
}
+
+ /* and an UBI volume */
+ if (jffs2_ubivol(c)) {
+ jffs2_ubivol_cleanup(c);
+ }
}
diff -auNrp tmp-from/fs/jffs2/os-linux.h tmp-to/fs/jffs2/os-linux.h
--- tmp-from/fs/jffs2/os-linux.h 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/fs/jffs2/os-linux.h 2007-03-14 17:15:50.000000000 +0200
@@ -98,6 +98,9 @@ static inline void jffs2_init_inode_info
#define jffs2_nor_wbuf_flash(c) (0)
#define jffs2_nor_wbuf_flash_setup(c) (0)
#define jffs2_nor_wbuf_flash_cleanup(c) do {} while (0)
+#define jffs2_ubivol(c) (0)
+#define jffs2_ubivol_setup(c) (0)
+#define jffs2_ubivol_cleanup(c) do {} while (0)

#else /* NAND and/or ECC'd NOR support present */

@@ -133,6 +136,9 @@ void jffs2_nand_flash_cleanup(struct jff
#define jffs2_dataflash(c) (c->mtd->type == MTD_DATAFLASH)
int jffs2_dataflash_setup(struct jffs2_sb_info *c);
void jffs2_dataflash_cleanup(struct jffs2_sb_info *c);
+#define jffs2_ubivol(c) (c->mtd->type == MTD_UBIVOLUME)
+int jffs2_ubivol_setup(struct jffs2_sb_info *c);
+void jffs2_ubivol_cleanup(struct jffs2_sb_info *c);

#define jffs2_nor_wbuf_flash(c) (c->mtd->type == MTD_NORFLASH && ! (c->mtd->flags & MTD_BIT_WRITEABLE))
int jffs2_nor_wbuf_flash_setup(struct jffs2_sb_info *c);
diff -auNrp tmp-from/fs/jffs2/wbuf.c tmp-to/fs/jffs2/wbuf.c
--- tmp-from/fs/jffs2/wbuf.c 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/fs/jffs2/wbuf.c 2007-03-14 17:15:50.000000000 +0200
@@ -1274,3 +1274,27 @@ int jffs2_nor_wbuf_flash_setup(struct jf
void jffs2_nor_wbuf_flash_cleanup(struct jffs2_sb_info *c) {
kfree(c->wbuf);
}
+
+int jffs2_ubivol_setup(struct jffs2_sb_info *c) {
+ c->cleanmarker_size = 0;
+
+ if (c->mtd->writesize == 1)
+ /* We do not need write-buffer */
+ return 0;
+
+ init_rwsem(&c->wbuf_sem);
+
+ c->wbuf_pagesize = c->mtd->writesize;
+ c->wbuf_ofs = 0xFFFFFFFF;
+ c->wbuf = kmalloc(c->wbuf_pagesize, GFP_KERNEL);
+ if (!c->wbuf)
+ return -ENOMEM;
+
+ printk(KERN_INFO "JFFS2 write-buffering enabled buffer (%d) erasesize (%d)\n", c->wbuf_pagesize, c->sector_size);

+
+ return 0;
+}
+

+void jffs2_ubivol_cleanup(struct jffs2_sb_info *c) {
+ kfree(c->wbuf);
+}
diff -auNrp tmp-from/include/mtd/mtd-abi.h tmp-to/include/mtd/mtd-abi.h
--- tmp-from/include/mtd/mtd-abi.h 2007-03-14 17:15:49.000000000 +0200
+++ tmp-to/include/mtd/mtd-abi.h 2007-03-14 17:15:50.000000000 +0200
@@ -24,6 +24,7 @@ struct mtd_oob_buf {
#define MTD_NORFLASH 3
#define MTD_NANDFLASH 4
#define MTD_DATAFLASH 6
+#define MTD_UBIVOLUME 7

#define MTD_WRITEABLE 0x400 /* Device is writeable */
#define MTD_BIT_WRITEABLE 0x800 /* Single bits can be flipped */

Artem Bityutskiy

unread,

Mar 14, 2007, 11:34:50 AM3/14/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/wl.c tmp-to/drivers/mtd/ubi/wl.c
--- tmp-from/drivers/mtd/ubi/wl.c 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/wl.c 2007-03-14 17:15:50.000000000 +0200
@@ -0,0 +1,1761 @@

+ * Authors: Artem B. Bityutskiy, Thomas Gleixner
+ */
+
+/*
+ * UBI wear-leveling unit.
+ *
+ * This unit is responsible for wear-leveling. It works in terms of physical
+ * eraseblocks and erase counters and knows nothing about logical eraseblocks,
+ * volumes, etc. From this unit's perspective all physical eraseblocks are of
+ * two types - used and free. Used physical eraseblocks are those that were
+ * "get" by the 'ubi_wl_get_peb()' function, and free physical eraseblocks are
+ * those that were put by the 'ubi_wl_put_peb()' function.
+ *
+ * Physical eraseblocks returned by 'ubi_wl_get_peb()' have only erase counter
+ * header. The rest of the physical eraseblock contains only 0xFF bytes.
+ *
+ * When physical eraseblocks are returned to the WL unit by means of the
+ * 'ubi_wl_put_peb()' function, they are scheduled for erasure. The erasure is
+ * done asynchronously in context of the per-UBI device background thread,
+ * which is also managed by the WL unit.
+ *
+ * The wear-leveling is ensured by means of moving the contents of used
+ * physical eraseblocks with low erase counter to free physical eraseblocks
+ * with high erase counter.
+ *
+ * The 'ubi_wl_get_peb()' function accepts data type hints which help to pick
+ * an "optimal" physical eraseblock. For example, when it is known that the
+ * physical eraseblock will be "put" soon because it contains short-term data,
+ * the WL unit may pick a free physical eraseblock with low erase counter, and
+ * so forth.
+ *
+ * If the WL unit fails to erase a physical eraseblock, it marks it as bad.
+ *
+ * This unit is also responsible for scrubbing. If a bit-flip is detected in a
+ * physical eraseblock, it has to be moved. Technically this is the same as
+ * moving it for wear-leveling reasons.
+ *
+ * As it was said, for the UBI unit all physical eraseblocks are either "free"
+ * or "used". Free eraseblock are kept in the @wl->free RB-tree, while used
+ * eraseblocks are kept in a set of different RB-trees: @wl->used,
+ * @wl->prot.pnum, @wl->prot.aec, and @wl->scrub.
+ *
+ * Note, in this implementation, we keep a small in-RAM object for each physical
+ * eraseblock. This is surely not a scalable solution. But it appears to be good
+ * enough for moderately large flashes and it is simple. In future, one may
+ * re-work this unit and make it more scalable.
+ *
+ * At the moment this unit does not utilize the sequence number, which was
+ * introduced relatively recently. But it would be wise to do this because the
+ * sequence number of a logical eraseblock characterizes how old is it. For
+ * example, when we move a PEB with low erase counter, and we need to pick the
+ * target PEB, we pick a PEB with the highest EC if our PEB is "old" and we
+ * pick target PEB with an average EC if our PEB is not very "old". This is a
+ * room for future re-works of the WL unit.
+ */
+
+#include <linux/slab.h>
+#include <linux/crc32.h>
+#include <linux/freezer.h>
+#include <linux/kthread.h>
+#include "ubi.h"
+
+/* Number of physical eraseblocks reserved for wear-leveling purposes */
+#define WL_RESERVED_PEBS 1
+
+/*
+ * How many erase cycles are short term, unknown, and long term physical
+ * eraseblocks protected.
+ */
+#define ST_PROTECTION 16
+#define U_PROTECTION 10
+#define LT_PROTECTION 4
+
+/*
+ * Maximum difference between two erase counters. If this threshold is
+ * exceeded, the WL unit starts moving data from used physical eraseblocks with
+ * low erase counter to free physical eraseblocks with high erase counter.
+ */
+#define UBI_WL_THRESHOLD CONFIG_MTD_UBI_WL_THRESHOLD
+
+/*
+ * When a physical eraseblock is moved, the WL unit has to pick the target
+ * physical eraseblock to move to. The simplest way would be just to pick the
+ * one with the highest erase counter. But in certain workloads this could lead
+ * to an unlimited wear of one or few physical eraseblock. Indeed, imagine a
+ * situation when the picked physical eraseblock is constantly erased after the
+ * data is written to it. So, we have a constant which limits the highest erase
+ * counter of the free physical eraseblock to pick. Namely, the WL unit does
+ * not pick eraseblocks with erase counter greater then the lowest erase
+ * counter plus %WL_FREE_MAX_DIFF.
+ */
+#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)
+
+/* Background thread name pattern */
+#define WL_NAME_PATTERN "ubi_bgt%dd"
+
+/*
+ * Maximum number of consecutive background thread failures which is enough to
+ * switch to read-only mode.
+ */
+#define WL_MAX_FAILURES 32
+
+/**
+ * struct ubi_wl_entry - wear-leveling entry.
+ *
+ * @rb: link in the corresponding RB-tree

+ * @ec: erase counter

+ * @pnum: physical eraseblock number
+ *

+ * Each physical eraseblock has a corresponding &struct wl_entry object which
+ * may be kept in different RB-trees.
+ */
+struct ubi_wl_entry {
+ struct rb_node rb;
+ int ec;
+ int pnum;
+};
+
+/**
+ * struct ubi_wl_prot_entry - PEB protection entry.
+ *
+ * @rb_pnum: link in the @wl->prot.pnum RB-tree
+ * @rb_aec: link in the @wl->prot.aec RB-tree
+ * @abs_ec: the absolute erase counter value when the protection ends
+ * @e: the wear-leveling entry of the physical eraseblock under protection
+ *
+ * When the WL unit returns a physical eraseblock, the physical eraseblock is
+ * protected from being moved for some "time". For this reason, the physical
+ * eraseblock is not directly moved from the @wl->free tree to the @wl->used
+ * tree. There is one more tree in between where this physical eraseblock is
+ * temporarily stored (@wl->prot).
+ *
+ * All this protection stuff is needed because:
+ * o we don't want to move physical eraseblocks just after we have given them
+ * to the user; instead, we first want to let users fill them up with data;
+ *
+ * o there is a chance that the user will put the physical eraseblock very
+ * soon, so it makes sense not to move it for some time, but wait; this is
+ * especially important in case of "short term" physical eraseblocks.
+ *
+ * Physical eraseblocks stay protected only for limited time. But the "time" is
+ * measured in erase cycles in this case. This is implemented with help of the
+ * absolute erase counter (@wl->abs_ec). When it reaches certain value, the
+ * physical eraseblocks are moved from the protection trees (@wl->prot.*) to
+ * the @wl->used tree.
+ *
+ * Protected physical eraseblocks are searched by physical eraseblock number
+ * (when they are put) and by the absolute erase counter (to check if it is
+ * time to move them to the @wl->used tree). So there are actually 2 RB-trees
+ * storing the protected physical eraseblocks: @wl->prot.pnum and
+ * @wl->prot.aec. They are referred to as the "protection" trees. The
+ * first one is indexed by the physical eraseblock number. The second one is
+ * indexed by the absolute erase counter. Both trees store
+ * &struct ubi_wl_prot_entry objects.
+ *
+ * Each physical eraseblock has 2 main states: free and used. The former state
+ * corresponds to the @wl->free tree. The latter state is split up on several
+ * sub-states:
+ * o the WL movement is allowed (@wl->used tree);
+ * o the WL movement is temporarily prohibited (@wl->prot.pnum and
+ * @wl->prot.aec trees);
+ * o scrubbing is needed (@wl->scrub tree).
+ *
+ * Depending on the sub-state, wear-leveling entries of the used physical
+ * eraseblocks may be kept in one of those trees.
+ */
+struct ubi_wl_prot_entry {
+ struct rb_node rb_pnum;
+ struct rb_node rb_aec;

+ unsigned long long abs_ec;

+ struct ubi_wl_entry *e;
+};
+
+/**
+ * struct ubi_work - UBI work description data structure.
+ *
+ * @list: a link in the list of pending works
+ * @func: worker function
+ * @priv: private data of the worker function
+ *
+ * @e: physical eraseblock to erase
+ * @torture: if the physical eraseblock has to be tortured
+ *
+ * The @func pointer points to the worker function. If the @cancel argument is
+ * not zero, the worker has to free the resources and exit immediately. The
+ * worker has to return zero in case of success and a negative error code in

+ * case of failure.
+ */

+struct ubi_work {
+ struct list_head list;
+ int (*func)(struct ubi_info *ubi, struct ubi_work *wrk, int cancel);
+ /* The below fields are only relevant to erasure works */
+ struct ubi_wl_entry *e;
+ int torture;
+};
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_WL
+static int paranoid_check_ec(const struct ubi_info *ubi, int pnum, int ec);
+static int paranoid_check_in_wl_tree(struct ubi_wl_entry *e,
+ struct rb_root *root);
+#else
+#define paranoid_check_ec(ubi, pnum, ec) 0
+#define paranoid_check_in_wl_tree(e, root)
+#endif
+
+/* Slab cache for wear-leveling entries */
+static struct kmem_cache *wl_entries_slab;
+
+/**
+ * tree_empty - a helper function to check if an RB-tree is empty.
+ *
+ * @root: the root of the tree
+ *
+ * This function returns non-zero if the RB-tree is empty and zero if not.
+ */
+static inline int tree_empty(struct rb_root *root)
+{
+ return root->rb_node == NULL;
+}
+
+/**
+ * wl_tree_add - add a wear-leveling entry to a WL RB-tree.
+ *
+ * @e: the wear-leveling entry to add
+ * @root: the root of the tree
+ *
+ * Note, we use (erase counter, physical eraseblock number) pairs as keys in
+ * the @ubi->wl.used and @ubi->wl.free RB-trees.
+ */
+static void wl_tree_add(struct ubi_wl_entry *e, struct rb_root *root)
+{

+ struct rb_node **p, *parent = NULL;
+

+ p = &root->rb_node;
+ while (*p) {
+ struct ubi_wl_entry *e1;

+
+ parent = *p;

+ e1 = rb_entry(parent, struct ubi_wl_entry, rb);
+
+ if (e->ec < e1->ec)

+ p = &(*p)->rb_left;

+ else if (e->ec > e1->ec)

+ p = &(*p)->rb_right;

+ else {
+ ubi_assert(e->pnum != e1->pnum);
+ if (e->pnum < e1->pnum)

+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+ }
+

+ rb_link_node(&e->rb, parent, p);
+ rb_insert_color(&e->rb, root);
+}
+
+
+/*
+ * Helper functions to add and delete wear-leveling entries from different
+ * trees.
+ */
+
+static inline void free_tree_add(struct ubi_info *ubi,
+ struct ubi_wl_entry *e)
+{
+ wl_tree_add(e, &ubi->wl.free);
+}
+static inline void used_tree_add(struct ubi_info *ubi,
+ struct ubi_wl_entry *e)
+{
+ wl_tree_add(e, &ubi->wl.used);
+}
+static inline void scrub_tree_add(struct ubi_info *ubi,
+ struct ubi_wl_entry *e)
+{
+ wl_tree_add(e, &ubi->wl.scrub);
+}
+static inline void free_tree_del(struct ubi_info *ubi,
+ struct ubi_wl_entry *e)
+{
+ paranoid_check_in_wl_tree(e, &ubi->wl.free);
+ rb_erase(&e->rb, &ubi->wl.free);
+}
+static inline void used_tree_del(struct ubi_info *ubi,
+ struct ubi_wl_entry *e)
+{
+ paranoid_check_in_wl_tree(e, &ubi->wl.used);
+ rb_erase(&e->rb, &ubi->wl.used);
+}
+static inline void scrub_tree_del(struct ubi_info *ubi,
+ struct ubi_wl_entry *e)
+{
+ paranoid_check_in_wl_tree(e, &ubi->wl.scrub);
+ rb_erase(&e->rb, &ubi->wl.scrub);
+}
+
+/**
+ * do_work - do one pending work.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function returns zero in case of success and a negative error code in

+ * case of failure.
+ */

+static int do_work(struct ubi_info *ubi)
+{
+ int err;
+ struct ubi_work *wrk;
+
+ spin_lock(&ubi->wl.lock);
+
+ if (list_empty(&ubi->wl.pending_works)) {
+ spin_unlock(&ubi->wl.lock);

+ return 0;
+ }
+

+ wrk = list_entry(ubi->wl.pending_works.next, struct ubi_work, list);
+ list_del(&wrk->list);
+ spin_unlock(&ubi->wl.lock);
+
+ /*
+ * Call the worker function. Do not touch the work structure
+ * after this call as it will have been freed or reused by that
+ * time by the worker function.
+ */
+ err = wrk->func(ubi, wrk, 0);
+ if (unlikely(err))
+ ubi_err("work failed with error code %d", err);
+
+ spin_lock(&ubi->wl.lock);
+ ubi->wl.pending_works_count -= 1;
+ ubi_assert(ubi->wl.pending_works_count >= 0);
+ spin_unlock(&ubi->wl.lock);

+ return err;
+}
+
+/**

+ * produce_free_peb - produce a free physical eraseblock.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function tries to make a free PEB by means of synchronous execution of
+ * pending works. This may be needed if, for example the background thread is
+ * disabled. Returns zero in case of success and a negative error code in case
+ * of failure.
+ */
+static int produce_free_peb(struct ubi_info *ubi)
+{
+ int err;
+
+ spin_lock(&ubi->wl.lock);
+ while (tree_empty(&ubi->wl.free)) {
+ spin_unlock(&ubi->wl.lock);
+
+ dbg_wl("do one work synchronously");
+ err = do_work(ubi);
+ if (unlikely(err))
+ return err;
+
+ spin_lock(&ubi->wl.lock);
+ }
+ spin_unlock(&ubi->wl.lock);

+
+ return 0;
+}
+

+/**
+ * in_wl_tree - check if wear-leveling entry is present in a WL RB-tree.
+ *
+ * @e: the wear-leveling entry to check
+ * @root: the root of the tree
+ *
+ * This function returns non-zero if @e is in the @root RB-tree and zero if it
+ * is not.
+ */
+static int in_wl_tree(struct ubi_wl_entry *e, struct rb_root *root)

+{
+ struct rb_node *p;
+

+ p = root->rb_node;
+ while (p) {
+ struct ubi_wl_entry *e1;
+
+ e1 = rb_entry(p, struct ubi_wl_entry, rb);
+
+ if (e->pnum == e1->pnum) {
+ ubi_assert(e == e1);

+ return 1;
+ }
+

+ if (e->ec < e1->ec)

+ p = p->rb_left;

+ else if (e->ec > e1->ec)

+ p = p->rb_right;

+ else {
+ ubi_assert(e->pnum != e1->pnum);
+ if (e->pnum < e1->pnum)

+ p = p->rb_left;
+ else
+ p = p->rb_right;
+ }
+ }

+
+ return 0;
+}
+

+/**
+ * prot_tree_add - add physical eraseblock to protection trees.

+ *
+ * @ubi: the UBI device description object

+ * @e: the physical eraseblock to add
+ * @pe: protection entry object to use
+ * @abs_ec: absolute erase counter value when this physical eraseblock has
+ * to be removed from the protection trees.
+ *
+ * @wl->lock has to be locked.
+ */
+static void prot_tree_add(struct ubi_info *ubi, struct ubi_wl_entry *e,
+ struct ubi_wl_prot_entry *pe, int abs_ec)
+{

+ struct rb_node **p, *parent = NULL;

+ struct ubi_wl_prot_entry *pe1;
+
+ pe->e = e;
+ pe->abs_ec = ubi->wl.abs_ec + abs_ec;
+
+ p = &ubi->wl.prot.pnum.rb_node;

+ while (*p) {
+ parent = *p;

+ pe1 = rb_entry(parent, struct ubi_wl_prot_entry, rb_pnum);
+
+ if (e->pnum < pe1->e->pnum)

+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }

+ rb_link_node(&pe->rb_pnum, parent, p);
+ rb_insert_color(&pe->rb_pnum, &ubi->wl.prot.pnum);
+
+ p = &ubi->wl.prot.aec.rb_node;
+ parent = NULL;

+ while (*p) {
+ parent = *p;

+ pe1 = rb_entry(parent, struct ubi_wl_prot_entry, rb_aec);
+
+ if (pe->abs_ec < pe1->abs_ec)

+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }

+ rb_link_node(&pe->rb_aec, parent, p);
+ rb_insert_color(&pe->rb_aec, &ubi->wl.prot.aec);
+}
+
+/**
+ * find_wl_entry - find wear-leveling entry closest to certain erase counter.
+ *
+ * @root: the RB-tree where to look for
+ * @max: highest possible erase counter
+ *
+ * This function looks for a wear leveling entry with erase counter closest to
+ * @max and less then @max.
+ */
+static struct ubi_wl_entry *find_wl_entry(struct rb_root *root, int max)

+{
+ struct rb_node *p;

+ struct ubi_wl_entry *e;
+
+ e = rb_entry(rb_first(root), struct ubi_wl_entry, rb);
+ max += e->ec;
+
+ p = root->rb_node;
+ while (p) {
+ struct ubi_wl_entry *e1;
+
+ e1 = rb_entry(p, struct ubi_wl_entry, rb);
+ if (e1->ec >= max)

+ p = p->rb_left;
+ else {
+ p = p->rb_right;

+ e = e1;
+ }
+ }
+
+ return e;
+}
+
+/**
+ * pick_unknown - select an "unknown" physical eraseblock.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function returns a physical eraseblock for "unknown" data. The wl->lock
+ * must be locked. The @wl->free list must not be empty.
+ */
+static struct ubi_wl_entry *pick_unknown(struct ubi_info *ubi)
+{
+ int medium_ec;
+ struct rb_node *p;
+ struct ubi_wl_entry *first, *last, *e;
+
+ /*
+ * For unknown data we are trying to pick a physical eraseblock with
+ * medium erase counter. But we by no means can pick a physical
+ * eraseblock with erase counter greater or equivalent then the the
+ * lowest erase counter plus %WL_FREE_MAX_DIFF.
+ */
+
+ first = rb_entry(rb_first(&ubi->wl.free), struct ubi_wl_entry, rb);
+ last = rb_entry(rb_last(&ubi->wl.free), struct ubi_wl_entry, rb);
+
+ if (last->ec - first->ec < WL_FREE_MAX_DIFF)
+ return rb_entry(ubi->wl.free.rb_node, struct ubi_wl_entry, rb);
+
+ medium_ec = (first->ec + WL_FREE_MAX_DIFF)/2;
+ e = first;
+
+ p = ubi->wl.free.rb_node;
+ while (p) {
+ struct ubi_wl_entry *e1;
+
+ e1 = rb_entry(p, struct ubi_wl_entry, rb);
+ if (e1->ec >= medium_ec)

+ p = p->rb_left;
+ else {
+ p = p->rb_right;

+ e = e1;
+ }
+ }
+
+ return e;
+}
+
+/**
+ * ubi_wl_get_peb - get a physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @dtype: type of data which will be stored in this physical eraseblock
+ *
+ * This function returns a physical eraseblock in case of success and a
+ * negative error code in case of failure. Might sleep.
+ */
+int ubi_wl_get_peb(struct ubi_info *ubi, enum ubi_data_type dtype)
+{
+ int err, protect;
+ struct ubi_wl_entry *e;
+ struct ubi_wl_prot_entry *pe;
+

+ ubi_assert(dtype == UBI_DATA_LONGTERM || dtype == UBI_DATA_SHORTTERM ||
+ dtype == UBI_DATA_UNKNOWN);
+

+ pe = kmalloc(sizeof(struct ubi_wl_prot_entry), GFP_KERNEL);
+ if (unlikely(!pe))
+ return -ENOMEM;
+
+retry:
+ spin_lock(&ubi->wl.lock);
+ if (unlikely(tree_empty(&ubi->wl.free))) {
+ if (unlikely(ubi->wl.pending_works_count == 0)) {
+ ubi_assert(list_empty(&ubi->wl.pending_works));
+ ubi_err("no free eraseblocks");
+ spin_unlock(&ubi->wl.lock);
+ kfree(pe);
+ return -ENOSPC;
+ }
+ spin_unlock(&ubi->wl.lock);
+
+ err = produce_free_peb(ubi);
+ if (unlikely(err < 0)) {
+ kfree(pe);
+ return err;
+ }

+ goto retry;
+ }
+

+ switch (dtype) {
+ case UBI_DATA_LONGTERM:
+ /*
+ * For long term data we pick a physical eraseblock
+ * with high erase counter. But the highest erase
+ * counter we can pick is bounded by the the lowest
+ * erase counter plus %WL_FREE_MAX_DIFF.
+ */
+ e = find_wl_entry(&ubi->wl.free, WL_FREE_MAX_DIFF);
+ protect = LT_PROTECTION;
+ break;
+ case UBI_DATA_UNKNOWN:
+ e = pick_unknown(ubi);
+ protect = U_PROTECTION;
+ break;
+ case UBI_DATA_SHORTTERM:
+ /*
+ * For short term data we pick a physical eraseblock
+ * with the lowest erase counter as we expect it will
+ * be erased soon.
+ */
+ e = rb_entry(rb_first(&ubi->wl.free),
+ struct ubi_wl_entry, rb);
+ protect = ST_PROTECTION;
+ break;
+ default:
+ protect = 0;
+ e = NULL;
+ BUG();
+ }
+
+ /*
+ * Move the physical eraseblock to the protection trees where it will
+ * be protected from being moved for some time.
+ */
+ free_tree_del(ubi, e);
+ prot_tree_add(ubi, e, pe, protect);
+
+ dbg_wl("PEB %d EC %d, protection %d", e->pnum, e->ec, protect);
+ spin_unlock(&ubi->wl.lock);
+
+ return e->pnum;
+}
+
+/**
+ * prot_tree_del - remove a physical eraseblock from the protection trees

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock to remove
+ */
+static void prot_tree_del(struct ubi_info *ubi, int pnum)
+{
+ struct rb_node *p;
+ struct ubi_wl_prot_entry *pe = NULL;
+
+ p = ubi->wl.prot.pnum.rb_node;
+ while (p) {
+
+ pe = rb_entry(p, struct ubi_wl_prot_entry, rb_pnum);
+
+ if (pnum == pe->e->pnum)
+ break;
+
+ if (pnum < pe->e->pnum)

+ p = p->rb_left;
+ else
+ p = p->rb_right;
+ }
+

+ ubi_assert(pe->e->pnum == pnum);
+ rb_erase(&pe->rb_aec, &ubi->wl.prot.aec);
+ rb_erase(&pe->rb_pnum, &ubi->wl.prot.pnum);
+ kfree(pe);
+}
+
+/**
+ * sync_erase - synchronously erase a physical eraseblock.

+ *
+ * @ubi: the UBI device description object

+ * @e: the the physical eraseblock to erase
+ * @torture: if the physical eraseblock has to be tortured
+ *
+ * This function returns zero in case of success and a negative error code in

+ * case of failure.
+ */

+static int sync_erase(struct ubi_info *ubi, struct ubi_wl_entry *e, int torture)
+{
+ int err;
+ struct ubi_ec_hdr *ec_hdr;
+ uint64_t ec = e->ec;
+
+ dbg_wl("erase PEB %d, old EC %llu", e->pnum, (unsigned long long)ec);
+
+ err = paranoid_check_ec(ubi, e->pnum, e->ec);
+ if (unlikely(err > 0))
+ return -EINVAL;

+
+ ec_hdr = kzalloc(ubi->io.ec_hdr_alsize, GFP_KERNEL);

+ if (unlikely(!ec_hdr))
+ return -ENOMEM;
+
+ err = ubi_io_sync_erase(ubi, e->pnum, torture);

+ if (unlikely(err < 0))
+ goto out_free;
+

+ ec += err;

+ if (unlikely(ec > UBI_MAX_ERASECOUNTER)) {
+ /*

+ * Erase counter overflow. Upgrade UBI and use 64-bit
+ * erase counters internally.
+ */

+ ubi_err("erase counter overflow at PEB %d, EC %llu",
+ e->pnum, (unsigned long long)ec);
+ err = -EINVAL;
+ goto out_free;
+ }
+
+ dbg_wl("erased PEB %d, new EC %llu", e->pnum, (unsigned long long)ec);

+
+ ec_hdr->ec = cpu_to_ubi64(ec);
+

+ err = ubi_io_write_ec_hdr(ubi, e->pnum, ec_hdr);
+ if (unlikely(err))
+ goto out_free;
+
+ e->ec = ec;
+ spin_lock(&ubi->wl.lock);
+ if (e->ec > ubi->wl.max_ec)
+ ubi->wl.max_ec = e->ec;
+ spin_unlock(&ubi->wl.lock);
+
+out_free:
+ kfree(ec_hdr);

+ return err;
+}
+
+/**

+ * check_protection_over - check if it is time to stop protecting some
+ * physical eraseblocks.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function is called after each erase operation, when the absolute erase
+ * counter is incremented, to check if some physical eraseblock have not to be
+ * protected any longer. These physical eraseblocks are moved from the
+ * protection trees to the used tree.
+ */
+static void check_protection_over(struct ubi_info *ubi)
+{
+ struct ubi_wl_prot_entry *pe;
+
+ /*
+ * There may be several protected physical eraseblock to remove,
+ * process them all.
+ */
+ while (1) {
+ spin_lock(&ubi->wl.lock);
+ if (tree_empty(&ubi->wl.prot.aec)) {
+ spin_unlock(&ubi->wl.lock);
+ break;
+ }
+
+ pe = rb_entry(rb_first(&ubi->wl.prot.aec),
+ struct ubi_wl_prot_entry, rb_aec);
+
+ if (pe->abs_ec > ubi->wl.abs_ec) {
+ spin_unlock(&ubi->wl.lock);
+ break;
+ }
+
+ dbg_wl("PEB %d protection over, abs_ec %llu, PEB abs_ec %llu",
+ pe->e->pnum, ubi->wl.abs_ec, pe->abs_ec);
+ rb_erase(&pe->rb_aec, &ubi->wl.prot.aec);
+ rb_erase(&pe->rb_pnum, &ubi->wl.prot.pnum);
+ used_tree_add(ubi, pe->e);
+ spin_unlock(&ubi->wl.lock);
+
+ kfree(pe);
+ cond_resched();
+ }
+}
+
+/**
+ * schedule_ubi_work - schedule a work.

+ *
+ * @ubi: the UBI device description object

+ * @wrk: the work to schedule
+ *
+ * This function enqueues a work defined by @wrk to the tail of the pending
+ * works list.
+ */
+static void schedule_ubi_work(struct ubi_info *ubi, struct ubi_work *wrk)
+{
+ spin_lock(&ubi->wl.lock);
+ list_add_tail(&wrk->list, &ubi->wl.pending_works);
+ ubi_assert(ubi->wl.pending_works_count >= 0);
+ ubi->wl.pending_works_count += 1;
+ if (ubi->wl.thread_enabled)
+ wake_up_process(ubi->wl.task);
+ spin_unlock(&ubi->wl.lock);
+}
+
+/**
+ * reschedule_ubi_work - re-schedule a work.

+ *
+ * @ubi: the UBI device description object

+ * @wrk: the work to re-schedule.
+ *
+ * This function enqueues a work defined by @wrk to the tail of the pending
+ * works list.
+ */
+static void reschedule_ubi_work(struct ubi_info *ubi, struct ubi_work *wrk)
+{
+ spin_lock(&ubi->wl.lock);
+ list_add_tail(&wrk->list, &ubi->wl.pending_works);
+ ubi_assert(ubi->wl.pending_works_count >= 0);
+ ubi->wl.pending_works_count += 1;
+ if (ubi->wl.thread_enabled)
+ wake_up_process(ubi->wl.task);
+ spin_unlock(&ubi->wl.lock);
+}
+
+static int erase_worker(struct ubi_info *ubi, struct ubi_work *wl_wrk,
+ int cancel);
+
+/**
+ * schedule_erase - schedule an erase work.

+ *
+ * @ubi: the UBI device description object

+ * @e: the WL entry of the physical eraseblock to erase
+ * @torture: if the physical eraseblock has to be tortured
+ *
+ * This function returns zero in case of success and a negative error code in

+ * case of failure.
+ */

+static int schedule_erase(struct ubi_info *ubi, struct ubi_wl_entry *e,
+ int torture)
+{
+ struct ubi_work *wl_wrk;
+
+ dbg_wl("schedule erasure of PEB %d, EC %d, torture %d",
+ e->pnum, e->ec, torture);
+
+ wl_wrk = kmalloc(sizeof(struct ubi_work), GFP_KERNEL);
+ if (unlikely(!wl_wrk))
+ return -ENOMEM;
+
+ wl_wrk->func = &erase_worker;
+ wl_wrk->e = e;
+ wl_wrk->torture = torture;
+
+ schedule_ubi_work(ubi, wl_wrk);

+ return 0;
+}
+

+/**
+ * wear_leveling_worker - wear-leveling worker function.

+ *
+ * @ubi: the UBI device description object

+ * @wrk: the work object
+ * @cancel: non-zero if the worker has to free memory and exit
+ *
+ * This function copies a less worn out physical eraseblock to a more worn out
+ * one. Returns zero in case of success and a negative error code in case of
+ * failure.
+ */
+static int wear_leveling_worker(struct ubi_info *ubi, struct ubi_work *wrk,
+ int cancel)
+{
+ int err, put = 0;
+ struct ubi_wl_entry *e1, *e2;

+ struct ubi_vid_hdr *vid_hdr;
+

+ kfree(wrk);
+
+ if (unlikely(cancel))
+ return 0;
+

+ vid_hdr = ubi_zalloc_vid_hdr(ubi);
+ if (unlikely(!vid_hdr))

+ return -ENOMEM;
+
+ spin_lock(&ubi->wl.lock);
+
+ /*
+ * Only one WL worker at a time is supported at this implementation, so
+ * make sure a PEB is not being moved already.
+ */
+ if (ubi->wl.move_to || tree_empty(&ubi->wl.free) ||
+ (tree_empty(&ubi->wl.used) && tree_empty(&ubi->wl.scrub))) {
+ /*
+ * Only one WL worker at a time is supported at this
+ * implementation, so if a LEB is already being moved, cancel.
+ *
+ * No free physical eraseblocks? Well, we cancel wear-leveling
+ * then. It will be triggered again when a free physical
+ * eraseblock appears.
+ *
+ * No used physical eraseblocks? They must be temporarily
+ * protected from being moved. They will be moved to the
+ * @ubi->wl.used tree later and the wear-leveling will be
+ * triggered again.
+ */
+ dbg_wl("cancel WL, a list is empty: free %d, used %d",
+ tree_empty(&ubi->wl.free), tree_empty(&ubi->wl.used));
+ ubi->wl.wl_scheduled = 0;
+ spin_unlock(&ubi->wl.lock);
+ ubi_free_vid_hdr(ubi, vid_hdr);

+ return 0;
+ }
+

+ if (tree_empty(&ubi->wl.scrub)) {
+ /*
+ * Now pick the least worn-out used physical eraseblock and a
+ * highly worn-out free physical eraseblock. If the erase
+ * counters differ much enough, start wear-leveling.
+ */
+ e1 = rb_entry(rb_first(&ubi->wl.used), struct ubi_wl_entry, rb);
+ e2 = find_wl_entry(&ubi->wl.free, WL_FREE_MAX_DIFF);
+
+ if (!(e2->ec - e1->ec >= UBI_WL_THRESHOLD)) {
+ dbg_wl("no WL needed: min used EC %d, max free EC %d",
+ e1->ec, e2->ec);
+ ubi->wl.wl_scheduled = 0;
+ spin_unlock(&ubi->wl.lock);
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ return 0;
+ }
+ used_tree_del(ubi, e1);
+ dbg_wl("move PEB %d EC %d to PEB %d EC %d",
+ e1->pnum, e1->ec, e2->pnum, e2->ec);
+ } else {
+ e1 = rb_entry(rb_first(&ubi->wl.scrub), struct ubi_wl_entry, rb);
+ e2 = find_wl_entry(&ubi->wl.free, WL_FREE_MAX_DIFF);
+ scrub_tree_del(ubi, e1);
+ dbg_wl("scrub PEB %d to PEB %d", e1->pnum, e2->pnum);
+ }
+
+ free_tree_del(ubi, e2);
+ ubi_assert(!ubi->wl.move_from && !ubi->wl.move_to);
+ ubi_assert(!ubi->wl.move_to_put && !ubi->wl.move_from_put);
+ ubi->wl.move_from = e1;
+ ubi->wl.move_to = e1;
+ spin_unlock(&ubi->wl.lock);
+
+ /*
+ * Now we are going to copy physical eraseblock @e1->pnum to @e2->pnum.
+ * We so far do not know which logical eraseblock our physical
+ * eraseblock (@e1) belongs to. We have to read the volume identifier
+ * header first.
+ */
+
+ err = ubi_io_read_vid_hdr(ubi, e1->pnum, vid_hdr, 0);

+ if (unlikely(err && err != UBI_IO_BITFLIPS)) {

+ if (err == UBI_IO_PEB_FREE) {
+ /*
+ * We are trying to move PEB without a VID header. UBI
+ * always write VID headers shortly after the PEB was
+ * given, so we have a situation when it did not have
+ * chance to write it down because it was preempted.
+ * Just re-schedule the task, so that next time it will
+ * likely have the VID header in place.
+ */
+ dbg_wl("PEB %d has no VID header", e1->pnum);
+ err = 0;
+ } else {
+ ubi_err("error %d while reading VID header from PEB %d",
+ err, e1->pnum);
+ if (err > 0)

+ err = -EIO;
+ }

+ goto error;
+ }
+
+ err = ubi_eba_copy_leb(ubi, e1->pnum, e2->pnum, vid_hdr);

+ if (unlikely(err)) {
+ if (err == UBI_IO_BITFLIPS)

+ err = 0;
+ goto error;
+ }
+
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ spin_lock(&ubi->wl.lock);
+ if (unlikely(!ubi->wl.move_to_put))
+ used_tree_add(ubi, e2);
+ else
+ put = 1;
+ ubi->wl.move_from = ubi->wl.move_to = NULL;
+ ubi->wl.move_from_put = ubi->wl.move_to_put = 0;
+ ubi->wl.wl_scheduled = 0;
+ spin_unlock(&ubi->wl.lock);
+
+ if (unlikely(put)) {
+ /*
+ * Well, the target PEB was put meanwhile, schedule it for
+ * erasure.
+ */
+ dbg_wl("PEB %d was put meanwhile, erase", e2->pnum);
+ err = schedule_erase(ubi, e2, 0);
+ if (unlikely(err)) {
+ kmem_cache_free(wl_entries_slab, e2);
+ ubi_ro_mode(ubi);
+ }
+ }
+
+ err = schedule_erase(ubi, e1, 0);
+ if (unlikely(err)) {
+ kmem_cache_free(wl_entries_slab, e1);
+ ubi_ro_mode(ubi);
+ }
+
+ dbg_wl("done");
+ return err;
+
+ /*
+ * Some error occurred. @e1 was not changed, so return it back. @e2
+ * might be changed, schedule it for erasure.
+ */
+error:
+ if (err)
+ dbg_wl("error %d occurred, cancel operation", err);
+ ubi_assert(err <= 0);
+
+ ubi_free_vid_hdr(ubi, vid_hdr);
+ spin_lock(&ubi->wl.lock);
+ ubi->wl.wl_scheduled = 0;
+ if (ubi->wl.move_from_put)
+ put = 1;
+ else
+ used_tree_add(ubi, e1);
+ ubi->wl.move_from = ubi->wl.move_to = NULL;
+ ubi->wl.move_from_put = ubi->wl.move_to_put = 0;
+ spin_unlock(&ubi->wl.lock);
+
+ if (put) {
+ /*
+ * Well, the target PEB was put meanwhile, schedule it for
+ * erasure.
+ */
+ dbg_wl("PEB %d was put meanwhile, erase", e1->pnum);
+ err = schedule_erase(ubi, e1, 0);
+ if (unlikely(err)) {
+ kmem_cache_free(wl_entries_slab, e1);
+ ubi_ro_mode(ubi);
+ }
+ }
+
+ err = schedule_erase(ubi, e2, 0);
+ if (unlikely(err)) {
+ kmem_cache_free(wl_entries_slab, e2);
+ ubi_ro_mode(ubi);
+ }
+
+ yield();

+ return err;
+}
+
+/**

+ * ensure_wear_leveling - schedule wear-leveling if it is needed.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function checks if it is time to start wear-leveling and schedules it
+ * if yes. This function returns zero in case of success and a negative error
+ * code in case of failure.
+ */
+static int ensure_wear_leveling(struct ubi_info *ubi)
+{

+ int err = 0;

+ struct ubi_wl_entry *e1;
+ struct ubi_wl_entry *e2;
+ struct ubi_work *wrk;
+
+ spin_lock(&ubi->wl.lock);
+ if (ubi->wl.wl_scheduled)
+ /* Wear-leveling is already in the work queue */

+ goto out_unlock;
+
+ /*

+ * If the ubi->wl.scrub tree is not empty, scrubbing is needed, and the
+ * the WL worker has to be scheduled anyway.
+ */
+ if (tree_empty(&ubi->wl.scrub)) {
+ if (tree_empty(&ubi->wl.used) || tree_empty(&ubi->wl.free))
+ /* No physical eraseblocks - no deal */

+ goto out_unlock;
+
+ /*

+ * We schedule wear-leveling only if the difference between the
+ * lowest erase counter of used physical eraseblocks and a high
+ * erase counter of free physical eraseblocks is greater then
+ * %UBI_WL_THRESHOLD.
+ */
+ e1 = rb_entry(rb_first(&ubi->wl.used), struct ubi_wl_entry, rb);
+ e2 = find_wl_entry(&ubi->wl.free, WL_FREE_MAX_DIFF);
+
+ if (!(e2->ec - e1->ec >= UBI_WL_THRESHOLD))
+ goto out_unlock;
+ dbg_wl("schedule wear-leveling");
+ } else
+ dbg_wl("schedule scrubbing");
+
+ ubi->wl.wl_scheduled = 1;
+ spin_unlock(&ubi->wl.lock);
+
+ wrk = kmalloc(sizeof(struct ubi_work), GFP_KERNEL);
+ if (unlikely(!wrk)) {
+ err = -ENOMEM;
+ goto out_cancel;
+ }
+
+ wrk->func = &wear_leveling_worker;
+ schedule_ubi_work(ubi, wrk);
+ return err;
+
+out_cancel:
+ spin_lock(&ubi->wl.lock);
+ ubi->wl.wl_scheduled = 0;
+out_unlock:
+ spin_unlock(&ubi->wl.lock);

+ return err;
+}
+
+/**

+ * erase_worker - physical eraseblock erase worker function.

+ *
+ * @ubi: the UBI device description object

+ * @wl_wrk: the work object
+ * @cancel: non-zero if the worker has to free memory and exit
+ *
+ * This function erases a physical eraseblock and perform torture testing if
+ * needed. It also takes care about marking the physical eraseblock bad if
+ * needed. Returns zero in case of success and a negative error code in case of
+ * failure.
+ */
+static int erase_worker(struct ubi_info *ubi, struct ubi_work *wl_wrk,
+ int cancel)
+{
+ int err;
+ struct ubi_wl_entry *e = wl_wrk->e;
+ int pnum = e->pnum;
+
+ if (unlikely(cancel)) {
+ dbg_wl("cancel erasure of PEB %d EC %d", pnum, e->ec);
+ kfree(wl_wrk);
+ kmem_cache_free(wl_entries_slab, e);

+ return 0;
+ }
+

+ dbg_wl("erase PEB %d EC %d", pnum, e->ec);
+
+ err = sync_erase(ubi, e, wl_wrk->torture);
+ if (likely(!err)) {
+ /* Fine, we've erased it successfully */
+ kfree(wl_wrk);
+
+ spin_lock(&ubi->wl.lock);
+ ubi->wl.abs_ec += 1;
+ free_tree_add(ubi, e);
+ spin_unlock(&ubi->wl.lock);
+
+ /*
+ * One more erase operation has happened, take care about protected
+ * physical eraseblocks.
+ */
+ check_protection_over(ubi);
+
+ /* And take care about wear-leveling */
+ err = ensure_wear_leveling(ubi);

+ return err;
+ }
+

+ /*
+ * Some error occurred during erasure. If this is something like
+ * %-EINTR, we just re-schedule this physical eraseblock for
+ * erasure.
+ */
+
+ if (err == -EINTR || err == -EAGAIN || err == -ENOMEM ||
+ err == -EBUSY) {
+ reschedule_ubi_work(ubi, wl_wrk);

+ return err;
+ }
+

+ kfree(wl_wrk);
+ kmem_cache_free(wl_entries_slab, e);
+
+ if (err != -EIO) {
+ /*
+ * If this is not %-EIO, we have no idea what to do. Scheduling
+ * this physical eraseblock for erasure again would cause
+ * errors again and again. Well, lets switch to RO mode.
+ */
+ ubi_ro_mode(ubi);

+ return err;
+ }
+

+ /* It is %-EIO, the PEB went bad */

+
+ if (!ubi->io.bad_allowed) {

+ ubi_err("flash device is severely bad");
+ ubi_ro_mode(ubi);
+ err = -EIO;
+ } else {
+ err = ubi_io_mark_bad(ubi, pnum);

+ if (err) {
+ ubi_ro_mode(ubi);

+ return err;
+ }

+ ubi->io.bad_peb_count += 1;

+ ubi->io.good_peb_count -= 1;
+ ubi_msg("PEB %d was mark as bad", pnum);
+ }
+

+ return err;
+}
+
+/**

+ * ubi_wl_put_peb - return a physical eraseblock to the wear-leveling
+ * unit.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: physical eraseblock to return

+ * @torture: if this physical eraseblock has to be tortured

+ *
+ * This function is called to return physical eraseblock @pnum to the pool of
+ * free physical eraseblocks. The @torture flag has to be set if an I/O error
+ * occurred to this @pnum and it has to be tested. This function returns zero

+ * in case of success and a negative error code in case of failure.
+ */

+int ubi_wl_put_peb(struct ubi_info *ubi, int pnum, int torture)
+{
+ int err;
+ struct ubi_wl_entry *e;
+
+ dbg_wl("PEB %d", pnum);
+ ubi_assert(pnum >= 0);
+ ubi_assert(pnum < ubi->io.peb_count);
+
+ spin_lock(&ubi->wl.lock);
+
+ e = ubi->wl.lookuptbl[pnum];
+ if (unlikely(e == ubi->wl.move_from)) {
+ /*
+ * User is putting the physical eraseblock which was selected to
+ * be moved. It will be scheduled for erasure in the
+ * wear-leveling worker.
+ */
+ dbg_wl("PEB %d is being moved", pnum);
+ ubi_assert(!ubi->wl.move_from_put);
+ ubi->wl.move_from_put = 1;
+ spin_unlock(&ubi->wl.lock);
+ return 0;
+ } else if (unlikely(e == ubi->wl.move_to)) {
+ /*
+ * User is putting the physical eraseblock which was selected
+ * as the target the data is moved to. It may happen if the EBA
+ * unit already re-mapped the LEB but the WL unit did has not
+ * put the PEB to the "used" tree.
+ */
+ dbg_wl("PEB %d is the target of data moving", pnum);
+ ubi_assert(!ubi->wl.move_to_put);
+ ubi->wl.move_to_put = 1;
+ spin_unlock(&ubi->wl.lock);

+ return 0;
+ } else {

+ if (in_wl_tree(e, &ubi->wl.used))
+ used_tree_del(ubi, e);
+ else if (unlikely(in_wl_tree(e, &ubi->wl.scrub)))
+ scrub_tree_del(ubi, e);
+ else
+ prot_tree_del(ubi, e->pnum);
+ }
+ spin_unlock(&ubi->wl.lock);
+
+ err = schedule_erase(ubi, e, torture);
+ if (unlikely(err)) {
+ spin_lock(&ubi->wl.lock);
+ used_tree_add(ubi, e);
+ spin_unlock(&ubi->wl.lock);
+ }
+

+ return err;
+}
+
+/**

+ * ubi_wl_scrub_peb - schedule a physical eraseblock for scrubbing.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock to schedule
+ *
+ * If a bit-flip in a physical eraseblock is detected, this physical eraseblock
+ * needs scrubbing. This function schedules a physical eraseblock for
+ * scrubbing which is done in background. This function returns zero in case of
+ * success and a negative error code in case of failure.
+ */
+int ubi_wl_scrub_peb(struct ubi_info *ubi, int pnum)
+{
+ struct ubi_wl_entry *e;
+
+ dbg_wl("schedule PEB %d for scrubbing", pnum);
+
+retry:
+ spin_lock(&ubi->wl.lock);
+ e = ubi->wl.lookuptbl[pnum];
+ if (unlikely(e == ubi->wl.move_from) || in_wl_tree(e, &ubi->wl.scrub)) {
+ spin_unlock(&ubi->wl.lock);

+ return 0;
+ }
+

+ if (unlikely(e == ubi->wl.move_to)) {
+ /*
+ * This physical eraseblock was used to move data to. The data
+ * was moved but the PEB was not yet inserted to the proper
+ * tree. We should just wait a little and let the WL worker
+ * proceed.
+ */
+ spin_unlock(&ubi->wl.lock);
+ dbg_wl("the PEB %d is not in proper tree, retry", pnum);

+ yield();
+ goto retry;
+ }
+

+ if (in_wl_tree(e, &ubi->wl.used))
+ used_tree_del(ubi, e);
+ else
+ prot_tree_del(ubi, pnum);
+
+ scrub_tree_add(ubi, e);
+ spin_unlock(&ubi->wl.lock);
+
+ /*
+ * Technically scrubbing is the same as wear-leveling, so it is done
+ * by the WL worker.
+ */
+ return ensure_wear_leveling(ubi);
+}
+
+/**
+ * ubi_wl_flush - flush all pending works.

+ *
+ * @ubi: the UBI device description object
+ *

+ * This function returns zero in case of success and a negative error code in

+ * case of failure.
+ */

+int ubi_wl_flush(struct ubi_info *ubi)
+{
+ int err, pending_count;
+
+ pending_count = ubi->wl.pending_works_count;
+
+ dbg_wl("flush (%d pending works)", pending_count);
+
+ /*
+ * Erase while the pending works queue is not empty, but not more then
+ * the number of currently pending works.
+ */
+ while (pending_count-- > 0) {
+ err = do_work(ubi);

+ if (unlikely(err))
+ return err;
+ }

+
+ return 0;
+}
+

+/**
+ * tree_destroy - destroy an RB-tree.
+ *
+ * @root: the root of the tree to destroy
+ */
+static void tree_destroy(struct rb_root *root)

+{
+ struct rb_node *rb;

+ struct ubi_wl_entry *e;
+
+ rb = root->rb_node;

+ while (rb) {
+ if (rb->rb_left)
+ rb = rb->rb_left;
+ else if (rb->rb_right)
+ rb = rb->rb_right;
+ else {

+ e = rb_entry(rb, struct ubi_wl_entry, rb);

+
+ rb = rb_parent(rb);
+ if (rb) {

+ if (rb->rb_left == &e->rb)

+ rb->rb_left = NULL;
+ else
+ rb->rb_right = NULL;
+ }
+

+ kmem_cache_free(wl_entries_slab, e);
+ }
+ }
+}
+
+/**
+ * ubi_thread - UBI background thread.
+ *
+ * @u: the UBI device description object pointer
+ */
+static int ubi_thread(void *u)
+{
+ int failures = 0;
+ struct ubi_info *ubi = u;
+
+ ubi_msg("background thread \"%s\" started, PID %d",
+ ubi->wl.bgt_name, current->pid);
+
+ for (;;) {
+ int err;
+
+ if (kthread_should_stop())
+ goto out;
+
+ if (try_to_freeze())
+ continue;
+
+ spin_lock(&ubi->wl.lock);
+ if (list_empty(&ubi->wl.pending_works) ||
+ ubi->io.ro_mode || !ubi->wl.thread_enabled) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ spin_unlock(&ubi->wl.lock);
+ schedule();
+ continue;
+ }
+ spin_unlock(&ubi->wl.lock);
+
+ err = do_work(ubi);
+ if (unlikely(err)) {
+ ubi_err("%s: work failed with error code %d",
+ ubi->wl.bgt_name, err);
+ if (failures++ > WL_MAX_FAILURES) {
+ /*
+ * Too many failures, disable the thread and
+ * switch to read-only mode.
+ */
+ ubi_msg("%s: %d consecutive failures",
+ ubi->wl.bgt_name, WL_MAX_FAILURES);
+ ubi_ro_mode(ubi);
+ break;
+ }
+ } else
+ failures = 0;

+
+ cond_resched();
+ }
+

+out:
+ dbg_wl("killing background thread \"%s\"", ubi->wl.bgt_name);
+
+ /* Cancel all pending works before exiting */
+ spin_lock(&ubi->wl.lock);
+ ubi->wl.task = NULL;
+
+ while (!list_empty(&ubi->wl.pending_works)) {
+ struct ubi_work *wrk;
+
+ wrk = list_entry(ubi->wl.pending_works.next,
+ struct ubi_work, list);
+ list_del(&wrk->list);
+ spin_unlock(&ubi->wl.lock);
+ wrk->func(ubi, wrk, 1);
+ spin_lock(&ubi->wl.lock);
+ ubi->wl.pending_works_count -= 1;
+ ubi_assert(ubi->wl.pending_works_count >= 0);
+ }
+ spin_unlock(&ubi->wl.lock);

+
+ return 0;
+}
+

+/**
+ * ubi_wl_init_scan - initialize the wear-leveling unit using scanning
+ * information.
+ *

+ * @ubi: the UBI device description object

+ * @si: a pointer to the scanning information

+ *
+ * This function returns zero in case of success, and a negative error code in
+ * case of failure.
+ */

+int ubi_wl_init_scan(struct ubi_info *ubi, struct ubi_scan_info *si)
+{
+ int err;

+ struct rb_node *rb1, *rb2;
+ struct ubi_scan_volume *sv;

+ struct ubi_scan_leb *seb, *tmp;
+ struct ubi_wl_entry *e;
+
+
+ ubi->wl.used = ubi->wl.free = ubi->wl.scrub = RB_ROOT;
+ ubi->wl.prot.pnum = ubi->wl.prot.aec = RB_ROOT;
+ spin_lock_init(&ubi->wl.lock);
+ ubi->wl.max_ec = si->max_ec;
+ INIT_LIST_HEAD(&ubi->wl.pending_works);
+
+ ubi->wl.bgt_name = kmalloc(sizeof(WL_NAME_PATTERN) + 20, GFP_KERNEL);
+ if (!ubi->wl.bgt_name)
+ return -ENOMEM;
+ sprintf(ubi->wl.bgt_name, WL_NAME_PATTERN, ubi->ubi_num);
+
+ ubi->wl.task = kthread_create(ubi_thread, ubi, ubi->wl.bgt_name);
+ if (IS_ERR(ubi->wl.task)) {
+ err = PTR_ERR(ubi->wl.task);
+ ubi_err("cannot spawn \"%s\", error %d", ubi->wl.bgt_name,
+ err);

+ goto out_free;
+ }
+

+ if (ubis_num == 0) {

+ wl_entries_slab = kmem_cache_create("ubi_wl_entry_slab",
+ sizeof(struct ubi_wl_entry),
+ 0, 0, NULL, NULL);
+ if (!wl_entries_slab)

+ return -ENOMEM;
+ }
+

+ ubi->wl.lookuptbl = kzalloc(ubi->io.peb_count * sizeof(void *), GFP_KERNEL);
+ if (!ubi->wl.lookuptbl) {
+ err = -ENOMEM;
+ goto out_free;
+ }
+
+ list_for_each_entry_safe(seb, tmp, &si->erase, u.list) {
+ cond_resched();
+
+ e = kmem_cache_alloc(wl_entries_slab, GFP_KERNEL);
+ if (unlikely(!e)) {
+ err = -ENOMEM;
+ goto out_free;
+ }
+
+ e->pnum = seb->pnum;
+ e->ec = seb->ec;
+ ubi->wl.lookuptbl[e->pnum] = e;
+ err = schedule_erase(ubi, e, 0);
+ if (unlikely(err)) {
+ kmem_cache_free(wl_entries_slab, e);
+ goto out_free;
+ }
+ }
+
+ err = -ENOMEM;

+ list_for_each_entry(seb, &si->free, u.list) {

+ cond_resched();
+
+ e = kmem_cache_alloc(wl_entries_slab, GFP_KERNEL);
+ if (unlikely(!e))
+ goto out_free;
+
+ e->pnum = seb->pnum;
+ e->ec = seb->ec;
+ ubi_assert(e->ec >= 0);
+ free_tree_add(ubi, e);
+ ubi->wl.lookuptbl[e->pnum] = e;

+ }
+
+ list_for_each_entry(seb, &si->corr, u.list) {

+ cond_resched();
+
+ e = kmem_cache_alloc(wl_entries_slab, GFP_KERNEL);
+ if (unlikely(!e)) {
+ err = -ENOMEM;
+ goto out_free;
+ }
+
+ e->pnum = seb->pnum;
+ e->ec = seb->ec;
+ ubi->wl.lookuptbl[e->pnum] = e;
+ err = schedule_erase(ubi, e, 0);
+ if (unlikely(err)) {
+ kmem_cache_free(wl_entries_slab, e);
+ goto out_free;
+ }
+ }
+
+ err = -ENOMEM;

+ rb_for_each_entry(rb1, sv, &si->volumes, rb) {
+ rb_for_each_entry(rb2, seb, &sv->root, u.rb) {

+ cond_resched();
+
+ e = kmem_cache_alloc(wl_entries_slab, GFP_KERNEL);
+ if (unlikely(!e))
+ goto out_free;
+
+ e->pnum = seb->pnum;
+ e->ec = seb->ec;
+ ubi->wl.lookuptbl[e->pnum] = e;
+ if (!seb->scrub) {
+ dbg_wl("add PEB %d EC %d to the used tree",
+ e->pnum, e->ec);
+ used_tree_add(ubi, e);
+ } else {
+ dbg_wl("add PEB %d EC %d to the scrub tree",
+ e->pnum, e->ec);
+ scrub_tree_add(ubi, e);
+ }
+ }
+ }
+
+ err = ubi_acc_reserve(ubi, WL_RESERVED_PEBS);
+ if (err)
+ goto out_free;
+
+ /* Schedule wear-leveling if needed */
+ err = ensure_wear_leveling(ubi);
+ if (err)
+ goto out_free;

+
+ return 0;
+

+out_free:
+ kfree(ubi->wl.bgt_name);
+ tree_destroy(&ubi->wl.used);
+ tree_destroy(&ubi->wl.free);
+ tree_destroy(&ubi->wl.scrub);
+ kfree(ubi->wl.lookuptbl);

+ if (ubis_num == 0)

+ kmem_cache_destroy(wl_entries_slab);

+ return err;
+}
+
+/**

+ * protection_trees_destroy - destroy the protection RB-trees.

+ *
+ * @ubi: the UBI device description object

+ */
+static void protection_trees_destroy(struct ubi_info *ubi)

+{
+ struct rb_node *rb;

+ struct ubi_wl_prot_entry *pe;
+
+ rb = ubi->wl.prot.aec.rb_node;

+ while (rb) {
+ if (rb->rb_left)
+ rb = rb->rb_left;
+ else if (rb->rb_right)
+ rb = rb->rb_right;
+ else {

+ pe = rb_entry(rb, struct ubi_wl_prot_entry, rb_aec);

+
+ rb = rb_parent(rb);
+ if (rb) {

+ if (rb->rb_left == &pe->rb_aec)

+ rb->rb_left = NULL;
+ else
+ rb->rb_right = NULL;
+ }
+

+ kmem_cache_free(wl_entries_slab, pe->e);
+ kfree(pe);
+ }
+ }
+}
+
+/**
+ * ubi_wl_close - close the wear-leveling unit.

+ *
+ * @ubi: the UBI device description object

+ */
+void ubi_wl_close(struct ubi_info *ubi)
+{
+ dbg_wl("disable \"%s\"", ubi->wl.bgt_name);
+ if (ubi->wl.task)
+ kthread_stop(ubi->wl.task);
+
+ dbg_wl("close the UBI wear-leveling unit");
+
+ ubi_assert(ubi->wl.pending_works_count == 0);
+ ubi_assert(list_empty(&ubi->wl.pending_works));
+
+ kfree(ubi->wl.bgt_name);
+ protection_trees_destroy(ubi);
+ tree_destroy(&ubi->wl.used);
+ tree_destroy(&ubi->wl.free);
+ tree_destroy(&ubi->wl.scrub);
+ kfree(ubi->wl.lookuptbl);

+ if (ubis_num == 1)

+ kmem_cache_destroy(wl_entries_slab);

+}
+
+/**
+ *

+ * ubi_wl_enable_thread - enable the background thread.

+ *
+ * @ubi: the UBI device description object

+ */
+void ubi_wl_enable_thread(struct ubi_info *ubi)
+{
+ if (!DBG_DISABLE_BGT) {
+ dbg_wl("enable the background thread");
+ ubi->wl.thread_enabled = 1;
+ wake_up_process(ubi->wl.task);
+ }
+}
+
+#ifdef CONFIG_MTD_UBI_DEBUG_PARANOID_WL
+
+/**
+ * paranoid_check_ec - make sure that the erase counter of a physical eraseblock
+ * is correct.

+ *
+ * @ubi: the UBI device description object

+ * @pnum: the physical eraseblock number to check
+ * @ec: the erase counter to check
+ *
+ * This function returns zero if the erase counter of physical eraseblock @pnum
+ * is equivalent to @ec, %1 if not, and a negative error code if an error
+ * occurred.
+ */
+static int paranoid_check_ec(const struct ubi_info *ubi, int pnum, int ec)
+{
+ int err;
+ long long read_ec;

+ struct ubi_ec_hdr *ec_hdr;
+
+ ec_hdr = kzalloc(ubi->io.ec_hdr_alsize, GFP_KERNEL);

+ if (unlikely(!ec_hdr))
+ return -ENOMEM;
+
+ err = ubi_io_read_ec_hdr(ubi, pnum, ec_hdr, 0);

+ if (unlikely(err) && err != UBI_IO_BITFLIPS) {

+ /* The header does not have to exist */
+ err = 0;

+ goto out_free;
+ }
+

+ read_ec = ubi64_to_cpu(ec_hdr->ec);
+ if (unlikely(ec != read_ec)) {
+ ubi_err("paranoid check failed for PEB %d", pnum);
+ ubi_err("read EC is %lld, should be %d", read_ec, ec);
+ ubi_dbg_dump_stack();
+ err = 1;
+ } else

+ err = 0;
+

+out_free:
+ kfree(ec_hdr);

+ return err;
+}
+
+/**

+ * paranoid_check_in_wl_tree - make sure that a wear-leveling entry is present
+ * in a WL RB-tree.
+ *
+ * @e: the wear-leveling entry to check
+ * @root: the root of the tree
+ *
+ * This function returns zero if @e is in the @root RB-tree and %1 if it
+ * is not.
+ */
+static int paranoid_check_in_wl_tree(struct ubi_wl_entry *e,
+ struct rb_root *root)
+{
+ if (likely(in_wl_tree(e, root)))
+ return 0;
+
+ ubi_err("paranoid check failed for PEB %d, EC %d, RB-tree %p ",
+ e->pnum, e->ec, root);

+ ubi_dbg_dump_stack();
+ return 1;
+}
+

+#endif /* CONFIG_MTD_UBI_DEBUG_PARANOID_WL */

Andrew Morton

unread,

Mar 15, 2007, 3:10:37 PM3/15/07

to Artem Bityutskiy

There's way too much code here to expect it to get decently reviewed, alas.

> On Wed, 14 Mar 2007 17:20:24 +0200 Artem Bityutskiy <dede...@infradead.org> wrote:
>
> ...

>
> +/**
> + * leb_get_ver - get logical eraseblock version.
> + *
> + * @ubi: the UBI device description object
> + * @vol_id: the volume ID
> + * @lnum: the logical eraseblock number
> + *
> + * The logical eraseblock has to be locked. Note, all this leb_ver stuff is
> + * obsolete and will be removed eventually. FIXME: to be removed together with
> + * leb_ver support.
> + */
> +static inline int leb_get_ver(struct ubi_info *ubi, int vol_id, int lnum)
> +{
> + int idx, leb_ver;
> +
> + idx = vol_id2idx(ubi, vol_id);
> +
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> + leb_ver = ubi->eba.eba_tbl[idx].recs[lnum].leb_ver;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +
> + return leb_ver;
> +}

I very much doubt that the locking in this function (and in the similar
ones here) does anything useful.

> +static unsigned long long next_sqnum(struct ubi_info *ubi)
> +{
> + unsigned long long sqnum;
> +
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + sqnum = ubi->eba.global_sq_cnt++;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +
> + return sqnum;
> +}

That one makes sense,

> +static inline void leb_map(struct ubi_info *ubi, int vol_id, int lnum, int pnum)
> +{
> + int idx;
> +
> + idx = vol_id2idx(ubi, vol_id);
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs[lnum].pnum < 0);
> + ubi->eba.eba_tbl[idx].recs[lnum].pnum = pnum;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +}

I doubt if that one does.

> +/**
> + * leb_unmap - un-map a logical eraseblock.
> + *
> + * @ubi: the UBI device description object
> + * @vol_id: the volume ID
> + * @lnum: the logical eraseblock number to unmap
> + *
> + * This function un-maps a logical eraseblock and increases its version. The
> + * logical eraseblock has to be locked.
> + */
> +static inline void leb_unmap(struct ubi_info *ubi, int vol_id, int lnum)

The patch is full of nutty inlining.

Suggestion: just remove all of it. Then reintroduce inlining in only
those places where a benefit is demonstrable. Reduced code size according to
/bin/size would be a suitable metric.

> +static inline int leb2peb(struct ubi_info *ubi, int vol_id, int lnum)
> +{
> + int idx, pnum;
> +
> + idx = vol_id2idx(ubi, vol_id);
> +
> + spin_lock(&ubi->eba.eba_tbl_lock);
> + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> + pnum = ubi->eba.eba_tbl[idx].recs[lnum].pnum;
> + spin_unlock(&ubi->eba.eba_tbl_lock);
> +
> + return pnum;
> +}

Again, the locking seems pointless.

Randy Dunlap

unread,

Mar 15, 2007, 6:27:04 PM3/15/07

to Andrew Morton

On Thu, 15 Mar 2007 11:07:03 -0800 Andrew Morton wrote:

>
> There's way too much code here to expect it to get decently reviewed, alas.

Yes.

/me repeats wish that Not Everything Should Be Sent to lkml. :(

> > On Wed, 14 Mar 2007 17:20:24 +0200 Artem Bityutskiy <dede...@infradead.org> wrote:
> >
> > ...
> >
> > +/**
> > + * leb_get_ver - get logical eraseblock version.
> > + *
> > + * @ubi: the UBI device description object
> > + * @vol_id: the volume ID
> > + * @lnum: the logical eraseblock number
> > + *
> > + * The logical eraseblock has to be locked. Note, all this leb_ver stuff is
> > + * obsolete and will be removed eventually. FIXME: to be removed together with
> > + * leb_ver support.
> > + */

Please use kernel-doc syntax and test it. Using and testing it
are really easy to do. It's just a simple language. Don't make
(even trivial) problems for others to clean up...

Documentation/kernel-doc-nano-HOWTO.txt

Above: no "blank" line between the function name and its parameters.

> > +static inline int leb_get_ver(struct ubi_info *ubi, int vol_id, int lnum)
> > +{
> > + int idx, leb_ver;
> > +
> > + idx = vol_id2idx(ubi, vol_id);
> > +
> > + spin_lock(&ubi->eba.eba_tbl_lock);
> > + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> > + leb_ver = ubi->eba.eba_tbl[idx].recs[lnum].leb_ver;
> > + spin_unlock(&ubi->eba.eba_tbl_lock);
> > +
> > + return leb_ver;
> > +}

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

Josh Boyer

unread,

Mar 15, 2007, 7:24:02 PM3/15/07

to Randy Dunlap

On Thu, Mar 15, 2007 at 02:24:10PM -0700, Randy Dunlap wrote:
> On Thu, 15 Mar 2007 11:07:03 -0800 Andrew Morton wrote:
>
> >
> > There's way too much code here to expect it to get decently reviewed, alas.
>
> Yes.
>
> /me repeats wish that Not Everything Should Be Sent to lkml. :(

Just curious, but where would you suggest this be sent to for review then?

josh

Randy Dunlap

unread,

Mar 15, 2007, 10:51:55 PM3/15/07

to Josh Boyer

On Thu, 15 Mar 2007 18:29:51 -0500 Josh Boyer wrote:

> On Thu, Mar 15, 2007 at 02:24:10PM -0700, Randy Dunlap wrote:
> > On Thu, 15 Mar 2007 11:07:03 -0800 Andrew Morton wrote:
> >
> > >
> > > There's way too much code here to expect it to get decently reviewed, alas.
> >
> > Yes.
> >
> > /me repeats wish that Not Everything Should Be Sent to lkml. :(
>
> Just curious, but where would you suggest this be sent to for review then?

Valid question. I should have chosen some other more appropriate
patch to make that comment.

I don't see a better list for UBI patches, so lkml is OK IMO.

Here is a summary of my thinking on Linux-related mailing lists.

1. Bug reports can go to lkml or focused mailing lists.

2. Development (like patches) should go to focused mailing lists
if there is such a list and they have enough usage.

Development areas that qualify for this IMO are:
- ACPI
- ATA
- file systems
- frame buffer
- ieee1394
- MM/VM
- multimedia
- networking
- PCI
- power management, suspend/resume
- SCSI
- sound
- USB
- virtualization

(not that I expect anything close to concensus on this)

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

Artem Bityutskiy

unread,

Mar 16, 2007, 6:15:33 AM3/16/07

to Andrew Morton

On Thu, 2007-03-15 at 11:07 -0800, Andrew Morton wrote:
> > + spin_lock(&ubi->eba.eba_tbl_lock);
> > + ubi_assert(ubi->eba.eba_tbl[idx].recs);
> > + leb_ver = ubi->eba.eba_tbl[idx].recs[lnum].leb_ver;
> > + spin_unlock(&ubi->eba.eba_tbl_lock);
> > +
> > + return leb_ver;
> > +}
>
> I very much doubt that the locking in this function (and in the similar
> ones here) does anything useful.

Well, yes, these are integers.

> > +static inline void leb_unmap(struct ubi_info *ubi, int vol_id, int lnum)
>
> The patch is full of nutty inlining.

Yeah, this file has too much of them.

> Suggestion: just remove all of it. Then reintroduce inlining in only
> those places where a benefit is demonstrable. Reduced code size according to
> /bin/size would be a suitable metric.

OK, thanks.

> > + spin_unlock(&ubi->eba.eba_tbl_lock);
> > +
> > + return pnum;
> > +}
>
> Again, the locking seems pointless.

Thanks for comments, will be fixed.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Artem Bityutskiy

unread,

Mar 16, 2007, 6:27:09 AM3/16/07

to Randy Dunlap

On Thu, 2007-03-15 at 18:49 -0700, Randy Dunlap wrote:
> Here is a summary of my thinking on Linux-related mailing lists.
>
> 1. Bug reports can go to lkml or focused mailing lists.
>
> 2. Development (like patches) should go to focused mailing lists
> if there is such a list and they have enough usage.
>
> Development areas that qualify for this IMO are:
> - ACPI
> - ATA
> - file systems
> - frame buffer
> - ieee1394
> - MM/VM
> - multimedia
> - networking
> - PCI
> - power management, suspend/resume
> - SCSI
> - sound
> - USB
> - virtualization

Please, rise this question in a separate thread, and discuss with
subsystem maintainers. I was directed here by the MTD maintainer.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Artem Bityutskiy

unread,

Mar 16, 2007, 6:28:15 AM3/16/07

to Randy Dunlap

On Thu, 2007-03-15 at 14:24 -0700, Randy Dunlap wrote:
> /me repeats wish that Not Everything Should Be Sent to lkml. :(

If you do statement like this, please, provide reasons why you say it
for these patches and suggest something _constructive_. I do not see any
point in this vague phrase.

And please, note, I was directed here by David Woodhouse who is MTD
maintainer because he thinks the patch is large and needs more people to
look at it, not just him.

> Documentation/kernel-doc-nano-HOWTO.txt
>
> Above: no "blank" line between the function name and its parameters.

OK, I'll look at this again, thanks.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Randy Dunlap

unread,

Mar 16, 2007, 11:57:54 AM3/16/07

to dede...@infradead.org

On Fri, 16 Mar 2007 12:21:08 +0200 Artem Bityutskiy wrote:

> On Thu, 2007-03-15 at 14:24 -0700, Randy Dunlap wrote:
> > /me repeats wish that Not Everything Should Be Sent to lkml. :(
>
> If you do statement like this, please, provide reasons why you say it
> for these patches and suggest something _constructive_. I do not see any
> point in this vague phrase.

so do you believe that Everything (that is kernel-related) should
be sent to lkml?

> And please, note, I was directed here by David Woodhouse who is MTD
> maintainer because he thinks the patch is large and needs more people to
> look at it, not just him.

As I wrote to Mr. Boyer, that makes sense in this case.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

Matt Mackall

unread,

Mar 18, 2007, 12:40:34 PM3/18/07

to Artem Bityutskiy

On Wed, Mar 14, 2007 at 05:19:34PM +0200, Artem Bityutskiy wrote:
> Hello,
>
> This patch-set contains UBI, which stands for Unsorted Block Images. This
> is closely related to the memory technology devices Linux subsystem (MTD),
> so this new piece of software is from drivers/mtd/ubi.
>
> In short, UBI is kind of LVM layer but for flash (MTD) devices. It makes
> it possible to dynamically create, delete and re-size volumes. But the
> analogy is not full. UBI also takes care of wear-leveling and bad
> eraseblocks handling, so UBI completely hides 2 aspects of flash chips
> which make them very difficult to work with:
>
> 1. wear of eraseblocks;
> 2. bad eraseblocks.

Forgive my ignorance, but why did you not implement the two features
above as device mapper layers instead? A device mapper can arbitrarily
transform I/O addresses and contents and has direct access to the
mapped device's ioctl interfaces, etc.

Writing a mapper that's driven by out of band data on the underlying
device can be done in a couple hundred lines of code. I've done it to
implement an MD5 integrity checking layer. In this case, all the I/Os
were simply remapped to be above the on-disk MD5 table and expanded to
be at least the hashed cluster size.

Even if the device mapper API is not completely up to the task (which
I strongly suspect it is), it would seem simpler to extend it than to
add 14000 lines of parallel subsystem.

--
Mathematics is the supreme nostalgia of our time.

Artem Bityutskiy

unread,

Mar 18, 2007, 12:50:33 PM3/18/07

to Matt Mackall

On Sun, 2007-03-18 at 11:27 -0500, Matt Mackall wrote:
> Forgive my ignorance, but why did you not implement the two features
> above as device mapper layers instead? A device mapper can arbitrarily
> transform I/O addresses and contents and has direct access to the
> mapped device's ioctl interfaces, etc.

Just because UBI is designed for flash devices, not block devices. Note,
UBI is not for MMC/USB stick/SC/etc flashes, which are used as block
devices, but for _bare_ flashes.

Please, glance here to find more information about the different between
flashes and block devices:

http://www.linux-mtd.infradead.org/faq/general.html#L_mtd_vs_hdd

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Matt Mackall

unread,

Mar 18, 2007, 3:31:20 PM3/18/07

to Artem Bityutskiy

On Sun, Mar 18, 2007 at 06:49:39PM +0200, Artem Bityutskiy wrote:
> On Sun, 2007-03-18 at 11:27 -0500, Matt Mackall wrote:
> > Forgive my ignorance, but why did you not implement the two features
> > above as device mapper layers instead? A device mapper can arbitrarily
> > transform I/O addresses and contents and has direct access to the
> > mapped device's ioctl interfaces, etc.
>
> Just because UBI is designed for flash devices, not block devices. Note,
> UBI is not for MMC/USB stick/SC/etc flashes, which are used as block
> devices, but for _bare_ flashes.
>
> Please, glance here to find more information about the different between
> flashes and block devices:
>
> http://www.linux-mtd.infradead.org/faq/general.html#L_mtd_vs_hdd

I'm well aware of all that. I wrote a NAND driver just last month.
Let's consider this table:

HARD drives MTD device
Consists of sectors Consists of eraseblocks
Sectors are small (512, 1024 bytes) Eraseblocks are larger (32KiB, 128KiB)
read sector and write sector read, write, and erase block
Bad sectors are re-mapped Bad eraseblocks are not hidden
HDD sectors don't wear out Eraseblocks get worn-out

If the end goal is to end up with something that looks like a block
device (which seems to be implied by adding transparent wear leveling
and bad block remapping), then I don't see any reason it can't be done
in device mapper. The 'smarts' of mtdblock could in fact be pulled up
a level. As I've pointed out already, you can already easily address
issues two, four, and five with device mapper layers.

If instead you still want the "NAND-ness" of the device exposed at the
top level so things can do raw eraseblock I/O more efficiently, then I
think instead of duplicating the device mapper framer, we should
instead think about how to integrate NAND devices more closely with
the block layer.

In the end, a block device is something which does random access
block-oriented I/O. Disk and NAND both fit that description.

--
Mathematics is the supreme nostalgia of our time.

Josh Boyer

unread,

Mar 18, 2007, 4:26:10 PM3/18/07

to Matt Mackall

On Sun, Mar 18, 2007 at 02:18:12PM -0500, Matt Mackall wrote:
>
> I'm well aware of all that. I wrote a NAND driver just last month.
> Let's consider this table:
>
> HARD drives MTD device
> Consists of sectors Consists of eraseblocks
> Sectors are small (512, 1024 bytes) Eraseblocks are larger (32KiB, 128KiB)
> read sector and write sector read, write, and erase block
> Bad sectors are re-mapped Bad eraseblocks are not hidden
> HDD sectors don't wear out Eraseblocks get worn-out

N/A NAND flash addressed in pages
N/A NAND flash has OOB areas
N/A (?) NAND flash requires ECC

>
> If the end goal is to end up with something that looks like a block
> device (which seems to be implied by adding transparent wear leveling

Nope, not the end goal. It's more about wear-leveling across the entire
flash chip than it is presenting a "block like" device.

> and bad block remapping), then I don't see any reason it can't be done
> in device mapper. The 'smarts' of mtdblock could in fact be pulled up

There is nothing smart about mtdblock. And mtdblock has nothing to do
with UBI.

> a level. As I've pointed out already, you can already easily address
> issues two, four, and five with device mapper layers.
>
> If instead you still want the "NAND-ness" of the device exposed at the
> top level so things can do raw eraseblock I/O more efficiently, then I
> think instead of duplicating the device mapper framer, we should
> instead think about how to integrate NAND devices more closely with
> the block layer.
>
> In the end, a block device is something which does random access
> block-oriented I/O. Disk and NAND both fit that description.

NAND very much doesn't fit the "random access" part of that. For writes
you have to write in incrementing pages within eraseblocks.

UBI is about maximizing the number of available eraseblocks to efficiently
wear-level across the largest possible area on a flash chip. MTD itself
contains no higher-level capabilities to deal with this, and UBI uses the
underlying MTD device directly, not through ioctls. This allows existing
flash specific users (e.g. JFFS2) to run on top of UBI with minimal changes.

Your idea does have some merit, however I believe your focus is misplaced.
Rather than convert UBI to device mapper and somehow try to make it work
through mtdblock (sic), perhaps what should be done is come up with a
better interface for MTD to present itself as a block device. I would
still find that troubling though.

josh

Matt Mackall

unread,

Mar 19, 2007, 1:22:58 PM3/19/07

to Josh Boyer

On Sun, Mar 18, 2007 at 03:31:50PM -0500, Josh Boyer wrote:
> On Sun, Mar 18, 2007 at 02:18:12PM -0500, Matt Mackall wrote:
> >
> > I'm well aware of all that. I wrote a NAND driver just last month.
> > Let's consider this table:
> >
> > HARD drives MTD device
> > Consists of sectors Consists of eraseblocks
> > Sectors are small (512, 1024 bytes) Eraseblocks are larger (32KiB, 128KiB)
> > read sector and write sector read, write, and erase block
> > Bad sectors are re-mapped Bad eraseblocks are not hidden
> > HDD sectors don't wear out Eraseblocks get worn-out
> N/A NAND flash addressed in pages
> N/A NAND flash has OOB areas
> N/A (?) NAND flash requires ECC

Disks have OOB areas with ECC, it's just nicely hidden inside the
drive. They also typically have physical sectors bigger than 512
bytes, again hidden.

> > If the end goal is to end up with something that looks like a block
> > device (which seems to be implied by adding transparent wear leveling
>
> Nope, not the end goal. It's more about wear-leveling across the entire
> flash chip than it is presenting a "block like" device.

It seems to be about spanning devices and repartitioning as well.
Hence the analogy with LVM.

> > and bad block remapping), then I don't see any reason it can't be done
> > in device mapper. The 'smarts' of mtdblock could in fact be pulled up
>
> There is nothing smart about mtdblock. And mtdblock has nothing to do
> with UBI.

Note the scare quotes. Device mapper runs on top of a block device.
And mtdblock is currently the block interface that MTD exports. And it
has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
talking about an approach that doesn't involve UBI at all.

> > In the end, a block device is something which does random access
> > block-oriented I/O. Disk and NAND both fit that description.
>
> NAND very much doesn't fit the "random access" part of that. For writes
> you have to write in incrementing pages within eraseblocks.

And? You can't do I/O smaller than a sector on a disk.

--
Mathematics is the supreme nostalgia of our time.

Josh Boyer

unread,

Mar 19, 2007, 2:17:16 PM3/19/07

to Matt Mackall

On Mon, 2007-03-19 at 12:08 -0500, Matt Mackall wrote:
>
> > > If the end goal is to end up with something that looks like a block
> > > device (which seems to be implied by adding transparent wear leveling
> >
> > Nope, not the end goal. It's more about wear-leveling across the entire
> > flash chip than it is presenting a "block like" device.
>
> It seems to be about spanning devices and repartitioning as well.
> Hence the analogy with LVM.

Yes, it can span multiple MTDs which spreads the wear-leveling even
more. Yes, it can create/resize/remove volumes. It does that
differently than LVM, but the ideas are related. I don't see the issue
here I guess.

(UBI also has static volumes which LVM doesn't but that is an aside.)

> > > and bad block remapping), then I don't see any reason it can't be done
> > > in device mapper. The 'smarts' of mtdblock could in fact be pulled up
> >
> > There is nothing smart about mtdblock. And mtdblock has nothing to do
> > with UBI.
>
> Note the scare quotes. Device mapper runs on top of a block device.
> And mtdblock is currently the block interface that MTD exports. And it
> has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
> talking about an approach that doesn't involve UBI at all.

Ok, but what I'm saying is that using device mapper on top of mtdblock
is not a good solution. mtdblock caches writes within an eraseblock to
a DRAM buffer of eraseblock size. If you get a power failure before
that is flushed out, you lose an entire eraseblock's worth of data.
Oops. And if you constantly flush the buffer, there's no point in
having it in the first place because it doesn't help or hide anything
then. UBI doesn't have this problem.

That's why I suggested fixing the MTD layers that present block devices
first in the part of my reply that you cut off. It seems to me that
you're really after getting flash to look like a block device, which
would enable device mapper to be used for something similar to UBI.
That's fine, but until someone does that work UBI fills a need, has
users, and has an existing implementation.

josh

Thomas Gleixner

unread,

Mar 19, 2007, 2:57:03 PM3/19/07

to Matt Mackall

Matt,

On Mon, 2007-03-19 at 12:08 -0500, Matt Mackall wrote:
> On Sun, Mar 18, 2007 at 03:31:50PM -0500, Josh Boyer wrote:
> > On Sun, Mar 18, 2007 at 02:18:12PM -0500, Matt Mackall wrote:
> > >
> > > I'm well aware of all that. I wrote a NAND driver just last month.
> > > Let's consider this table:
> > >
> > > HARD drives MTD device
> > > Consists of sectors Consists of eraseblocks
> > > Sectors are small (512, 1024 bytes) Eraseblocks are larger (32KiB, 128KiB)
> > > read sector and write sector read, write, and erase block
> > > Bad sectors are re-mapped Bad eraseblocks are not hidden
> > > HDD sectors don't wear out Eraseblocks get worn-out
> > N/A NAND flash addressed in pages
> > N/A NAND flash has OOB areas
> > N/A (?) NAND flash requires ECC
>
> Disks have OOB areas with ECC, it's just nicely hidden inside the
> drive. They also typically have physical sectors bigger than 512
> bytes, again hidden.

The difference is that the harddrive has an intellegent controller,
which hides all this away. NAND FLASH has not and we have to do it in
software.

> > > If the end goal is to end up with something that looks like a block
> > > device (which seems to be implied by adding transparent wear leveling
> >
> > Nope, not the end goal. It's more about wear-leveling across the entire
> > flash chip than it is presenting a "block like" device.
>
> It seems to be about spanning devices and repartitioning as well.
> Hence the analogy with LVM.

Yes, UBI is a kind of LVM for FLASH and we did think quite a time about
reusing LVM before we went the UBI way.

MTD block has no 'smarts' at all. It is a stupid and broken hack, which
you can utilize to lose data and wear your FLASH out.

> > > In the end, a block device is something which does random access
> > > block-oriented I/O. Disk and NAND both fit that description.
> >
> > NAND very much doesn't fit the "random access" part of that. For writes
> > you have to write in incrementing pages within eraseblocks.
>
> And? You can't do I/O smaller than a sector on a disk.

Should we export block devices with 16/32/64/128 KiB size ? If not, we
would need to put a lot of clever functionality into the mtd block
device code, which we decided to put into UBI, so FLASH aware file
systems can use this shared functionality too.

If someone wants to implement an intellegent mtd block device, which
allows to run arbitrary filesystems, then it should be done on top of
UBI. It's not rocket science, but nobody bothers as we have functional
FLASH filesystems which do their job better w/o any notion of a block
device.

A disk _IS_ fundamentally different to FLASH and all the magic which is
done inside of CF-Cards and USB-Sticks is just hiding this away. Most of
the controller chips in these devices are broken and I would never ever
store any important data on such.

The main points of UBI are:

- wear levelling across the complete device
- background handling of bitflips
- safe updates
- handling of static volumes, which are easily accessible for
bootloaders

Nothing of this is anyway near of LVM and disks. The only LVM alike
feature is dynamic creation/deletion/resizing of volumes.

tglx

Matt Mackall

unread,

Mar 19, 2007, 4:09:35 PM3/19/07

to Josh Boyer

On Mon, Mar 19, 2007 at 01:16:28PM -0500, Josh Boyer wrote:
> On Mon, 2007-03-19 at 12:08 -0500, Matt Mackall wrote:
> >
> > > > If the end goal is to end up with something that looks like a block
> > > > device (which seems to be implied by adding transparent wear leveling
> > >
> > > Nope, not the end goal. It's more about wear-leveling across the entire
> > > flash chip than it is presenting a "block like" device.
> >
> > It seems to be about spanning devices and repartitioning as well.
> > Hence the analogy with LVM.
>
> Yes, it can span multiple MTDs which spreads the wear-leveling even
> more. Yes, it can create/resize/remove volumes. It does that
> differently than LVM, but the ideas are related. I don't see the issue
> here I guess.

The issue is 14000 lines of patch to make a parallel subsystem.

> (UBI also has static volumes which LVM doesn't but that is an aside.)

If a static volume is simply a non-dynamic volume, then device mapper
can do that too. And countless other things. Which is not an aside.
UBI growing to do all the things that device mapper does is exactly
the thing we should be seeking to avoid.

> > > > and bad block remapping), then I don't see any reason it can't be done
> > > > in device mapper. The 'smarts' of mtdblock could in fact be pulled up
> > >
> > > There is nothing smart about mtdblock. And mtdblock has nothing to do
> > > with UBI.
> >
> > Note the scare quotes. Device mapper runs on top of a block device.
> > And mtdblock is currently the block interface that MTD exports. And it
> > has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
> > talking about an approach that doesn't involve UBI at all.
>
> Ok, but what I'm saying is that using device mapper on top of mtdblock
> is not a good solution. mtdblock caches writes within an eraseblock to
> a DRAM buffer of eraseblock size. If you get a power failure before
> that is flushed out, you lose an entire eraseblock's worth of data.

Sigh. That's precisely why I talked about moving said smarts. This is
nothing that a higher level remapping layer can't address.

> That's why I suggested fixing the MTD layers that present block devices
> first in the part of my reply that you cut off. It seems to me that
> you're really after getting flash to look like a block device, which
> would enable device mapper to be used for something similar to UBI.
> That's fine, but until someone does that work UBI fills a need, has
> users, and has an existing implementation.

False starts that get mainlined delay or prevent things getting done
right. The question is and remains "is UBI the right way to do
things?" Not "is UBI the easiest way to do things?" or "is UBI
something people have already adopted?"

If the right way is instead to extend the block layer and device
mapper to encompass the quirks of NAND in a sensible fashion, then UBI
should not go in.

Let me draw a picture so we have something to argue about:

iSCSI/nbd(6)
|
filesystem { swap | ext3 ext3 jffs2
\ | | | /
/ \ | dm-crypt->snapshot(5) /
device mapper -| \ \ | /
| partitioning /
| | partitioning(4)
| wear leveling(3) /
| | /
| block concatenation
| | | | |
\ bad block remapping(2)
| | | |
MTD raw block { raw block devices with no smarts(1)
/ | \ \
hardware { NAND NAND NAND NAND

Notes:
1. This would provide a block device that allowed writing pages and
a secondary method for erasing whole blocks as well as a method for
querying/setting out of band information.
2. This would hide erase blocks either by using an embedded table or
out of band info. This could stack on top of block concatenation if
desired.
3. This would provide wear leveling, and probably simultaneously
provide relatively efficient and safe access to write sector
and page-sized I/O. Below this level, things had better be
comfortable with the limitations of NAND if they want to work well.
4. JFFS2 has its own wear-leving scheme, as do several other
filesystems, so they probably want to bypass this piece of the stack.
5. We don't reimplement higher pieces of the stack (dm-crypt,
snapshot, etc.).
6. We make some things possible that simply aren't otherwise.

And this picture isn't even interesting yet. Imagine a dm-cache layer
that caches data read from disks in high-speed flash. Or using
dm-mirror to mirror writes to local flash over NBD or to a USB drive.
Neither of these can be done 'right' in a stack split between device
mapper and UBI.

--
Mathematics is the supreme nostalgia of our time.

Matt Mackall

unread,

Mar 19, 2007, 4:25:54 PM3/19/07

to Thomas Gleixner

I explained precisely what I meant by 'smarts' and why I put it in
'smarts' in quotes. And here you are repeat that same exact damn thing
I responded to five lines up.

> > > > In the end, a block device is something which does random access
> > > > block-oriented I/O. Disk and NAND both fit that description.
> > >
> > > NAND very much doesn't fit the "random access" part of that. For writes
> > > you have to write in incrementing pages within eraseblocks.
> >
> > And? You can't do I/O smaller than a sector on a disk.
>
> Should we export block devices with 16/32/64/128 KiB size ?

Sure, why not?

> A disk _IS_ fundamentally different to FLASH and all the magic which is
> done inside of CF-Cards and USB-Sticks is just hiding this away.

And yet they're still both block devices. That our current block layer
doesn't handle one as well as the other is something we should fix
instead of inventing a whole new full-feature but incompatible block
layer on the side.

--
Mathematics is the supreme nostalgia of our time.

Artem Bityutskiy

unread,

Mar 19, 2007, 4:26:31 PM3/19/07

to Matt Mackall

On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
> The issue is 14000 lines of patch to make a parallel subsystem.

It'll be much smaller after I remove "itsy-bitsy" and most of the
debugging stuff, in progress - wait for take 4.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Artem Bityutskiy

unread,

Mar 19, 2007, 5:14:03 PM3/19/07

to Matt Mackall

On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:

> The issue is 14000 lines of patch to make a parallel subsystem.

Parallel system exists since very long. One is
flash->SW_or_HW_FTL->all_blkdev_stuff. The other is MTD->JFFS2. Think
about _why_ there are 2 of them. Hint - reliability, performance. Your
ranting basically says that only the first one makes sense. This is not
true.

We enhance the second branch, not the first, please, realize this. Both
branches have their user base, and have always had.

> iSCSI/nbd(6)
> |
> filesystem { swap | ext3 ext3 jffs2
> \ | | | /
> / \ | dm-crypt->snapshot(5) /
> device mapper -| \ \ | /
> | partitioning /
> | | partitioning(4)
> | wear leveling(3) /
> | | /
> | block concatenation
> | | | | |
> \ bad block remapping(2)
> | | | |
> MTD raw block { raw block devices with no smarts(1)
> / | \ \
> hardware { NAND NAND NAND NAND

Matt, as I pointed in the first mail, flash != block device. In your
picture I see NAND->MTD raw block. So am I right that you assume that we
already have a decent FTL? The fact is that we do not.

Please, bear in mind that decent FTL is difficult and an FS on top of
FTL is slow, FTL hits performance considerably.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Thomas Gleixner

unread,

Mar 19, 2007, 5:15:02 PM3/19/07

to Matt Mackall

On Mon, 2007-03-19 at 15:12 -0500, Matt Mackall wrote:
> > Should we export block devices with 16/32/64/128 KiB size ?
>
> Sure, why not?

Simply because we want to have the ability to write fine grained in
order to write data safe to FLASH. If we export those large sizes we
lose this ability and have to write full erase blocks for a couple of
bytes. This simply breaks JFFS2 and you can do the math yourself what
that means for the life time of FLASH, when you write small data chunks
in fast sequences and want to make sure that they are written to FLASH
immidiately.

> > A disk _IS_ fundamentally different to FLASH and all the magic which is
> > done inside of CF-Cards and USB-Sticks is just hiding this away.
>
> And yet they're still both block devices. That our current block layer
> doesn't handle one as well as the other is something we should fix
> instead of inventing a whole new full-feature but incompatible block
> layer on the side.

And yet they are still broken and unreliable. And you can wear them out
in no time, just because they are stupid and do full eraseblock updates
when you write one sector.

No thanks. A bunch of people have done experiments with those beasts and
they are unusable for environments, where we need to make sure, that
data is on FLASH.

UBI is not an incompatible block layer. It allows to implement a very
clever block layer on top. And you can use just one large partition and
small ones for your kernel image and bootloader, which still get the
benefits of data integrity (by doing background safe copies on bit
flips) and the easy implementation in an IPL.

tglx

Thomas Gleixner

unread,

Mar 19, 2007, 5:31:34 PM3/19/07

to Matt Mackall

On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:

> > (UBI also has static volumes which LVM doesn't but that is an aside.)
>
> If a static volume is simply a non-dynamic volume, then device mapper
> can do that too. And countless other things. Which is not an aside.
> UBI growing to do all the things that device mapper does is exactly
> the thing we should be seeking to avoid.

No it can't and device mapper sits on top of block devices. FLASH is no
block device. Period.

Device mapper can not provide a simple easy to decode scheme for boot
loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
and be able to find the kernel or second stage boot loader in this
unordered device.

And no, fixed addresses do not work. Do you want to implement device
mapper into your Initialial Bootloader stage ?

> > That's why I suggested fixing the MTD layers that present block devices
> > first in the part of my reply that you cut off. It seems to me that
> > you're really after getting flash to look like a block device, which
> > would enable device mapper to be used for something similar to UBI.
> > That's fine, but until someone does that work UBI fills a need, has
> > users, and has an existing implementation.
>
> False starts that get mainlined delay or prevent things getting done
> right. The question is and remains "is UBI the right way to do
> things?" Not "is UBI the easiest way to do things?" or "is UBI
> something people have already adopted?"
>
> If the right way is instead to extend the block layer and device
> mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> should not go in.

No, block layer on top of FLASH needs 80% of the functionality of UBI in
the first place. You need to implement a clever journalling block device
emulator in order to keep the data alive and the FLASH not weared out
within no time. You need the wear levelling, otherwise you can throw
away your FLASH in no time.

> Let me draw a picture so we have something to argue about:
>
> iSCSI/nbd(6)
> |
> filesystem { swap | ext3 ext3 jffs2
> \ | | | /
> / \ | dm-crypt->snapshot(5) /
> device mapper -| \ \ | /
> | partitioning /
> | | partitioning(4)
> | wear leveling(3) /
> | | /
> | block concatenation
> | | | | |
> \ bad block remapping(2)
> | | | |
> MTD raw block { raw block devices with no smarts(1)
> / | \ \
> hardware { NAND NAND NAND NAND
>
> Notes:
> 1. This would provide a block device that allowed writing pages and
> a secondary method for erasing whole blocks as well as a method for
> querying/setting out of band information.

Forget about OOB data. OOB data is reserved for ECC. Please read the
recommendations of the NAND FLASH manufacturers. NAND gets less reliable
with higher density devices and smaller processes.

> 2. This would hide erase blocks either by using an embedded table or
> out of band info. This could stack on top of block concatenation if
> desired.

Hide erase blocks ? UBI does not hide anything. It maps logical
eraseblocks, which are exposed to the clients to arbitrary physical
eraseblocks on the FLASH device in order to provide across device wear
levelling.

This is fundamentaly different to device mapper.

> 3. This would provide wear leveling, and probably simultaneously
> provide relatively efficient and safe access to write sector
> and page-sized I/O. Below this level, things had better be
> comfortable with the limitations of NAND if they want to work well.

I don't see how this provides across device wear levelling.

> 4. JFFS2 has its own wear-leving scheme, as do several other
> filesystems, so they probably want to bypass this piece of the stack.

JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
wear levelling sucks.

> 5. We don't reimplement higher pieces of the stack (dm-crypt,
> snapshot, etc.).

Why should we reimplement that ?

> 6. We make some things possible that simply aren't otherwise.
>
> And this picture isn't even interesting yet. Imagine a dm-cache layer
> that caches data read from disks in high-speed flash. Or using
> dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> Neither of these can be done 'right' in a stack split between device
> mapper and UBI.

Err. Implement a clever block layer on top of UBI and use all the
goodies you want including device mapper.

tglx

Matt Mackall

unread,

Mar 19, 2007, 5:50:24 PM3/19/07

to Artem Bityutskiy

On Mon, Mar 19, 2007 at 11:06:33PM +0200, Artem Bityutskiy wrote:
> On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
> > The issue is 14000 lines of patch to make a parallel subsystem.
>
> Parallel system exists since very long. One is
> flash->SW_or_HW_FTL->all_blkdev_stuff. The other is MTD->JFFS2. Think
> about _why_ there are 2 of them. Hint - reliability, performance. Your
> ranting basically says that only the first one makes sense. This is not
> true.

A better way would be for MTD to deliver a block dev with a rich
enough interface for JFFS2 to use efficiently in the first place. Yes,
I know that can't be done with the current block dev layer. But that's
what the source is for.

> We enhance the second branch, not the first, please, realize this. Both
> branches have their user base, and have always had.
>
> > iSCSI/nbd(6)
> > |
> > filesystem { swap | ext3 ext3 jffs2
> > \ | | | /
> > / \ | dm-crypt->snapshot(5) /
> > device mapper -| \ \ | /
> > | partitioning /
> > | | partitioning(4)
> > | wear leveling(3) /
> > | | /
> > | block concatenation
> > | | | | |
> > \ bad block remapping(2)
> > | | | |
> > MTD raw block { raw block devices with no smarts(1)
> > / | \ \
> > hardware { NAND NAND NAND NAND
>
> Matt, as I pointed in the first mail, flash != block device.

And as I pointed out, you're wrong. It is both block oriented
(eraseBLOCK??) and random access. That's what a block device is. The
fact that it doesn't look like the other things that Linux currently
calls a block device and supports well is another matter.

> In your picture I see NAND->MTD raw block. So am I right that you
> assume that we already have a decent FTL? The fact is that we do
> not.

No. Look at the picture for more than two seconds, please.

I can tell you didn't do this because you didn't manage to find (1)
which explicitly says "with no smarts". And you also cut out the footnote
where I explained what I meant by "with no smarts".

Find the spots marked (2) and (3). These are your FTL.

> Please, bear in mind that decent FTL is difficult and an FS on top of
> FTL is slow, FTL hits performance considerably.

..and if you'd actually looked at the picture, you'd have seen JFFS2
bypassing it. Along with another footnote explaining it.

--
Mathematics is the supreme nostalgia of our time.

Matt Mackall

unread,

Mar 19, 2007, 6:45:48 PM3/19/07

to Thomas Gleixner

On Mon, Mar 19, 2007 at 10:05:29PM +0100, Thomas Gleixner wrote:
> On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
> > > (UBI also has static volumes which LVM doesn't but that is an aside.)
> >
> > If a static volume is simply a non-dynamic volume, then device mapper
> > can do that too. And countless other things. Which is not an aside.
> > UBI growing to do all the things that device mapper does is exactly
> > the thing we should be seeking to avoid.
>
> No it can't and device mapper sits on top of block devices. FLASH is no
> block device. Period.

Which of the following two properties does it lack?

- discrete blocks
- non-sequential access to blocks

When you do the obvious s/blocks/eraseblocks/, this appears to be
true.

Saying "but I can't do I/O smaller than the blocksize" doesn't change
this any more than it would for disks.

Saying "but I can do smaller I/O efficiently in some circumstances"
also doesn't change it.

In historical UNIX, some tapes were block devices too. Because they
supported seek().

> Device mapper can not provide a simple easy to decode scheme for boot
> loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> and be able to find the kernel or second stage boot loader in this
> unordered device.
>
> And no, fixed addresses do not work. Do you want to implement device
> mapper into your Initialial Bootloader stage ?

This is exactly the same problem as booting on a desktop PC. But
somehow LILO manages. My first Linux box had a hell of a lot less disk
than the platform I bootstrapped (and wrote NAND drivers for) last
month had in NAND.

> > > That's why I suggested fixing the MTD layers that present block devices
> > > first in the part of my reply that you cut off. It seems to me that
> > > you're really after getting flash to look like a block device, which
> > > would enable device mapper to be used for something similar to UBI.
> > > That's fine, but until someone does that work UBI fills a need, has
> > > users, and has an existing implementation.
> >
> > False starts that get mainlined delay or prevent things getting done
> > right. The question is and remains "is UBI the right way to do
> > things?" Not "is UBI the easiest way to do things?" or "is UBI
> > something people have already adopted?"
> >
> > If the right way is instead to extend the block layer and device
> > mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> > should not go in.
>
> No, block layer on top of FLASH needs 80% of the functionality of UBI in
> the first place.

Incorrect. A block-based filesystem on top of flash needs this
functionality. But a block device suitable to device mapper layering
(which then provides the functionality) does not.

> You need to implement a clever journalling block device
> emulator in order to keep the data alive and the FLASH not weared out
> within no time. You need the wear levelling, otherwise you can throw
> away your FLASH in no time.

And that's why it's in my picture.

Sorry, I meant hiding bad blocks here. That's why this layer was
labeled "bad block remapping".

> > 3. This would provide wear leveling, and probably simultaneously
> > provide relatively efficient and safe access to write sector
> > and page-sized I/O. Below this level, things had better be
> > comfortable with the limitations of NAND if they want to work well.
>
> I don't see how this provides across device wear levelling.

Because the layer immediately beneath it ("block concatenation") takes
N devices and presents one logical device.

> > 4. JFFS2 has its own wear-leving scheme, as do several other
> > filesystems, so they probably want to bypass this piece of the stack.
>
> JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> wear levelling sucks.

Ok, fine. How about LogFS, then?

> > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > snapshot, etc.).
>
> Why should we reimplement that ?

So that you can get encryption and snapshot, etc.?

> > 6. We make some things possible that simply aren't otherwise.
> >
> > And this picture isn't even interesting yet. Imagine a dm-cache layer
> > that caches data read from disks in high-speed flash. Or using
> > dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> > Neither of these can be done 'right' in a stack split between device
> > mapper and UBI.
>
> Err. Implement a clever block layer on top of UBI and use all the
> goodies you want including device mapper.

If I wanted to have both device mapper and device mapper's little
brother in my kernel, I wouldn't have started this thread.

--
Mathematics is the supreme nostalgia of our time.

Thomas Gleixner

unread,

Mar 19, 2007, 8:36:08 PM3/19/07

to Matt Mackall

On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
> > > If a static volume is simply a non-dynamic volume, then device mapper
> > > can do that too. And countless other things. Which is not an aside.
> > > UBI growing to do all the things that device mapper does is exactly
> > > the thing we should be seeking to avoid.
> >
> > No it can't and device mapper sits on top of block devices. FLASH is no
> > block device. Period.
>
> Which of the following two properties does it lack?
>
> - discrete blocks
> - non-sequential access to blocks
>
> When you do the obvious s/blocks/eraseblocks/, this appears to be
> true.

It appears to be, but it is not. You enforce semantics on a device,
which it does not have.

> Saying "but I can't do I/O smaller than the blocksize" doesn't change
> this any more than it would for disks.

There is a huge difference. Disk block size is 512 byte and FLASH block
size is min 16KiB and up to 256KiB.

Just do the math:

Write sampling data streams in 2KiB chunks to your uber devicemapper on
a 1GiB device with 64KiB erase block size:

Fine grained FLASH aware writes allow 32 chunks in a block without
erasing the block.

Your method erases the block 32 times to write the same amount of data.

Result: You wear out the flash 32 times faster. Cool feature.

> Saying "but I can do smaller I/O efficiently in some circumstances"
> also doesn't change it.

We can do it under _any_ circumstances and that _does_ change it.
Implementing a clever block device layer on top of UBI is simple and
would provide FLASH page sized I/O, i.e. 2Kib in the above example.

> In historical UNIX, some tapes were block devices too. Because they
> supported seek().

I'm impressed. How exactly are "some tapes" comparable to FLASH chips ?

Your next proposal is to throw away MTD-utils and use "mt" instead ?

> > Device mapper can not provide a simple easy to decode scheme for boot
> > loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> > and be able to find the kernel or second stage boot loader in this
> > unordered device.
> >
> > And no, fixed addresses do not work. Do you want to implement device
> > mapper into your Initialial Bootloader stage ?
>
> This is exactly the same problem as booting on a desktop PC. But
> somehow LILO manages. My first Linux box had a hell of a lot less disk
> than the platform I bootstrapped (and wrote NAND drivers for) last
> month had in NAND.

No, it is not. You get the absolute sector address of your second stage
and this is a complete nobrainer. The translation is done in the DISK
device.

You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
whatever - there is a more or less intellegent controller device, which
does the mapping to the physical storage location. There is _NO_ such
thing on a bare FLASH chip.

It does not matter, whether your embedded device had more NAND space
than my old CP/M machines floppy. It simply matters, that even the old
CP/M floppy device had some rudimentary intellence on board.

Furthermore I want to be able to get the bitflip correction on my second
stage loader / kernel in the same safe way as we do it for everything
else and still be able to bootstrap that from an extremly small
bootloader.

> > > If the right way is instead to extend the block layer and device
> > > mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> > > should not go in.
> >
> > No, block layer on top of FLASH needs 80% of the functionality of UBI in
> > the first place.
>
> Incorrect. A block-based filesystem on top of flash needs this
> functionality. But a block device suitable to device mapper layering
> (which then provides the functionality) does not.

How exactly does device mapper:

A) across device wear levelling ?
B) dynamic partitioning for FLASH aware file systems ?
C) across device wear levelling for FLASH aware file systems ?
D) background bit-flip corrections (copying affected blocks and recylce
the old one) ?
E) allow position independent placement of the second stage bootloader ?

> > You need to implement a clever journalling block device
> > emulator in order to keep the data alive and the FLASH not weared out
> > within no time. You need the wear levelling, otherwise you can throw
> > away your FLASH in no time.
>
> And that's why it's in my picture.

Yes, it is in your picture, but:

1) it excludes FLASH aware file systems and UBI does not.
2) your picture does still not explain how it does achive the above A),
B), C), D) and E)

Your extra path for partitioning(4) and JFFS2 is just a weird hack,
which makes your proposal completely absurd.

> > > Let me draw a picture so we have something to argue about:
> > >
> > > iSCSI/nbd(6)
> > > |
> > > filesystem { swap | ext3 ext3 jffs2
> > > \ | | | /
> > > / \ | dm-crypt->snapshot(5) /
> > > device mapper -| \ \ | /
> > > | partitioning /
> > > | | partitioning(4)
> > > | wear leveling(3) /
> > > | | /
> > > | block concatenation
> > > | | | | |
> > > \ bad block remapping(2)
> > > | | | |
> > > MTD raw block { raw block devices with no smarts(1)
> > > / | \ \
> > > hardware { NAND NAND NAND NAND
> > > Notes:

Let me draw an UBI picture:

No notes: it's simple and self explaining.

> > > 1. This would provide a block device that allowed writing pages and
> > > a secondary method for erasing whole blocks as well as a method for
> > > querying/setting out of band information.
> >
> > Forget about OOB data. OOB data is reserved for ECC. Please read the
> > recommendations of the NAND FLASH manufacturers. NAND gets less reliable
> > with higher density devices and smaller processes.
> >
> > > 2. This would hide erase blocks either by using an embedded table or
> > > out of band info. This could stack on top of block concatenation if
> > > desired.
> >
> > Hide erase blocks ? UBI does not hide anything. It maps logical
> > eraseblocks, which are exposed to the clients to arbitrary physical
> > eraseblocks on the FLASH device in order to provide across device wear
> > levelling.
>
> Sorry, I meant hiding bad blocks here. That's why this layer was
> labeled "bad block remapping".

Oh well, a seperate bad block remapper. And how is the logical mapping
of wear levelled erase blocks done ?

> > > 3. This would provide wear leveling, and probably simultaneously
> > > provide relatively efficient and safe access to write sector
> > > and page-sized I/O. Below this level, things had better be
> > > comfortable with the limitations of NAND if they want to work well.
> >
> > I don't see how this provides across device wear levelling.
>
> Because the layer immediately beneath it ("block concatenation") takes
> N devices and presents one logical device.

And how is the wear levelling done on this logical device in
devicemapper ?

How is ensured, that the wear average is maintained also across the
partitions which are used by JFFS2 or other FLASH aware filesystems ?

> > > 4. JFFS2 has its own wear-leving scheme, as do several other
> > > filesystems, so they probably want to bypass this piece of the stack.
> >
> > JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> > wear levelling sucks.
>
> Ok, fine. How about LogFS, then?

LogFS can easily leverage UBI's wear algorithm.

> > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > > snapshot, etc.).
> >
> > Why should we reimplement that ?
>
> So that you can get encryption and snapshot, etc.?

1. On top of a clever block device.

2. UBI can do snapshots by design.

3. Encryption should be done on the VFS layer and not below the
filesystem layer. Doing it inside the block layer or the device mapper
is broken by design.

> > > 6. We make some things possible that simply aren't otherwise.
> > >
> > > And this picture isn't even interesting yet. Imagine a dm-cache layer
> > > that caches data read from disks in high-speed flash. Or using
> > > dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> > > Neither of these can be done 'right' in a stack split between device
> > > mapper and UBI.
> >
> > Err. Implement a clever block layer on top of UBI and use all the
> > goodies you want including device mapper.
>
> If I wanted to have both device mapper and device mapper's little
> brother in my kernel, I wouldn't have started this thread.

You still did not explain how devicemapper does:

- across device wear levelling
- dynamic partitioning for FLASH aware file systems
- across device wear levelling for FLASH aware file systems
- simple boot loader support
- fine grained I/O

UBI is not devicemappers little brother. It is the software version of
the silicon in a CF-CARD / USB-Stick, but it does a better job and
allows clever usage of FLASH aside of enforcing eraseblock sized I/O
units. Does your CF-Card / USB-Stick do that ?

Just think about the 1GiB USB stick, which would present you 64KiB I/O
units instead of 2KiB ones.

Your signature is a nice intellectual signboard, but the ancient simple
rule of three just tells me, that you are off by factor 32.

tglx

Thomas Gleixner

unread,

Mar 19, 2007, 8:37:31 PM3/19/07

to Matt Mackall

On Mon, 2007-03-19 at 16:36 -0500, Matt Mackall wrote:
> On Mon, Mar 19, 2007 at 11:06:33PM +0200, Artem Bityutskiy wrote:
> > On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
> > > The issue is 14000 lines of patch to make a parallel subsystem.
> >
> > Parallel system exists since very long. One is
> > flash->SW_or_HW_FTL->all_blkdev_stuff. The other is MTD->JFFS2. Think
> > about _why_ there are 2 of them. Hint - reliability, performance. Your
> > ranting basically says that only the first one makes sense. This is not
> > true.
>
> A better way would be for MTD to deliver a block dev with a rich
> enough interface for JFFS2 to use efficiently in the first place. Yes,
> I know that can't be done with the current block dev layer. But that's
> what the source is for.

Why the hell would JFFS2 need a block device interface ?

What's the gain ?

> > We enhance the second branch, not the first, please, realize this. Both
> > branches have their user base, and have always had.
> >
> > > iSCSI/nbd(6)
> > > |
> > > filesystem { swap | ext3 ext3 jffs2
> > > \ | | | /
> > > / \ | dm-crypt->snapshot(5) /
> > > device mapper -| \ \ | /
> > > | partitioning /
> > > | | partitioning(4)
> > > | wear leveling(3) /
> > > | | /
> > > | block concatenation
> > > | | | | |
> > > \ bad block remapping(2)
> > > | | | |
> > > MTD raw block { raw block devices with no smarts(1)
> > > / | \ \
> > > hardware { NAND NAND NAND NAND
> >
> > Matt, as I pointed in the first mail, flash != block device.
>
> And as I pointed out, you're wrong. It is both block oriented
> (eraseBLOCK??) and random access. That's what a block device is. The
> fact that it doesn't look like the other things that Linux currently
> calls a block device and supports well is another matter.

It does well matter, as it is not a block device. It is a FLASH device
and you can do as much comparisons of eraseBLOCK as you want, you do not
turn FLASH into a DISK.

Again: Disks (including CF-Cards and USB-Sticks) have intellegent
controllers, which abstract the hardware oddities away and present you a
block device.

> > In your picture I see NAND->MTD raw block. So am I right that you
> > assume that we already have a decent FTL? The fact is that we do
> > not.
>
> No. Look at the picture for more than two seconds, please.
>
> I can tell you didn't do this because you didn't manage to find (1)
> which explicitly says "with no smarts". And you also cut out the footnote
> where I explained what I meant by "with no smarts".
>
> Find the spots marked (2) and (3). These are your FTL.

And where please are (2) and (3) inside of device mapper ?

> > Please, bear in mind that decent FTL is difficult and an FS on top of
> > FTL is slow, FTL hits performance considerably.
>

> ...and if you'd actually looked at the picture, you'd have seen JFFS2

> bypassing it. Along with another footnote explaining it.

The (4) partitioning and JFFS2 on top is a step back from the current
UBI functionality. Now we can have resizable partitioning even for JFFS2
and JFFS2 can utilize the UBI wear levelling, which is way better than
the crude heuristics of JFFS2.

You want to force FLASH into device mapper for some strange and no
obvious reason. Just the coincidence of "eraseBLOCK" and "BLOCKdevice"
is not really convincing.

You impose the usage of eraseblock size on FLASH, which is simply wrong:

DISK has a 1:1 relationship of "eraseblock" and minimal I/O. FLASH has
not. I did the math in a different mail and I'm not buying your factor
32 FLASH life time reduction for the price of having a bunch of lines of
code less in the kernel.

If you really consider to run ext3, xfs or whatever on top of FLASH,
please go and do the homework on CF-Cards and USB-Sticks. Run them into
the fast wearout death. And device mapper does not help anything to
avoid that. Running ext3 on top of FLASH with a minimal I/O size of
erase block size is simply braindead.

tglx

Matt Mackall

unread,

Mar 19, 2007, 9:18:50 PM3/19/07

to Thomas Gleixner

Sigh. That's the current /dev/mtdblock method, not my method. You're too
fixated on what you think I'm saying to hear what I'm saying.

> > Saying "but I can do smaller I/O efficiently in some circumstances"
> > also doesn't change it.
>
> We can do it under _any_ circumstances and that _does_ change it.
> Implementing a clever block device layer on top of UBI is simple and
> would provide FLASH page sized I/O, i.e. 2Kib in the above example.

Yes. I know. I've written a complete (non-Linux) FTL. I know what's
entailed.

> > In historical UNIX, some tapes were block devices too. Because they
> > supported seek().
>
> I'm impressed. How exactly are "some tapes" comparable to FLASH chips ?
>
> Your next proposal is to throw away MTD-utils and use "mt" instead ?

Don't be an ass. I'm pointing out that not all block devices are disks.

> > > Device mapper can not provide a simple easy to decode scheme for boot
> > > loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> > > and be able to find the kernel or second stage boot loader in this
> > > unordered device.
> > >
> > > And no, fixed addresses do not work. Do you want to implement device
> > > mapper into your Initialial Bootloader stage ?
> >
> > This is exactly the same problem as booting on a desktop PC. But
> > somehow LILO manages. My first Linux box had a hell of a lot less disk
> > than the platform I bootstrapped (and wrote NAND drivers for) last
> > month had in NAND.
>
> No, it is not. You get the absolute sector address of your second stage
> and this is a complete nobrainer. The translation is done in the DISK
> device.

LILO and friends manage to boot systems that use software RAID and
LVM. There are multiple methods. Some use block lists, some use tiny
boot partitions, etc. All of them are applicable to controllerless NAND.

> You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
> whatever - there is a more or less intellegent controller device, which
> does the mapping to the physical storage location. There is _NO_ such
> thing on a bare FLASH chip.

How many times do I have to tell you that I wrote a driver for
controllerless NAND just last month?

> How exactly does device mapper:
>
> A) across device wear levelling ?

The same way UBI does, but encapsulated in a device mapper layer.

> B) dynamic partitioning for FLASH aware file systems ?

See above.

> C) across device wear levelling for FLASH aware file systems ?

See above.

> D) background bit-flip corrections (copying affected blocks and recylce
> the old one) ?

See above.

> E) allow position independent placement of the second stage bootloader ?

See way above to my LILO response.

> > > You need to implement a clever journalling block device
> > > emulator in order to keep the data alive and the FLASH not weared out
> > > within no time. You need the wear levelling, otherwise you can throw
> > > away your FLASH in no time.
> >
> > And that's why it's in my picture.
>
> Yes, it is in your picture, but:
>
> 1) it excludes FLASH aware file systems and UBI does not.
> 2) your picture does still not explain how it does achive the above A),
> B), C), D) and E)
>
> Your extra path for partitioning(4) and JFFS2 is just a weird hack,
> which makes your proposal completely absurd.

No, it's just there to show the flexibility of device mapper. But I have
the sneaking suspicion you have no idea how device mapper works.

In brief: device mapper takes one or more devices, applies a mapping
to them, and returns a new device. For example, take various spans of
/dev/hda1 and /dev/sda3 and present them as new-device1. Take
new-device1 and transform it with dm-crypt to get new-device2. The
kernel doesn't decide how to do this, any more than it decides where
to mount your filesystems. Userspace does.

> > > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > > > snapshot, etc.).
> > >
> > > Why should we reimplement that ?
> >
> > So that you can get encryption and snapshot, etc.?
>
> 1. On top of a clever block device.
>
> 2. UBI can do snapshots by design.

Oh, so you HAVE reimplemented it.

> 3. Encryption should be done on the VFS layer and not below the
> filesystem layer. Doing it inside the block layer or the device mapper
> is broken by design.

That's highly debatable and not a topic for this thread.

--
Mathematics is the supreme nostalgia of our time.

Thomas Gleixner

unread,

Mar 20, 2007, 2:21:52 AM3/20/07

to Matt Mackall

On Mon, 2007-03-19 at 20:05 -0500, Matt Mackall wrote:
> On Tue, Mar 20, 2007 at 01:42:46AM +0100, Thomas Gleixner wrote:
> > On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
> > > This is exactly the same problem as booting on a desktop PC. But
> > > somehow LILO manages. My first Linux box had a hell of a lot less disk
> > > than the platform I bootstrapped (and wrote NAND drivers for) last
> > > month had in NAND.
> >
> > No, it is not. You get the absolute sector address of your second stage
> > and this is a complete nobrainer. The translation is done in the DISK
> > device.
>
> LILO and friends manage to boot systems that use software RAID and
> LVM. There are multiple methods. Some use block lists, some use tiny
> boot partitions, etc. All of them are applicable to controllerless NAND.

Yes, by using fixed addresses, which is not what I want.

> > You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
> > whatever - there is a more or less intellegent controller device, which
> > does the mapping to the physical storage location. There is _NO_ such
> > thing on a bare FLASH chip.
>
> How many times do I have to tell you that I wrote a driver for
> controllerless NAND just last month?

Wow. I'm impressed because I'm pulling my opinion out of thin air.

> > How exactly does device mapper:
> >
> > A) across device wear levelling ?
>
> The same way UBI does, but encapsulated in a device mapper layer.

Does the device mapper do that ?

> > B) dynamic partitioning for FLASH aware file systems ?
>
> See above.

Does the device mapper do that ?

> > C) across device wear levelling for FLASH aware file systems ?
>
> See above.

Look at your own drawing.

> > D) background bit-flip corrections (copying affected blocks and recylce
> > the old one) ?
>
> See above.

Repeating patterns do not impress me. Your drawing tells otherwise

> > E) allow position independent placement of the second stage bootloader ?
>
> See way above to my LILO response.

Neither LILO nor GRUB have search capabilities for randomly located
second stage loaders.

> > > > You need to implement a clever journalling block device
> > > > emulator in order to keep the data alive and the FLASH not weared out
> > > > within no time. You need the wear levelling, otherwise you can throw
> > > > away your FLASH in no time.
> > >
> > > And that's why it's in my picture.
> >
> > Yes, it is in your picture, but:
> >
> > 1) it excludes FLASH aware file systems and UBI does not.
> > 2) your picture does still not explain how it does achive the above A),
> > B), C), D) and E)
> >
> > Your extra path for partitioning(4) and JFFS2 is just a weird hack,
> > which makes your proposal completely absurd.
>
> No, it's just there to show the flexibility of device mapper. But I have
> the sneaking suspicion you have no idea how device mapper works.

Sigh. Layering violation == flexibility.

> In brief: device mapper takes one or more devices, applies a mapping
> to them, and returns a new device. For example, take various spans of
> /dev/hda1 and /dev/sda3 and present them as new-device1. Take
> new-device1 and transform it with dm-crypt to get new-device2. The
> kernel doesn't decide how to do this, any more than it decides where
> to mount your filesystems. Userspace does.

I know how it works. But your blurb does not answer any of my questions.

> > > > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > > > > snapshot, etc.).
> > > >
> > > > Why should we reimplement that ?
> > >
> > > So that you can get encryption and snapshot, etc.?
> >
> > 1. On top of a clever block device.
> >
> > 2. UBI can do snapshots by design.
>
> Oh, so you HAVE reimplemented it.

No, it already works

> > 3. Encryption should be done on the VFS layer and not below the
> > filesystem layer. Doing it inside the block layer or the device mapper
> > is broken by design.
>
> That's highly debatable and not a topic for this thread.

I see, you define, what has to be discussed.

tglx

Josh Boyer

unread,

Mar 20, 2007, 8:14:29 AM3/20/07

to Matt Mackall

On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
>

> False starts that get mainlined delay or prevent things getting done
> right. The question is and remains "is UBI the right way to do
> things?" Not "is UBI the easiest way to do things?" or "is UBI
> something people have already adopted?"
>
> If the right way is instead to extend the block layer and device
> mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> should not go in.

This is where we disagree obviously. However, getting UBI into mainline
won't delay or prevent your proposal from getting done. That's like
saying having ext3 in mainline prevents other filesystems from getting
created. There is nothing wrong with having different subsystems that
overlap in a few areas.

What you're proposing seems like it would take at least several weeks to
even get close to what is needed in terms of reliability and the
required wear-leveling if it is indeed possible to implement. And it
would likely duplicate some of the wear-leveling and bad block handling
code that is present in UBI anyway. In the meantime, the need for UBI
exists today and there is a working, tested implementation available.

josh

Artem Bityutskiy

unread,

Mar 20, 2007, 8:27:12 AM3/20/07

to Matt Mackall

>
> iSCSI/nbd(6)
> |
> filesystem { swap | ext3 ext3 jffs2
> \ | | | /
> / \ | dm-crypt->snapshot(5) /
> device mapper -| \ \ | /
> | partitioning /
> | | partitioning(4)
> | wear leveling(3) /
> | | /
> | block concatenation
> | | | | |
> \ bad block remapping(2)
> | | | |
> MTD raw block { raw block devices with no smarts(1)
> / | \ \
> hardware { NAND NAND NAND NAND

You failed to clearly define what is block until now, then you blame me
that I do not understand you. So I see block = eraseblock, lets assume
for further conversation.

OK. Suppose we have done what you say, although I _do not_ think it is
makes a lot of sense. So, now we have a block device, with 128KiB block
size. We have LVM, dm-wl or whatever stuff. Fine.

Do you realize that 128KiB is _huge_ block size, and performance will
suck, and suck a lot if you utilize say, ext2 or whatever block device
FS.

Do you realize that I may not be satisfied with slow I/O? Do I have
right to have faster one? Thanks if yes.

To make it faster I have to have a way to do finer grained I/O:
read/write to different positions of 128KiB block. Do you realize how
much you will abuse all the generic block device infrastructure if you
try to add this? Note, all levels up to LVM will need to have this. A
believe it is braindead ((c) tglx) to add this feature.

Also, in UBI we have the following features:
1. data type hints: you basically may help UBI to pick optimal
eraseblock if you specify data life-time - is it long-live data, or
short-live/temporary data.
2. Some other ones, do not want to describe now.

Do you offer to add this stuff to DM-mapper?

So, you approach only makes sense if you are going to work with flash as
block device with block size = eraseblock size. No finer grained access
at all. It is fine, some users may be ok with this. But please, do not
be so naive - the performance will suck a _lot_. Let alone I doubt it
will really fit the DM infrastructure.

We work on different approach. And in general, the picture which Thomas
drew to you makes _much more_ sense. Please, do not be so stuck to your
way, it is not bad or good, it is just _different_, and it has obvious
limitations which we do not want to have, thus we go other way.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Theodore Tso

unread,

Mar 20, 2007, 9:54:01 AM3/20/07

to Artem Bityutskiy

On Tue, Mar 20, 2007 at 02:25:49PM +0200, Artem Bityutskiy wrote:
> You failed to clearly define what is block until now, then you blame me
> that I do not understand you. So I see block = eraseblock, lets assume
> for further conversation.
>
> OK. Suppose we have done what you say, although I _do not_ think it is
> makes a lot of sense. So, now we have a block device, with 128KiB block
> size. We have LVM, dm-wl or whatever stuff. Fine.
>
> Do you realize that 128KiB is _huge_ block size, and performance will
> suck, and suck a lot if you utilize say, ext2 or whatever block device
> FS.

As a suggestion, let's stop right here and see if we can get both
sides talking in a more constructive fashion. Maybe it's just me, but
I see both sides talking past each other in a rather dramatic fashion.

Linux seems to allow multiple implementations of things at the edge
(such as filesystems), but not at the core (devicemapper when in,
device mapper didn't; it's unlikely that we would have two competing
block device layers or two VFS layers, etc.). The question then is
whether UBI and dm are close enough or not that should be one
subsystem or not.

There a number of red herrings that have been introduced in this
discussion; of *course* the existing block device layer can handle
FLASH devices; Matt is proposing that they be extended. And of
*course* you woulnd't propose to use ext2 on top of an 128k blocksize,
anymore than you would force a flash filesystem to use a 4k or 512
byte blocksize; there are plenty of configurations that won't make
sense, and by itself this isn't an indictment of the core idea that
the block device layer and dm should be augmented to encompass flash
functionality.

As far as who gets to do the work, unfortunately sometimes the people
submitting the new code have to make the changes suggested by the
reviewer. That's one of the prices that gets paid for mainline
inclusion. It's different when someone asks for a completely new
feature, especially for code that is already in mainline; then, "feel
free to send a patch" is perfectly accepted. But if it's a matter of
refactoring the code to fit in some other framework, that's often up
to the submitter to do, not the reviewer.

Of course, it remains to be seen whether or not this is a good idea to
do in the first place; but some of the arguments being used to shoot
down Matt's suggestion aren't really good ones to begin with.

> To make it faster I have to have a way to do finer grained I/O:
> read/write to different positions of 128KiB block. Do you realize how
> much you will abuse all the generic block device infrastructure if you
> try to add this? Note, all levels up to LVM will need to have this. A
> believe it is braindead ((c) tglx) to add this feature.

OK, and this could be it. But I suspect one of the things which may
be missing that make it easier for you to explain why what UBI is
doing is so different from the dm and block device stack is to include
the contents of:

http://www.linux-mtd.infradead.org/faq/ubi.html
http://www.linux-mtd.infradead.org/doc/ubi.html

and some system level documentation in a Documentation/ubi.txt file as
part of the patch set. I don't think people completely understand the
high-level architecture of what UBI is trying to achieve. What are
the interfaces at the top and the bottom of the stack? For example,
the fact that UBI exports Logical Erase blocks that are not a
power-of-two (possibly 128k minus 128 bytes) means that it certainly
might not be a good match for the dm stack. But why is that the case?
I can imagine good reasons for it, but a high-level description of the
design decisions would be very useful.

It would also help people understand why there are so many "units" in
UBI, since hopefully the high-level documentation would explain why
they fit together, and perhaps why some of the units weren't folded
together. What value do they add as separate components?

There are hints of the overall system architecture in some of the
indivdual comments for data structures, but even reading all of those,
there isn't quite enough for people to figure out what it is; and that
may be causing some of these comments of people saying there's too
much code to evaluate, or why didn't you do it *this* way?

Regards,

- Ted

Artem Bityutskiy

unread,

Mar 20, 2007, 11:16:48 AM3/20/07

to Theodore Tso

On Tue, 2007-03-20 at 09:52 -0400, Theodore Tso wrote:
> It would also help people understand why there are so many "units" in
> UBI, since hopefully the high-level documentation would explain why
> they fit together, and perhaps why some of the units weren't folded
> together. What value do they add as separate components?

Teo, the units will go away. I'll leave only 4 of them:

1. I/O, just to hide some I/O related complexities.
2. Scanning: just because I am planning to add other device attaching
methods, without scanning.
3. Wear-leveling, just because I want to improve the algorithm in
future. Changing algorithm means changing data structures. So I want to
keeps them separate.
4. EBA - because I want to keep all mapping-related stuff in one place.
Well, this does not have to be called unit, just mapping-related code in
on file. Also, long-term is to have the table on-flash (currently it is
in-ram which does not scale well).

Everything else will be folded together. No itsy-bitsy. I've almost
finished this re-structuring, doing bug-fixing.

P.S.: I'll let other folks to comment the other stuff.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Josh Boyer

unread,

Mar 20, 2007, 12:00:58 PM3/20/07

to Theodore Tso

On Tue, 2007-03-20 at 09:52 -0400, Theodore Tso wrote:

> As a suggestion, let's stop right here and see if we can get both
> sides talking in a more constructive fashion. Maybe it's just me, but
> I see both sides talking past each other in a rather dramatic fashion.

Perhaps, yes. Though I've been trying to be open to Matt's suggestions.
Please don't mistake confusion for hostility.

> There a number of red herrings that have been introduced in this
> discussion; of *course* the existing block device layer can handle
> FLASH devices; Matt is proposing that they be extended. And of

Sure. But the larger question is *should* it be extended to do so.

> *course* you woulnd't propose to use ext2 on top of an 128k blocksize,
> anymore than you would force a flash filesystem to use a 4k or 512
> byte blocksize; there are plenty of configurations that won't make

Except that flash filesystems don't use block devices at all. They use
MTD interfaces.

> sense, and by itself this isn't an indictment of the core idea that
> the block device layer and dm should be augmented to encompass flash
> functionality.

This is where the concept starts to lose me. Augmented how? To not use
MTD at all (obviously with the exception of the low-level flash
drivers)? How is that not going to duplicate MTD? Etc, etc.

Look at it from this point of view. MTD is the existing interface for
dealing with flash devices. UBI was written to solve issues with flash
devices. UBI was written on top of MTD. Suggesting that the UBI
developers go off and hack the block layer to work with flash devices
just to use dm seems completely foreign. Most of the boards UBI is used
on disable the block layer as much as they can because it's not needed.

This is the biggest source of confusion/contention. Making the somewhat
magical jump to representing flash as a block device without a bit more
detail as to how it's going to really cope with the requirements needed
for flash and why it's a great idea to do so is a bit hard to swallow.
Discussing the device mapper extensions is sort of pointless until this
is figured out.

> high-level architecture of what UBI is trying to achieve. What are
> the interfaces at the top and the bottom of the stack? For example,
> the fact that UBI exports Logical Erase blocks that are not a
> power-of-two (possibly 128k minus 128 bytes) means that it certainly
> might not be a good match for the dm stack. But why is that the case?
> I can imagine good reasons for it, but a high-level description of the
> design decisions would be very useful.

Basically because you need to store metadata in each eraseblock (and not
in OOB). That metadata consumes space reducing the usable storage in
each eraseblock by an amount and making it no longer a power-of-two.

> It would also help people understand why there are so many "units" in
> UBI, since hopefully the high-level documentation would explain why
> they fit together, and perhaps why some of the units weren't folded
> together. What value do they add as separate components?

Artem is reworking the units per your (and other's) suggestions. The
debugging code is also being worked on.

> There are hints of the overall system architecture in some of the
> indivdual comments for data structures, but even reading all of those,
> there isn't quite enough for people to figure out what it is; and that
> may be causing some of these comments of people saying there's too
> much code to evaluate, or why didn't you do it *this* way?

Some of that can probably be added, sure. Though to be fair, it'll add
even more lines to the patch and those links have been posted 4 times
already. They're even posted at the start of _this_ thread. Having it
in the patch under Documentation/ is a good idea, but you can't force
people to read that before they comment on things.

josh

David Lang

unread,

Mar 20, 2007, 3:24:14 PM3/20/07

to Josh Boyer

On Tue, 20 Mar 2007, Josh Boyer wrote:

> On Tue, 2007-03-20 at 09:52 -0400, Theodore Tso wrote:
>> As a suggestion, let's stop right here and see if we can get both
>> sides talking in a more constructive fashion. Maybe it's just me, but
>> I see both sides talking past each other in a rather dramatic fashion.
>
> Perhaps, yes. Though I've been trying to be open to Matt's suggestions.
> Please don't mistake confusion for hostility.
>
>> There a number of red herrings that have been introduced in this
>> discussion; of *course* the existing block device layer can handle
>> FLASH devices; Matt is proposing that they be extended. And of
>
> Sure. But the larger question is *should* it be extended to do so.
>
>> *course* you woulnd't propose to use ext2 on top of an 128k blocksize,
>> anymore than you would force a flash filesystem to use a 4k or 512
>> byte blocksize; there are plenty of configurations that won't make
>
> Except that flash filesystems don't use block devices at all. They use
> MTD interfaces.
>
>> sense, and by itself this isn't an indictment of the core idea that
>> the block device layer and dm should be augmented to encompass flash
>> functionality.
>
> This is where the concept starts to lose me. Augmented how? To not use
> MTD at all (obviously with the exception of the low-level flash
> drivers)? How is that not going to duplicate MTD? Etc, etc.

What Matt and Ted are looking at is the question 'are flash devices close enough
to other block devices that it would make sense to change the existing linux
definition of a block device to handle the special requirements of flash'

if the block device layer can be reasonably modified to accomodate flash, then
doing so greatly improves flexibility and maintainability. It would also reduce
the overall code size since existing features of the block layer (for example
snapshots) would not need to be duplicated or re-written for the flash block
layer.

if not then so be it.

everyone understands that flash has different requirement from a hard drive as a
block device, what isn't clear to the people reading this thread (and reviewing
the code) is why you believe that it is _so_ different that it's impossible to
consider extending the linux definition of a block device.

the fact that the native eraseblock size is significantly larger isn't a factor.

the fact that you erase in large blocks and then write in smaller blocks is a
difference, and one that the current block layer doesn't understand. but this is
a difference that the current block layer could be changed to understand. it's
not something that would justify a seperate-but-equal block layer for flash
devices.

as Ted notes, the idea that block sizes may not be powers of 2 (128k-128b from
his e-mail) _may_ end up being a big enough difference that it's not worth
teaching the exising block layer how to deal with, but it's not clear why you
are useing this odd size.

this is why you are being asked for further explinations.

David Lang

Artem Bityutskiy

unread,

Mar 20, 2007, 4:13:25 PM3/20/07

to David Lang

On Tue, 2007-03-20 at 10:58 -0800, David Lang wrote:
> the fact that you erase in large blocks and then write in smaller blocks is a
> difference, and one that the current block layer doesn't understand. but this is
> a difference that the current block layer could be changed to understand. it's
> not something that would justify a seperate-but-equal block layer for flash
> devices.

I am _not_ an block device layer expert. But I think it is silly idea to
abuse it adding a possibility of reading/writing from/to the middle of
the block. Isn't it obvious that the fact that block is _minimal_ I/O
unit is _deep_ inside the design???

We also need few other features as well, like data life-time hints to
help the wear-leveling engine to pick optimal eraseblock. And there are
more features we need to have. Do you want to add all those to block
device infrastructure?

Thomas wrote about how one can reuse all block device goodies, like LVM,
FSes etc. He drew a picture, just roll back and glance. This makes much
more sense.

Guess why we still do not have a decent FTL? Because it is _difficult_.
Now, when we have UBI one can implement FTL much, much easier. It
becomes really possible now. Because UBI already hides many complexities
of flash, and FTL layer should not care about many things. It may
concentrate on FTL problems, for example on a smart garbage collector,
which is also a difficult thing. Also, with UBI, for example, the FTL
layer may store on-flash tables with block mappings, because UBI takes
care of wear-leveling. I mean, FTL may update those tables as may times
as it want, without caring that corresponding eraseblocks go worn-out.

After we have implemented FTL, we can re-use all the block device
infrastructure - LVM, dm-crypt, ext3 and 4, and so on. This does make
sense. And this is at Thomas's picture.

So please, look at UBI as a low-level layer just which hides flash
complexities like wear and bad blocks. It also does write-failure
recovery automatically - this is very important feature. These are
essentially the features which makes our life horrible, and UBI kicks
them out. I am not a newbie in the area and I know how difficult is it
to develop on top of a raw flash. Yes, it allows creating volumes, but
this is not the main feature of it. It gust goes naturally.

And one note: UBI is flash type independent, so you can use it on top of
NOR/NAND/DataFlash/AG-AND/ECCd NOR and so on, as long as MTD support
exists. For example, we do not use OOB at all. I write it just because
Matt always used NAND as an example, just for clarification.

> as Ted notes, the idea that block sizes may not be powers of 2 (128k-128b from
> his e-mail) _may_ end up being a big enough difference that it's not worth
> teaching the exising block layer how to deal with, but it's not clear why you
> are useing this odd size.

Eraseblock size is power of 2. We store the erase counter (needed for
wear-levelling) and logical to physical eraseblock mapping in each
eraseblock. Thus, we reduce the size.

We do not want to have any on-flash table, because we end up with a
chicken-and-egg problem: the tables are updated often, so they cannot
sit in fixed eraseblocks. They should constantly change position to
ensure wear-leveling. This is very difficult and less robust.

> this is why you are being asked for further explinations.

Although we do not have shiny documentation, but all _essential_ are
explained in the existing, not shiny one, so those really interested
could find this there. I mean, if one does not know much about the area,
and does not spend time to explore it, we cannot really help. But
anyway, we will try to write better docs, it just a question of time.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

David Woodhouse

unread,

Mar 20, 2007, 5:33:43 PM3/20/07

to David Lang

On Tue, 2007-03-20 at 10:58 -0800, David Lang wrote:

> What Matt and Ted are looking at is the question 'are flash devices close enough
> to other block devices that it would make sense to change the existing linux
> definition of a block device to handle the special requirements of flash'

I've seen no real proposals about how this could be done, so it's a
purely academic question. But I'm dubious. The block layer is optimised,
perhaps unsurprisingly, for block devices. Making it handle our special
case might be possible, but I don't really think it's likely to fly once
it becomes real code and not just mental self-abuse. Hell, we haven't
even got block _discard_ support merged yet, because it's too esoteric
for people to care about.

The MTD API does need to be re-thought. It's no longer quite so
unthinkable that we'll encounter flash sizes above 4GiB, and the way we
(theoretically) handle asynchronous erases while read and write are
synchronous is a bit icky. I'm not averse to using queues and making it
look a _bit_ like block devices in some respects, but in practice I
don't think it'll be very close at all.

The MTD API is intended to represent the raw capabilities of the
underlying flash devices, with all the bizarre restrictions and features
that the various types of flash chip have. If you want to use it as a
block device, that's what translation layers are for.

--
dwmw2

David Woodhouse

unread,

Mar 20, 2007, 5:37:19 PM3/20/07

to dede...@infradead.org

On Tue, 2007-03-20 at 22:05 +0200, Artem Bityutskiy wrote:
> Guess why we still do not have a decent FTL? Because it is
> _difficult_.

No. We don't have a decent FTL because it's _pointless_. We've got basic
implementations of FTL, NFTL, INFTL etc. for compatibility with PCMCIA
stuff and DiskOnChip, but the fact remains that pretending to be a
normal block device with atomically-overwritten 512-byte sectors is just
_stupid_. You end up implementing a kind of pseudo-filesystem to do
that, and then on top of that you put a 'normal' filesystem with no real
knowledge about what's underneath. It's crap -- and as we currently have
it, the top level file system doesn't even get to tell the underlying
FTL that a given block can be discarded because it's no longer used. So
during garbage collection the FTL even ends up copying crap around the
medium that's no longer relevant.

This isn't DOS. We don't have to make our storage available through the
restricted interface that INT 13h offers us. We can, and do, do better
than that. And that's why we don't have a decent FTL implementation.

--
dwmw2

Theodore Tso

unread,

Mar 20, 2007, 6:04:10 PM3/20/07

to Josh Boyer

On Tue, Mar 20, 2007 at 10:59:55AM -0500, Josh Boyer wrote:
> Sure. But the larger question is *should* it be extended to do so.

Absolutely, and so let's focus on that.

> Except that flash filesystems don't use block devices at all. They use
> MTD interfaces.

Yes, so that would be first issue. We would need to change the flash
filesystems to the block interface issue, and expand the block device
to use MTD. Now, maybe Matt is conversant about what would be
involved to do this, but I will admit to being MTD ignorant. But if
that's there is a huge impendance mismatch right there, that that
might be enough to kill it right there.

> > high-level architecture of what UBI is trying to achieve. What are
> > the interfaces at the top and the bottom of the stack? For example,
> > the fact that UBI exports Logical Erase blocks that are not a
> > power-of-two (possibly 128k minus 128 bytes) means that it certainly
> > might not be a good match for the dm stack. But why is that the case?
> > I can imagine good reasons for it, but a high-level description of the
> > design decisions would be very useful.
>
> Basically because you need to store metadata in each eraseblock (and not
> in OOB). That metadata consumes space reducing the usable storage in
> each eraseblock by an amount and making it no longer a power-of-two.

So this is probably a stupid question, but what drives the design
decision to store the metadata in-band instead of out-of-band (and you
don't have to answer me here; putting it in the overall system
architecture document is just as good, and probably better. :-)

> > It would also help people understand why there are so many "units" in
> > UBI, since hopefully the high-level documentation would explain why
> > they fit together, and perhaps why some of the units weren't folded
> > together. What value do they add as separate components?
>
> Artem is reworking the units per your (and other's) suggestions. The
> debugging code is also being worked on.

As I mentioned to you in IRC, in the future if there is pending
changes in response to reviewer comments, it might be a good idea to
mention that, so that reviewers know not make those comments again, or
worry that the comments had been ignored.

> > There are hints of the overall system architecture in some of the
> > indivdual comments for data structures, but even reading all of those,
> > there isn't quite enough for people to figure out what it is; and that
> > may be causing some of these comments of people saying there's too
> > much code to evaluate, or why didn't you do it *this* way?
>
> Some of that can probably be added, sure. Though to be fair, it'll add
> even more lines to the patch and those links have been posted 4 times
> already. They're even posted at the start of _this_ thread. Having it
> in the patch under Documentation/ is a good idea, but you can't force
> people to read that before they comment on things.

Well, having spent some time looking at the FAQ's and all of the
comments kernel docs embedded in the header files and source files,
there are sections that I would move to an overall system architecture
documentation, but there is still a lot that was missing that makes it
hard to review the patches. I'm sure a lot of it is my own ignorance,
but that's probably one of the challenges with the UBI layer; not as
more people have a basic background in say say scheduling or VM or
filesystem than there are people who have a basic background in flash
devices.

Regards,

- Ted

Artem Bityutskiy

unread,

Mar 21, 2007, 4:52:49 AM3/21/07

to Theodore Tso

On Tue, 2007-03-20 at 18:03 -0400, Theodore Tso wrote:
> So this is probably a stupid question, but what drives the design
> decision to store the metadata in-band instead of out-of-band (and you
> don't have to answer me here; putting it in the overall system
> architecture document is just as good, and probably better. :-)

Because
a. Many flashes have no out-of-band. We want to support them as well
b. Modern MLC NAND flashes use _whole_ OOB for ECC and this is the
modern trend.

I will update FAQ and add this there later.

> As I mentioned to you in IRC, in the future if there is pending
> changes in response to reviewer comments, it might be a good idea to
> mention that, so that reviewers know not make those comments again, or
> worry that the comments had been ignored.

Teo, I wrote you 2 times that your point was understood and this would
be fixed. You should not think your comments are ignored because they
are not.

> Well, having spent some time looking at the FAQ's and all of the
> comments kernel docs embedded in the header files and source files,
> there are sections that I would move to an overall system architecture
> documentation, but there is still a lot that was missing that makes it
> hard to review the patches. I'm sure a lot of it is my own ignorance,
> but that's probably one of the challenges with the UBI layer; not as
> more people have a basic background in say say scheduling or VM or
> filesystem than there are people who have a basic background in flash
> devices.

Docs and FAQ will be improved, this is a question of time.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Artem Bityutskiy

unread,

Mar 21, 2007, 4:58:41 AM3/21/07

to David Woodhouse

On Tue, 2007-03-20 at 21:36 +0000, David Woodhouse wrote:
> On Tue, 2007-03-20 at 22:05 +0200, Artem Bityutskiy wrote:
> > Guess why we still do not have a decent FTL? Because it is
> > _difficult_.
>
> No. We don't have a decent FTL because it's _pointless_. We've got basic
> implementations of FTL, NFTL, INFTL etc. for compatibility with PCMCIA
> stuff and DiskOnChip, but the fact remains that pretending to be a
> normal block device with atomically-overwritten 512-byte sectors is just
> _stupid_. You end up implementing a kind of pseudo-filesystem to do
> that, and then on top of that you put a 'normal' filesystem with no real
> knowledge about what's underneath. It's crap -- and as we currently have
> it, the top level file system doesn't even get to tell the underlying
> FTL that a given block can be discarded because it's no longer used. So
> during garbage collection the FTL even ends up copying crap around the
> medium that's no longer relevant.
>
> This isn't DOS. We don't have to make our storage available through the
> restricted interface that INT 13h offers us. We can, and do, do better
> than that. And that's why we don't have a decent FTL implementation.

While I agree with you, I still think decent FTL (a) makes sense and is
(b) difficult.

a. Some people may be satisfied with FTL and enjoy all the block
device-related software, which is huge benefit, although costs you
performance. Yes, FTL moves garbage around, but who cares as long as the
performance fits the system requirements.

b. It is certainly not easy.

But anyway, I agree with what you say, although you seem to be too
assertive.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Jörn Engel

unread,

Mar 21, 2007, 7:10:24 AM3/21/07

to Thomas Gleixner

On Tue, 20 March 2007 01:42:46 +0100, Thomas Gleixner wrote:
> On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
>
> > > > 4. JFFS2 has its own wear-leving scheme, as do several other
> > > > filesystems, so they probably want to bypass this piece of the stack.
> > >
> > > JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> > > wear levelling sucks.
> >
> > Ok, fine. How about LogFS, then?
>
> LogFS can easily leverage UBI's wear algorithm.

Ok, now we have reached the absurd. UBI quite fundamentally cannot do
wear leveling as good as LogFS can. Simply because UBI has zero
knowledge of the _contents_ of its blocks. Knowing whether a block is
90% garbage or not makes a great difference.

Also LogFS currently requires erasesizes of 2^n.

Thomas, I can give you my opinion on this flamewar in private - after
you have cooled off.

Jörn

--
When I am working on a problem I never think about beauty. I think
only how to solve the problem. But when I have finished, if the
solution is not beautiful, I know it is wrong.
-- R. Buckminster Fuller

Thomas Gleixner

unread,

Mar 21, 2007, 7:18:54 AM3/21/07

to Jörn Engel

On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
> On Tue, 20 March 2007 01:42:46 +0100, Thomas Gleixner wrote:
> > On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
> >
> > > > > 4. JFFS2 has its own wear-leving scheme, as do several other
> > > > > filesystems, so they probably want to bypass this piece of the stack.
> > > >
> > > > JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> > > > wear levelling sucks.
> > >
> > > Ok, fine. How about LogFS, then?
> >
> > LogFS can easily leverage UBI's wear algorithm.
>
> Ok, now we have reached the absurd. UBI quite fundamentally cannot do
> wear leveling as good as LogFS can. Simply because UBI has zero
> knowledge of the _contents_ of its blocks. Knowing whether a block is
> 90% garbage or not makes a great difference.
>
> Also LogFS currently requires erasesizes of 2^n.

Last time I talked to you about that, you said it would be possible and
fixable. We talked about several mechanisms, which would allow a
filesystem or other users to hint such things to UBI.

Even if the LogFS wear levelling is so superior, it CAN'T do across
device wear levelling.

tglx

Jörn Engel

unread,

Mar 21, 2007, 7:39:54 AM3/21/07

to Thomas Gleixner

On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote:
> On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
> >
> > Ok, now we have reached the absurd. UBI quite fundamentally cannot do
> > wear leveling as good as LogFS can. Simply because UBI has zero
> > knowledge of the _contents_ of its blocks. Knowing whether a block is
> > 90% garbage or not makes a great difference.
> >
> > Also LogFS currently requires erasesizes of 2^n.
>
> Last time I talked to you about that, you said it would be possible and
> fixable. We talked about several mechanisms, which would allow a
> filesystem or other users to hint such things to UBI.

Note the word "currently". And yes, we did talk about hints. Back then
I still believed in UBI. That has changed and I would like to spare
myself another flamewar, so please leave it at that.

> Even if the LogFS wear levelling is so superior, it CAN'T do across
> device wear levelling.

Correct. And I don't see any problem with this. I see two classes of
usecases for flash, with some amount of overlap in between.

1. Small amounts of flash.

Here the flash contains a large ratio of read-only data. Bootloader,
kernel, etc. Having wear levelling across the device will gain you
something. This is what you designed UBI for.

2. Large amounts of flash.

Just to be precise, large can go well into the Terabyte range and
beyond. I don't mean large as in "the biggest embedded device I worked
on last year" - that is still small.

Even if such flashes still contain a bootloader and a kernel, that will
occupy less than 1% of the device. Wear leveling across the device is
fairly pointless here. This is what I designed LogFS for.

There is some middle ground where a combination of UBI and LogFS may
make sense. LogFS can still make sense for devices as small as 64MiB.
But I'm not too concerned about that because flashes will continue to
grow and the advantages of cross-device wear leveling will continue to
diminish.

Jörn

--
"Security vulnerabilities are here to stay."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001

Artem Bityutskiy

unread,

Mar 21, 2007, 7:43:25 AM3/21/07

to tg...@linutronix.de

On Wed, 2007-03-21 at 12:25 +0100, Thomas Gleixner wrote:
> Last time I talked to you about that, you said it would be possible and
> fixable. We talked about several mechanisms, which would allow a
> filesystem or other users to hint such things to UBI.
>
> Even if the LogFS wear levelling is so superior, it CAN'T do across
> device wear levelling.

Exactly. Although it is true that it cannot be _as good_ as FS, one can
optimize this by means of asking FS beforehand and make it quite OK. And
eraseblock movement is not so frequent event - we do it only once the
erase counter difference is more then 4Ki (although it is tunable).

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Thomas Gleixner

unread,

Mar 21, 2007, 7:51:01 AM3/21/07

to Jörn Engel

On Wed, 2007-03-21 at 12:35 +0100, Jörn Engel wrote:
> Even if such flashes still contain a bootloader and a kernel, that will
> occupy less than 1% of the device. Wear leveling across the device is
> fairly pointless here. This is what I designed LogFS for.

Still you need to have a solution for handling bitflips in those
bootloader and kernel areas.

I don't dispute, that on a Terrabyte solid state disk which is used in a
totally different way, UBI is not necessarily the right tool.

> There is some middle ground where a combination of UBI and LogFS may
> make sense. LogFS can still make sense for devices as small as 64MiB.
> But I'm not too concerned about that because flashes will continue to
> grow and the advantages of cross-device wear leveling will continue to
> diminish.

Flashes will grow, but this will not change the embedded use case with a
relativly small FLASH and the bootloader / kernel / rootfs / datafs
scenario, where UBI is the right tool to use.

There is no hammer for all nails and I don't see device mapper doing
what UBI does right now.

tglx

Jörn Engel

unread,

Mar 21, 2007, 8:36:29 AM3/21/07

to Thomas Gleixner

On Wed, 21 March 2007 12:57:42 +0100, Thomas Gleixner wrote:
> On Wed, 2007-03-21 at 12:35 +0100, Jörn Engel wrote:
> > Even if such flashes still contain a bootloader and a kernel, that will
> > occupy less than 1% of the device. Wear leveling across the device is
> > fairly pointless here. This is what I designed LogFS for.
>
> Still you need to have a solution for handling bitflips in those
> bootloader and kernel areas.

Correct. It may make sense to use UBI for that, I don't know. What I
do know is that UBI cannot make wear leveling decisions as well as
LogFS.

And that is all I care about wrt. this discussion.

Jörn

--
Joern's library part 8:
http://citeseer.ist.psu.edu/plank97tutorial.html

Artem Bityutskiy

unread,

Mar 21, 2007, 8:40:43 AM3/21/07

to Jörn Engel

On Wed, 2007-03-21 at 13:31 +0100, Jörn Engel wrote:
> Correct. It may make sense to use UBI for that, I don't know. What I
> do know is that UBI cannot make wear leveling decisions as well as
> LogFS.

So lets discuss this in different thread when you post a request to
include LogFS into mainline.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Jörn Engel

unread,

Mar 21, 2007, 9:07:51 AM3/21/07

to David Woodhouse

*sigh*

I really did not want to become involved in this. So please be nice and
leave the flamethrower in your weapon closet or I will disappear again
before you can say "fire".

On Tue, 20 March 2007 21:32:40 +0000, David Woodhouse wrote:
> On Tue, 2007-03-20 at 10:58 -0800, David Lang wrote:
> > What Matt and Ted are looking at is the question 'are flash devices close enough
> > to other block devices that it would make sense to change the existing linux
> > definition of a block device to handle the special requirements of flash'
>
> I've seen no real proposals about how this could be done, so it's a
> purely academic question.

What you have seen and shot down were patches to make mtd more generic.
So let me just assume both mtd and jffs2 were generic, even though they
currently aren't.

In very broad terms, an mtd is a device with:
1. a read operation
2. a write operation
3. an erase operation
4. a minimal write blocksize
5. a minimal erase blocksize
6. a method to query bad eraseblocks
7. a method to mark bad eraseblocks

Anything else? There are many more fields, but I believe this is the
essential. point() and unpoint() were omitted, because they are just
one option to provide XIP. filemap_xip.c is another used for block
devices.

In very broad terms, a block device has:
1. a read operation
2. a write operation
3. some devices have an ioctl() for erase, but that is uncommon
4. a blocksize

What is missing? Obviously the erase operation needs to become a
first-class citizen and block devices need two fields for the two
meaningful blocksizes. And they need methods to query and set bad
blocks.

So far it looks simple enough. Obviously there are many messy details
left out, so it will be a lot of work in practice. So the question is:
is it worth it?

What are the gains from combining mtd and block devices?

[ And at this point I would like to state again that I don't want to
become involved in the UBI discussion. The question whether two
seperate subsystems make sense is quite independent and I don't want
both discussions to get mixed up. ]

Jörn

--
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu

Theodore Tso

unread,

Mar 21, 2007, 9:51:46 AM3/21/07

to Artem Bityutskiy

On Wed, Mar 21, 2007 at 10:44:35AM +0200, Artem Bityutskiy wrote:
> > As I mentioned to you in IRC, in the future if there is pending
> > changes in response to reviewer comments, it might be a good idea to
> > mention that, so that reviewers know not make those comments again, or
> > worry that the comments had been ignored.
>
> Teo, I wrote you 2 times that your point was understood and this would
> be fixed. You should not think your comments are ignored because they
> are not.

Artem, no need to be defensive. You did tell me that you were going
to address them; but then you resubmitted patches where they weren't
address. Normally, patch authors take all of the comments, clear them
all, and then in the next repost of the patch, either explain why it
wasn't feasible to handle some of the comments, *OR* why some of the
comments were so hard to handle that they wouldn't be handled until a
future version of the patch. Furthermore, in a patch of the size that
you are submitting, a listing of what you *did* fix would also be
good.

And at this point, I don't doubt that you will at some point going to
heed my comments --- but note that doing so will involve a massive
refactorization of the code, which will tend to invalidate the reviews
done of this current (take 3) version of the patches; so I am bit
curious what was your motiviation in reposting this round of the
patches.

Believe it or not, the people who are responding on this thread are
trying to help. Otherwise they would just be ignoring you and UBI.
Keeping this thought in mind and trying to help them, where in some
cases perhaps they are lacking in the same knowledge and experience
and those who have been working on UBI and have spent many months
thinking about the problem, may help keep things more constructive.

Regards,

- Ted

Josh Boyer

unread,

Mar 21, 2007, 10:00:22 AM3/21/07

to Theodore Tso

On Wed, 2007-03-21 at 09:50 -0400, Theodore Tso wrote:
>
> And at this point, I don't doubt that you will at some point going to
> heed my comments --- but note that doing so will involve a massive
> refactorization of the code, which will tend to invalidate the reviews
> done of this current (take 3) version of the patches; so I am bit
> curious what was your motiviation in reposting this round of the
> patches.

Yes. However, nobody has actually reviewed any of the _code_ in this
round. So while it may have been somewhat superfluous from a submission
standpoint, at least no code review has been wasted and we are getting
some fairly decent design discussions.

josh

Artem Bityutskiy

unread,

Mar 21, 2007, 10:03:52 AM3/21/07

to Theodore Tso

On Wed, 2007-03-21 at 09:50 -0400, Theodore Tso wrote:

> And at this point, I don't doubt that you will at some point going to
> heed my comments --- but note that doing so will involve a massive
> refactorization of the code, which will tend to invalidate the reviews
> done of this current (take 3) version of the patches; so I am bit
> curious what was your motiviation in reposting this round of the
> patches.

Well, in take 3 I _already_ removed quite _a lot_ of itsy-bitsy units,
and I thought it is enough, so I've submitted it. But later I've
realized that I should go further (e.g., dispose of per-unit data
structures), and started more re-work. Yes, I should have notified about
this new re-work, apologies.

Point taken, will be fixed :-)

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Frank Haverkamp

unread,

Mar 21, 2007, 11:40:10 AM3/21/07

to Theodore Tso

Hi Ted,

On Wed, 2007-03-21 at 09:50 -0400, Theodore Tso wrote:

> Keeping this thought in mind and trying to help them, where in some
> cases perhaps they are lacking in the same knowledge and experience
> and those who have been working on UBI and have spent many months
> thinking about the problem, may help keep things more constructive.

since I am one of those who are working on UBI together with Artem, I
would like to try to describe how we are using UBI and how it solves
some problems, we could not solve to our satisfaction with existing
solutions.

Our system is using only NAND flashes and we use a nand-flash-controller
for the 1st stage boot-code (we called it IPL - initial program loader)
to appear contiguous in memory. The size of the IPL is limited to 4 KiB.

In these 4KiB code we initialize the processor, and scan the NAND flash
for copies of our 2nd stage boot-loader (SPL), which is stored in one or
even redundantly in multiple static UBI volumes. If the SPL has one-bit
flips we can correct them in the IPL before loading it into RAM.

We want to correct bit-flips as soon as possible to avoid getting
uncorrectable errors. UBI is doing that transparently for us by copying
the block with the bitflip to a free block. The logic in the IPL is able
to cope with the situation where UBI is interrupted when doing that.

Imagine if you have boot-code at a fixed location (skipping bad blocks
of course too): if you erase a block of the SPL to write it back to the
same block (to remove bit-errors), you will not be robust against
power-loss anymore! You could put a 2nd temporary SPL before you could
do this, but that is not nice too.

The static UBI volumes are so simple structured, that it is possible to
load them to RAM using only a few KiB of code. We do not even look at
UBIs volume information table to be able to do this.

UBI solves here:
1. possiblitly to boot a controller with limited resources using NAND
2. transparent bitflip correction on read only data e.g. boot-code,
kernel, initrd. Note that the mechanisms here are robust against
power-loss, that is also very important to us

We wanted to use JFFS2 and found that the traditional update mechanisms
did not ensure that an interrupted update attempt can be detected as
such. The UBI volume update ensures, that the volume is only usable
after it was updated completely.

3. Update mechanism which ensures that incomplete data cannot be used

We found that putting certain flash content at fixed locations with
fixed size is especially cumbersome if raw NAND is used e.g. if you
consider that bad blocks have to be skipped. Resizing partitions is a
pain.

UBI helps us to get rid of those limitations. We can resize the UBI
volumes and because UBI takes care to find the volume data even our
second stage boot-code can be located anywhere on the chip.

4. Volume resizing is easy

Because we want to ensure that we gain maximum lifetime for our systems,
we want that bitflips are corrected immediately when they are found.
Feature 2. of UBI does this for us.

I think that the largest portion of what we put in our NAND flashes is
code and data and basically read-only. Nevertheless data is written
during operation and, as already pointed out, maximum lifetime is
important for us, and wear-leveling across the whole flash chip helps
espcially with our usage-pattern. UBIs ability to copy blocks
transparently e.g. a read-only block with small erase count to a free
block with relatively high erase count, helps to get this done.

5. wear-leveling across the whole flash chip

We found that being able to use the same code update mechanisms for
NAND/NOR/? based systems is a nice side effect too. That was one reason
beside others (see previous mails) to up the UBI meta data into the data
section of the flashes and sacrifice therefore some space for data and
of course that the usable size of a block is not n^2 anymore. I think it
was a good decision, because if we would have put in in NAND OOB area,
the discussion here might be limited to NAND users only.

Regards,

Frank Haverkamp

David Lang

unread,

Mar 21, 2007, 4:52:42 PM3/21/07

to Frank Haverkamp

On Wed, 21 Mar 2007, Frank Haverkamp wrote:

several of these things sound like they would be useful to other block devices
as well

> UBI solves here:
> 1. possiblitly to boot a controller with limited resources using NAND
> 2. transparent bitflip correction on read only data e.g. boot-code,
> kernel, initrd. Note that the mechanisms here are robust against
> power-loss, that is also very important to us
>
> We wanted to use JFFS2 and found that the traditional update mechanisms
> did not ensure that an interrupted update attempt can be detected as
> such. The UBI volume update ensures, that the volume is only usable
> after it was updated completely.

a dm layer that detects and remaps soft errors before they become hard errors is
useful for hard drives as well.

> 3. Update mechanism which ensures that incomplete data cannot be used
>
> We found that putting certain flash content at fixed locations with
> fixed size is especially cumbersome if raw NAND is used e.g. if you
> consider that bad blocks have to be skipped. Resizing partitions is a
> pain.
>
> UBI helps us to get rid of those limitations. We can resize the UBI
> volumes and because UBI takes care to find the volume data even our
> second stage boot-code can be located anywhere on the chip.
>
> 4. Volume resizing is easy
>
> Because we want to ensure that we gain maximum lifetime for our systems,
> we want that bitflips are corrected immediately when they are found.
> Feature 2. of UBI does this for us.
>
> I think that the largest portion of what we put in our NAND flashes is
> code and data and basically read-only. Nevertheless data is written
> during operation and, as already pointed out, maximum lifetime is
> important for us, and wear-leveling across the whole flash chip helps
> espcially with our usage-pattern. UBIs ability to copy blocks
> transparently e.g. a read-only block with small erase count to a free
> block with relatively high erase count, helps to get this done.

both of these also sound useful as dm layers (in fact lvm already does some of
the resizing things)

> 5. wear-leveling across the whole flash chip
>
> We found that being able to use the same code update mechanisms for
> NAND/NOR/? based systems is a nice side effect too. That was one reason
> beside others (see previous mails) to up the UBI meta data into the data
> section of the flashes and sacrifice therefore some space for data and
> of course that the usable size of a block is not n^2 anymore. I think it
> was a good decision, because if we would have put in in NAND OOB area,
> the discussion here might be limited to NAND users only.

wear leveling would also be useful on other block devices (think CD-RAM for
example)

cross-device wear leveling sounds a lot like putting the wear leveling layer
above a lvm-like layer that stitches the seperate flash chips togeather into one
logical device.

additionally, if wear leveling is an optional layer in dm then it can be left
out when it's not appropriate (like when the FS has features that make it
unnessasary, or when it's read-only so you don't have writes to worry about (or
even if it's read-only 99.9999% of the time so writes are so rare that all the
writes in the expected lifetime of the device won't cause problems)

David Lang

Artem Bityutskiy

unread,

Mar 23, 2007, 12:55:03 PM3/23/07

to Linux Kernel Mailing List

Hello,

In short, UBI provides wear-levelling support across the whole flash chip.
UBI completely hides 2 aspects of flash chips which make them very difficult to
work with:
1. wear of eraseblocks;
2. bad eraseblocks.

UBI also makes it possible to dynamically create, delete and re-size flash
partitions (UBI volumes). So here some analogy to LVM may be pointed.

There is some documentation available at:
http://www.linux-mtd.infradead.org/doc/ubi.html
http://www.linux-mtd.infradead.org/faq/ubi.html

The sources are available via the GIT tree:
git://git.infradead.org/ubi-2.6.git (stable)
git://git.infradead.org/~dedekind/dedekind-ubi-2.6.git (devel)

One can also browse the GIT trees at http://git.infradead.org/

This is the 4th iteration of the post which has fixed most of the stuff
pointed to previously:

- Removed "itsy-bitsy" units
- Removed a lot of debugging stuff
- Fixed kernel-doc
- fixed inline damage in eba.c

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Artem Bityutskiy

unread,

Mar 23, 2007, 12:55:57 PM3/23/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/ubi/ubi.h tmp-to/drivers/mtd/ubi/ubi.h
--- tmp-from/drivers/mtd/ubi/ubi.h 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/ubi.h 2007-03-23 18:20:01.000000000 +0200
@@ -0,0 +1,560 @@
+/*
+ * Copyright (c) International Business Machines Corp., 2006
+ * Copyright (c) Nokia Corporation, 2006, 2007
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Author: Artem Bityutskiy (Битюцкий Артём)
+ */
+
+#ifndef __UBI_UBI_H__
+#define __UBI_UBI_H__
+
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/string.h>
+#include <linux/mtd/mtd.h>
+
+#include <mtd/ubi-header.h>
+#include <linux/mtd/ubi.h>
+
+#include "scan.h"
+#include "debug.h"
+
+/* Maximum number of supported UBI devices */
+#define UBI_MAX_DEVICES 32
+
+/* UBI name used for character devices, sysfs, etc */
+#define UBI_NAME_STR "ubi"
+
+/* Normal UBI messages */
+#define ubi_msg(fmt, ...) printk(KERN_NOTICE "UBI: " fmt "\n", ##__VA_ARGS__)
+/* UBI warning messages */
+#define ubi_warn(fmt, ...) printk(KERN_WARNING "UBI warning: %s: " fmt "\n", \
+ __FUNCTION__, ##__VA_ARGS__)
+/* UBI error messages */
+#define ubi_err(fmt, ...) printk(KERN_ERR "UBI error: %s: " fmt "\n", \
+ __FUNCTION__, ##__VA_ARGS__)
+
+/* Lowest number PEBs reserved for bad PEB handling */
+#define MIN_RESEVED_PEBS 2
+
+/* Background thread name pattern */
+#define UBI_BGT_NAME_PATTERN "ubi_bgt%dd"
+
+/* This marker in the EBA table means that the LEB is um-mapped */
+#define UBI_LEB_UNMAPPED -1
+
+/*
+ * In case of errors, UBI tries to repeat the operation several times before
+ * returning error. The below constant defines how many times UBI re-tries.
+ */
+#define UBI_IO_RETRIES 3
+
+/*
+ * Error codes returned by the I/O unit.
+ *
+ * UBI_IO_PEB_EMPTY: the physical eraseblock is empty, i.e. it contains only
+ * 0xFF bytes
+ * UBI_IO_PEB_FREE: the physical eraseblock is free, i.e. it contains only a
+ * valid erase counter header, and the rest are %0xFF bytes
+ * UBI_IO_BAD_EC_HDR: the erase counter header is corrupted (bad magic or CRC)
+ * UBI_IO_BAD_VID_HDR: the volume identifier header is corrupted (bad magic or
+ * CRC)
+ * UBI_IO_BITFLIPS: bit-flips were detected and corrected
+ */
+enum {
+ UBI_IO_PEB_EMPTY = 1,
+ UBI_IO_PEB_FREE,
+ UBI_IO_BAD_EC_HDR,
+ UBI_IO_BAD_VID_HDR,
+ UBI_IO_BITFLIPS
+};
+
+extern int ubi_devices_cnt;
+extern struct ubi_device *ubi_devices[];
+
+struct ubi_volume_desc;
+
+/**
+ * struct ubi_volume - UBI volume description data structure.
+ * @dev: device object to make use of the the Linux device model
+ * @cdev: character device object to create character device
+ * @ubi: reference to the UBI device description object
+ * @vol_id: volume ID
+ * @readers: number of users holding this volume in read-only mode
+ * @writers: number of users holding this volume in read-write mode
+ * @exclusive: whether somebody holds this volume in exclusive mode
+ * @removed: if the volume was removed
+ * @checked: if this static volume was checked
+ *
+ * @reserved_pebs: how many physical eraseblocks are reserved for this volume
+ * @vol_type: volume type (%UBI_DYNAMIC_VOLUME or %UBI_STATIC_VOLUME)
+ * @usable_leb_size: logical eraseblock size without padding
+ * @used_ebs: how many logical eraseblocks in this volume contain data
+ * @last_eb_bytes: how many bytes are stored in the last logical eraseblock
+ * @used_bytes: how many bytes of data this volume contains
+ * @upd_marker: non-zero if the update marker is set for this volume
+ * @corrupted: non-zero if the volume is corrupted (static volumes only)
+ * @alignment: volume alignment
+ * @data_pad: how many bytes are not used at the end of physical eraseblocks to
+ * satisfy the requested alignment
+ * @name_len: volume name length
+ * @name: volume name
+ *
+ * @updating: whether the volume is being updated
+ * @upd_ebs: how many eraseblocks are expected to be updated
+ * @upd_bytes: how many bytes are expected to be received
+ * @upd_received: how many update bytes were already received
+ * @upd_buf: update buffer which is used to collect update data
+ *
+ * @eba_tbl: EBA table of this volume (LEB->PEB mapping)
+ *
+ * @gluebi_desc: gluebi UBI volume descriptor
+ * @gluebi_refcount: reference count of the gluebi MTD device
+ * @gluebi_mtd: MTD device description object of the gluebi MTD device
+ *
+ * The @corrupted field indicates that the volume's contents is corrupted.
+ * Since UBI protects only static volumes, this field is not relevant to
+ * dynamic volumes - it is user's responsibility to assure their data
+ * integrity.
+ *
+ * The @upd_marker flag indicates that this volume is either being updated at
+ * the moment or is damaged because of an unclean reboot.
+ */
+struct ubi_volume {
+ struct device dev;
+ struct cdev cdev;
+ struct ubi_device *ubi;
+ int vol_id;
+ int readers;
+ int writers;
+ int exclusive;
+ int removed;
+ int checked;
+
+ int reserved_pebs;
+ int vol_type;
+ int usable_leb_size;
+ int used_ebs;
+ int last_eb_bytes;
+ long long used_bytes;
+ int upd_marker;
+ int corrupted;
+ int alignment;
+ int data_pad;
+ int name_len;
+ char name[UBI_VOL_NAME_MAX+1];
+
+ int updating;
+ int upd_ebs;
+ long long upd_bytes;
+ long long upd_received;
+ void *upd_buf;
+
+ int *eba_tbl;
+
+#ifdef CONFIG_MTD_UBI_GLUEBI
+ /* Gluebi-related stuff may be compiled out */
+ struct ubi_volume_desc *gluebi_desc;
+ int gluebi_refcount;
+ struct mtd_info gluebi_mtd;
+#endif
+};
+
+/**
+ * struct ubi_volume_desc - descriptor of the UBI volume returned when it is
+ * opened.
+ * @vol: reference to the corresponding volume description object
+ * @mode: open mode (%UBI_READONLY, %UBI_READWRITE, or %UBI_EXCLUSIVE)
+ */
+struct ubi_volume_desc {
+ struct ubi_volume *vol;
+ int mode;
+};
+
+struct ubi_wl_entry;
+
+/**
+ * struct ubi_device - UBI device description structure
+ * @dev: class device object to use the the Linux device model
+ * @cdev: character device object to create character device
+ * @ubi_num: UBI device number
+ * @ubi_name: UBI device name
+ * @major: character device major number
+ * @vol_count: number of volumes in this UBI device
+ * @volumes: volumes of this UBI device
+ * @volumes_lock: protects @volumes, @rsvd_pebs, @avail_pebs, beb_rsvd_pebs,
+ * @beb_rsvd_level, @bad_peb_count, @good_peb_count, @vol_count, @vol->readers,
+ * @vol->writers, @vol->exclusive, @vol->removed, @vol->mapping and
+ * @vol->eba_tbl.
+ *
+ * @rsvd_pebs: count of reserved physical eraseblocks
+ * @avail_pebs: count of available physical eraseblocks
+ * @beb_rsvd_pebs: how many physical eraseblocks are reserved for bad PEB
+ * handling
+ * @beb_rsvd_level: normal level of PEBs reserved for bad PEB handling
+ *
+ * @vtbl_slots: how many slots are available in the volume table
+ * @vtbl_size: size of the volume table in bytes
+ * @vtbl: in-RAM volume table copy
+ *
+ * @max_ec: current highest erase counter value
+ * @mean_ec: current mean erase counter value
+ *
+ * global_sqnum: global sequence number
+ * @ltree_lock: protects the lock tree and @global_sqnum
+ * @ltree: the lock tree
+ * @vtbl_mutex: protects on-flash volume table
+ *
+ * @used: RB-tree of used physical eraseblocks
+ * @free: RB-tree of free physical eraseblocks
+ * @scrub: RB-tree of physical eraseblocks which need scrubbing
+ * @prot: protection trees
+ * @prot.pnum: protection tree indexed by physical eraseblock numbers
+ * @prot.aec: protection tree indexed by absolute erase counter value
+ * @wl_lock: protects the @used, @free, @prot, @lookuptbl, @abs_ec, @move_from,
+ * @move_to, @move_to_put @erase_pending, @wl_scheduled, and @works
+ * fields
+ * @wl_scheduled: non-zero if the wear-leveling was scheduled
+ * @lookuptbl: a table to quickly find a &struct ubi_wl_entry object for any
+ * physical eraseblock
+ * @abs_ec: absolute erase counter
+ * @move_from: physical eraseblock from where the data is being moved
+ * @move_to: physical eraseblock where the data is being moved to
+ * @move_from_put: if the "from" PEB was put
+ * @move_to_put: if the "to" PEB was put
+ * @works: list of pending works
+ * @works_count: count of pending works
+ * @bgt_thread: background thread description object
+ * @thread_enabled: if the background thread is enabled
+ * @bgt_name: background thread name
+ *
+ * @flash_size: underlying MTD device size (in bytes)
+ * @peb_count: count of physical eraseblocks on the MTD device
+ * @peb_size: physical eraseblock size
+ * @bad_peb_count: count of bad physical eraseblocks
+ * @good_peb_count: count of good physical eraseblocks
+ * @min_io_size: minimal input/output unit size of the underlying MTD device
+ * @hdrs_min_io_size: minimal I/O unit size used for VID and EC headers
+ * @ro_mode: if the UBI device is in read-only mode
+ * @leb_size: logical eraseblock size
+ * @leb_start: starting offset of logical eraseblocks within physical
+ * eraseblocks
+ * @ec_hdr_alsize: size of the EC header aligned to @hdrs_min_io_size
+ * @vid_hdr_alsize: size of the VID header aligned to @hdrs_min_io_size
+ * @vid_hdr_offset: starting offset of the volume identifier header (might be
+ * unaligned)
+ * @vid_hdr_aloffset: starting offset of the VID header aligned to
+ * @hdrs_min_io_size
+ * @vid_hdr_shift: contains @vid_hdr_offset - @vid_hdr_aloffset
+ * @bad_allowed: whether the MTD device admits of bad physical eraseblocks or
+ * not
+ * @mtd: MTD device descriptor
+ */
+struct ubi_device {
+ struct cdev cdev;
+ struct device dev;
+ int ubi_num;
+ char ubi_name[sizeof(UBI_NAME_STR)+5];
+ int major;
+ int vol_count;
+ struct ubi_volume *volumes[UBI_MAX_VOLUMES+UBI_INT_VOL_COUNT];
+ spinlock_t volumes_lock;
+
+ int rsvd_pebs;
+ int avail_pebs;
+ int beb_rsvd_pebs;
+ int beb_rsvd_level;
+
+ int vtbl_slots;
+ int vtbl_size;
+ struct ubi_vtbl_record *vtbl;
+ struct mutex vtbl_mutex;
+
+ int max_ec;
+ int mean_ec;
+
+ /* EBA unit's stuff */
+ unsigned long long global_sqnum;
+ spinlock_t ltree_lock;
+ struct rb_root ltree;
+
+ /* Wear-leveling unit's stuff */
+ struct rb_root used;
+ struct rb_root free;
+ struct rb_root scrub;
+ struct {
+ struct rb_root pnum;
+ struct rb_root aec;
+ } prot;
+ spinlock_t wl_lock;
+ int wl_scheduled;
+ struct ubi_wl_entry **lookuptbl;
+ unsigned long long abs_ec;
+ struct ubi_wl_entry *move_from;
+ struct ubi_wl_entry *move_to;
+ int move_from_put;
+ int move_to_put;
+ struct list_head works;
+ int works_count;
+ struct task_struct *bgt_thread;
+ int thread_enabled;
+ char bgt_name[sizeof(UBI_BGT_NAME_PATTERN)+2];
+
+ /* I/O unit's stuff */
+ long long flash_size;
+ int peb_count;
+ int peb_size;
+ int bad_peb_count;
+ int good_peb_count;
+ int min_io_size;
+ int hdrs_min_io_size;
+ int ro_mode;
+ int leb_size;
+ int leb_start;
+ int ec_hdr_alsize;
+ int vid_hdr_alsize;
+ int vid_hdr_offset;
+ int vid_hdr_aloffset;
+ int vid_hdr_shift;
+ int bad_allowed;
+ struct mtd_info *mtd;
+};
+
+extern struct file_operations ubi_cdev_operations;
+extern struct file_operations ubi_vol_cdev_operations;
+extern struct class *ubi_class;
+
+/* vtbl.c */
+int ubi_change_vtbl_record(struct ubi_device *ubi, int idx,
+ struct ubi_vtbl_record *vtbl_rec);
+int ubi_read_volume_table(struct ubi_device *ubi, struct ubi_scan_info *si);
+
+/* vmt.c */
+int ubi_create_volume(struct ubi_device *ubi, struct ubi_mkvol_req *req);
+int ubi_remove_volume(struct ubi_volume_desc *desc);
+int ubi_resize_volume(struct ubi_volume_desc *desc, int reserved_pebs);
+int ubi_add_volume(struct ubi_device *ubi, int vol_id);
+void ubi_free_volume(struct ubi_device *ubi, int vol_id);
+
+/* upd.c */
+int ubi_start_update(struct ubi_device *ubi, int vol_id, long long bytes);
+int ubi_more_update_data(struct ubi_device *ubi, int vol_id,
+ const void __user *buf, int count);
+
+/* misc.c */
+int ubi_calc_data_len(const struct ubi_device *ubi, const void *buf, int length);
+int ubi_check_volume(struct ubi_device *ubi, int vol_id);
+void ubi_calculate_reserved(struct ubi_device *ubi);
+
+/* gluebi.c */
+#ifdef CONFIG_MTD_UBI_GLUEBI
+int ubi_create_gluebi(struct ubi_device *ubi, struct ubi_volume *vol);
+int ubi_destroy_gluebi(struct ubi_volume *vol);
+#else
+#define ubi_create_gluebi(ubi, vol) 0
+#define ubi_destroy_gluebi(vol) 0
+#endif
+
+/* eba.c */
+int ubi_eba_unmap_leb(struct ubi_device *ubi, int vol_id, int lnum);
+int ubi_eba_read_leb(struct ubi_device *ubi, int vol_id, int lnum, void *buf,
+ int offset, int len, int check);
+int ubi_eba_write_leb(struct ubi_device *ubi, int vol_id, int lnum,
+ const void *buf, int offset, int len,
+ enum ubi_data_type dtype);
+int ubi_eba_write_leb_st(struct ubi_device *ubi, int vol_id, int lnum,
+ const void *buf, int len, enum ubi_data_type dtype,
+ int used_ebs);
+int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
+ struct ubi_vid_hdr *vid_hdr);
+int ubi_eba_init_scan(struct ubi_device *ubi, struct ubi_scan_info *si);
+void ubi_eba_close(const struct ubi_device *ubi);
+
+/* wl.c */
+int ubi_wl_get_peb(struct ubi_device *ubi, enum ubi_data_type dtype);
+int ubi_wl_put_peb(struct ubi_device *ubi, int pnum, int torture);
+int ubi_wl_flush(struct ubi_device *ubi);
+int ubi_wl_scrub_peb(struct ubi_device *ubi, int pnum);
+int ubi_wl_init_scan(struct ubi_device *ubi, struct ubi_scan_info *si);
+void ubi_wl_close(struct ubi_device *ubi);
+
+/* io.c */
+int ubi_io_read(const struct ubi_device *ubi, void *buf, int pnum, int offset,
+ int len);
+int ubi_io_write(const struct ubi_device *ubi, const void *buf, int pnum,
+ int offset, int len);
+int ubi_io_sync_erase(const struct ubi_device *ubi, int pnum, int torture);
+int ubi_io_is_bad(const struct ubi_device *ubi, int pnum);
+int ubi_io_mark_bad(const struct ubi_device *ubi, int pnum);
+int ubi_io_read_ec_hdr(const struct ubi_device *ubi, int pnum,
+ struct ubi_ec_hdr *ec_hdr, int verbose);
+int ubi_io_write_ec_hdr(const struct ubi_device *ubi, int pnum,
+ struct ubi_ec_hdr *ec_hdr);
+int ubi_io_read_vid_hdr(const struct ubi_device *ubi, int pnum,
+ struct ubi_vid_hdr *vid_hdr, int verbose);
+int ubi_io_write_vid_hdr(const struct ubi_device *ubi, int pnum,
+ struct ubi_vid_hdr *vid_hdr);
+
+/*
+ * ubi_rb_for_each_entry - walk an RB-tree.
+ * @rb: a pointer to type 'struct rb_node' to to use as a loop counter
+ * @pos: a pointer to RB-tree entry type to use as a loop counter
+ * @root: RB-tree's root
+ * @member: the name of the 'struct rb_node' within the RB-tree entry
+ */
+#define ubi_rb_for_each_entry(rb, pos, root, member) \
+ for (rb = rb_first(root), \
+ pos = (rb ? container_of(rb, typeof(*pos), member) : NULL); \
+ rb; \
+ rb = rb_next(rb), pos = container_of(rb, typeof(*pos), member))
+
+/**
+ * ubi_align_up - align an integer to another integer.
+ * @x: the integer to align
+ * @y: the integer to align to
+ *
+ * This function returns the lowest number which is multiple to @y and not less
+ * then @x.
+ */
+static inline int ubi_align_up(int x, int y)
+{
+ return y*(x/y) + (!!(x % y)) * y;
+}
+
+/**
+ * ubi_align_down - align an integer to another integer.
+ * @x: the integer to align
+ * @y: the integer to align to
+ *
+ * This function returns the highest number which is multiple to @y and not
+ * greater then @x.
+ */
+static inline int ubi_align_down(int x, int y)
+{
+ return y*(x/y);
+}
+
+/**
+ * ubi_zalloc_vid_hdr - allocate a volume identifier header object.
+ * @ubi: UBI device description object
+ *
+ * This function returns a pointer to the newly allocated and zero-filled
+ * volume identifier header object in case of success and %NULL in case of
+ * failure.
+ */
+static inline struct ubi_vid_hdr *ubi_zalloc_vid_hdr(const struct ubi_device *ubi)
+{
+ void *vid_hdr;
+
+ vid_hdr = kzalloc(ubi->vid_hdr_alsize, GFP_KERNEL);
+ if (unlikely(!vid_hdr))
+ return NULL;
+
+ /*
+ * VID headers may be stored at un-aligned flash offsets, so we shift
+ * the pointer.
+ */
+ return vid_hdr + ubi->vid_hdr_shift;
+}
+
+/**
+ * ubi_free_vid_hdr - free a volume identifier header object.
+ * @ubi: UBI device description object
+ * @vid_hdr: the object to free
+ */
+static inline void ubi_free_vid_hdr(const struct ubi_device *ubi,
+ struct ubi_vid_hdr *vid_hdr)
+{
+ void *p = vid_hdr;
+
+ if (unlikely(!p))
+ return;
+
+ kfree(p - ubi->vid_hdr_shift);
+}
+
+/*
+ * This function is equivalent to 'ubi_io_read()', but @offset is relative to
+ * the beginning of the logical eraseblock, not to the beginning of the
+ * physical eraseblock.
+ */
+static inline int ubi_io_read_data(const struct ubi_device *ubi, void *buf,
+ int pnum, int offset, int len)
+{
+ ubi_assert(offset >= 0);
+ return ubi_io_read(ubi, buf, pnum, offset + ubi->leb_start, len);
+}
+
+/*
+ * This function is equivalent to 'ubi_io_write()', but @offset is relative to
+ * the beginning of the logical eraseblock, not to the beginning of the
+ * physical eraseblock.
+ */
+static inline int ubi_io_write_data(const struct ubi_device *ubi, const void *buf,
+ int pnum, int offset, int len)
+{
+ ubi_assert(offset >= 0);
+ return ubi_io_write(ubi, buf, pnum, offset + ubi->leb_start, len);
+}
+
+/**
+ * ubi_ro_mode - switch to read-only mode.
+ * @ubi: UBI device description object
+ */
+static inline void ubi_ro_mode(struct ubi_device *ubi)
+{
+ ubi->ro_mode = 1;
+ ubi_warn("switch to read-only mode");
+}
+
+/**
+ * vol_id2idx - get table index by volume ID.
+ * @ubi: UBI device description object
+ * @vol_id: volume ID
+ */
+static inline int vol_id2idx(const struct ubi_device *ubi, int vol_id)
+{
+ if (vol_id >= UBI_INTERNAL_VOL_START)
+ return vol_id - UBI_INTERNAL_VOL_START + ubi->vtbl_slots;
+ else
+ return vol_id;
+}
+
+/**
+ * idx2vol_id - get volume ID by table index.
+ * @ubi: UBI device description object
+ * @idx: table index
+ */
+static inline int idx2vol_id(const struct ubi_device *ubi, int idx)
+{
+ if (idx >= ubi->vtbl_slots)
+ return idx - ubi->vtbl_slots + UBI_INTERNAL_VOL_START;
+ else
+ return idx;
+}
+
+#endif /* !__UBI_UBI_H__ */

Artem Bityutskiy

unread,

Mar 23, 2007, 12:56:55 PM3/23/07

to Linux Kernel Mailing List

diff -auNrp tmp-from/drivers/mtd/Kconfig tmp-to/drivers/mtd/Kconfig
--- tmp-from/drivers/mtd/Kconfig 2007-03-23 18:20:01.000000000 +0200
+++ tmp-to/drivers/mtd/Kconfig 2007-03-23 18:20:01.000000000 +0200
@@ -292,5 +292,7 @@ source "drivers/mtd/nand/Kconfig"

source "drivers/mtd/onenand/Kconfig"

+source "drivers/mtd/ubi/Kconfig"
+
endmenu

diff -auNrp tmp-from/drivers/mtd/Makefile tmp-to/drivers/mtd/Makefile
--- tmp-from/drivers/mtd/Makefile 2007-03-23 18:20:01.000000000 +0200
+++ tmp-to/drivers/mtd/Makefile 2007-03-23 18:20:01.000000000 +0200
@@ -28,3 +28,5 @@ nftl-objs := nftlcore.o nftlmount.o
inftl-objs := inftlcore.o inftlmount.o

obj-y += chips/ maps/ devices/ nand/ onenand/
+
+obj-$(CONFIG_MTD_UBI) += ubi/
diff -auNrp tmp-from/drivers/mtd/ubi/Kconfig tmp-to/drivers/mtd/ubi/Kconfig
--- tmp-from/drivers/mtd/ubi/Kconfig 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/Kconfig 2007-03-23 18:20:01.000000000 +0200
@@ -0,0 +1,58 @@
+# drivers/mtd/ubi/Kconfig
+
+menu "UBI - Unsorted block images"
+ depends on MTD
+
+config MTD_UBI
+ tristate "Enable UBI"
+ depends on MTD
+ select CRC32
+ help
+ UBI is a software layer above MTD layer which admits of LVM-like
+ logical volumes on top of MTD devices, hides some complexities of
+ flash chips like wear and bad blocks and provides some other useful
+ capabilities. Please, consult the MTD web site for more details
+ (www.linux-mtd.infradead.org).
+
+config MTD_UBI_WL_THRESHOLD
+ int "UBI wear-leveling threshold"
+ default 4096
+ range 2 65536
+ depends on MTD_UBI
+ help
+ This parameter defines the maximum difference between the highest
+ erase counter value and the lowest erase counter value of eraseblocks
+ of UBI devices. When this threshold is exceeded, UBI starts performing
+ wear leveling by means of moving data from eraseblock with low erase
+ counter to eraseblocks with high erase counter. Leave the default
+ value if unsure.
+
+config MTD_UBI_BEB_RESERVE
+ int "Percentage of reserved eraseblocks for bad eraseblocks handling"
+ default 1
+ range 0 25
+ depends on MTD_UBI
+ help
+ If the MTD device admits of bad eraseblocks (e.g. NAND flash), UBI
+ reserves some amount of physical eraseblocks to handle new bad
+ eraseblocks. For example, if a flash physical eraseblock becomes bad,
+ UBI uses these reserved physical eraseblocks to relocate the bad one.
+ This option specifies how many physical eraseblocks will be reserved
+ for bad eraseblock handling (percents of total number of good flash
+ eraseblocks). If the underlying flash does not admit of bad
+ eraseblocks (e.g. NOR flash), this value is ignored and nothing is
+ reserved. Leave the default value if unsure.
+
+config MTD_UBI_GLUEBI
+ bool "Emulate MTD devices"
+ default n
+ depends on MTD_UBI
+ help
+ This option enables MTD devices emulation on top of UBI volumes: for
+ each UBI volumes an MTD device is created, and all I/O to this MTD
+ device is redirected to the UBI volume. This is handy to make
+ MTD-oriented software (like JFFS2) work on top of UBI. Do not enable
+ this if no legacy software will be used.
+
+source "drivers/mtd/ubi/Kconfig.debug"
+endmenu
diff -auNrp tmp-from/drivers/mtd/ubi/Kconfig.debug tmp-to/drivers/mtd/ubi/Kconfig.debug
--- tmp-from/drivers/mtd/ubi/Kconfig.debug 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/Kconfig.debug 2007-03-23 18:20:01.000000000 +0200
@@ -0,0 +1,104 @@
+comment "UBI debugging options"
+ depends on MTD_UBI
+
+config MTD_UBI_DEBUG
+ bool "UBI debugging"
+ depends on SYSFS
+ depends on MTD_UBI
+ select DEBUG_FS
+ select KALLSYMS_ALL
+ help
+ This option enables UBI debugging.
+
+config MTD_UBI_DEBUG_MSG
+ bool "UBI debugging messages"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option enables UBI debugging messages.
+
+config MTD_UBI_DEBUG_PARANOID
+ bool "Extra self-checks"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables extra checks in UBI code. Note this slows UBI down
+ significantly.
+
+config MTD_UBI_DEBUG_DISABLE_BGT
+ bool "Do not enable the UBI background thread"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option switches the background thread off by default. The thread
+ may be also be enabled/disabled via UBI sysfs.
+
+config MTD_UBI_DEBUG_USERSPACE_IO
+ bool "Direct user-space write/erase support"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ By default, users cannot directly write and erase individual
+ eraseblocks of dynamic volumes, and have to use update operation
+ instead. This option enables this capability - it is very useful for
+ debugging and testing.
+
+config MTD_UBI_DEBUG_EMULATE_BITFLIPS
+ bool "Emulate flash bit-flips"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option emulates bit-flips with probability 1/50, which in turn
+ causes scrubbing. Useful for debugging and stressing UBI.
+
+config MTD_UBI_DEBUG_EMULATE_WRITE_FAILURES
+ bool "Emulate flash write failures"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option emulates write failures with probability 1/100. Useful for
+ debugging and testing how UBI handlines errors.
+
+config MTD_UBI_DEBUG_EMULATE_ERASE_FAILURES
+ bool "Emulate flash erase failures"
+ depends on MTD_UBI_DEBUG
+ default n
+ help
+ This option emulates erase failures with probability 1/100. Useful for
+ debugging and testing how UBI handlines errors.
+
+menu "Additional UBI debugging messages"
+ depends on MTD_UBI_DEBUG
+
+config MTD_UBI_DEBUG_MSG_BLD
+ bool "Additional UBI initialization and build messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables detailed UBI initialization and device build
+ debugging messages.
+
+config MTD_UBI_DEBUG_MSG_EBA
+ bool "Eraseblock association unit messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables debugging messages from the UBI eraseblock
+ association unit.
+
+config MTD_UBI_DEBUG_MSG_WL
+ bool "Wear-leveling unit messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables debugging messages from the UBI wear-leveling
+ unit.
+
+config MTD_UBI_DEBUG_MSG_IO
+ bool "Input/output unit messages"
+ default n
+ depends on MTD_UBI_DEBUG
+ help
+ This option enables debugging messages from the UBI input/output unit.
+
+endmenu # UBI debugging messages
diff -auNrp tmp-from/drivers/mtd/ubi/Makefile tmp-to/drivers/mtd/ubi/Makefile
--- tmp-from/drivers/mtd/ubi/Makefile 1970-01-01 02:00:00.000000000 +0200
+++ tmp-to/drivers/mtd/ubi/Makefile 2007-03-23 18:20:01.000000000 +0200
@@ -0,0 +1,7 @@
+obj-$(CONFIG_MTD_UBI) += ubi.o
+
+ubi-y += vtbl.o vmt.o upd.o build.o cdev.o kapi.o eba.o io.o wl.o scan.o
+ubi-y += misc.o
+
+ubi-$(CONFIG_MTD_UBI_DEBUG) += debug.o
+ubi-$(CONFIG_MTD_UBI_GLUEBI) += gluebi.o
diff -auNrp tmp-from/include/mtd/Kbuild tmp-to/include/mtd/Kbuild
--- tmp-from/include/mtd/Kbuild 2007-03-23 18:20:01.000000000 +0200
+++ tmp-to/include/mtd/Kbuild 2007-03-23 18:20:01.000000000 +0200
@@ -3,3 +3,5 @@ header-y += jffs2-user.h
header-y += mtd-abi.h
header-y += mtd-user.h
header-y += nftl-user.h
+header-y += ubi-header.h
+header-y += ubi-user.h

Artem Bityutskiy

unread,

Mar 23, 2007, 12:58:50 PM3/23/07

to Linux Kernel Mailing List

+/*
+ * Copyright (c) International Business Machines Corp., 2006

+ * Author: Artem Bityutskiy (Битюцкий Артём),
+ * Frank Haverkamp
+ */
+
+/*
+ * This file includes UBI initialization and building of UBI devices. At the
+ * moment UBI devices may only be added while UBI is initialized, but dynamic
+ * device add/remove functionality is planned. Also, at the moment we only
+ * attach UBI devices by scanning, which will become a bottleneck when flashes
+ * reach certain large size. Then one may improve UBI and add other methods.
+ */
+
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/stringify.h>
+#include <linux/stat.h>
+#include "ubi.h"
+
+/* Maximum length of the 'mtd=' parameter */
+#define MTD_PARAM_LEN_MAX 64
+
+/**
+ * struct mtd_dev_param - MTD device parameter description data structure.
+ * @name: MTD device name or number string
+ * @vid_hdr_offs: VID header offset
+ * @data_offs: data offset
+ */
+struct mtd_dev_param
+{
+ char name[MTD_PARAM_LEN_MAX];
+ int vid_hdr_offs;
+ int data_offs;
+};
+
+/* Numbers of elements set in the @mtd_dev_param array */
+static int mtd_devs = 0;
+
+/* MTD devices specification parameters */
+static struct mtd_dev_param mtd_dev_param[UBI_MAX_DEVICES];
+
+/* Number of UBI devices in system */
+int ubi_devices_cnt;
+
+/* All UBI devices in system */
+struct ubi_device *ubi_devices[UBI_MAX_DEVICES];
+
+/* Root UBI "class" object (corresponds to '/<sysfs>/class/ubi/') */
+struct class *ubi_class;
+
+/* "Show" method for files in '/<sysfs>/class/ubi/' */
+static ssize_t ubi_version_show(struct class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", UBI_VERSION);
+}
+
+/* UBI version attribute ('/<sysfs>/class/ubi/version') */
+static struct class_attribute ubi_version =
+ __ATTR(version, S_IRUGO, ubi_version_show, NULL);
+
+static ssize_t dev_attribute_show(struct device *dev,
+ struct device_attribute *attr, char *buf);
+
+/* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
+static struct device_attribute dev_eraseblock_size =
+ __ATTR(eraseblock_size, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_avail_eraseblocks =
+ __ATTR(avail_eraseblocks, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_total_eraseblocks =
+ __ATTR(total_eraseblocks, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_volumes_count =
+ __ATTR(volumes_count, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_max_ec =
+ __ATTR(max_ec, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_reserved_for_bad =
+ __ATTR(reserved_for_bad, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_bad_peb_count =
+ __ATTR(bad_peb_count, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_max_vol_count =
+ __ATTR(max_vol_count, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_min_io_size =
+ __ATTR(min_io_size, S_IRUGO, dev_attribute_show, NULL);
+static struct device_attribute dev_bgt_enabled =
+ __ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
+
+/* "Show" method for files in '/<sysfs>/class/ubi/ubiX/' */
+static ssize_t dev_attribute_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ const struct ubi_device *ubi;
+
+ ubi = container_of(dev, struct ubi_device, dev);
+ if (attr == &dev_eraseblock_size)
+ return sprintf(buf, "%d\n", ubi->leb_size);
+ else if (attr == &dev_avail_eraseblocks)
+ return sprintf(buf, "%d\n", ubi->avail_pebs);
+ else if (attr == &dev_total_eraseblocks)
+ return sprintf(buf, "%d\n", ubi->good_peb_count);
+ else if (attr == &dev_volumes_count)
+ return sprintf(buf, "%d\n", ubi->vol_count);
+ else if (attr == &dev_max_ec)
+ return sprintf(buf, "%d\n", ubi->max_ec);
+ else if (attr == &dev_reserved_for_bad)
+ return sprintf(buf, "%d\n", ubi->beb_rsvd_pebs);
+ else if (attr == &dev_bad_peb_count)
+ return sprintf(buf, "%d\n", ubi->bad_peb_count);
+ else if (attr == &dev_max_vol_count)
+ return sprintf(buf, "%d\n", ubi->vtbl_slots);
+ else if (attr == &dev_min_io_size)
+ return sprintf(buf, "%d\n", ubi->min_io_size);
+ else if (attr == &dev_bgt_enabled)
+ return sprintf(buf, "%d\n", ubi->thread_enabled);
+ else
+ BUG();
+
+ return 0;
+}
+
+/* Fake "release" method for UBI devices */
+static void dev_release(struct device *dev) { }
+
+/**
+ * ubi_sysfs_init - initialize sysfs for an UBI device.

+ * @ubi: UBI device description object
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */
+static int ubi_sysfs_init(struct ubi_device *ubi)
+{
+ int err;
+
+ ubi->dev.release = dev_release;
+ ubi->dev.devt = MKDEV(ubi->major, 0);
+ ubi->dev.class = ubi_class;
+ sprintf(&ubi->dev.bus_id[0], UBI_NAME_STR"%d", ubi->ubi_num);
+ err = device_register(&ubi->dev);
+ if (err)
+ goto out;
+
+ err = device_create_file(&ubi->dev, &dev_eraseblock_size);
+ if (err)
+ goto out_unregister;
+ err = device_create_file(&ubi->dev, &dev_avail_eraseblocks);
+ if (err)
+ goto out_eraseblock_size;
+ err = device_create_file(&ubi->dev, &dev_total_eraseblocks);
+ if (err)
+ goto out_avail_eraseblocks;
+ err = device_create_file(&ubi->dev, &dev_volumes_count);
+ if (err)
+ goto out_total_eraseblocks;
+ err = device_create_file(&ubi->dev, &dev_max_ec);
+ if (err)
+ goto out_volumes_count;
+ err = device_create_file(&ubi->dev, &dev_reserved_for_bad);
+ if (err)
+ goto out_volumes_max_ec;
+ err = device_create_file(&ubi->dev, &dev_bad_peb_count);
+ if (err)
+ goto out_reserved_for_bad;
+ err = device_create_file(&ubi->dev, &dev_max_vol_count);
+ if (err)
+ goto out_bad_peb_count;
+ err = device_create_file(&ubi->dev, &dev_min_io_size);
+ if (err)
+ goto out_max_vol_count;
+ err = device_create_file(&ubi->dev, &dev_bgt_enabled);
+ if (err)
+ goto out_min_io_size;
+
+ return 0;
+
+out_min_io_size:
+ device_remove_file(&ubi->dev, &dev_min_io_size);
+out_max_vol_count:
+ device_remove_file(&ubi->dev, &dev_max_vol_count);
+out_bad_peb_count:
+ device_remove_file(&ubi->dev, &dev_bad_peb_count);
+out_reserved_for_bad:
+ device_remove_file(&ubi->dev, &dev_reserved_for_bad);
+out_volumes_max_ec:
+ device_remove_file(&ubi->dev, &dev_max_ec);
+out_volumes_count:
+ device_remove_file(&ubi->dev, &dev_volumes_count);
+out_total_eraseblocks:
+ device_remove_file(&ubi->dev, &dev_total_eraseblocks);
+out_avail_eraseblocks:
+ device_remove_file(&ubi->dev, &dev_avail_eraseblocks);
+out_eraseblock_size:
+ device_remove_file(&ubi->dev, &dev_eraseblock_size);
+out_unregister:
+ device_unregister(&ubi->dev);
+out:
+ ubi_err("failed to initialize sysfs for %s", ubi->ubi_name);
+ return err;
+}
+
+/**
+ * ubi_sysfs_close - close sysfs for an UBI device.

+ * @ubi: UBI device description object
+ */

+static void ubi_sysfs_close(struct ubi_device *ubi)
+{
+ device_remove_file(&ubi->dev, &dev_bgt_enabled);
+ device_remove_file(&ubi->dev, &dev_min_io_size);
+ device_remove_file(&ubi->dev, &dev_max_vol_count);
+ device_remove_file(&ubi->dev, &dev_bad_peb_count);
+ device_remove_file(&ubi->dev, &dev_reserved_for_bad);
+ device_remove_file(&ubi->dev, &dev_max_ec);
+ device_remove_file(&ubi->dev, &dev_volumes_count);
+ device_remove_file(&ubi->dev, &dev_total_eraseblocks);
+ device_remove_file(&ubi->dev, &dev_avail_eraseblocks);
+ device_remove_file(&ubi->dev, &dev_eraseblock_size);
+ device_unregister(&ubi->dev);
+}
+
+/**
+ * kill_volumes - destroy all volumes.

+ * @ubi: UBI device description object
+ */

+static void kill_volumes(struct ubi_device *ubi)
+{
+ int i;
+
+ for (i = 0; i < ubi->vtbl_slots; i++)
+ if (ubi->volumes[i])
+ ubi_free_volume(ubi, i);
+}
+
+/**
+ * uif_init - initialize user interfaces for an UBI device.

+ * @ubi: UBI device description object
+ *

+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */
+static int uif_init(struct ubi_device *ubi)
+{
+ int i, err;
+ dev_t dev;
+
+ mutex_init(&ubi->vtbl_mutex);
+ spin_lock_init(&ubi->volumes_lock);
+
+ sprintf(ubi->ubi_name, UBI_NAME_STR "%d", ubi->ubi_num);
+
+ /*
+ * Major numbers for the UBI character devices are allocated
+ * dynamically. Major numbers of volume character devices are
+ * equivalent to ones of the corresponding UBI character device. Minor
+ * numbers of UBI character devices are 0, while minor numbers of
+ * volume character devices start from 1. Thus, we allocate one major
+ * number and ubi->vtbl_slots + 1 minor numbers.
+ */
+ err = alloc_chrdev_region(&dev, 0, ubi->vtbl_slots + 1, ubi->ubi_name);
+ if (err) {
+ ubi_err("cannot register UBI character devices");
+ return err;
+ }
+
+ cdev_init(&ubi->cdev, &ubi_cdev_operations);
+ ubi->major = MAJOR(dev);
+ dbg_msg("%s major is %u", ubi->ubi_name, ubi->major);
+ ubi->cdev.owner = THIS_MODULE;
+
+ dev = MKDEV(ubi->major, 0);
+ err = cdev_add(&ubi->cdev, dev, 1);
+ if (err) {
+ ubi_err("cannot add character device %s", ubi->ubi_name);
+ goto out_unreg;
+ }
+
+ err = ubi_sysfs_init(ubi);
+ if (err)
+ goto out_cdev;
+
+ for (i = 0; i < ubi->vtbl_slots; i++)
+ if (ubi->volumes[i]) {
+ err = ubi_add_volume(ubi, i);
+ if (err)
+ goto out_volumes;
+ }
+
+ return 0;
+
+out_volumes:
+ kill_volumes(ubi);
+ ubi_sysfs_close(ubi);
+out_cdev:
+ cdev_del(&ubi->cdev);
+out_unreg:
+ unregister_chrdev_region(MKDEV(ubi->major, 0),
+ ubi->vtbl_slots + 1);
+ return err;
+}
+
+/**
+ * uif_close - close user interfaces for an UBI device.

+ * @ubi: UBI device description object
+ */

+static void uif_close(struct ubi_device *ubi)
+{
+ kill_volumes(ubi);
+ ubi_sysfs_close(ubi);
+ cdev_del(&ubi->cdev);
+ unregister_chrdev_region(MKDEV(ubi->major, 0), ubi->vtbl_slots + 1);
+}
+
+/**
+ * attach_by_scanning - attach an MTD device using scanning method.
+ * @ubi: UBI device descriptor
+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ *
+ * Note, currently this is the only method to attach UBI devices. Hopefully in
+ * the future we'll have more scalable attaching methods and avoid full media
+ * scanning. But even in this case scanning will be needed as a fall-back
+ * attaching method if there are some on-flash table corruptions.
+ */
+static int attach_by_scanning(struct ubi_device *ubi)
+{
+ int err;
+ struct ubi_scan_info *si;
+
+ si = ubi_scan(ubi);
+ if (IS_ERR(si))
+ return PTR_ERR(si);
+
+ ubi->bad_peb_count = si->bad_peb_count;
+ ubi->good_peb_count = ubi->peb_count - ubi->bad_peb_count;
+ ubi->max_ec = si->max_ec;
+ ubi->mean_ec = si->mean_ec;
+
+ err = ubi_read_volume_table(ubi, si);
+ if (err)
+ goto out_si;
+
+ err = ubi_wl_init_scan(ubi, si);
+ if (err)
+ goto out_vtbl;
+
+ err = ubi_eba_init_scan(ubi, si);
+ if (err)
+ goto out_wl;
+
+ ubi_scan_destroy_si(si);
+ return 0;
+
+out_wl:
+ ubi_wl_close(ubi);
+out_vtbl:
+ kfree(ubi->vtbl);
+out_si:
+ ubi_scan_destroy_si(si);
+ return err;
+}
+
+/**
+ * io_init - initialize I/O unit for a given UBI device.

+ * @ubi: UBI device description object
+ *

+ * If @ubi->vid_hdr_offset or @ubi->leb_start is zero, default offsets are
+ * assumed:
+ * o EC header is always at offset zero - this cannot be changed;
+ * o VID header starts just after the EC header at the closest address
+ * aligned to @io->@hdrs_min_io_size;
+ * o data starts just after the VID header at the closest address aligned to
+ * @io->@min_io_size
+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of failure.
+ */
+static int io_init(struct ubi_device *ubi)
+{
+ if (ubi->mtd->numeraseregions != 0) {
+ /*
+ * Some flashes have several erase regions. Different regions
+ * may have different eraseblock size and other
+ * characteristics. It looks like mostly multi-region flashes
+ * have one "main" region and one or more small regions to
+ * store boot loader code or boot parameters or whatever. I
+ * guess we should just pick the largest region. But this is
+ * not implemented.
+ */
+ ubi_err("multiple regions, not implemented");
+ return -EINVAL;
+ }
+
+ /*
+ * Note, in this implementation we support MTD devices with 0x7FFFFFFF
+ * physical eraseblocks maximum.
+ */
+
+ ubi->peb_size = ubi->mtd->erasesize;
+ ubi->peb_count = ubi->mtd->size / ubi->mtd->erasesize;
+ ubi->flash_size = ubi->mtd->size;
+
+ if (ubi->mtd->block_isbad && ubi->mtd->block_markbad)
+ ubi->bad_allowed = 1;
+
+ ubi->min_io_size = ubi->mtd->writesize;
+ ubi->hdrs_min_io_size = ubi->mtd->writesize >> ubi->mtd->subpage_sft;
+
+ ubi_assert(ubi->hdrs_min_io_size > 0);
+ ubi_assert(ubi->hdrs_min_io_size <= ubi->min_io_size);
+ ubi_assert(ubi->min_io_size % ubi->hdrs_min_io_size == 0);
+
+ /* Calculate default aligned sizes of EC and VID headers */
+ ubi->ec_hdr_alsize = ubi_align_up(UBI_EC_HDR_SIZE,
+ ubi->hdrs_min_io_size);
+ ubi->vid_hdr_alsize = ubi_align_up(UBI_VID_HDR_SIZE,
+ ubi->hdrs_min_io_size);
+
+ dbg_msg("min_io_size %d", ubi->min_io_size);
+ dbg_msg("hdrs_min_io_size %d", ubi->hdrs_min_io_size);
+ dbg_msg("ec_hdr_alsize %d", ubi->ec_hdr_alsize);
+ dbg_msg("vid_hdr_alsize %d", ubi->vid_hdr_alsize);
+
+ if (ubi->vid_hdr_offset == 0)
+ /* Default offset */
+ ubi->vid_hdr_offset = ubi->vid_hdr_aloffset =
+ ubi->ec_hdr_alsize;
+ else {
+ ubi->vid_hdr_aloffset = ubi_align_down(ubi->vid_hdr_offset,
+ ubi->hdrs_min_io_size);
+ ubi->vid_hdr_shift = ubi->vid_hdr_offset -
+ ubi->vid_hdr_aloffset;
+ }
+
+ /* Similar for the data offset */
+ if (ubi->leb_start == 0) {
+ ubi->leb_start = ubi->vid_hdr_offset + ubi->vid_hdr_alsize;
+ ubi->leb_start = ubi_align_up(ubi->leb_start, ubi->min_io_size);
+ }
+
+ dbg_msg("vid_hdr_offset %d", ubi->vid_hdr_offset);
+ dbg_msg("vid_hdr_aloffset %d", ubi->vid_hdr_aloffset);
+ dbg_msg("vid_hdr_shift %d", ubi->vid_hdr_shift);
+ dbg_msg("leb_start %d", ubi->leb_start);
+
+ /* The shift must be aligned to 32-bit boundary */
+ if (ubi->vid_hdr_shift % 4) {
+ ubi_err("unaligned VID header shift %d",
+ ubi->vid_hdr_shift);
+ return -EINVAL;
+ }
+
+ /* Check sanity */
+ if (ubi->vid_hdr_offset < UBI_EC_HDR_SIZE ||
+ ubi->leb_start < ubi->vid_hdr_offset + UBI_VID_HDR_SIZE ||
+ ubi->leb_start > ubi->peb_size - UBI_VID_HDR_SIZE ||
+ ubi->leb_start % ubi->min_io_size) {
+ ubi_err("bad VID header (%d) or data offsets (%d)",
+ ubi->vid_hdr_offset, ubi->leb_start);
+ return -EINVAL;
+ }
+
+ /*
+ * It may happen that EC and VID headers are situated in one minimal
+ * I/O unit. In this case we can only accept this UBI image in
+ * read-only mode.
+ */
+ if (ubi->vid_hdr_offset + UBI_VID_HDR_SIZE <= ubi->hdrs_min_io_size) {
+ ubi_warn("EC and VID headers are in the same minimal I/O unit, "
+ "switch to read-only mode");

+ ubi->ro_mode = 1;
+ }

+
+ ubi->leb_size = ubi->peb_size - ubi->leb_start;
+
+ if (!(ubi->mtd->flags & MTD_WRITEABLE)) {
+ ubi_msg("MTD device %d is write-protected, attach in "
+ "read-only mode", ubi->mtd->index);

+ ubi->ro_mode = 1;
+ }

+
+ dbg_msg("leb_size %d", ubi->leb_size);
+ dbg_msg("ro_mode %d", ubi->ro_mode);
+
+ /*
+ * Note, ideally, we have to initialize ubi->bad_peb_count here. But
+ * unfortunately, MTD does not provide this information. We should loop
+ * over all physical eraseblocks and invoke mtd->block_is_bad() for
+ * each physical eraseblock. So, we skip ubi->bad_peb_count
+ * uninitialized and initialize it after scanning.
+ */
+
+ return 0;
+}
+
+/**
+ * attach_mtd_dev - attach an MTD device.
+ * @mtd_dev: MTD device name or number string
+ * @vid_hdr_offset: VID header offset
+ * @data_offset: data offset
+ *
+ * This function attaches an MTD device to UBI. It first treats @mtd_dev as the
+ * MTD device name, and tries to open it by this name. If it is unable to open,
+ * it tries to convert @mtd_dev to an integer and open the MTD device by its
+ * number. Returns zero in case of success and a negative error code in case of

+ * failure.
+ */

+static int attach_mtd_dev(const char *mtd_dev, int vid_hdr_offset,
+ int data_offset)
+{
+ struct ubi_device *ubi;
+ struct mtd_info *mtd;
+ int i, err;
+
+ mtd = get_mtd_device_nm(mtd_dev);
+ if (IS_ERR(mtd)) {
+ int mtd_num;
+ char *endp;
+
+ if (PTR_ERR(mtd) != -ENODEV)
+ return PTR_ERR(mtd);
+
+ /*
+ * Probably this is not MTD device name but MTD device number -
+ * check this out.
+ */
+ mtd_num = simple_strtoul(mtd_dev, &endp, 0);
+ if (*endp != '\0' || mtd_dev == endp) {
+ ubi_err("incorrect MTD device: \"%s\"", mtd_dev);
+ return -ENODEV;
+ }
+
+ mtd = get_mtd_device(NULL, mtd_num);
+ if (IS_ERR(mtd))
+ return PTR_ERR(mtd);
+ }
+
+ /* Check if we already have the same MTD device attached */
+ for (i = 0; i < ubi_devices_cnt; i++)
+ if (ubi_devices[i]->mtd->index == mtd->index) {
+ ubi_err("mtd%d is already attached to ubi%d",
+ mtd->index, i);
+ err = -EINVAL;
+ goto out_mtd;
+ }
+
+ ubi = ubi_devices[ubi_devices_cnt] = kzalloc(sizeof(struct ubi_device),
+ GFP_KERNEL);
+ if (!ubi) {
+ err = -ENOMEM;
+ goto out_mtd;
+ }
+
+ ubi->ubi_num = ubi_devices_cnt;
+ ubi->mtd = mtd;
+
+ dbg_msg("attaching mtd%d to ubi%d: VID header offset %d data offset %d",
+ ubi->mtd->index, ubi_devices_cnt, vid_hdr_offset, data_offset);
+
+ ubi->vid_hdr_offset = vid_hdr_offset;
+ ubi->leb_start = data_offset;
+ err = io_init(ubi);
+ if (err)
+ goto out_free;
+
+ err = attach_by_scanning(ubi);
+ if (err) {
+ dbg_err("failed to attach by scanning, error %d", err);
+ goto out_free;
+ }
+
+ err = uif_init(ubi);
+ if (err)
+ goto out_detach;
+
+ ubi_devices_cnt += 1;
+
+ ubi_msg("attached mtd%d to ubi%d", ubi->mtd->index, ubi_devices_cnt);
+ ubi_msg("MTD device name: \"%s\"", ubi->mtd->name);
+ ubi_msg("MTD device size: %llu MiB", ubi->flash_size >> 20);
+ ubi_msg("physical eraseblock size: %d bytes (%d KiB)",
+ ubi->peb_size, ubi->peb_size >> 10);
+ ubi_msg("logical eraseblock size: %d bytes", ubi->leb_size);
+ ubi_msg("number of good PEBs: %d", ubi->good_peb_count);
+ ubi_msg("number of bad PEBs: %d", ubi->bad_peb_count);
+ ubi_msg("smallest flash I/O unit: %d", ubi->min_io_size);
+ ubi_msg("VID header offset: %d (aligned %d)",
+ ubi->vid_hdr_offset, ubi->vid_hdr_aloffset);
+ ubi_msg("data offset: %d", ubi->leb_start);
+ ubi_msg("max. allowed volumes: %d", ubi->vtbl_slots);
+ ubi_msg("wear-leveling threshold: %d", CONFIG_MTD_UBI_WL_THRESHOLD);
+ ubi_msg("number of internal volumes: %d", UBI_INT_VOL_COUNT);
+ ubi_msg("number of user volumes: %d",
+ ubi->vol_count - UBI_INT_VOL_COUNT);
+ ubi_msg("available PEBs: %d", ubi->avail_pebs);
+ ubi_msg("total number of reserved PEBs: %d", ubi->rsvd_pebs);
+ ubi_msg("number of PEBs reserved for bad PEB handling: %d",
+ ubi->beb_rsvd_pebs);
+ ubi_msg("max/mean erase counter: %d/%d", ubi->max_ec, ubi->mean_ec);
+
+ /* Enable the background thread */
+ if (!DBG_DISABLE_BGT) {
+ ubi->thread_enabled = 1;
+ wake_up_process(ubi->bgt_thread);
+ }
+
+ return 0;
+
+out_detach:
+ ubi_eba_close(ubi);
+ ubi_wl_close(ubi);
+ kfree(ubi->vtbl);
+out_free:
+ kfree(ubi);
+out_mtd:
+ put_mtd_device(mtd);
+ ubi_devices[ubi_devices_cnt] = NULL;
+ return err;
+}
+
+/**
+ * detach_mtd_dev - detach an MTD device.

+ * @ubi: UBI device description object
+ */

+static void detach_mtd_dev(struct ubi_device *ubi)
+{
+ int ubi_num = ubi->ubi_num, mtd_num = ubi->mtd->index;
+
+ dbg_msg("detaching mtd%d from ubi%d", ubi->mtd->index, ubi_num);
+ uif_close(ubi);
+ ubi_eba_close(ubi);
+ ubi_wl_close(ubi);
+ kfree(ubi->vtbl);
+ put_mtd_device(ubi->mtd);
+ kfree(ubi_devices[ubi_num]);
+ ubi_devices[ubi_num] = NULL;
+ ubi_devices_cnt -= 1;
+ ubi_assert(ubi_devices_cnt >= 0);
+ ubi_msg("mtd% is detached from ubi%d", mtd_num, ubi_num);
+}
+
+static int __init ubi_init(void)
+{
+ int err, i, k;
+
+ /* Ensure that EC and VID headers have correct size */
+ BUILD_BUG_ON(sizeof(struct ubi_ec_hdr) != 64);
+ BUILD_BUG_ON(sizeof(struct ubi_vid_hdr) != 64);
+
+ if (mtd_devs > UBI_MAX_DEVICES) {
+ printk("UBI error: too many MTD devices, maximum is %d\n",
+ UBI_MAX_DEVICES);
+ return -EINVAL;
+ }
+
+ ubi_class = class_create(THIS_MODULE, UBI_NAME_STR);
+ if (IS_ERR(ubi_class))
+ return PTR_ERR(ubi_class);
+
+ err = class_create_file(ubi_class, &ubi_version);
+ if (err)
+ goto out_class;
+
+ /* Attach MTD devices */
+ for (i = 0; i < mtd_devs; i++) {
+ struct mtd_dev_param *p = &mtd_dev_param[i];
+
+ cond_resched();
+
+ if (!p->name) {
+ dbg_err("empty name");
+ err = -EINVAL;
+ goto out_detach;
+ }
+
+ err = attach_mtd_dev(p->name, p->vid_hdr_offs, p->data_offs);
+ if (err)
+ goto out_detach;
+ }
+
+ return 0;
+
+out_detach:
+ for (k = 0; k < i; k++)
+ detach_mtd_dev(ubi_devices[k]);
+ class_remove_file(ubi_class, &ubi_version);
+out_class:
+ class_destroy(ubi_class);
+ return err;
+}
+module_init(ubi_init);
+
+static void __exit ubi_exit(void)
+{
+ int i, n = ubi_devices_cnt;
+
+ for (i = 0; i < n; i++)
+ detach_mtd_dev(ubi_devices[i]);
+ class_remove_file(ubi_class, &ubi_version);
+ class_destroy(ubi_class);
+}
+module_exit(ubi_exit);
+
+/**
+ * bytes_str_to_int - convert a string representing number of bytes to an
+ * integer.
+ * @str: the string to convert
+ *
+ * This function returns positive resulting integer in case of success and a
+ * negative error code in case of failure.
+ */
+static int __init bytes_str_to_int(const char *str)
+{
+ char *endp;
+ unsigned long result;
+
+ result = simple_strtoul(str, &endp, 0);
+ if (str == endp || result < 0) {
+ printk("UBI error: incorrect bytes count: \"%s\"\n", str);
+ return -EINVAL;
+ }
+
+ switch (*endp) {
+ case 'G':
+ result *= 1024;
+ case 'M':
+ result *= 1024;
+ case 'K':
+ case 'k':
+ result *= 1024;
+ if (endp[1] == 'i' && (endp[2] == '\0' ||
+ endp[2] == 'B' || endp[2] == 'b'))
+ endp += 2;
+ case '\0':
+ break;
+ default:
+ printk("UBI error: incorrect bytes count: \"%s\"\n", str);
+ return -EINVAL;
+ }
+
+ return result;
+}
+
+/**
+ * ubi_mtd_param_parse - parse the 'mtd=' UBI parameter.
+ * @val: the parameter value to parse
+ * @kp: not used
+ *
+ * This function returns zero in case of success and a negative error code in
+ * case of error.
+ */
+static int __init ubi_mtd_param_parse(const char *val, struct kernel_param *kp)
+{
+ int i, len;
+ struct mtd_dev_param *p;
+ char buf[MTD_PARAM_LEN_MAX];
+ char *pbuf = &buf[0];
+ char *tokens[3] = {NULL, NULL, NULL};
+
+ if (mtd_devs == UBI_MAX_DEVICES) {
+ printk("UBI error: too many parameters, max. is %d\n",
+ UBI_MAX_DEVICES);
+ return -EINVAL;
+ }
+
+ len = strnlen(val, MTD_PARAM_LEN_MAX);
+ if (len == MTD_PARAM_LEN_MAX) {
+ printk("UBI error: parameter \"%s\" is too long, max. is %d\n",
+ val, MTD_PARAM_LEN_MAX);
+ return -EINVAL;
+ }
+
+ if (len == 0) {
+ printk("UBI warning: empty 'mtd=' parameter - ignored\n");
+ return 0;
+ }
+
+ strcpy(buf, val);
+
+ /* Get rid of the final newline */
+ if (buf[len - 1] == '\n')
+ buf[len - 1] = 0;
+
+ for (i = 0; i < 3; i++)
+ tokens[i] = strsep(&pbuf, ",");
+
+ if (pbuf) {
+ printk("UBI error: too many arguments at \"%s\"\n", val);
+ return -EINVAL;
+ }
+
+ if (tokens[0] == '\0')
+ return -EINVAL;
+
+ p = &mtd_dev_param[mtd_devs];
+ strcpy(&p->name[0], tokens[0]);
+
+ if (tokens[1])
+ p->vid_hdr_offs = bytes_str_to_int(tokens[1]);
+ if (tokens[2])
+ p->data_offs = bytes_str_to_int(tokens[2]);
+
+ if (p->vid_hdr_offs < 0)
+ return p->vid_hdr_offs;
+ if (p->data_offs < 0)
+ return p->data_offs;
+
+ mtd_devs += 1;
+ return 0;
+}
+
+module_param_call(mtd, ubi_mtd_param_parse, NULL, NULL, 000);
+MODULE_PARM_DESC(mtd, "MTD devices to attach. Parameter format: "
+ "mtd=<name|num>[,<vid_hdr_offs>,<data_offs>]. "
+ "Multiple \"mtd\" parameters may be specified.\n"
+ "MTD devices may be specified by their number or name. "
+ "Optional \"vid_hdr_offs\" and \"data_offs\" parameters "
+ "specify UBI VID header position and data starting "
+ "position to be used by UBI.\n"
+ "Example: mtd=content,1984,2048 mtd=4 - attach MTD device"
+ "with name content using VID header offset 1984 and data "
+ "start 2048, and MTD device number 4 using default "
+ "offsets");
+
+MODULE_VERSION(__stringify(UBI_VERSION));
+MODULE_DESCRIPTION("UBI - Unsorted Block Images");
+MODULE_AUTHOR("Artem Bityutskiy");
+MODULE_LICENSE("GPL");

Jörn Engel

unread,

Mar 25, 2007, 5:18:38 PM3/25/07

to Thomas Gleixner

On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote:
> On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
> >
> > Also LogFS currently requires erasesizes of 2^n.
>
> Last time I talked to you about that, you said it would be possible and
> fixable.

Actually, no. LogFS is not broken, there is nothing to fix.

And there is no fundamental reason why UBI should export blocks with
non-power-of-two sizes. UBI currently consists of two parts that are
intimately intertwined in the current implementation, but have
relatively little connection otherwise.

1. Logical volume management.
2. Static volumes.

Logical volume management can just as easily move its management
information into a table, instead of having it spread across all blocks.
Blocks can keep their original size. Since you have to scan flash
anyway, you can also scan for a table, compare a magical number and do
some extra check to protect yourself against a UBI image inside some
logical volume. No big deal.

Static volumes can keep a header inside their volumes. The tiny
first-stage bootloader is currently scanning flash and can continue to
do so. But at least this header no longer causes trouble for LogFS or
any other UBI user.

UBI is just as broken as LogFS is. It works with every user in mainline
(which comes down to JFFS2). LogFS works with every MTD device in
mainline. The only combination that doesn't work is LogFS on UBI - due
to deliberate design decisions on both sides.

Jörn

--
Joern's library part 8:
http://citeseer.ist.psu.edu/plank97tutorial.html

David Lang

unread,

Mar 25, 2007, 6:17:38 PM3/25/07

to Jörn Engel

On Sun, 25 Mar 2007, Jörn Engel wrote:

> On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote:
>> On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
>>>
>>> Also LogFS currently requires erasesizes of 2^n.
>>
>> Last time I talked to you about that, you said it would be possible and
>> fixable.
>
> Actually, no. LogFS is not broken, there is nothing to fix.
>
> And there is no fundamental reason why UBI should export blocks with
> non-power-of-two sizes. UBI currently consists of two parts that are
> intimately intertwined in the current implementation, but have
> relatively little connection otherwise.
>
> 1. Logical volume management.
> 2. Static volumes.
>
> Logical volume management can just as easily move its management
> information into a table, instead of having it spread across all blocks.
> Blocks can keep their original size. Since you have to scan flash
> anyway, you can also scan for a table, compare a magical number and do
> some extra check to protect yourself against a UBI image inside some
> logical volume. No big deal.

if you are being paranoid about write cycles putting the write count in the
block you are writing avoids doing an erase/write elsewhere

although, since you can flip bits to 1 without requireing an erase you could
sacrafice some space and say that your table has a normal counter for the number
of times the block has been erased, but a 'tally counter' where you turn one bit
on each time you erase the block, and when you fill up the tally block you
re-write the entire table, clearing all the tallys. if you have relativly large
eraseblocks it seems like you could afford to sacrafice the space in your master
table to avoid erases of it

IIRC someone said that the count per block was 128 bits? if so you could have a
master table with a 32 bit integer (4B writes) + 96 bits of tally so that you
would only have to re-write the table when a block on the flash has been erased
96 times. with this approach the wear on your table is unlikely to be a factor
until the point where you are loosing a lot of other eraseblocks due to wear
(at which point you could shift to a different block for your table, or retire
the flash)

David Lang

Jörn Engel

unread,

Mar 25, 2007, 7:00:58 PM3/25/07

to David Lang

On Sun, 25 March 2007 13:49:58 -0800, David Lang wrote:
> On Sun, 25 Mar 2007, Jörn Engel wrote:
>
> >Logical volume management can just as easily move its management
> >information into a table, instead of having it spread across all blocks.
> >Blocks can keep their original size. Since you have to scan flash
> >anyway, you can also scan for a table, compare a magical number and do
> >some extra check to protect yourself against a UBI image inside some
> >logical volume. No big deal.

[ This was not a request for UBI to be changed. The only purpose was to
illustrate that LogFS is not broken. The previous thread suggested
otherwise and I just couldn't leave it at that. ]

> if you are being paranoid about write cycles putting the write count in the
> block you are writing avoids doing an erase/write elsewhere
>
> although, since you can flip bits to 1 without requireing an erase you

[ vice versa. you can flip bits to 0 without erasing. ]

> could sacrafice some space and say that your table has a normal counter for
> the number of times the block has been erased, but a 'tally counter' where
> you turn one bit on each time you erase the block, and when you fill up the
> tally block you re-write the entire table, clearing all the tallys. if you
> have relativly large eraseblocks it seems like you could afford to
> sacrafice the space in your master table to avoid erases of it

Or you could have a table and any number of updates to it. Erase one
block, append a small update marker to the table. There are plenty of
options. All have in common that code would be more complicated.

Another advantage is that erase counts don't get reset if the race
against a power failure during erase is lost.

Whether the advantaves of power-of-two blocksizes and safe erasecounts
are worth it, I leave for others to decide.

Jörn

--
Fools ignore complexity. Pragmatists suffer it.
Some can avoid it. Geniuses remove it.
-- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept. 1982

David Woodhouse

unread,

Mar 25, 2007, 7:47:58 PM3/25/07

to Jörn Engel

On Mon, 2007-03-26 at 00:55 +0200, Jörn Engel wrote:
> > although, since you can flip bits to 1 without requireing an erase you
> [ vice versa. you can flip bits to 0 without erasing. ]

And on NAND flash you can't just do it in multiple cycles one bit at a
time. The 'tally' trick isn't viable there.

--
dwmw2

Jörn Engel

unread,

Mar 25, 2007, 8:06:40 PM3/25/07

to David Woodhouse

On Mon, 26 March 2007 00:46:33 +0100, David Woodhouse wrote:
> On Mon, 2007-03-26 at 00:55 +0200, Jörn Engel wrote:
> > > although, since you can flip bits to 1 without requireing an erase you
> > [ vice versa. you can flip bits to 0 without erasing. ]
>
> And on NAND flash you can't just do it in multiple cycles one bit at a
> time. The 'tally' trick isn't viable there.

You can on NAND. ECC is done in software. And for a data structure as
simple as the 'tally', foregoing ECC is not a huge problem - most
bitflips are easily detected and the remaining only cause off-by-a-few
on the erase count.

On NOR with transparent (hardware) ECC you can't.

Jörn

--
Homo Sapiens is a goal, not a description.
-- unknown

David Woodhouse

unread,

Mar 25, 2007, 8:22:15 PM3/25/07

to Jörn Engel

On Mon, 2007-03-26 at 02:01 +0200, Jörn Engel wrote:
> You can on NAND. ECC is done in software. And for a data structure as
> simple as the 'tally', foregoing ECC is not a huge problem - most
> bitflips are easily detected and the remaining only cause off-by-a-few
> on the erase count.

You're only allowed a limited number of write cycles to each page
though. So you can't just clear the bits in a 2112-byte page one at a
time; typically when you clear the fifth bit, the contents of the whole
page become undefined until the next erase cycle.

--
dwmw2

Jörn Engel

unread,

Mar 25, 2007, 9:10:15 PM3/25/07

to David Woodhouse

On Mon, 26 March 2007 01:21:25 +0100, David Woodhouse wrote:
> On Mon, 2007-03-26 at 02:01 +0200, Jörn Engel wrote:
> > You can on NAND. ECC is done in software. And for a data structure as
> > simple as the 'tally', foregoing ECC is not a huge problem - most
> > bitflips are easily detected and the remaining only cause off-by-a-few
> > on the erase count.
>
> You're only allowed a limited number of write cycles to each page
> though. So you can't just clear the bits in a 2112-byte page one at a
> time; typically when you clear the fifth bit, the contents of the whole
> page become undefined until the next erase cycle.

That limitation stems from ECC and ECC is done in software. Currently
everyone and his dog is doing ECC in chunks of 256 bytes on NAND. So
your minimum write size is 256 bytes _if you care about ECC_. If you
don't care, you can write single bits on NAND, just as you can on NOR.

Controlling ECC in software means we are quite flexible. Given
sufficient incentive, we can change the rules quite significantly.

Jörn

--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike

David Woodhouse

unread,

Mar 26, 2007, 5:46:46 AM3/26/07

to Jörn Engel

On Mon, 2007-03-26 at 03:04 +0200, Jörn Engel wrote:
> That limitation stems from ECC and ECC is done in software. Currently
> everyone and his dog is doing ECC in chunks of 256 bytes on NAND. So
> your minimum write size is 256 bytes _if you care about ECC_. If you
> don't care, you can write single bits on NAND, just as you can on NOR.

No, on NAND flash it's a limitation of the hardware. The number of write
cycles you can perform to a given page is limited. Exceed it and the
contents of that page become undefined due to leakage, until you next
erase it.

--
dwmw2

Jörn Engel

unread,

Mar 26, 2007, 5:56:19 AM3/26/07

to David Woodhouse

On Mon, 26 March 2007 10:45:57 +0100, David Woodhouse wrote:
>
> No, on NAND flash it's a limitation of the hardware. The number of write
> cycles you can perform to a given page is limited. Exceed it and the
> contents of that page become undefined due to leakage, until you next
> erase it.

Are you sure? Do you have any specs or similar that state this?

So far I have only encountered this limitation by word of mouth. And
such a myth coming from ECC effects is nothing that would surprise me.

Jörn

--
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories

Thomas Gleixner

unread,

Mar 26, 2007, 6:03:15 AM3/26/07

to David Woodhouse

On Mon, 2007-03-26 at 10:45 +0100, David Woodhouse wrote:
> On Mon, 2007-03-26 at 03:04 +0200, Jörn Engel wrote:
> > That limitation stems from ECC and ECC is done in software. Currently
> > everyone and his dog is doing ECC in chunks of 256 bytes on NAND. So
> > your minimum write size is 256 bytes _if you care about ECC_. If you
> > don't care, you can write single bits on NAND, just as you can on NOR.
>
> No, on NAND flash it's a limitation of the hardware. The number of write
> cycles you can perform to a given page is limited. Exceed it and the
> contents of that page become undefined due to leakage, until you next
> erase it.

Right and you cannot write to random locations in a page. The write
chunks have to be in consecutive order. If you write 0xAA to offset 0,
you cannot rewrite it to 0x00 later without risking corruption.

tglx

David Woodhouse

unread,

Mar 26, 2007, 6:08:27 AM3/26/07

to Jörn Engel

On Mon, 2007-03-26 at 11:51 +0200, Jörn Engel wrote:
> Are you sure? Do you have any specs or similar that state this?
>
> So far I have only encountered this limitation by word of mouth. And
> such a myth coming from ECC effects is nothing that would surprise
> me.

See pp 6 and 31 of http://david.woodhou.se/TC58DVAM72AF1FT_030124.pdf
for example.

--
dwmw2

Artem Bityutskiy

unread,

Mar 26, 2007, 6:56:55 AM3/26/07

to Jörn Engel

On Sun, 2007-03-25 at 22:08 +0200, Jörn Engel wrote:
> And there is no fundamental reason why UBI should export blocks with
> non-power-of-two sizes.

False. There is.

> UBI currently consists of two parts that are
> intimately intertwined in the current implementation, but have
> relatively little connection otherwise.

False. They do have connection.

> 1. Logical volume management.
> 2. Static volumes.
>
> Logical volume management can just as easily move its management
> information into a table, instead of having it spread across all blocks.
> Blocks can keep their original size. Since you have to scan flash
> anyway, you can also scan for a table, compare a magical number and do
> some extra check to protect yourself against a UBI image inside some
> logical volume. No big deal.

First off, I see these no big deal statements for years already, and no
decent implementation proved by usage in real world. Could we please,
move these academic discussions to another thread?

Second, it is much more robust to kip erase counter and mapping
information on per-eraseblock basis then to keep any on-flash table -
you may always scan whole media and gracefully recover from errors and
corruptions. And you do not loose use a lot in case of corruptions.

Third, it is much simpler then keeping any on-flash table, it is thus
robust. We do not need a journal to update any table.

Third, if needed, on-flash table may be _added_ to increase scalability,
so "since you have to scan flash anyway" may become false when there is
real need in better scalability. For now scanning is OK. And still,
scanning method will be a good fall-back way to recover from errors.

> UBI is just as broken as LogFS is. It works with every user in mainline
> (which comes down to JFFS2). LogFS works with every MTD device in
> mainline. The only combination that doesn't work is LogFS on UBI - due
> to deliberate design decisions on both sides.

You are welcome to discuss other irrelevant things to this thread.

--
Best regards,
Artem Bityutskiy (Битюцкий Артём)

Jörn Engel

unread,

Mar 26, 2007, 7:35:43 AM3/26/07

to Artem Bityutskiy

On Mon, 26 March 2007 13:49:06 +0300, Artem Bityutskiy wrote:
> On Sun, 2007-03-25 at 22:08 +0200, Jörn Engel wrote:
> >
> > Logical volume management can just as easily move its management
> > information into a table, instead of having it spread across all blocks.
> > Blocks can keep their original size. Since you have to scan flash
> > anyway, you can also scan for a table, compare a magical number and do
> > some extra check to protect yourself against a UBI image inside some
> > logical volume. No big deal.
>
> First off, I see these no big deal statements for years already, and no
> decent implementation proved by usage in real world. Could we please,
> move these academic discussions to another thread?

You could wait a day, then reread what I wrote. Maybe you will notice
that what I wrote is not identical to what we have discussed about a
year ago and you seem to have read.

You may also want to reread this:

||[ This was not a request for UBI to be changed. The only purpose was to
||illustrate that LogFS is not broken. The previous thread suggested
||otherwise and I just couldn't leave it at that. ]

Jörn

--
tglx1 thinks that joern should get a (TM) for "Thinking Is Hard"
-- Thomas Gleixner

0 new messages