introduce dm-snap-mv

Cong Meng

unread,

Oct 6, 2010, 4:33:10 AM10/6/10

to linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Christoph Hellwig, Nick Piggin

Hello everyone,

I am very glad to introduce my work dm-snap-mv here.

what is dm-snap-mv
-------------------
The dm-snap-mv is a target module for device-mapper, which can take
multiple snapshots against an existing block device(origin device).

All snapshots are saved in an independent block device(COW device).

The copy-on-write is used, so only diff-data will be saved in the COW device.

features
--------
1. snapshot of origin
2. snapshot of snapshot
3. instant snapshot creation and deletion
4. origin and snapshots are concurrent readable/writable
5. rollback snapshot to origin
6. multiple origin devices share a COW device
7. multiple COW devices are supported

diagram
-------

+---------------------+ +---------------+ +--------------+
read <--| |----- | | |
| origin dm_dev | | origin_dev | COW | |
write ---| |----> |-----> snapshot-1 |
+---------------------+ +--|------------+ | ... |
| | ... |
+---------------------+ +--|------------------+ ... |
read <--| |-------?- snapshot-N |
| snapshot-X dm_dev | | cow_dev |
write ---| |----> |
+---------------------+ +------------------------------------+

dm_dev: device-mapper device, created by "dmsetup create ..."
COW: copy-on-write

download the source
-----------------------
http://github.com/mcpacino/dm-snap-mv

git clone git://github.com/mcpacino/dm-snap-mv.git

a kernel patch
--------------
Now, dm-snap-mv highly depends on a kernel patch below, which make __getblk()
can get a 4K buffer head while block size of the disk is NOT 4K.

Signed-off-by: Cong Meng <mcpa...@gmail.com>
---
fs/buffer.c | 7 ++-----
1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..f7f9d33 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1051,10 +1051,7 @@ grow_buffers(struct block_device *bdev, sector_t block, int size)
pgoff_t index;
int sizebits;

- sizebits = -1;
- do {
- sizebits++;
- } while ((size << sizebits) < PAGE_SIZE);
+ sizebits = PAGE_CACHE_SHIFT - bdev->bd_inode->i_blkbits;

index = block >> sizebits;

@@ -2924,7 +2921,7 @@ int submit_bh(int rw, struct buffer_head * bh)
*/
bio = bio_alloc(GFP_NOIO, 1);

- bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
+ bio->bi_sector = bh->b_blocknr << (bh->b_bdev->bd_inode->i_blkbits - 9);
bio->bi_bdev = bh->b_bdev;
bio->bi_io_vec[0].bv_page = bh->b_page;
bio->bi_io_vec[0].bv_len = bh->b_size;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Daniel Phillips

unread,

Oct 7, 2010, 6:05:10 PM10/7/10

to Cong Meng, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Christoph Hellwig, Nick Piggin

Hi Meng,

The patch looks sensible, however the question is: why do you want to
do this? Would it not be better to generalize your metadata format to
accomodate the device's native blocksize?

Regards,

Daniel

Christoph Hellwig

unread,

Oct 8, 2010, 4:23:46 AM10/8/10

to Daniel Phillips, Cong Meng, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Christoph Hellwig, Nick Piggin

On Thu, Oct 07, 2010 at 02:31:14PM -0700, Daniel Phillips wrote:
> Hi Meng,
>
> The patch looks sensible, however the question is: why do you want to
> do this? Would it not be better to generalize your metadata format to
> accomodate the device's native blocksize?

Even if it uses fixed 4k sectors it should just read them in smaller
chunks OR even better stop using buffer heads and just read them
manually using submit_bio. BHs really shouldn't be used outside of
filesystems, and even there they slowly are on their way out.

McPacino

unread,

Oct 8, 2010, 5:01:14 AM10/8/10

to Daniel Phillips, Christoph Hellwig, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Nick Piggin

Hi Daniel,

It is very cumbersome to deal with small (512, 1024 or 2048) blocksize.

I used fixed size (4096 bytes) block to save the exception list.

If the blocksize is 512 byte, I have to invoke __getblk() 8 times to read
the exception list. And what more cumbersome is, my exception list
struct might scatter in 8 non-continuous memory segments.

In my code, each exception is presented by a 6-bytes struct. 4K
block can present an exception list containing at most about 670
exceptions. If I used 512 bytes block to present an exception list,
the number is just about 80. That is too small.

So, it's really a big favor to me if __getblk() could read 4K buffer
head in any case.

PS: There is no other kernel component have the demand like my
case? I am learning ext FS code now.

Regards.

Cong Meng.

McPacino

unread,

Oct 8, 2010, 5:14:38 AM10/8/10

to Christoph Hellwig, Daniel Phillips, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Nick Piggin

Hi Christoph,

I have to take care the cache problem If using the bio directly.
BHs can be released by kernel when necessary.

Is there any existing code using bio to read/write metadata
blocks? How do they handle the timing freeing bios? I really
wish to learn something form it.

Regards.

Cong Meng.

Christoph Hellwig

unread,

Oct 8, 2010, 5:25:34 AM10/8/10

to McPacino, Christoph Hellwig, Daniel Phillips, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Nick Piggin

On Fri, Oct 08, 2010 at 05:14:27PM +0800, McPacino wrote:
> Hi Christoph,
>
> I have to take care the cache problem If using the bio directly.
> BHs can be released by kernel when necessary.
>
> Is there any existing code using bio to read/write metadata
> blocks? How do they handle the timing freeing bios? I really
> wish to learn something form it.

If you actually need caching just use the pagecache, e.g.
read_mapping_page to read in your data. That completely abstracts
away the underlying block size.

McPacino

unread,

Oct 8, 2010, 5:28:17 AM10/8/10

to Christoph Hellwig, Daniel Phillips, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Nick Piggin

Thank you very very much...

Daniel Phillips

unread,

Oct 8, 2010, 9:24:58 AM10/8/10

to Christoph Hellwig, McPacino, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Nick Piggin

On Friday 08 October 2010, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 05:14:27PM +0800, McPacino wrote:
> > Hi Christoph,
> >
> > I have to take care the cache problem If using the bio directly.
> > BHs can be released by kernel when necessary.
> >
> > Is there any existing code using bio to read/write metadata
> > blocks? How do they handle the timing freeing bios? I really
> > wish to learn something form it.
>
> If you actually need caching just use the pagecache, e.g.
> read_mapping_page to read in your data. That completely abstracts
> away the underlying block size.

And that will automatically give him the PAGE_CACHE_SIZE objects he
wants. I still don't understand why his model cannot be generalized
to arbitrary block size specifiable at create time.

Regards,

Daniel

McPacino

unread,

Oct 13, 2010, 12:45:43 PM10/13/10

to Christoph Hellwig, Daniel Phillips, linux-...@vger.kernel.org, linux-...@vger.kernel.org, dm-d...@redhat.com, Andrew Morton, Alexander Viro, Nick Piggin

On Fri, Oct 8, 2010 at 5:24 PM, Christoph Hellwig <h...@lst.de> wrote:
> On Fri, Oct 08, 2010 at 05:14:27PM +0800, McPacino wrote:
>> Hi Christoph,
>>
>> I have to take care the cache problem If using the bio directly.
>> BHs can be released by kernel when necessary.
>>
>> Is there any existing code using bio to read/write metadata
>> blocks? How do they handle the timing freeing bios? I really
>> wish to learn something form it.
>
> If you actually need caching just use the pagecache, e.g.
> read_mapping_page to read in your data. �That completely abstracts
> away the underlying block size.

Hi Christoph,

My understanding is that, read_mapping_page_async() works in submit-wait
way. But whatI want is some function works in submit-callback way, just like
sumbit_bh(). Is there something like it in kernel?

thanks.
cong meng.

Mikulas Patocka

unread,

Oct 19, 2010, 3:58:57 PM10/19/10

to Cong Meng, linux-...@vger.kernel.org, device-mapper development, linux-...@vger.kernel.org, Andrew Morton, Alexander Viro, Christoph Hellwig, Nick Piggin, Daniel Phillips

Hi

Thanks for the development!

I'm already developing something similar
paper: http://people.redhat.com/mpatocka/papers/shared-snapshots.pdf
kernel code: http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/r22/
user code: http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.73/

Please change your code to conform to the existing interface. My kernel
code is extendible, you can plug arbitrary on-disk formats to it, there's
another example of Daniel Phillips' zumastore snapshots. Use this
interface so that it could be used with the existing userspace
infrastructure.

Mikulas

> dm-devel mailing list
> dm-d...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

Daniel Phillips

unread,

Nov 5, 2010, 11:27:19 PM11/5/10

to Mikulas Patocka, Cong Meng, linux-...@vger.kernel.org, device-mapper development, linux-...@vger.kernel.org, Andrew Morton, Alexander Viro, Christoph Hellwig, Nick Piggin

Hi Mikulas,

Whoops, I didn't notice this mail for ages. Yes, the Zumastor front end is a very flexible and effective piece of code, never mind that it is entirely written in Bash. I'm not proud of that fact by any means, but managers at Google vetoed a rewrite to a more sensible language such as Python. Regardless, it worked fine then and is working now perfectly well in Bash, and is largely decoupled from the details of the underlying block device replication scheme. Adapting it to, say, DRBD or one of the efforts to improve LVM snapshots would just be a few lines. Oh, and I did rewrite it in Python anyway, mostly, which took about a day. If anybody wants that code I can point them at it.

/me makes the sign of the beast in the general direction of Google management.

Regards,

Daniel