Prealocation data files

Jan B.

unread,

Dec 20, 2013, 10:02:02 PM12/20/13

to weed-fil...@googlegroups.com

Hello, files *.dat and *.idx grow progressively. Because on harddrive there are many files, files are fragmented on native filesystem. I think that files would be prealocated to large size. Am I wrong?

Philippe GASSMANN

unread,

Dec 21, 2013, 3:04:04 AM12/21/13

to weed-fil...@googlegroups.com

I think the fragmentation is only an issue when reading files in the same order they were written in the volume.

In normal use, reading files written in weedfs will most likely be random. So having the file fragmented should not strongly affect performance since it will be randomly accessed.

Anyway it would be a great idea to benchmark fragmented vs contiguous files on different kind of filesystem.

Le 21 déc. 2013 04:02, "Jan B." <jan....@gmail.com> a écrit :

Hello, files *.dat and *.idx grow progressively. Because on harddrive there are many files, files are fragmented on native filesystem. I think that files would be prealocated to large size. Am I wrong?

--
You received this message because you are subscribed to the Google Groups "Weed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to weed-file-syst...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Chris Lu

unread,

Dec 21, 2013, 5:51:53 AM12/21/13

to weed-fil...@googlegroups.com

The feature to pre-allocate disk space is only available on some specific host file systems. Not likely going for file system specific function calls, unless absolutely necessary, or someone already has a piece of generic go code instead of C wrapper code to allocate disk space.

Having a big block size in host file system would help in theory. But it may just be a pre-mature optimization since many file systems already try very hard to minimize fragmentation.

Chris

Amir Keibi

unread,

Nov 23, 2016, 5:33:21 PM11/23/16

to Seaweed File System, weed-fil...@googlegroups.com

Well, running latest version of seaweed out of the box on a XFS file system generated 83% fragmentation. And this is with Speculative Preallocation.

Amir Keibi

unread,

Nov 23, 2016, 5:44:36 PM11/23/16

to Seaweed File System, weed-fil...@googlegroups.com

Also, "I think the fragmentation is only an issue when reading files in the same order they were written in the volume.".. this is just not true.. Haystack paper specifically mentions XFS because of its ability to preallocate large file. The only context in which this preallocation becomes important is fragmentation.

Chris Lu

unread,

Nov 23, 2016, 8:23:12 PM11/23/16

to Seaweed File System, weed-fil...@googlegroups.com

83% seems a large percent. How is this value calculated?

When reading one specific file, how many of them are in 2 different blocks, 3 different blocks, etc? This would translate to 2 disk seeks, 3 disk seeks.

Using XFS' preallocation is of course ideal. We may need to have a CGO build just for this, sacrificing the pure-go promise.

Chris

--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Amir Keibi

unread,

Nov 24, 2016, 1:07:02 PM11/24/16

to Seaweed File System, weed-fil...@googlegroups.com

Please see the attached screenshot.

As for preallocation, I'm not sure if GO should be forgone. I tried the following code (fd is a *os.File) and it works in Linux. All need to be done then is to start writing from the beginning of the file (and not end) and keep track of where the start offset for next file is.

// 10GB

var length int64 = int64(10240000000)

e = fd.Truncate(length)

var mode uint32 = 1

var offset int64 = int64(0)

syscall.Fallocate(int(fd.Fd()), mode, offset, length)

To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+...@googlegroups.com.

XFS_without_sparse_file_19313_uploaded.png

Chris Lu

unread,

Nov 24, 2016, 1:21:38 PM11/24/16

to Seaweed File System

Thanks! This is nice! We will use this piece of code.

Now we need to decide where to remember the start offset for next file entry. The last entry in the index file could be used to calculate this info. And the "weed fix" tool need to have a way to determine the correct end of file when iterating the file entries.

Seems quite doable. Just need some careful implementation.

Chris

To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.

Amir Keibi

unread,

Nov 28, 2016, 12:33:51 PM11/28/16

to Seaweed File System

Cool!

Can I help? Or do you have a ticket I could follow?

Qin Liu

unread,

Jan 5, 2017, 11:27:34 AM1/5/17

to Seaweed File System

Hi Chris,

Is there any plan to fix this problem?

In my benchmark (described in another thread https://groups.google.com/forum/#!topic/seaweedfs/9RhxU1J9bK8), the random read of seaweedfs is ~2 times slower than the result obtained with fio. I think the fragmentation may be the reason. As shown in below, the files generated by weed contain much more fragments than a normal file.

The number of extents in a 373G file used for the fio random read test:

$ filefrag random-read.1.0
random-read.1.0: 2 extents found

The number of extents in weed’s dat files (187G in total):

$ filefrag volume1/benchmark_*dat
volume1/benchmark_1.dat: 7 extents found
volume1/benchmark_2.dat: 8 extents found
volume1/benchmark_3.dat: 8 extents found
volume1/benchmark_4.dat: 7 extents found
volume1/benchmark_5.dat: 7 extents found
volume1/benchmark_6.dat: 7 extents found
volume1/benchmark_7.dat: 7 extents found

The fragmentation factor of our file system after weed’s benchmark:

$ sudo xfs_db -r /dev/sda
xfs_db> frag
actual 224, ideal 20, fragmentation factor 91.07%

在 2016年11月25日星期五 UTC+8上午2:21:38，ChrisLu写道：

Chris Lu

unread,

Jan 5, 2017, 11:31:08 AM1/5/17

to seaw...@googlegroups.com

Did you try with more volumes to read? Your test with fio has how many threads?

Qin Liu

unread,

Jan 5, 2017, 11:40:44 AM1/5/17

to seaw...@googlegroups.com

I've tried with 123 volumes and the result is similar.

The test with fio has 16 threads.

best,

Qin Liu

lqhl.me

To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Seaweed File System" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seaweedfs/2oRtDMa1czw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to seaweedfs+unsubscribe@googlegroups.com.

Chris Lu

unread,

Jan 5, 2017, 11:44:10 AM1/5/17

to seaw...@googlegroups.com

Is there any tool to defragment the dat files and compare the difference?

--

You received this message because you are subscribed to a topic in the Google Groups "Seaweed File System" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seaweedfs/2oRtDMa1czw/unsubscribe.

To unsubscribe from this group and all its topics, send an email to seaweedfs+...@googlegroups.com.

Qin Liu

unread,

Jan 5, 2017, 11:57:15 AM1/5/17

to seaw...@googlegroups.com

I've tried `xfs_fsr` and it didn't help. `xfs_fsr` reduce the fragment factor of XFS from ~90% to ~50% which is still pretty bad.

best,

Qin Liu

lqhl.me

To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Seaweed File System" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seaweedfs/2oRtDMa1czw/unsubscribe.

To unsubscribe from this group and all its topics, send an email to seaweedfs+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.

To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Seaweed File System" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seaweedfs/2oRtDMa1czw/unsubscribe.

To unsubscribe from this group and all its topics, send an email to seaweedfs+unsubscribe@googlegroups.com.

Chris Lu

unread,

Jan 5, 2017, 12:15:26 PM1/5/17

to Seaweed File System

You can "weed benchmark -write=true -read=false" to generate the .dat files. Stop the volume server, run your tool to defragement the .dat files, and restart the volume servers, run "weed benchmark -write=false -read=true".

ChrisLu

unread,

Jan 5, 2017, 1:57:19 PM1/5/17

to Seaweed File System

Please let me know the result. If this improves the performance, we can add the fallocate() optimization.

Chris

Qin Liu

unread,

Jan 6, 2017, 10:45:23 AM1/6/17

to seaw...@googlegroups.com

After defragment, the read throughput is 62.8 MB/s (before is ~50 MB/s).

The improvement is not as significant as I expect...

By the way, the syscall `fallocate()` has an option `FALLOC_FL_KEEP_SIZE` which keeps the file size unchanged after preallocation. I think this is useful for the implementation of the optimization.

best,

Qin Liu

lqhl.me

Chris Lu

unread,

Jan 6, 2017, 12:35:23 PM1/6/17

to Seaweed File System

Thanks for the update and stats!

The difference is big enough. The "performance difference" could due to the benchmark tool itself. I have seen other people using some C code to get higher performance numbers on SeaweedFS.

Chris

--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.

Chris Lu

unread,

Jan 8, 2017, 6:24:32 PM1/8/17

to Seaweed File System

I have added the support to pre-allocate volume file disk space

https://github.com/chrislusf/seaweedfs/wiki/Optimization#preallocate-volume-file-disk-spaces

Please report your performance difference by running benchmark.

Chris

Qin Liu

unread,

Jan 9, 2017, 2:48:43 AM1/9/17

to seaw...@googlegroups.com

Random read 2MB files:

weed before preallocation: 330 MB/s

weed after preallocation: 391 MB/s

fio: 639 MB/s

Write performance:

weed before preallocation: 859 MB/s

weed after preallocation: 953 MB/s

fio: 681 MB/s

This benchmark is conducted on a new server, since the disk of my previous test server is broken.

best,

Qin Liu

lqhl.me

Chris Lu

unread,

Jan 9, 2017, 2:53:14 AM1/9/17

to Seaweed File System

Thanks. Please post your "fio" command used.

Chris

Qin Liu

unread,

Jan 9, 2017, 3:11:04 AM1/9/17

to seaw...@googlegroups.com

I use the following config file:

[random-read]
rw=randread
blocksize=2M
size=500G
directory=/DATA_RAID/weed-test
ioengine=libaio
iodepth=16
invalidate=1
direct=1

best,

Qin Liu

lqhl.me

Qin Liu

unread,

Jan 9, 2017, 3:12:27 AM1/9/17

to seaw...@googlegroups.com

Sorry, the write performance is tested using `dd` not `fio`:

dd if=/dev/zero of=/DATA_RAID/tmp bs=1G count=50

best,

Qin Liu

lqhl.me

Chris Lu

unread,

Jan 9, 2017, 3:17:05 AM1/9/17

to Seaweed File System

You are using 16 concurrent tests in "fio" reading. SeaweedFS default benchmark is using 7 volumes to read and write, each volume access is serialized. To increase the number of volumes, use this URL

curl http://localhost:9333/vol/grow?count=16&collection=benchmark

See

https://github.com/chrislusf/seaweedfs/wiki/Optimization#increase-concurrent-writes

Chris

Reply all

Reply to author

Forward