# [TL]请教一个类似磁盘整理的算法问题，没想出来

452 views

### Sisong Hou

Aug 2, 2017, 9:58:19 AM8/2/17

[oldPos,oldPos+copyLen)代表的区域之间可能重叠; 而[newPos,newPos+copyLen)代表的区域之间不会重叠,但可能有空隙；

### Sisong Hou

Aug 2, 2017, 10:20:36 AM8/2/17

### HaoPeiQiang

Aug 2, 2017, 10:43:44 AM8/2/17

--

---

### Changsheng Jiang

Aug 2, 2017, 11:22:17 AM8/2/17
Assume the filesystem support zero holes.

Remove all non referred bytes from the old file in place, and update all positions accordingly.

Read a small array from the end of the old file, truncate it after read. Write that array to its destinations (could be partial writes) and update all blocks, delete empty blocks. Repeat until the old file becomes empty.

--

---

### Sisong Hou

Aug 2, 2017, 8:50:44 PM8/2/17

--

---

### my412140706

Aug 2, 2017, 9:03:04 PM8/2/17

### Sisong Hou

Aug 2, 2017, 9:03:23 PM8/2/17
“Remove all non referred bytes from the old file in place, and update all positions accordingly.”
这里是复杂度O(N)的算法，时间上应该可以接受：将C数组按oldPos排序，稍加分析，就可以在磁盘上完成该整理；
“Read a small array from the end of the old file, truncate it after read.”
这里是复杂度O(N)的算法，时间上应该可以接受；
“Write that array to its destinations (could be partial writes) and update all blocks, delete empty blocks.”
这里是复杂度O(N*N)的算法？  因为空间有限，这时的目标New文件只能是一个小文件，每放置一次old数据都需要用O(N)的插入算法；
“delete empty blocks” 不知道这个操作的意思 @Changsheng Jiang

### Sisong Hou

Aug 2, 2017, 10:01:40 PM8/2/17

### Xinyu LIU

Aug 2, 2017, 10:25:01 PM8/2/17

src+len和dst+len可能有重叠，执行后可以破坏src的数据，但是dst的数据要正确完整。

Larry, LIU Xinyu
https://github.com/liuxinyu95/AlgoXY

e^(πi)+1 = 0

### Sisong Hou

Aug 2, 2017, 10:57:44 PM8/2/17

### Sisong Hou

Aug 3, 2017, 3:05:10 AM8/3/17
memmove的问题好解，划分区域的按字节swap就能解，不需要额外内存；

### Xinyu LIU

Aug 7, 2017, 5:47:24 AM8/7/17
memmove没有想象的那么简单。考虑这个例子
memmove(20, 50, 100)

memmove(20, 5, 100)

Larry, LIU Xinyu
https://github.com/liuxinyu95/AlgoXY

e^(πi)+1 = 0

### Quinn Lu

Aug 7, 2017, 1:37:32 PM8/7/17

### Changsheng Jiang

Aug 7, 2017, 8:51:14 PM8/7/17
I was thinking something like pwrite. it may be not available, and not tight in the size.

As you, Sisong Hou, already pointed out, w/o pwrite, we could modify the original file in place as three steps:

First, Compac t, after this, all non referenced bytes removed.

Second, Expand,  after this, all bytes are referenced exactly once as source and destination.

Third, Shuffle. This could be implemented as: Pick one remaining block (s_0 (source), d_0 (destination), l_0 (length)), find a block which has d_0 as source, (s_1 = d_0, d_1, l_1), ..., until we got s_i = s_0, finish this cycle (limit the length to the minimal length) and remove all (s_j, d_j,  min len) from all blocks, this may split a block to two blocks (third blocks with the middle block done). This approach is data move optimal (one read and one write per byte), but it has scale problem: a very small initial blocks number may give a cycle of 1 byte blocks, almost the same as the file size.

We could break the cycle earlier and start from the longest block to keep the block number small:

Remove ID blocks (or from starts, only add non ID blocks), and maintain the blocks as ordered map source => (dest, len) and ordered set (len, source).

While there are non ID blocks, pick a block (s, d, l) with largest l, remove it from the blocks.

If [s, s + l) intersects [d, d + l) empty,  swap (s, s + l) with (d, d + l), and replace all [d, d+l) in the blocks' source to  [s, s + l) accordingly.

If [s, s + l) and [d, d + l) has intersection, every block in the remaining blocks has non source bytes in this intersection range. We could move [s, s + l) to [d, d + l) and move the non intersection range in d to s, and update the source references accordingly.

For example, assume s < d and s + l > d,

permute the data [s ... d   ... s + l ... d + l) to [ s + l ... d + l, s ... s + l], and replace all source bytes in [s + l,  d + l) to [s, d).

In all replacement, we may split at most two blocks to at most two blocks. All ID blocks after the replacement should be ignored.

In every step, including the remove after the pick, the blocks number may change at most +1/-2, and block length sum is decreased at least maximal block length and at most twice maximal block length.

It's a little hard to prove the complexity, but i guess it should be O(N * log(S) * log(N) + S * (bytes io)), where S is the initial block length sum, and N is the initial number of blocks.

Changsheng Jiang

### Changsheng Jiang

Aug 7, 2017, 9:04:28 PM8/7/17
Oh, sorry, the blocks number may decrease more than 2 in every shuffle step, as some ID blocks separated by none ID blocks.

Changsheng Jiang

### Xinyu LIU

Aug 8, 2017, 12:01:58 AM8/8/17
memmove既不简单，也不复杂 :-)
char* memmove(char* src, char* dst, unsigned len) {
if (src < dst)
while(len--) *(dst + len) = *(src+len);
else
while(len--) *dst++ = *src++;
}

Larry, LIU Xinyu
https://github.com/liuxinyu95/AlgoXY

e^(πi)+1 = 0

### Sisong Hou

Aug 8, 2017, 9:43:50 PM8/8/17

### Sisong Hou

Aug 8, 2017, 10:03:37 PM8/8/17

*  没看明白解决方案如何保持最大操作长度的，从而使操作次数维持在O(N)附近；
*   败者树的思路，... O(M*S)的总数据量读写；

### Long Cheng

Aug 9, 2017, 2:15:10 AM8/9/17

### Sisong Hou

Aug 9, 2017, 3:26:39 AM8/9/17
to: Long Cheng

（Xdelta不熟悉，但生成的补丁比起bsdiff和HDiffPatch实在大的太多太多，不是同一类设计目的）

### Sisong Hou

Aug 9, 2017, 10:16:58 AM8/9/17

Aug 18, 2017, 5:13:23 AM8/18/17
​这个问题根本不需要什么算法，只需要用mmap配合现在很容易取得的64bit计算环境，即可。
python也要用64bit的。
mmap把整个文件映射到内存即可，以目前的64bit系统，至少支持48bit的地址空间。足够使用。​

--

---

--
Regards,

### Sisong Hou

Aug 19, 2017, 8:10:58 AM8/19/17
谢谢，不错的思路，将算法转嫁到一个已有的类似实现上，代码也比较好写！ 有些类似的实现也是这样做的，但是担心速度会降低到非常慢的情景，这类似于内存不够用时使用虚拟内存交换？ 这样的话估计没有专门实现的算法快。

to Long Cheng:
我试了你推荐的xdelta3，在我的mac上安装\$brew install xdelta
两个500多MB的源代码文件执行命令\$xdelta3 -e -9 -s gcc-4.7.0.tar gcc-4.8.0.tar gcc-4.7.0--4.8.0.tar.delta3
但得到的结果很一般：
补丁大小：xdelta3补丁96MB，而bsdiff4.3+bz2得到的结果是11.8MB；
我的HDiff2+bz2的结果是8.4MB (+lzma是7.3MB)，HDiff2.1限制diff内存占用版默认参数+bz2的结果是12.6MB (+lzma是10.3MB)
内存占用：xdelta3 400多MB（不加-e -9压缩参数是200多M），bsdiff 4.4GB ，HDiff2 3.0GB  HDiff2.1限制内存版 72MB
运行速度：xdelta3  146秒， bsdiff 366秒 ，HDiff2 69秒， HDiff2.1限制内存版 21秒；

对xdelta3不熟悉，不知道我使用xdelta3是否有误？ xdelta3性能优势在其他方面？

### Sisong Hou

Aug 20, 2017, 1:24:07 AM8/20/17
xdelta3用的应该没有错，测试了其他文件之间制作补丁是正常的，个别情景偏大，其他在正常补丁范围；

Aug 21, 2017, 1:31:17 AM8/21/17
mmap是交给操作系统来判定要读取哪些内容到内存。

### Long Cheng

Aug 29, 2017, 11:09:35 PM8/29/17

memory options:
-B bytes     source window size
-W bytes     input window size
-P size      compression duplicates window
-I size      instruction buffer size (0 = unlimited)