[TL]请教一个类似磁盘整理的算法问题，没想出来

Sisong Hou

unread,

Aug 2, 2017, 9:58:19 AM8/2/17

to pon...@googlegroups.com

请教一个算法问题:

磁盘上已有一个大文件Old,需要从Old中任意位置多次拷贝一些数据来构造一个新的大文件New,并舍弃Old；

每次拷贝操作可描述为结构(oldPos,newPos,copyLen)，所有拷贝操作组成已知数组C(已按newPos升序排序);

[oldPos,oldPos+copyLen)代表的区域之间可能重叠; 而[newPos,newPos+copyLen)代表的区域之间不会重叠,但可能有空隙；

没有限制的话实现很简单：将Old加载到内存,遍历C数组拷贝数据依次写入到新文件New,并根据情况填充空隙；Old加载后或New生成后，删除Old；

现在假设内存很有限,一次只能加载小部分Old数据,并且磁盘空间也很有限,不能同时存放2个大文件,只有maxSize(Old,New)这么大；为了避免复杂先假设C数组不是太大能够加载到内存（实际中这个都无法保证,只能保证能够按顺序遍历）；这时算法要如何实现？

Sisong Hou

unread,

Aug 2, 2017, 10:20:36 AM8/2/17

to pon...@googlegroups.com

如果只限制磁盘占用，不限制内存，那把Old加载到内存，删除Old，再写New就好了；算法非常快；

如果只限制内存占用，不限制磁盘，那可以不加载Old，而是每次需要数据的时候都从Old文件中读取，而写New是几乎不需要内存的；算法会比较慢，因为算法需要对Old文件进行随机位置读取（特别是机械硬盘HDD会很慢）；

现在同时限制了2者，总感觉也“应该能"写出一个算法来的？

HaoPeiQiang

unread,

Aug 2, 2017, 10:43:44 AM8/2/17

to pon...@googlegroups.com

这些限制有点奇怪，特别是“磁盘空间也很有限,不能同时存放2个大文件,只有maxSize(Old,New)这么大”这一点。我感觉你对问题的描述有点怪。你讲讲业务需求吧，为啥有这些限制吧

--

---
您收到此邮件是因为您订阅了Google网上论坛上的“TopLanguage”群组。
要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+un...@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

Changsheng Jiang

unread,

Aug 2, 2017, 11:22:17 AM8/2/17

to pon...@googlegroups.com

Assume the filesystem support zero holes.

Remove all non referred bytes from the old file in place, and update all positions accordingly.

Read a small array from the end of the old file, truncate it after read. Write that array to its destinations (could be partial writes) and update all blocks, delete empty blocks. Repeat until the old file becomes empty.

要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+unsubscribe@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

--

---
您收到此邮件是因为您订阅了Google网上论坛上的“TopLanguage”群组。

要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+unsubscribe@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

Sisong Hou

unread,

Aug 2, 2017, 8:50:44 PM8/2/17

to pon...@googlegroups.com

这里算法假设的场景是：类似物联网的小型设备进行增量更新用；这个抽象算法完成的是将老系统能用到的部分保留下来（而new的空隙那部分会插入一些实际新增的内容，这样就完成了版本增量升级）

为了降低这些设备的成本，磁盘和内存当然越小越好，略有冗余的能跑业务就行；升级时可以下载完整的新版本替换就行，现在想只下载增量包升级，但又不想为此而增加硬件成本，所以需要一个限制“严苛”并且运行时间能接受的算法；

要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+unsubscribe@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

--

---
您收到此邮件是因为您订阅了Google网上论坛上的“TopLanguage”群组。

要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+unsubscribe@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

my412140706

unread,

Aug 2, 2017, 9:03:04 PM8/2/17

to pon...@googlegroups.com

不知道docker容器是不是你想要的。

要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+un...@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

Sisong Hou

unread,

Aug 2, 2017, 9:03:23 PM8/2/17

to pon...@googlegroups.com

“Remove all non referred bytes from the old file in place, and update all positions accordingly.”

这里是复杂度O(N)的算法，时间上应该可以接受：将C数组按oldPos排序，稍加分析，就可以在磁盘上完成该整理；

“Read a small array from the end of the old file, truncate it after read.”

这里是复杂度O(N)的算法，时间上应该可以接受；

“Write that array to its destinations (could be partial writes) and update all blocks, delete empty blocks.”

这里是复杂度O(N*N)的算法？因为空间有限，这时的目标New文件只能是一个小文件，每放置一次old数据都需要用O(N)的插入算法；

“delete empty blocks” 不知道这个操作的意思 @Changsheng Jiang

Sisong Hou

unread,

Aug 2, 2017, 10:01:40 PM8/2/17

to pon...@googlegroups.com

“不知道docker容器是不是你想要的。” 更docker没关系吧

我想的一个还不成立的算法：（假设文件系统是4k对齐的）

将Old按4k边界划分成P份,遍历当前的C数组，得到P中每份总数据读取量r和最后一次被访问到的序号f；

将Old文件的大小设置为按4k对齐的maxSize(Old,New)大小alignSize，对P块数据（r>0的）按f序号排序到文件尾对齐;

遍历C数组从Old头部开始写New数据，每次处理4k，结合f信息能够找到排序后的P数据块,并从r中减去读走的数据量；

当前要写入的位置P数据块的r还不为0时，需要将该4k数据放到内存中（以后当r为0后删除）；

这个算法的问题就在于，需要缓存到内存的P数据块，使用内存量不可控，最差内存复杂度还是O(N)和直接缓存Old一致；

Xinyu LIU

unread,

Aug 2, 2017, 10:25:01 PM8/2/17

to pon...@googlegroups.com

想起3年前我喜欢用的一个面试题了。在内存很小的非智能手机里，如何实现一个安全的memove(src, dst, len)。

src+len和dst+len可能有重叠，执行后可以破坏src的数据，但是dst的数据要正确完整。

Larry, LIU Xinyu
https://github.com/liuxinyu95/AlgoXY

e^(πi)+1 = 0

Sisong Hou

unread,

Aug 2, 2017, 10:57:44 PM8/2/17

to pon...@googlegroups.com

确实和限制内存使用的memove有点类似

接着Changsheng Jiang的思路：

将Old文件大小设置为maxSize(Old,New)，遍历C数组，将用到的Old数据区域都按文件尾对齐，并更新C中oldPos值；

将C数组按oldPos升序排序，遍历C数组，依次将数据copy到从Old文件头(+总空隙大小)开始的写入位置；（不用担心覆盖掉还没有使用到的Old数据），完成后设置当前文件大小为size(new);

现在的C数组的大小描述了文件中new被分成了相应的块数，C中的copyLen描述了每一块数据的大小，C中的newPos描述了这些块本应该在的位置；

现在剩下的问题就是如何快速完成该排序整理了；（这样转换后的问题比原问题应该好解决一些）

Sisong Hou

unread,

Aug 3, 2017, 3:05:10 AM8/3/17

to pon...@googlegroups.com

memmove的问题好解，划分区域的按字节swap就能解，不需要额外内存；

这里的问题更像磁盘碎片整理或外排序算法；

在 2017年8月3日上午10:24，Xinyu LIU <liuxi...@gmail.com>写道：

Xinyu LIU

unread,

Aug 7, 2017, 5:47:24 AM8/7/17

to pon...@googlegroups.com

memmove没有想象的那么简单。考虑这个例子

memmove(20, 50, 100)

然后再考虑这个例子

memmove(20, 5, 100)

要求不许申请额外内存。

Larry, LIU Xinyu
https://github.com/liuxinyu95/AlgoXY

e^(πi)+1 = 0

Quinn Lu

unread,

Aug 7, 2017, 1:37:32 PM8/7/17

to pon...@googlegroups.com

那个memmove是算好offset，然后分两次move吗？

Changsheng Jiang

unread,

Aug 7, 2017, 8:51:14 PM8/7/17

to pon...@googlegroups.com

I was thinking something like pwrite. it may be not available, and not tight in the size.

As you, Sisong Hou, already pointed out, w/o pwrite, we could modify the original file in place as three steps:

First, Compac t, after this, all non referenced bytes removed.

Second, Expand, after this, all bytes are referenced exactly once as source and destination.

Third, Shuffle. This could be implemented as: Pick one remaining block (s_0 (source), d_0 (destination), l_0 (length)), find a block which has d_0 as source, (s_1 = d_0, d_1, l_1), ..., until we got s_i = s_0, finish this cycle (limit the length to the minimal length) and remove all (s_j, d_j, min len) from all blocks, this may split a block to two blocks (third blocks with the middle block done). This approach is data move optimal (one read and one write per byte), but it has scale problem: a very small initial blocks number may give a cycle of 1 byte blocks, almost the same as the file size.

We could break the cycle earlier and start from the longest block to keep the block number small:

Remove ID blocks (or from starts, only add non ID blocks), and maintain the blocks as ordered map source => (dest, len) and ordered set (len, source).

While there are non ID blocks, pick a block (s, d, l) with largest l, remove it from the blocks.

If [s, s + l) intersects [d, d + l) empty, swap (s, s + l) with (d, d + l), and replace all [d, d+l) in the blocks' source to [s, s + l) accordingly.

If [s, s + l) and [d, d + l) has intersection, every block in the remaining blocks has non source bytes in this intersection range. We could move [s, s + l) to [d, d + l) and move the non intersection range in d to s, and update the source references accordingly.

For example, assume s < d and s + l > d,

permute the data [s ... d ... s + l ... d + l) to [ s + l ... d + l, s ... s + l], and replace all source bytes in [s + l, d + l) to [s, d).

In all replacement, we may split at most two blocks to at most two blocks. All ID blocks after the replacement should be ignored.

In every step, including the remove after the pick, the blocks number may change at most +1/-2, and block length sum is decreased at least maximal block length and at most twice maximal block length.

It's a little hard to prove the complexity, but i guess it should be O(N * log(S) * log(N) + S * (bytes io)), where S is the initial block length sum, and N is the initial number of blocks.

Changsheng Jiang

unread,

Aug 7, 2017, 9:04:28 PM8/7/17

to pon...@googlegroups.com

Oh, sorry, the blocks number may decrease more than 2 in every shuffle step, as some ID blocks separated by none ID blocks.

Changsheng Jiang

Xinyu LIU

unread,

Aug 8, 2017, 12:01:58 AM8/8/17

to pon...@googlegroups.com

memmove既不简单，也不复杂 :-)

char* memmove(char* src, char* dst, unsigned len) {

if (src < dst)

while(len--) *(dst + len) = *(src+len);

else

while(len--) *dst++ = *src++;

}

Larry, LIU Xinyu
https://github.com/liuxinyu95/AlgoXY

e^(πi)+1 = 0

Sisong Hou

unread,

Aug 8, 2017, 9:43:50 PM8/8/17

to pon...@googlegroups.com

看起来类似这样操作：假设一共有N块数据要排序，总数据长度S，不断的找到下一块应该排到最前面的new块，与当前要写入的位置的old块交换，交换min(new,old)长的数据；

如果2块大小一致（几率很小），那ok；如果不等，那还是剩余N块，但需要排序的总长度已经减少了min(new,old)；

不断继续下去算法复杂度应该是：最大O(S)次操作、O(S)的总数据量读写就能完成排序；

这个算法的性能可能遇到的问题： 1. 每次从新的new块拿数据类似磁盘的随机访问，性能可能比较慢； 2.在迭代过程中，每次操作的数据块可能越来越小，甚至单字节，访问的额外代价越来越大；第一个问题可能无解，第二个问题因为英文比较差，没看明白解决方案如何保持最大操作长度的，从而使操作次数维持在O(L)附近；

对于重排序我现在的另一个思路：用外排序的败者树的思路，假设可用内存为文件大小的1/M，那M次全文件范围的局部排序就能完成整体排序，也就是O(M*L)的总数据量读写；只要M够小，而且因为文件一直是顺序访问，性能应该也还行；

Sisong Hou

unread,

Aug 8, 2017, 10:03:37 PM8/8/17

to pon...@googlegroups.com

文字错误修正：

* 没看明白解决方案如何保持最大操作长度的，从而使操作次数维持在O(N)附近；

* 败者树的思路，... O(M*S)的总数据量读写；

Long Cheng

unread,

Aug 9, 2017, 2:15:10 AM8/9/17

to pon...@googlegroups.com

不要讨论造轮子，去学学 VCDIFF, http://xdelta.org/ 或者 https://github.com/mendsley/bsdiff

要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+un...@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

Sisong Hou

unread,

Aug 9, 2017, 3:26:39 AM8/9/17

to pon...@googlegroups.com

to: Long Cheng

不是我要造轮子，而是我的实现应该不比bsdiff晚，bsdiff作者2006年有一篇paper发表，正好那年我也发了一篇blog文章介绍我的工作《一个高效的二进制数据补丁算法》: http://blog.csdn.net/housisong/article/details/658863 ；而我的升级软件更早就实际部署了,文章反而是大约滞后1年才写的；

这是我的开源的版本地址： https://github.com/sisong/HDiffPatch，可以看到和bsdiff的数据对比：生成的补丁大部分情况下更小、diff和patch时都更快、内存占用也更少；

最早实现的版本就比当时我查到的补丁软件都好很多；很后来才发现和我的实现非常像的bsdiff，不同场景下互有胜负；因为我的版本比bsdiff略好，并且是自己写的而且业务上也用着，所以也就一直维护着自己的版本；

（Xdelta不熟悉，但生成的补丁比起bsdiff和HDiffPatch实在大的太多太多，不是同一类设计目的）

这里的邮件讨论是想把环境限制的更苛刻情况下的可能补丁实现方案，比如你的手机磁盘空间几乎用光时，还可以正常的升级系统和软件；

Sisong Hou

unread,

Aug 9, 2017, 10:16:58 AM8/9/17

to pon...@googlegroups.com

邮件里谈论的算法应该不是在造轮子吧，它设计目的是想从逻辑上解决极限环境情况下的原地升级算法（主要难点是条件苛刻情况下还要求性能也要可接受）；

刚搜到一个8月7号的新闻：安卓8.0将在原A/B区升级机制上实现流式原地升级B区的功能，只需要约100k的额外储存空间（而不是1G）就可以完成系统升级； http://techreport.com/news/32361/android-8-0-streaming-updates-wont-take-over-phones 看起来和我设想要解决的问题场景高度一致，不知道解决方案是否也类似？

Linker

unread,

Aug 18, 2017, 5:13:23 AM8/18/17

to pon...@googlegroups.com

这个问题根本不需要什么算法，只需要用mmap配合现在很容易取得的64bit计算环境，即可。

python也要用64bit的。

mmap把整个文件映射到内存即可，以目前的64bit系统，至少支持48bit的地址空间。足够使用。

--

---
您收到此邮件是因为您订阅了Google网上论坛上的“TopLanguage”群组。
要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+unsubscribe@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

--

Regards,
Linker Lin
linker...@gmail.com

Sisong Hou

unread,

Aug 19, 2017, 8:10:58 AM8/19/17

to pon...@googlegroups.com

to Linker:

谢谢，不错的思路，将算法转嫁到一个已有的类似实现上，代码也比较好写！有些类似的实现也是这样做的，但是担心速度会降低到非常慢的情景，这类似于内存不够用时使用虚拟内存交换？这样的话估计没有专门实现的算法快。

to Long Cheng:

我试了你推荐的xdelta3，在我的mac上安装$brew install xdelta

两个500多MB的源代码文件执行命令$xdelta3 -e -9 -s gcc-4.7.0.tar gcc-4.8.0.tar gcc-4.7.0--4.8.0.tar.delta3

但得到的结果很一般：

补丁大小：xdelta3补丁96MB，而bsdiff4.3+bz2得到的结果是11.8MB；

我的HDiff2+bz2的结果是8.4MB (+lzma是7.3MB)，HDiff2.1限制diff内存占用版默认参数+bz2的结果是12.6MB (+lzma是10.3MB)

内存占用：xdelta3 400多MB（不加-e -9压缩参数是200多M），bsdiff 4.4GB ，HDiff2 3.0GB HDiff2.1限制内存版 72MB

运行速度：xdelta3 146秒， bsdiff 366秒，HDiff2 69秒， HDiff2.1限制内存版 21秒；

对xdelta3不熟悉，不知道我使用xdelta3是否有误？ xdelta3性能优势在其他方面？

Sisong Hou

unread,

Aug 20, 2017, 1:24:07 AM8/20/17

to pon...@googlegroups.com

xdelta3用的应该没有错，测试了其他文件之间制作补丁是正常的，个别情景偏大，其他在正常补丁范围；

和我的HDiff2.1限制内存版对比的话，互有胜负，平均我的更小；

Linker

unread,

Aug 21, 2017, 1:31:17 AM8/21/17

to pon...@googlegroups.com

mmap是交给操作系统来判定要读取哪些内容到内存。

不需要太多的内存就可以映射很大的文件。

比如100MB内存就足以映射100GB的文件并维持很好的响应，只要读取的数据是冷热不均匀的。

Long Cheng

unread,

Aug 29, 2017, 11:09:35 PM8/29/17

to pon...@googlegroups.com

memory options:
   -B bytes     source window size
   -W bytes     input window size
   -P size      compression duplicates window
   -I size      instruction buffer size (0 = unlimited)

这几个window大小对最终生成的patch大小影响比较大，你放大windows size试试

要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到pongba+un...@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

Reply all

Reply to author

Forward