I was thinking something like pwrite. it may be not available, and not tight in the size.
As you, Sisong Hou, already pointed out, w/o pwrite, we could modify the original file in place as three steps:
First, Compac t, after this, all non referenced bytes removed.
Second, Expand, after this, all bytes are referenced exactly once as source and destination.
Third, Shuffle. This could be implemented as: Pick one remaining block (s_0 (source), d_0 (destination), l_0 (length)), find a block which has d_0 as source, (s_1 = d_0, d_1, l_1), ..., until we got s_i = s_0, finish this cycle (limit the length to the minimal length) and remove all (s_j, d_j, min len) from all blocks, this may split a block to two blocks (third blocks with the middle block done). This approach is data move optimal (one read and one write per byte), but it has scale problem: a very small initial blocks number may give a cycle of 1 byte blocks, almost the same as the file size.
We could break the cycle earlier and start from the longest block to keep the block number small:
Remove ID blocks (or from starts, only add non ID blocks), and maintain the blocks as ordered map source => (dest, len) and ordered set (len, source).
While there are non ID blocks, pick a block (s, d, l) with largest l, remove it from the blocks.
If [s, s + l) intersects [d, d + l) empty, swap (s, s + l) with (d, d + l), and replace all [d, d+l) in the blocks' source to [s, s + l) accordingly.
If [s, s + l) and [d, d + l) has intersection, every block in the remaining blocks has non source bytes in this intersection range. We could move [s, s + l) to [d, d + l) and move the non intersection range in d to s, and update the source references accordingly.
For example, assume s < d and s + l > d,
permute the data [s ... d ... s + l ... d + l) to [ s + l ... d + l, s ... s + l], and replace all source bytes in [s + l, d + l) to [s, d).
In all replacement, we may split at most two blocks to at most two blocks. All ID blocks after the replacement should be ignored.
In every step, including the remove after the pick, the blocks number may change at most +1/-2, and block length sum is decreased at least maximal block length and at most twice maximal block length.
It's a little hard to prove the complexity, but i guess it should be O(N * log(S) * log(N) + S * (bytes io)), where S is the initial block length sum, and N is the initial number of blocks.