This is an interesting problem. I've tried a few ways, with eventual
hesitant success. Someone suggested copying back the bytes and then
truncating the file, leading to the following (erronious) code I called
'copyback':
------------------------------------------------------------------------
#include <stdlib.h>
#include <sys/stat.h>
#include <stdio.h>
int main(int argc, char** argv) {
char* infile = argv[1];
int skip_bytes = atoi(argv[2]);
struct stat buf;
stat(infile, &buf);
off_t total_bytes = buf.st_size;
/* Use buffered files to keep things simpler */
FILE* in = fopen(infile, "r");
fseek(in, skip_bytes, SEEK_SET);
FILE* out = fopen(infile, "a");
fseek(out, 0, SEEK_SET);
off_t bytes_to_copy = total_bytes - skip_bytes;
off_t i;
for(i=0; i < bytes_to_copy; i++) {
unsigned char c;
fread(&c, 1, 1, in);
fwrite(&c, 1, 1, out);
}
fclose(in);
ftruncate(fileno(out), bytes_to_copy);
fclose(out);
}
------------------------------------------------------------------------
The problem is that, at least on OpenBSD, opening the file in append
mode will always writie to the end of the file and the fseek call is
ignored.
I imagined a utility called slurp that would allow for the following:
slurp infile | sed 1,5d > outfile
slurp would read a file and print it to stdout while removing it from
disk at the same time. The obstacle to writing slurp is that truncate
can only work at the end of the file and not the head. This lead me to
think of using tac and truncate, but the re-reversing of the input file
cannot be done without creating a temp file or loading it completely in
memory.
For slurp to work it would have to be a very low level utility, or even
a syscall. It would essentially 'cdr' down the block chain on the
filesystem of a file, re-updating the head pointer and freeing blocks as
it iterates through the file, and tossing the blocks in the garbage as
it goes. There would also be no turning back after a call to slurp and
the input file would be gone, unless you captured the output.
A final, working, soution is to use mmap and ftruncate if you have a 64
bit system that can map 100GB of address space using copyback-mmap.c:
------------------------------------------------------------------------
#include <stdlib.h>
#include <sys/stat.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
int main(int argc, char** argv) {
char* infile = argv[1];
int skip_bytes = atoi(argv[2]);
struct stat buf;
stat(infile, &buf);
off_t total_bytes = buf.st_size;
int in = open(infile, O_RDWR);
char* bytes = mmap(NULL, total_bytes, PROT_READ | PROT_WRITE,
MAP_SHARED, in, 0);
off_t bytes_to_copy = total_bytes - skip_bytes;
memmove(bytes, bytes+skip_bytes, bytes_to_copy);
if(msync(bytes, 0, MS_SYNC) < 0) perror("memsync: ");
munmap(bytes, total_bytes);
fsync(in); // just to be sure
ftruncate(in, bytes_to_copy);
close(in);
}
------------------------------------------------------------------------
You can use this with the following script:
------------------------------------------------------------------------
#!/bin/bash
input=$1
nlines=5
# Find the number of bytes to skip for the first 5 lines
skip_bytes=$(head -n$nlines < $input | wc -c)
copyback-mmap $input $skip_bytes
------------------------------------------------------------------------
The operation of copyback-mmap is totally dependent on the
implementation of the host operating system's virtual memory manager. If
you have a quality OS, I think it should work.
A long post but I think I finally might have gotten to the solution.
Hopefully it works for you, and as usual, no guarantees, warranties or
suitability of fitness for purpose for the code and I assume no
liabillity if you choose to use it on your possibly valuable 100GB of
data.
--
Burton Samograd