Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

modify file with little memory left

541 views
Skip to first unread message

Ed Morton

unread,
Feb 7, 2013, 9:18:30 AM2/7/13
to
If I have a file that's 100G in size and I have 50G memory total left on my
machine, how do I delete the first 5 lines from that file?

I can't manually create a tmp file as I don't have enough memory, similarly sed
-i will fail to create it's under-the-table temp file, and ed will fail to load
my original file into it's buffer.

What's left?

Ed.

pk

unread,
Feb 7, 2013, 9:28:12 AM2/7/13
to
Count how many bytes those 5 lines take, and write a program (in C or
whatever) that moves everything starting from byte N+1 back N bytes,
finally call truncate() to remove the excess N bytes at the end. Obviously,
during the whole process the file is inconsistent.

Ed Morton

unread,
Feb 7, 2013, 9:32:32 AM2/7/13
to
Thanks, but I'm kinda hoping there's a shell command to do that or a tool that
does real in-place editing.

Ed.

paul

unread,
Feb 7, 2013, 9:33:23 AM2/7/13
to
With some scripting, you can use dd(1) and truncate(1).


pk

unread,
Feb 7, 2013, 9:36:20 AM2/7/13
to
Ed Morton <morto...@gmail.com> wrote:

> Thanks, but I'm kinda hoping there's a shell command to do that or a tool
> that does real in-place editing.

I'm not aware of such a tool.


Kenny McCormack

unread,
Feb 7, 2013, 9:37:27 AM2/7/13
to
In article <kf0dt1$34t$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>> Count how many bytes those 5 lines take, and write a program (in C or
>> whatever) that moves everything starting from byte N+1 back N bytes,
>> finally call truncate() to remove the excess N bytes at the end. Obviously,
>> during the whole process the file is inconsistent.
>>
>
>Thanks, but I'm kinda hoping there's a shell command to do that or a tool that
>does real in-place editing.

Well, there will be (a shell tool) once you (or someone else) have written it.

Seriously, it doesn't look that hard. I think a Perl jockey could (and
no doube will, within the next 24 hours) bash it out in about 10 lines.

I could probably do it in TAWK - or even GAWK with call_any() [*].
But I will leave it to the Perl jocks.

[*] My private version of GAWK that has access to system/library calls.

--
Modern Christian: Someone who can take time out from
complaining about "welfare mothers popping out babies we
have to feed" to complain about welfare mothers getting
abortions that PREVENT more babies to be raised at public
expense.

Janis Papanagnou

unread,
Feb 7, 2013, 12:03:29 PM2/7/13
to
Hi Ed

I don't think that there's a standard shell solution.

Where I'd look into would be the inplace r/w redirection operations in
conjunction with the seek facilities of ksh93. I've tried them out in
the past but I am not used to those redirections so I cannot provide
you an example off the top of my head. But you may find it out with the
ksh docs here:
http://www2.research.att.com/sw/download/man/man1/ksh.html
search for: <>file
and either of the following: <#((...)) or >#((...)) or <#pattern
I suppose it's possible to combine it with dd and process substitution
to make it work.

Hope it's of some use.

Janis

>
> Ed.

Barry Margolin

unread,
Feb 7, 2013, 12:21:17 PM2/7/13
to
In article <kf0d2p$vd7$1...@dont-email.me>,
You seem to be confusing disk space with memory, but everyone seems to
have understood you, and I'm certainly not going to nit-pick. You got
lucky, there are usually pedants just waiting to pounce on errors like
this.

Try this:

gzip filename
zcat filename.gz | tail -n +6 > filename
rm filename.gz

Assuming the file compresses at least 2x, you should have enough space.
Most kinds of text files get much better compression than this.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Burton Samograd

unread,
Feb 7, 2013, 12:49:55 PM2/7/13
to
This is an interesting problem. I've tried a few ways, with eventual
hesitant success. Someone suggested copying back the bytes and then
truncating the file, leading to the following (erronious) code I called
'copyback':

------------------------------------------------------------------------
#include <stdlib.h>
#include <sys/stat.h>
#include <stdio.h>

int main(int argc, char** argv) {
char* infile = argv[1];
int skip_bytes = atoi(argv[2]);

struct stat buf;
stat(infile, &buf);
off_t total_bytes = buf.st_size;

/* Use buffered files to keep things simpler */
FILE* in = fopen(infile, "r");
fseek(in, skip_bytes, SEEK_SET);
FILE* out = fopen(infile, "a");
fseek(out, 0, SEEK_SET);

off_t bytes_to_copy = total_bytes - skip_bytes;
off_t i;

for(i=0; i < bytes_to_copy; i++) {
unsigned char c;
fread(&c, 1, 1, in);
fwrite(&c, 1, 1, out);
}
fclose(in);
ftruncate(fileno(out), bytes_to_copy);
fclose(out);
}
------------------------------------------------------------------------

The problem is that, at least on OpenBSD, opening the file in append
mode will always writie to the end of the file and the fseek call is
ignored.

I imagined a utility called slurp that would allow for the following:

slurp infile | sed 1,5d > outfile

slurp would read a file and print it to stdout while removing it from
disk at the same time. The obstacle to writing slurp is that truncate
can only work at the end of the file and not the head. This lead me to
think of using tac and truncate, but the re-reversing of the input file
cannot be done without creating a temp file or loading it completely in
memory.

For slurp to work it would have to be a very low level utility, or even
a syscall. It would essentially 'cdr' down the block chain on the
filesystem of a file, re-updating the head pointer and freeing blocks as
it iterates through the file, and tossing the blocks in the garbage as
it goes. There would also be no turning back after a call to slurp and
the input file would be gone, unless you captured the output.

A final, working, soution is to use mmap and ftruncate if you have a 64
bit system that can map 100GB of address space using copyback-mmap.c:

------------------------------------------------------------------------
#include <stdlib.h>
#include <sys/stat.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>

int main(int argc, char** argv) {
char* infile = argv[1];
int skip_bytes = atoi(argv[2]);

struct stat buf;
stat(infile, &buf);
off_t total_bytes = buf.st_size;

int in = open(infile, O_RDWR);

char* bytes = mmap(NULL, total_bytes, PROT_READ | PROT_WRITE,
MAP_SHARED, in, 0);

off_t bytes_to_copy = total_bytes - skip_bytes;

memmove(bytes, bytes+skip_bytes, bytes_to_copy);

if(msync(bytes, 0, MS_SYNC) < 0) perror("memsync: ");

munmap(bytes, total_bytes);
fsync(in); // just to be sure

ftruncate(in, bytes_to_copy);

close(in);
}
------------------------------------------------------------------------

You can use this with the following script:

------------------------------------------------------------------------
#!/bin/bash

input=$1
nlines=5

# Find the number of bytes to skip for the first 5 lines
skip_bytes=$(head -n$nlines < $input | wc -c)

copyback-mmap $input $skip_bytes
------------------------------------------------------------------------

The operation of copyback-mmap is totally dependent on the
implementation of the host operating system's virtual memory manager. If
you have a quality OS, I think it should work.

A long post but I think I finally might have gotten to the solution.
Hopefully it works for you, and as usual, no guarantees, warranties or
suitability of fitness for purpose for the code and I assume no
liabillity if you choose to use it on your possibly valuable 100GB of
data.

--
Burton Samograd

Ivan Shmakov

unread,
Feb 7, 2013, 1:12:01 PM2/7/13
to
>>>>> Barry Margolin <bar...@alum.mit.edu> writes:
>>>>> Ed Morton <morto...@gmail.com> wrote:

[Assuming that a mention of "core" makes this one fit for AFC.]

>> If I have a file that's 100G in size and I have 50G memory total
>> left on my machine, how do I delete the first 5 lines from that
>> file?

[...]

> You seem to be confusing disk space with memory, but everyone seems
> to have understood you, and I'm certainly not going to nit-pick. You
> got lucky, there are usually pedants just waiting to pounce on errors
> like this.

Now, what to do with all those folks who still examine their
/core/ files, now that "core memory" isn't used for some
30 years?

Moreover, I'd argue that the proper term for the concept is
"filesystem space", as filesystems of today may reside on disks
(as in: HDD), flash chips (as in: SSD), or entirely in RAM (as
in: tmpfs; unless swap is used, that is. Which also means that
the du(1) and df(1) commands are somewhat ill-named, and should
probably be renamed to, say, fsu(1) and fsf(1).)

[...]

PS. FWIW, my own definition of "memory" would allow for any device
capable of storing information. And so would the one given in
the Wikipedia article [1].

[1] http://en.wikipedia.org/wiki/Computer_memory

--
FSF associate member #7257

Janis Papanagnou

unread,
Feb 7, 2013, 2:30:33 PM2/7/13
to
Here's my try with a ksh93 loop, starting the read operations from the
pattern "20" as example...

while IFS= read -r line
do printf "%s\n" "$line"
done <afile <#"20" <>afile >#((0))

Sample data file is created by...

seq 1 2000 > afile

Data before the call of above script...

$ hat afile
1
2
3
4
5
...
1996
1997
1998
1999
2000

and after the call...

$ hat afile
20
21
22
23
24
...
1996
1997
1998
1999
2000

Ed Morton

unread,
Feb 7, 2013, 9:48:30 PM2/7/13
to
Thanks all for the pointers and scripts.

Ed.

Shmuel Metz

unread,
Feb 7, 2013, 9:47:02 PM2/7/13
to
In <87vca4d...@violet.siamics.net>, on 02/07/2013
at 06:12 PM, Ivan Shmakov <onei...@gmail.com> said:

> Now, what to do with all those folks who still examine their
> /core/ files, now that "core memory" isn't used for some
> 30 years?

Weren't they still using core in the 1980's for high radiation
environments and for applications requiring nonvolatile memory?

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spam...@library.lspace.org

Geoff Clare

unread,
Feb 8, 2013, 8:26:07 AM2/8/13
to
Burton Samograd wrote:

> /* Use buffered files to keep things simpler */
> FILE* in = fopen(infile, "r");
> fseek(in, skip_bytes, SEEK_SET);
> FILE* out = fopen(infile, "a");
> fseek(out, 0, SEEK_SET);
[...]
> The problem is that, at least on OpenBSD, opening the file in append
> mode will always writie to the end of the file and the fseek call is
> ignored.

If you want to fopen() a file for normal writing (not appending) without
truncating it, use mode "r+".

--
Geoff Clare <net...@gclare.org.uk>

Burton Samograd

unread,
Feb 8, 2013, 10:01:34 AM2/8/13
to
D'oh. I was sure that existed but I missed it on the man page. Thanks
for the reminder. :)

--
Burton Samograd

Scott Lurndal

unread,
Feb 8, 2013, 10:38:00 AM2/8/13
to
Shmuel (Seymour J.) Metz <spam...@library.lspace.org.invalid> writes:
>In <87vca4d...@violet.siamics.net>, on 02/07/2013
> at 06:12 PM, Ivan Shmakov <onei...@gmail.com> said:
>
>> Now, what to do with all those folks who still examine their
>> /core/ files, now that "core memory" isn't used for some
>> 30 years?
>
>Weren't they still using core in the 1980's for high radiation
>environments and for applications requiring nonvolatile memory?
>

If I recall correctly, the original space shuttle computers used
core memory.

From Wikipedia:

For example, the Space Shuttle flight computers initially used core memory,
which preserved the contents of memory even through the Challenger's disintegration
and subsequent plunge into the sea in 1986

Janis Papanagnou

unread,
Feb 9, 2013, 5:40:15 AM2/9/13
to
On 07.02.2013 20:30, Janis Papanagnou wrote:
> On 07.02.2013 18:03, Janis Papanagnou wrote:
>> Am 07.02.2013 15:32, schrieb Ed Morton:
>>> On 2/7/2013 8:28 AM, pk wrote:
>>>> On Thu, 07 Feb 2013 08:18:30 -0600, Ed Morton <morto...@gmail.com>
>>>> wrote:
>>>>
>>>>> If I have a file that's 100G in size and I have 50G memory total left on
>>>>> my machine, how do I delete the first 5 lines from that file?
>>>>>
>>>>> [...]
>>>>
>>>> Count how many bytes those 5 lines take, and write a program (in C or
>>>> whatever) that moves everything starting from byte N+1 back N bytes,
>>>> finally call truncate() to remove the excess N bytes at the end.
>>>> Obviously, during the whole process the file is inconsistent.
>>>
>>> Thanks, but I'm kinda hoping there's a shell command to do that or a
>>> tool that does real in-place editing.
>>
>> I don't think that there's a standard shell solution.
>>
>> Where I'd look into would be the inplace r/w redirection operations in
>> conjunction with the seek facilities of ksh93. [...]
>
> Here's my try with a ksh93 loop, [...]

Completing the script to determine the starting point and truncating the
final data, e.g. as in

skiplines=5
content=$( sed -n "$((skiplines+1)){p;q}" afile ) ## ???
position=$( grep -b "${content}" afile | head -1 ) ## ???
offset=${position%:*}

while IFS= read -r line
do printf "%s\n" "${line}"
done <afile <#((offset)) <>afile >#((0))

truncate -s-"${offset}" afile

I wonder whether there's a more appropriate tool to do the sed and grep
pre-processing, maybe in one command in a single early terminating pass.
Above, sed is used to obtain the pattern in the line to start with, and
grep -b to get the byte offset for the final truncate. Can sed (or some
other Unix tool) determine the byte offset of a matching pattern or of
an actual line number in a text file?

Janis

Icarus Sparry

unread,
Feb 9, 2013, 6:09:14 PM2/9/13
to
I would have used sed or head to get the first n lines, and piped them to
wc to count the characters.

Jon LaBadie

unread,
Feb 10, 2013, 11:18:51 AM2/10/13
to
not tested on a strapped system, but it worked on Janis' "seq 1 2000 > data" test file.

tail -n +6 | tee ff > /dev/null

Jon

Luuk

unread,
Feb 10, 2013, 1:33:20 PM2/10/13
to
tested, seems to work!.... ;)

opensuse:/home/luuk/tmp/mnt # ls -la
total 53
drwxr-xr-x 3 root root 1024 Feb 10 19:30 .
drwxrwxr-x 37 luuk users 36864 Feb 10 19:26 ..
drwx------ 2 root root 12288 Feb 10 19:25 lost+found
opensuse:/home/luuk/tmp/mnt # seq 1 15000 >data
opensuse:/home/luuk/tmp/mnt # df | grep loop0
/dev/loop0 121 93 22 81% /home/luuk/tmp/mnt
opensuse:/home/luuk/tmp/mnt # tail -n +6 data | tee data >/dev/null
opensuse:/home/luuk/tmp/mnt # head data
6
7
8
9
10
11
12
13
14
15
opensuse:/home/luuk/tmp/mnt # df | grep loop0
/dev/loop0 121 87 28 76% /home/luuk/tmp/mnt
opensuse:/home/luuk/tmp/mnt #

Janis Papanagnou

unread,
Feb 10, 2013, 2:15:21 PM2/10/13
to
$ seq 1 2000 >data
$ tail -n +6 data | tee data >/dev/null

With 2000 data lines this creates an empty (destroyed) data file
when run with ksh or bash on my system. With larger data files the
files are corrupted, also with zsh.

I think you cannot guarantee any order in which the processes in
the pipeline are run, and you cannot guarantee when the data file
will be reset by the one process or read by the other.

You shouldn't try to tackle the OP's problem with such code. I am
sure only in-place file manipulations will solve the OP's problem
reliably.

Janis

>
> Jon
>

Robert Bonomi

unread,
Feb 10, 2013, 6:23:23 PM2/10/13
to
In article <kf0d2p$vd7$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
>If I have a file that's 100G in size and I have 50G memory total left on my
>machine, how do I delete the first 5 lines from that file?
>
>I can't manually create a tmp file as I don't have enough memory, similarly sed
>-I will fail to create it's under-the-table temp file, and ed will fail to load
>my original file into it's buffer.
>
>What's left?

Rude, crude, and *ugly* but should work:

count=`head -5 ${file} |wc -c`
dd if=${file} bs=${count} skip=1 of=${file}

Testing (FBSD 7.2) shows same inode and creation timestamp, so it should
'do the right thing' with a large file.


Test in your environment before using on irreplacable data.

Using if=/of= instead of redirection is important. So is the order.




Geoff Clare

unread,
Feb 11, 2013, 8:42:53 AM2/11/13
to
On Linux and Solaris, the file ends up empty. Perhaps FBSD's dd has
a special case for when the input and output files are the same file.

(Also note that for Solaris you need to remove the leading spaces
from the output of wc.)

Adding conv=notrunc would allow dd to overwrite the file as desired,
but it would not change the file size, resulting in some left-over
data at the end.

--
Geoff Clare <net...@gclare.org.uk>

Robert Bonomi

unread,
Feb 16, 2013, 12:49:16 PM2/16/13
to
In article <t39nu9-...@leafnode-msgid.gclare.org.uk>,
Geoff Clare <net...@gclare.org.uk> wrote:
>Robert Bonomi wrote:
>
>> In article <kf0d2p$vd7$1...@dont-email.me>,
>> Ed Morton <morto...@gmail.com> wrote:
>>>If I have a file that's 100G in size and I have 50G memory total left on my
>>>machine, how do I delete the first 5 lines from that file?
>>>
>>>I can't manually create a tmp file as I don't have enough memory, similarly sed
>>>-I will fail to create it's under-the-table temp file, and ed will fail to load
>>>my original file into it's buffer.
>>>
>>>What's left?
>>
>> Rude, crude, and *ugly* but should work:
>>
>> count=`head -5 ${file} |wc -c`
>> dd if=${file} bs=${count} skip=1 of=${file}
>>
>> Testing (FBSD 7.2) shows same inode and creation timestamp, so it should
>> 'do the right thing' with a large file.
>>
>>
>> Test in your environment before using on irreplacable data.
>
>On Linux and Solaris, the file ends up empty. Perhaps FBSD's dd has
>a special case for when the input and output files are the same file.

Did you use 'if='/of=' ?? vs '<'/'>'?? (the latter will *not* work)

If the former, try reversing the order of the two options.
There is a dependency in the order of the file open operations, and, therefore,
on the direction of option processing.

There is also a dependency on when the 'truncate()' is done.
one can do:
skip
seek
truncate
copy
close
or
skip
seek
copy
truncate
close

I haven't gone through src, but FBSD dd(1) appears to use the latter logic
No need to 'special case' in == out, it "just works". <grin>

It wouldn't surprise me if 'classic dd' used 'open with truncate' unless
'seek' or 'notrunc', which were special-cased.

>(Also note that for Solaris you need to remove the leading spaces
>from the output of wc.)

'shell dependent' whether leading whitespace is stripped in `` substitution,
probably. :)

Geoff Clare

unread,
Feb 20, 2013, 8:55:13 AM2/20/13
to
Robert Bonomi wrote:

> In article <t39nu9-...@leafnode-msgid.gclare.org.uk>,
> Geoff Clare <net...@gclare.org.uk> wrote:
>>Robert Bonomi wrote:
>>
>>> In article <kf0d2p$vd7$1...@dont-email.me>,
>>> Ed Morton <morto...@gmail.com> wrote:
>>>>If I have a file that's 100G in size and I have 50G memory total left on my
>>>>machine, how do I delete the first 5 lines from that file?
>>>>
>>>>I can't manually create a tmp file as I don't have enough memory, similarly sed
>>>>-I will fail to create it's under-the-table temp file, and ed will fail to load
>>>>my original file into it's buffer.
>>>>
>>>>What's left?
>>>
>>> Rude, crude, and *ugly* but should work:
>>>
>>> count=`head -5 ${file} |wc -c`
>>> dd if=${file} bs=${count} skip=1 of=${file}
>>>
>>> Testing (FBSD 7.2) shows same inode and creation timestamp, so it should
>>> 'do the right thing' with a large file.
>>>
>>>
>>> Test in your environment before using on irreplacable data.
>>
>>On Linux and Solaris, the file ends up empty. Perhaps FBSD's dd has
>>a special case for when the input and output files are the same file.
>
> Did you use 'if='/of=' ?? vs '<'/'>'?? (the latter will *not* work)

I used exactly the dd command from your post (using copy&paste to
ensure I made no typos), preceded by file= and count= assignments.

> If the former, try reversing the order of the two options.
> There is a dependency in the order of the file open operations, and, therefore,
> on the direction of option processing.

Putting of= before if= would be less likely to work, but I tried it anyway
and got the same result.

> It wouldn't surprise me if 'classic dd' used 'open with truncate' unless
> 'seek' or 'notrunc', which were special-cased.

Using truss on Solaris, and strace on Linux, shows that this is indeed
what's happening:

Solaris:

open64("t1", O_RDONLY) = 3
creat64("t1", 0666) = 4

Linux:

open("t1", O_RDONLY|O_LARGEFILE) = 3
open("t1", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3

(For those who spotted that both opens return 3 on Linux and wonder why,
it's because there is a dup2(3, 0) and close(3) between them.)

>>(Also note that for Solaris you need to remove the leading spaces
>>from the output of wc.)
>
> 'shell dependent' whether leading whitespace is stripped in `` substitution,
> probably. :)

Nothing so subtle. The output from Solaris wc has leading whitespace
whereas the output from Linux wc does not.

--
Geoff Clare <net...@gclare.org.uk>
0 new messages