Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Awk on very large files with parallel

121 views
Skip to first unread message

charlemagn...@gmail.com

unread,
Jun 27, 2017, 11:37:25 AM6/27/17
to
This is a technique for awk to deal with very large files (10s of gigs) that are too big to fit into memory or simply too slow. Some awk operations require the whole file to be in memory, such as sort and uniq. Or the operation may just run too slow.

Examples:

Uniq ie. "uniq file"
awk '\!s[$0]++' file

Uniq and sort ie. "sort file | uniq"
awk '{\!s[$0]++};END{asorti(s,sd);for(e in sd) print sd[e]}' file

The technique uses GNU parallel to split large file into 1M blocks, process each block in parallel (fast low memory!), then recombine into a single file. This is derived from the parallel man page.

Awk uniq with parallel:

cat file | parallel --pipe --files awk\ \'\\\!s\[\$0\]\+\+\'\ | parallel -Xj1 cat {} ';' rm {}

--pipe takes input from stdin, in this case the cat command

--files means output 1M blocks from the input file in a temporary filename in /tmp or wherever --tmpdir is set to.

awk is the unix command to run on the block. The command requires \ escapes for spaces and non-letter characters including a trailing space. It is ugly but can be simplified in a wrapper script.

The file names in /tmp are passed to the second parallel which cat's each temp file to stdout, then rm's them.

System resources can be controlled in the first parallel with -j (max number of awk jobs to run at once) or with --delay (seconds to pause between each awk command). Normally parallel runs as many jobs at once as there are CPUs.

An option for a progress bar showing percent until completion. Install 'pv' (apt-get install pv) then:

pv file | parallel --pipe --files awk\ \'\\\!s\[\$0\]\+\+\'\ | parallel -Xj1 cat {} ';' rm {} > out

Kaz Kylheku

unread,
Jun 27, 2017, 11:48:29 AM6/27/17
to
On 2017-06-27, charlemagn...@gmail.com <charlemagn...@gmail.com> wrote:
> This is a technique for awk to deal with very large files (10s of gigs) that are too big to fit into memory or simply too slow. Some awk operations require the whole file to be in memory, such as sort and uniq. Or the operation may just run too slow.
>
> Examples:
>
> Uniq ie. "uniq file"
> awk '\!s[$0]++' file
^

What???

$ awk '\!seen[$0] {blah}'
awk: \!seen[$0] {blah}
awk: ^ backslash not last character on line

> cat file | parallel --pipe --files awk\ \'\\\!s\[\$0\]\+\+\'\ | parallel -Xj1 cat {} ';' rm {}

You're gonna face a backslash over this posting.

charlemagn...@gmail.com

unread,
Jun 27, 2017, 12:53:47 PM6/27/17
to
On Tuesday, June 27, 2017 at 11:48:29 AM UTC-4, Kaz Kylheku wrote:

> You're gonna face a backslash over this posting.

Ahh sorry I'm the last man on earth not using bash :) Adjust as needed.

Janis Papanagnou

unread,
Jun 27, 2017, 1:07:28 PM6/27/17
to
On 27.06.2017 17:48, Kaz Kylheku wrote:
> On 2017-06-27, charlemagn...@gmail.com <charlemagn...@gmail.com> wrote:
>> This is a technique for awk to deal with very large files (10s of gigs)
>> that are too big to fit into memory or simply too slow. Some awk operations
>> require the whole file to be in memory, such as sort and uniq. Or the
>> operation may just run too slow.

Without diving into "parallel" I fail to see how that should work in the
general case when there's no significant compactification in effect; I'd
think you'd need an external solution (e.g. a Merge Sort[*], as opposed
to parallelizing). Or mind to explain how the sorted file components will
merge to an overall sorted result file in case of using "parallel"?

>>
>> Examples:
>>
>> Uniq ie. "uniq file"
>> awk '\!s[$0]++' file
> ^
>
> What???
>
> $ awk '\!seen[$0] {blah}'
> awk: \!seen[$0] {blah}
> awk: ^ backslash not last character on line
>
>> cat file | parallel --pipe --files awk\ \'\\\!s\[\$0\]\+\+\'\ | parallel -Xj1 cat {} ';' rm {}

(Looks like a UUOC and UUO"--pipe". No?)

>
> You're gonna face a backslash over this posting.

Without diving into parallel details; I suppose using "awk -f" would probably
remove most of this horrible quoting mess.

Anyway; I suppose using external sort/uniq would probably be faster, and more
flexible, and better legible. And the (GNU) sort I am using on my system even
seems to have a concurrency option --parallel=N in case we need performance.

With all that shell level code presented here; is there any advantage using
(with or without all those quotes) the "parallel" tool instead of using the
shell tools directly?

Janis

[*] https://en.wikipedia.org/wiki/Merge_sort

Janis Papanagnou

unread,
Jun 27, 2017, 1:15:58 PM6/27/17
to
I am also not using bash (but still another POSIX shell). Do you mean
that this is a bash issue, or are you working in a MS environment and
all the quoting mess is a consequence of that? In this case the awk -f
standard suggestion for MS command interpreters is advisable for your
environment. All POSIX shells behave widely similary WRT quoting.

Janis

Kenny McCormack

unread,
Jun 27, 2017, 1:20:21 PM6/27/17
to
In article <900791a4-189c-459f...@googlegroups.com>,
Heh - I still prefer tcsh (it is a better shell) and use it when I can.
Unfortunately, since most systems these days come with bash as default, it
can be painful to have to re-setup/re-configure each new system you come
across. It's easier to just do as the natives do. So, one learns just
enough bash to get along...

As far as this thread itself, I assume it was more of a demo of 'parallel',
rather than a fully serious recommendation to the old-hands found here (I'm
looking at you, J) to change their ways. TBH, as much as I admire
'parallel' from what I've read of it, it looks like a steep learning curve in
order to really figure out what it is about. I.e., you'd really have to be
serious about wanting to learn it - i.e., you'd really have to have some
big tasks on your hands - in order for it to be worth the journey.

--
"I have a simple philosophy. Fill what's empty. Empty what's full. And
scratch where it itches."

Alice Roosevelt Longworth

Kenny McCormack

unread,
Jun 27, 2017, 1:21:28 PM6/27/17
to
In article <oiu3sc$492$1...@news-1.m-online.net>,
No, we're talking about (t)csh.

You can run along, now.

(MS's got nothin' to do with it.)

--
If Jeb is Charlie Brown kicking a football-pulled-away, Mitt is a '50s
housewife with a black eye who insists to her friends the roast wasn't
dry.

Kaz Kylheku

unread,
Jun 27, 2017, 1:23:42 PM6/27/17
to
On 2017-06-27, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 27.06.2017 18:53, charlemagn...@gmail.com wrote:
>> On Tuesday, June 27, 2017 at 11:48:29 AM UTC-4, Kaz Kylheku wrote:
>>
>>> You're gonna face a backslash over this posting.
>>
>> Ahh sorry I'm the last man on earth not using bash :) Adjust as needed.
>
> I am also not using bash (but still another POSIX shell). Do you mean

awk '\!expr ...'

is either non-POSIX awk, non-POSIX shell, or both.

Kenny McCormack

unread,
Jun 27, 2017, 1:27:27 PM6/27/17
to
In article <201706271...@kylheku.com>,
We're talking about (t)csh here.

You can run along, now.

Janis Papanagnou

unread,
Jun 27, 2017, 1:33:03 PM6/27/17
to
On 27.06.2017 19:21, Kenny McCormack wrote:
> In article <oiu3sc$492$1...@news-1.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 27.06.2017 18:53, charlemagn...@gmail.com wrote:
>>> On Tuesday, June 27, 2017 at 11:48:29 AM UTC-4, Kaz Kylheku wrote:
>>>
>>>> You're gonna face a backslash over this posting.
>>>
>>> Ahh sorry I'm the last man on earth not using bash :) Adjust as needed.
>>
>> I am also not using bash (but still another POSIX shell). Do you mean
>> that this is a bash issue, or are you working in a MS environment and
>> all the quoting mess is a consequence of that? In this case the awk -f
>> standard suggestion for MS command interpreters is advisable for your
>> environment. All POSIX shells behave widely similary WRT quoting.
>
> No, we're talking about (t)csh.

Aha, that wasn't clear to me from the OP's posting. I suppose "awk -f"
will nonetheless also solve most of those quoting issues in csh then.
No?

Generally I am interested to understand what of the presented quoting
mess is a consequence of the shell (tcsh, as you say), and what is a
result of using "parallel".

Janis

Ben Bacarisse

unread,
Jun 27, 2017, 2:19:39 PM6/27/17
to
charlemagn...@gmail.com writes:

> This is a technique for awk to deal with very large files (10s of
> gigs) that are too big to fit into memory or simply too slow. Some awk
> operations require the whole file to be in memory, such as sort and
> uniq. Or the operation may just run too slow.
>
> Examples:
>
> Uniq ie. "uniq file"
> awk '\!s[$0]++' file

Eh?

> Awk uniq with parallel:
>
> cat file | parallel --pipe --files awk\ \'\\\!s\[\$0\]\+\+\'\ | parallel -Xj1 cat {} ';' rm {}
>
> --pipe takes input from stdin, in this case the cat command
>
> --files means output 1M blocks from the input file in a temporary
> filename in /tmp or wherever --tmpdir is set to.

That's not what the man pages says. --pipe splits the input into blocks
of records.

> awk is the unix command to run on the block. The command requires \
> escapes for spaces and non-letter characters including a trailing
> space. It is ugly but can be simplified in a wrapper script.

The quoting is not right here.

> The file names in /tmp are passed to the second parallel which cat's
> each temp file to stdout, then rm's them.

So it won't work. You need to run uniq on the concatenation of the now
processed data blocks.

<snip>
--
Ben.

charlemagn...@gmail.com

unread,
Jun 27, 2017, 2:25:28 PM6/27/17
to
On Tuesday, June 27, 2017 at 1:07:28 PM UTC-4, Janis Papanagnou wrote:
> Or mind to explain how the sorted file components will
> merge to an overall sorted result file in case of using "parallel"?

Right I only showed the example for uniq. For sort, instead of doing cat in the second parallel use sort -m which has a k-sorting effect (along with -S to limit memory usage). It opens all the temp files, reads the first line of each, sorts and outputs that set, etc.. One could use sort in the first parallel also, but I was trying to show how to use awk + parallel for large jobs. Awk has the advantage of being able to do many things not just sorting.

The backslash escaping is a relic of parallel not the shell (except the ! which is tcsh specific as Kenny rightly picked up). It is solvable other ways like awk -f as you say.

My explanation of --pipe was incomplete it means "spread jobs into multiple blocks from stdin". It doesn't mean merely "read from stdin".

Kenny McCormack

unread,
Jun 27, 2017, 2:35:00 PM6/27/17
to
In article <oiu4sc$4f8$1...@news-1.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>I suppose "awk -f" will nonetheless also solve most of those quoting
>issues in csh then. No?

I don't think we're here to discuss shell quoting issues. If people want
to continue in this vein, they should copy this thread over to a shell
group and continue it there.

>Generally I am interested to understand what of the presented quoting
>mess is a consequence of the shell (tcsh, as you say), and what is a
>result of using "parallel".

I don't think we're here to discuss shell quoting issues. If people want
to continue in this vein, they should copy this thread over to a shell
group and continue it there.

The shell quoting issues were primarily an artifact of the OP trying to
post the whole thing in one go - i.e., as a single stream rather than
presenting it as multiple files (which is difficult to do in a medium like
Usenet). I would imagine that if anyone (including OP himself) were to
actually implement this, they would, of course, put things into multiple files.

And also, of course, is the fact that Useneters tend to have a
visceral/emotional response/reaction to anything that smacks of a
non-sh-like shell being used.

--
The plural of "anecdote" is _not_ "data".

charlemagn...@gmail.com

unread,
Jun 27, 2017, 2:44:31 PM6/27/17
to
On Tuesday, June 27, 2017 at 2:19:39 PM UTC-4, Ben Bacarisse wrote:

> > The file names in /tmp are passed to the second parallel which cat's
> > each temp file to stdout, then rm's them.
>
> So it won't work. You need to run uniq on the concatenation of the now
> processed data blocks.

You're right. hmm.. the uniq example won't work.

A sort example using GNU sort:

cat $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {}

A sort example using awk:

cat $1 | parallel --pipe --files awk <stuff> | parallel -Xj1 sort -S1024M -m {} ';' rm {}

For other awk commands the second parallel could be replaced with cat depending what it's doing.

Janis Papanagnou

unread,
Jun 27, 2017, 3:23:27 PM6/27/17
to
On 27.06.2017 20:34, Kenny McCormack wrote:
> In article <oiu4sc$4f8$1...@news-1.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>> I suppose "awk -f" will nonetheless also solve most of those quoting
>> issues in csh then. No?
>
> I don't think we're here to discuss shell quoting issues. If people want
> to continue in this vein, they should copy this thread over to a shell
> group and continue it there.

Better tell that to the OP, in the first place, who was the inventor of
all that shell stuff here.

>
>> Generally I am interested to understand what of the presented quoting
>> mess is a consequence of the shell (tcsh, as you say), and what is a
>> result of using "parallel".
>
> I don't think we're here to discuss shell quoting issues. If people want
> to continue in this vein, they should copy this thread over to a shell
> group and continue it there.

In this case the OP should have better posted this in comp.unix.shell, so
better tell that to the OP, in the first place, who was the inventor of
all that shell stuff here.

>
> The shell quoting issues were primarily an artifact of the OP trying to
> post the whole thing in one go - i.e., as a single stream rather than
> presenting it as multiple files (which is difficult to do in a medium like
> Usenet). I would imagine that if anyone (including OP himself) were to
> actually implement this, they would, of course, put things into multiple files.
>
> And also, of course, is the fact that Useneters tend to have a
> visceral/emotional response/reaction to anything that smacks of a
> non-sh-like shell being used.

One (additional) problem with the OP's code/posting is that it's not even
valid awk code (as Kaz has shown), and I recall that in former times you
complained here about such shell code and about invalid awk code (due to
shell effects). I also see emotional responses here, and specifically from
yourself where you are obviously biased if a poster is using your favourite
shell or not. I don't care what you (or the OP) is actually using. But I do
care to understand whether any (even severe) issues with posted code stem
from a specific environment, or from the advertised tool, or from awk.

Janis

Janis Papanagnou

unread,
Jun 27, 2017, 3:25:07 PM6/27/17
to
I see. Thanks for clarifying all the non-obvious issues.

Janis

charlemagn...@gmail.com

unread,
Jun 27, 2017, 3:59:25 PM6/27/17
to
On Tuesday, June 27, 2017 at 3:23:27 PM UTC-4, Janis Papanagnou wrote:
>
> Better tell that to the OP, in the first place, who was the inventor of
> all that shell stuff here.

The shell escaping is not obvious and took me a while to figure out so I thought it would be worthwhile including in the example since this is the awk forum. Normally parallel doesn't require escaping and I still haven't figure out why it needs it for awk, but probably something to do with single quotes used by awk conflicting with internal parallel use. The escaping is the same for bash and tcsh, except tcsh uses \\\! while bash \\! for that uniq example, which doesn't work anyway, so we can move from that one hopefully.

Markus Gnam

unread,
Jun 27, 2017, 4:48:25 PM6/27/17
to
As Janis already mentioned, it seems what you are looking for is a merge sort
algorithm. Your latest code shown on my "A list comparison tool" thread fails
for big files. I use a custom merge sort to deal with this problem in my code.
An interesting topic: I think this issue can't be solved without a merge sort.

charlemagn...@gmail.com

unread,
Jun 27, 2017, 5:19:04 PM6/27/17
to
On Tuesday, June 27, 2017 at 4:48:25 PM UTC-4, Markus Gnam wrote:

> As Janis already mentioned, it seems what you are looking for is a merge sort
> algorithm.

I posted in reply to Janis that for sorting use "sort -m" which is merge sort, and then posted an example of it in another post. But not everything needs sorting though it depends on the awk command in the first parallel.

> Your latest code shown on my "A list comparison tool" thread fails
> for big files.

This is true, it works 99% of the time for me since most files don't exceed memory; but if I need it now I know your library can handle large files which is good. Can it adjust temporary file location and memory usage limits?

Markus Gnam

unread,
Jun 27, 2017, 5:35:01 PM6/27/17
to
Yes, memory usage limits can be adjusted as an option.
The Temporary file location is the Windows Temp path which can't be adjusted
at the moment. However, please let us not talk about this tool any longer :-)
I'm glad the old thread is finally finished.

I don't know much about parallel programming with AWK yet.
It sounds really interesting.

Ian Zimmerman

unread,
Jun 27, 2017, 7:22:41 PM6/27/17
to
On 2017-06-27 19:07, Janis Papanagnou wrote:

> I'd think you'd need an external solution (e.g. a Merge Sort[*], as
> opposed to parallelizing).

Isn't merge sort essentially what sort(1) does, at least for inputs
large enough to not fit in RAM?

--
Please *no* private Cc: on mailing lists and newsgroups
Personal signed mail: please _encrypt_ and sign
Don't clear-text sign:
http://primate.net/~itz/blog/the-problem-with-gpg-signatures.html

Janis Papanagnou

unread,
Jun 27, 2017, 7:59:25 PM6/27/17
to
On 28.06.2017 01:22, Ian Zimmerman wrote:
> On 2017-06-27 19:07, Janis Papanagnou wrote:
>
>> I'd think you'd need an external solution (e.g. a Merge Sort[*], as
>> opposed to parallelizing).
>
> Isn't merge sort essentially what sort(1) does, at least for inputs
> large enough to not fit in RAM?

I'd hope and expect sort(1) would do such a thing - since the algorithms
are long existing - but I don't know. With memory optimized usage it can
use any ordinary N*log(N) algorithm to build maximum length runs of data
and merge those in as few as necessary turns using merge sort. But for
data that already fits in memory it would usually be faster to just skip
the [external] merge sort and use (e.g.) quicksort.

Janis

Joe User

unread,
Jun 27, 2017, 9:34:43 PM6/27/17
to
charlemagn...@gmail.com wrote:

> Awk uniq with parallel:
>
> cat file | parallel --pipe --files awk\ \'\\\!s\[\$0\]\+\+\'\ | parallel
> -Xj1 cat {} ';' rm {}
>

Just a friendly warning about parallel:

I was using Debian Jessie.

I had a script to poll 20 URL's (with curl) every so often.

It looked like parallel would allow all of the accesses to be done in
parallel with timeouts and so on, with returned text being merged in order.
It seemed like a simple thing, and very useful.

It reliably crashed my operating system, every time. So, I put parallel
away as a good idea that can do bad things.

I tried using ulimit to limit the number of spawned subprocesses and memory
uses, but I just couldn't get it to work. I'd be curious to hear if you get
a useful implementation. Maybe I need to dig it up now that I have upgraded
to Debian Stretch. It didn't work on Jessie.


charlemagn...@gmail.com

unread,
Jun 28, 2017, 11:05:27 AM6/28/17
to
On Tuesday, June 27, 2017 at 9:34:43 PM UTC-4, Joe User wrote:

> I was using Debian Jessie.
>
> I had a script to poll 20 URL's (with curl) every so often.
>
> It looked like parallel would allow all of the accesses to be done in
> parallel with timeouts and so on, with returned text being merged in order.
> It seemed like a simple thing, and very useful.

I do this very thing with parallel on Mint (Ubuntu) in a VirtualBox. Checking upwards of 2 URLs per second for 24hrs at a time. There are dozens of log files being updated so huge amounts of data with open files. Works perfectly never had a problem, it's remarkable how reliable its been. Not sure what's happening in your case it certainly stresses many parts of the system. Maybe try the --delay option so it checks one URL every X seconds (or sub-seconds) and combine with -j to control how many are running at once.

Bruce Horrocks

unread,
Jun 28, 2017, 4:52:39 PM6/28/17
to
On 28/06/2017 02:34, Joe User wrote:
> I had a script to poll 20 URL's (with curl) every so often.
>
> It looked like parallel would allow all of the accesses to be done in
> parallel with timeouts and so on, with returned text being merged in order.

Text retrieved by curl won't be in order, by definition, because the
processes are running in parallel. You'll need to save output to
numbered files and then join them in sequence afterwards if that is what
you want.

> It seemed like a simple thing, and very useful.
>
> It reliably crashed my operating system, every time. So, I put parallel
> away as a good idea that can do bad things.
>
> I tried using ulimit to limit the number of spawned subprocesses and memory
> uses, but I just couldn't get it to work. I'd be curious to hear if you get
> a useful implementation. Maybe I need to dig it up now that I have upgraded
> to Debian Stretch. It didn't work on Jessie.

Try something like this to run no more than two curls at a time.

# Bash script
declare -a arr=("curl command 1"
"curl command 2"
"curl command 3"
"curl command 4"
)

function max2 {
while [ `jobs | wc -l` -ge 2 ]
do
sleep 5
done
}

for curlcmd in "${arr[@]}"
do
max2; /usr/bin/curl $curlcmd &
done
wait

--
Bruce Horrocks
Surrey
England
(bruce at scorecrow dot com)

Joe User

unread,
Jun 28, 2017, 11:39:01 PM6/28/17
to
Bruce Horrocks wrote:

> Text retrieved by curl won't be in order, by definition, because the
> processes are running in parallel. You'll need to save output to
> numbered files and then join them in sequence afterwards if that is what
> you want.

The man page for parallel says:

"GNU parallel makes sure output from the commands is the same output as
you would get had you run the commands sequentially. This makes it possible
to use output from GNU parallel as input for other programs."

That was a useful feature for me, but I never could get it to work.
Apparently, parallel manages temporary files as necessary.

Maybe I need to try the newer revision of parallel, with Debian Stretch.




Joe User

unread,
Jun 28, 2017, 11:42:21 PM6/28/17
to
charlemagn...@gmail.com wrote:

> I do this very thing with parallel on Mint (Ubuntu) in a VirtualBox.
> Checking upwards of 2 URLs per second for 24hrs at a time. There are
> dozens of log files being updated so huge amounts of data with open files.
> Works perfectly never had a problem, it's remarkable how reliable its
> been.

Thanks for telling me that. It's the only success story I've gotten from an
actual user.

I'll have to try a newer version.

charlemagn...@gmail.com

unread,
Jun 29, 2017, 1:00:29 AM6/29/17
to
It sounds like --keep-order will work it will "Keep sequence of output same as the order of input". Also combined with a slight --delay 0.2 so they don't start simultaneous.
0 new messages