read ahead or before

Mag Gam

unread,

Jul 26, 2008, 3:02:48 PM7/26/08

to

I have been trying to do this instead of placing everything in a hash/
array and compare in the END block.

For example, if I have a file like this

111
2222
333
333
4445
3434

Notice there is a duplicate "333". How can I test if the next line is
the same as the current line? I suppose I can use getline() but is
there another clever way of achieving this?

Also, how can I check for previous line?

TIA

pk

unread,

Jul 26, 2008, 3:14:22 PM7/26/08

to

On Saturday 26 July 2008 21:02, Mag Gam wrote:

> I have been trying to do this instead of placing everything in a hash/
> array and compare in the END block.
>
> For example, if I have a file like this
>
> 111
> 2222
> 333
> 333
> 4445
> 3434
>
> Notice there is a duplicate "333". How can I test if the next line is
> the same as the current line? I suppose I can use getline() but is
> there another clever way of achieving this?

I don't know if that can be considered more clever, however you can just
save the value of the previous line:

awk '{if ($0==prev) { # ... this line is the same as previous line }
prev=$0}' file

What are you trying to do? What's the underlying problem?

If you just want to remove duplicates, you can do

awk '!a[$0]++' file

> Also, how can I check for previous line?

See above.

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Mag Gam

unread,

Jul 27, 2008, 10:05:42 AM7/27/08

to

Thanks for the response.

The underlying problem is, the file is huge; its close to 15g and I
would like to compare.

What I am trying to do is, compare the current line to the next like
(or vice versa, 2nd line to 1st first).

With the hash solution, I was able to get the answer. However my
sysadmin is complaining I am taking up too much memory.

Janis Papanagnou

unread,

Jul 27, 2008, 12:11:17 PM7/27/08

to

Mag Gam wrote:
> Thanks for the response.

[Please don't top-post!]

>
> The underlying problem is, the file is huge; its close to 15g and I
> would like to compare.

(Never measured files in gram, so I can't help you here.)

>
> What I am trying to do is, compare the current line to the next like
> (or vice versa, 2nd line to 1st first).

Have you tried pk's proposal? - Which solves what you've asked for.

>
> With the hash solution, I was able to get the answer. However my
> sysadmin is complaining I am taking up too much memory.

You already told us that your own hash solution doesn't fit your
needs. So just use pk's solution. What's the problem?

Janis

loki harfagr

unread,

Jul 27, 2008, 1:48:06 PM7/27/08

to

On Sun, 27 Jul 2008 18:11:17 +0200, Janis Papanagnou wrote:

> Mag Gam wrote:
>> Thanks for the response.
>
> [Please don't top-post!]
>
>
>> The underlying problem is, the file is huge; its close to 15g and I
>> would like to compare.
>
> (Never measured files in gram, so I can't help you here.)

Ah Janis, the poor OP wasn't meaning grams but gravitational levels
and under 15 g that's certainly difficult to cure any file ~;O)

>> What I am trying to do is, compare the current line to the next like
>> (or vice versa, 2nd line to 1st first).
>
> Have you tried pk's proposal? - Which solves what you've asked for.
>
>
>> With the hash solution, I was able to get the answer. However my
>> sysadmin is complaining I am taking up too much memory.
>
> You already told us that your own hash solution doesn't fit your needs.
> So just use pk's solution. What's the problem?

I suspect pk's solution (though very good) may, in the OP case,
still consume a lot of memory in the a[] buffer if by 'chance'
the input overgravitated file has a lot of different lines ;-)

If that's the point, I propose here a possible way to
drastically reduce the memory usage, certainly not the
golf contest winner of the month but quite close to list
in obfuscating style samples ;D)
Anyway:

$ awk '{n++;n%=1;a[n]=a[n+1];a[n+1]=$0;if(a[n+1]==a[n]){print "Mind the gap";next}}1'

that way if the OP sysadmin has a problem with mem usage that'd leave us
with a few hypothesis, the server might upgrade from Z80-MSX-16KB towards
power machines like AtariST520 or even a PDP-20, or the sysadmin has to
be seen as human and may need to have some vacation time (like this week I
just had ;-) or maybe the file has extremely looong records...

(to OP: Replace the ``Londoner'' message by whatever you need, other msg or action)

--
have space suit : "VMSBUX:B0...@GOHH.GO"
will travel : tr "MLKJHGFDSQNBVCXWPOIUYTREZA" "a-z"

Ted Davis

unread,

Jul 27, 2008, 4:01:31 PM7/27/08

to

Functionally, this is the same as PK's suggestion, it's just written out
in a fuller (C-like), and hopefully, clearer, form - since you didn't say
what you want to do with the lines after suppressing adjacent duplicates,
I wrote it to print the non-duplicate lines as it encounters them. This
should not be sensitive to the file size because it stores only one line
at a time.

{
if( $0 != Prev ) print $0
Prev = $0
}

In minimalist awk format, that's
$0 != Prev {print}
{Prev = $0}

As a command line program that could be (minimalist format)

awk '$0!=Prev{print}{Prev=$0}' source > target

(tested under Fedora and XP (as a script file - all variations tested
under Linux) with your sample data)

BTW, "gigabytes" is usually abbreviated GB (Gb would be "gigabits").
Abbreviations for SI prefixes for units larger than kilo are all upper
case - all those smaller than mega are in lower case - the full prefixes
are in lower case unless the language requires initial capitals (k and K
have an unofficial byte/bit context usage: k = 1000; K = 1024).

--

T.E.D. (tda...@mst.edu) MST (Missouri University of Science and Technology)
used to be UMR (University of Missouri - Rolla).
.

Janis Papanagnou

unread,

Jul 27, 2008, 5:08:17 PM7/27/08

to

loki harfagr wrote:
> On Sun, 27 Jul 2008 18:11:17 +0200, Janis Papanagnou wrote:
>
>
>>Mag Gam wrote:
>>
>>>Thanks for the response.
>>
>>[Please don't top-post!]
>>
>>
>>
>>>The underlying problem is, the file is huge; its close to 15g and I
>>>would like to compare.
>>
>>(Never measured files in gram, so I can't help you here.)
>
>
> Ah Janis, the poor OP wasn't meaning grams but gravitational levels
> and under 15 g that's certainly difficult to cure any file ~;O)

:-) Frankly, I wasn't sure whether he could have meant gravity ;-)

>
>
>>>What I am trying to do is, compare the current line to the next like
>>>(or vice versa, 2nd line to 1st first).
>>
>>Have you tried pk's proposal? - Which solves what you've asked for.
>>
>>
>>
>>>With the hash solution, I was able to get the answer. However my
>>>sysadmin is complaining I am taking up too much memory.
>>
>>You already told us that your own hash solution doesn't fit your needs.
>>So just use pk's solution. What's the problem?
>
>
> I suspect pk's solution (though very good) may, in the OP case,
> still consume a lot of memory in the a[] buffer if by 'chance'
> the input overgravitated file has a lot of different lines ;-)

Oh, I meant his first proposal, the one without a[]...

awk '{if ($0==prev) { # ... this line is the same as previous line }
prev=$0}' file

Janis

Janis Papanagnou

unread,

Jul 27, 2008, 5:40:48 PM7/27/08

to

If we're going to go minimalist, maybe even...

awk '$0!=prev;{prev=$0}' source > target

Janis

Sashi

unread,

Jul 30, 2008, 9:31:41 PM7/30/08

to

> If you just want to remove duplicates, you can do
> awk '!a[$0]++' file

Typical wizardry in awk.
Can someone please explain why/how this works?

Thanks,
Sashi

Grant

unread,

Jul 30, 2008, 10:18:09 PM7/30/08

to

awk '(!$0 in a) { # if not seen
a[$0]++ # add $0 to seen list a[]
print # and print $0
}' file

Grant.
--
http://bugsplatter.mine.nu/

Ed Morton

unread,

Jul 31, 2008, 3:17:45 AM7/31/08

to

The value of a[$0] is being tested against zero (to produce the default action
of printing the current record) then incremented. If it's zero then !a[$0] is
true so the record is printed.

It's the same as:

awk '
!a[$0]{print}
{a[$0]++}
' file

or, even wordier:

awk '
(a[$0] == 0){print}
{a[$0] = a[$0] + 1}
' file

Imagine an input file like this:

foo
bar
foo

The first time "foo" is read and stored in $0, a["foo"] has the value zero so
the record is printed. Then a["foo"] is incremented to 1.

Now "bar" is read and stored in $0, a["bar"] has the value zero so the record is
printed. Then a["bar"] is incremented to 1.

Now "foo" is read and stored in $0, a["foo"] has the value one so the record is
NOT printed since "a[$0]" is non-zero (i.e. true) so "!a[$0]" is false. Then
a["foo"] is incremented to 2.

So the output would be:

foo
bar

an the array "a" would contain a count of the occurrences of each record in the
input file:

$ cat file
foo
bar
foo
$ awk '!a[$0]++; END{for (i in a) printf "a[\"%s\"]=%s\n",i,a[i]}' file
foo
bar
a["foo"]=2
a["bar"]=1

Regards,

Ed.

Ed Morton

unread,

Jul 31, 2008, 3:19:24 AM7/31/08

to

On 7/30/2008 9:18 PM, Grant wrote:
> On Wed, 30 Jul 2008 18:31:41 -0700 (PDT), Sashi <smal...@gmail.com> wrote:
>
>
>>>If you just want to remove duplicates, you can do
>>>awk '!a[$0]++' file
>>
>>Typical wizardry in awk.
>>Can someone please explain why/how this works?
>
>
> awk '(!$0 in a) { # if not seen

ITYM:

awk '!($0 in a) {

> a[$0]++ # add $0 to seen list a[]

No need for the "++":

a[$0]

Grant

unread,

Jul 31, 2008, 3:46:21 AM7/31/08

to

On Thu, 31 Jul 2008 02:19:24 -0500, Ed Morton <mor...@lsupcaemnt.com> wrote:

>
>
>On 7/30/2008 9:18 PM, Grant wrote:
>> On Wed, 30 Jul 2008 18:31:41 -0700 (PDT), Sashi <smal...@gmail.com> wrote:
>>
>>
>>>>If you just want to remove duplicates, you can do
>>>>awk '!a[$0]++' file
>>>
>>>Typical wizardry in awk.
>>>Can someone please explain why/how this works?
>>
>>
>> awk '(!$0 in a) { # if not seen
>
>ITYM:
>
> awk '!($0 in a) {

Thanks Ed, that's what I had in mind...

>
>> a[$0]++ # add $0 to seen list a[]
>
>No need for the "++":
>
> a[$0]
>
>> print # and print $0
>> }' file
>>
>> Grant.

--
http://bugsplatter.id.au/