For example, if I have a file like this
111
2222
333
333
4445
3434
Notice there is a duplicate "333". How can I test if the next line is
the same as the current line? I suppose I can use getline() but is
there another clever way of achieving this?
Also, how can I check for previous line?
TIA
> I have been trying to do this instead of placing everything in a hash/
> array and compare in the END block.
>
> For example, if I have a file like this
>
> 111
> 2222
> 333
> 333
> 4445
> 3434
>
> Notice there is a duplicate "333". How can I test if the next line is
> the same as the current line? I suppose I can use getline() but is
> there another clever way of achieving this?
I don't know if that can be considered more clever, however you can just
save the value of the previous line:
awk '{if ($0==prev) { # ... this line is the same as previous line }
prev=$0}' file
What are you trying to do? What's the underlying problem?
If you just want to remove duplicates, you can do
awk '!a[$0]++' file
> Also, how can I check for previous line?
See above.
--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.
The underlying problem is, the file is huge; its close to 15g and I
would like to compare.
What I am trying to do is, compare the current line to the next like
(or vice versa, 2nd line to 1st first).
With the hash solution, I was able to get the answer. However my
sysadmin is complaining I am taking up too much memory.
[Please don't top-post!]
>
> The underlying problem is, the file is huge; its close to 15g and I
> would like to compare.
(Never measured files in gram, so I can't help you here.)
>
> What I am trying to do is, compare the current line to the next like
> (or vice versa, 2nd line to 1st first).
Have you tried pk's proposal? - Which solves what you've asked for.
>
> With the hash solution, I was able to get the answer. However my
> sysadmin is complaining I am taking up too much memory.
You already told us that your own hash solution doesn't fit your
needs. So just use pk's solution. What's the problem?
Janis
> Mag Gam wrote:
>> Thanks for the response.
>
> [Please don't top-post!]
>
>
>> The underlying problem is, the file is huge; its close to 15g and I
>> would like to compare.
>
> (Never measured files in gram, so I can't help you here.)
Ah Janis, the poor OP wasn't meaning grams but gravitational levels
and under 15 g that's certainly difficult to cure any file ~;O)
>> What I am trying to do is, compare the current line to the next like
>> (or vice versa, 2nd line to 1st first).
>
> Have you tried pk's proposal? - Which solves what you've asked for.
>
>
>> With the hash solution, I was able to get the answer. However my
>> sysadmin is complaining I am taking up too much memory.
>
> You already told us that your own hash solution doesn't fit your needs.
> So just use pk's solution. What's the problem?
I suspect pk's solution (though very good) may, in the OP case,
still consume a lot of memory in the a[] buffer if by 'chance'
the input overgravitated file has a lot of different lines ;-)
If that's the point, I propose here a possible way to
drastically reduce the memory usage, certainly not the
golf contest winner of the month but quite close to list
in obfuscating style samples ;D)
Anyway:
$ awk '{n++;n%=1;a[n]=a[n+1];a[n+1]=$0;if(a[n+1]==a[n]){print "Mind the gap";next}}1'
that way if the OP sysadmin has a problem with mem usage that'd leave us
with a few hypothesis, the server might upgrade from Z80-MSX-16KB towards
power machines like AtariST520 or even a PDP-20, or the sysadmin has to
be seen as human and may need to have some vacation time (like this week I
just had ;-) or maybe the file has extremely looong records...
(to OP: Replace the ``Londoner'' message by whatever you need, other msg or action)
--
have space suit : "VMSBUX:B0...@GOHH.GO"
will travel : tr "MLKJHGFDSQNBVCXWPOIUYTREZA" "a-z"
Functionally, this is the same as PK's suggestion, it's just written out
in a fuller (C-like), and hopefully, clearer, form - since you didn't say
what you want to do with the lines after suppressing adjacent duplicates,
I wrote it to print the non-duplicate lines as it encounters them. This
should not be sensitive to the file size because it stores only one line
at a time.
{
if( $0 != Prev ) print $0
Prev = $0
}
In minimalist awk format, that's
$0 != Prev {print}
{Prev = $0}
As a command line program that could be (minimalist format)
awk '$0!=Prev{print}{Prev=$0}' source > target
(tested under Fedora and XP (as a script file - all variations tested
under Linux) with your sample data)
BTW, "gigabytes" is usually abbreviated GB (Gb would be "gigabits").
Abbreviations for SI prefixes for units larger than kilo are all upper
case - all those smaller than mega are in lower case - the full prefixes
are in lower case unless the language requires initial capitals (k and K
have an unofficial byte/bit context usage: k = 1000; K = 1024).
--
T.E.D. (tda...@mst.edu) MST (Missouri University of Science and Technology)
used to be UMR (University of Missouri - Rolla).
.
:-) Frankly, I wasn't sure whether he could have meant gravity ;-)
>
>
>>>What I am trying to do is, compare the current line to the next like
>>>(or vice versa, 2nd line to 1st first).
>>
>>Have you tried pk's proposal? - Which solves what you've asked for.
>>
>>
>>
>>>With the hash solution, I was able to get the answer. However my
>>>sysadmin is complaining I am taking up too much memory.
>>
>>You already told us that your own hash solution doesn't fit your needs.
>>So just use pk's solution. What's the problem?
>
>
> I suspect pk's solution (though very good) may, in the OP case,
> still consume a lot of memory in the a[] buffer if by 'chance'
> the input overgravitated file has a lot of different lines ;-)
Oh, I meant his first proposal, the one without a[]...
awk '{if ($0==prev) { # ... this line is the same as previous line }
prev=$0}' file
Janis
If we're going to go minimalist, maybe even...
awk '$0!=prev;{prev=$0}' source > target
Janis
Typical wizardry in awk.
Can someone please explain why/how this works?
Thanks,
Sashi
awk '(!$0 in a) { # if not seen
a[$0]++ # add $0 to seen list a[]
print # and print $0
}' file
Grant.
--
http://bugsplatter.mine.nu/
The value of a[$0] is being tested against zero (to produce the default action
of printing the current record) then incremented. If it's zero then !a[$0] is
true so the record is printed.
It's the same as:
awk '
!a[$0]{print}
{a[$0]++}
' file
or, even wordier:
awk '
(a[$0] == 0){print}
{a[$0] = a[$0] + 1}
' file
Imagine an input file like this:
foo
bar
foo
The first time "foo" is read and stored in $0, a["foo"] has the value zero so
the record is printed. Then a["foo"] is incremented to 1.
Now "bar" is read and stored in $0, a["bar"] has the value zero so the record is
printed. Then a["bar"] is incremented to 1.
Now "foo" is read and stored in $0, a["foo"] has the value one so the record is
NOT printed since "a[$0]" is non-zero (i.e. true) so "!a[$0]" is false. Then
a["foo"] is incremented to 2.
So the output would be:
foo
bar
an the array "a" would contain a count of the occurrences of each record in the
input file:
$ cat file
foo
bar
foo
$ awk '!a[$0]++; END{for (i in a) printf "a[\"%s\"]=%s\n",i,a[i]}' file
foo
bar
a["foo"]=2
a["bar"]=1
Regards,
Ed.
On 7/30/2008 9:18 PM, Grant wrote:
> On Wed, 30 Jul 2008 18:31:41 -0700 (PDT), Sashi <smal...@gmail.com> wrote:
>
>
>>>If you just want to remove duplicates, you can do
>>>awk '!a[$0]++' file
>>
>>Typical wizardry in awk.
>>Can someone please explain why/how this works?
>
>
> awk '(!$0 in a) { # if not seen
ITYM:
awk '!($0 in a) {
> a[$0]++ # add $0 to seen list a[]
No need for the "++":
a[$0]
>
>
>On 7/30/2008 9:18 PM, Grant wrote:
>> On Wed, 30 Jul 2008 18:31:41 -0700 (PDT), Sashi <smal...@gmail.com> wrote:
>>
>>
>>>>If you just want to remove duplicates, you can do
>>>>awk '!a[$0]++' file
>>>
>>>Typical wizardry in awk.
>>>Can someone please explain why/how this works?
>>
>>
>> awk '(!$0 in a) { # if not seen
>
>ITYM:
>
> awk '!($0 in a) {
Thanks Ed, that's what I had in mind...
>
>> a[$0]++ # add $0 to seen list a[]
>
>No need for the "++":
>
> a[$0]
>
>> print # and print $0
>> }' file
>>
>> Grant.