Need a awk, sed, perl or bash script for some magic...

Koppe74

unread,

Sep 7, 2008, 6:15:57 PM9/7/08

to

I have a file containing the absolute path for a number
of files, that looks something like this:
> /home/bok/www.site.org/f/233/1/index.html
> /home/bok/www.site.org/f/233/2/index.html
> /home/bok/www.site.org/f/375/1/index.html
> /home/bok/www.site.org/f/419/1/index.html
> /home/bok/www.site.org/f/419/2/index.html
> /home/rtm/www.site.org/f/238/1/index.html
> /home/rtm/www.site.org/f/233/2/index.html
> /home/rtm/www.site.org/f/532/1/index.html

That is, lots of lines in the form:
www.site.org/f/<num>/1/index.html
usually -- but not always -- followed by an
almost identical line where 1 is replaced by 2:
www.site.org/f/<num>/2/index.html

I need a script that output only the 1-lines that
*are not followed* by an almost identical 2-line.
I.e. the 3rd and last line in the example on top.

I don't really care what it's written in, though I
would think awk, perl -- or perhaps sed -- was
the right tools for the job. I tried writing it in awk,
but discovered that I actually really suck in that
language... and I have no experience in perl. I
could probably do it myself in C++, but that seems
like something of an overkill...

David W. Hodgins

unread,

Sep 7, 2008, 8:42:28 PM9/7/08

to

On Sun, 07 Sep 2008 18:15:57 -0400, Koppe74 <kop...@gmail.com> wrote:

> I have a file containing the absolute path for a number
> of files, that looks something like this:
>> /home/bok/www.site.org/f/233/1/index.html
>> /home/bok/www.site.org/f/233/2/index.html
>> /home/bok/www.site.org/f/375/1/index.html
>> /home/bok/www.site.org/f/419/1/index.html
>> /home/bok/www.site.org/f/419/2/index.html
>> /home/rtm/www.site.org/f/238/1/index.html
>> /home/rtm/www.site.org/f/233/2/index.html
>> /home/rtm/www.site.org/f/532/1/index.html

> I need a script that output only the 1-lines that

> *are not followed* by an almost identical 2-line.
> I.e. the 3rd and last line in the example on top.

See if the following is what you want ...
#!/bin/bash

set -u # Show usage of unset variables as an error, to help catch typos.

FileInput="/home/dave/test.txt"

IFS=$'\n' # Seperate command output by newlines only
LinesSorted=(`sort $FileInput`) # One arrary element per line
unset IFS

NumberOfLines=${#LinesSorted[@]}

for ((LineIndexCurrent=0; LineIndexCurrent < NumberOfLines; LineIndexCurrent ++)); do

LineCurrent="${LinesSorted[$LineIndexCurrent]}"

LineUpToLastSlash="${LineCurrent%[/]*}" # Extract string up to last slash
LengthUpToLastSlash="${#LineUpToLastSlash}"
RightPart="${LineCurrent:$LengthUpToLastSlash}"

LeftPart="${LineUpToLastSlash%[/]*}"
LengthLeftPart="${#LeftPart}"
MiddlePart="${LineUpToLastSlash:$LengthLeftPart}"

LineIndexNext=0
if [[ "$MiddlePart" == "/1" ]] ; then
let LineIndexNext="$LineIndexCurrent + 1"
if [ "$LineIndexNext" -ge $NumberOfLines ] ; then
echo " Line found = $LineCurrent"
else
LineNext="${LinesSorted[$LineIndexNext]}"
LineUpToLastSlash="${LineNext%[/]*}"
LengthUpToLastSlash="${#LineUpToLastSlash}"
LineNextRightPart="${LineCurrent:$LengthUpToLastSlash}"
LineNextLeftPart="${LineUpToLastSlash%[/]*}"
LengthLeftPart="${#LeftPart}"
LineNextMiddlePart="${LineUpToLastSlash:$LengthLeftPart}"
if [[ "$LeftPart" != "$LineNextLeftPart" || "$RightPart" != "$LineNextRightPart" ]] ; then
echo " Line found = $LineCurrent"
fi
fi
fi

done

If your newsreader doesn't handle format=flowed, the above will be messed up, so
I've uploaded a copy to http://www.ody.ca/~dwhodgins/showuniq

Regards, Dave Hodgins

--
Change nomail.afraid.org to ody.ca to reply by email.
(nomail.afraid.org has been set up specifically for
use in usenet. Feel free to use it yourself.)

Kees Theunissen

unread,

Sep 8, 2008, 4:10:09 AM9/8/08

to

Koppe74 wrote:
> I have a file containing the absolute path for a number
> of files, that looks something like this:
>> /home/bok/www.site.org/f/233/1/index.html
>> /home/bok/www.site.org/f/233/2/index.html
>> /home/bok/www.site.org/f/375/1/index.html
>> /home/bok/www.site.org/f/419/1/index.html
>> /home/bok/www.site.org/f/419/2/index.html
>> /home/rtm/www.site.org/f/238/1/index.html
>> /home/rtm/www.site.org/f/233/2/index.html
>> /home/rtm/www.site.org/f/532/1/index.html
>
> That is, lots of lines in the form:
> www.site.org/f/<num>/1/index.html
> usually -- but not always -- followed by an
> almost identical line where 1 is replaced by 2:
> www.site.org/f/<num>/2/index.html
>
>I need a script that output only the 1-lines that
>*are not followed* by an almost identical 2-line.
>I.e. the 3rd and last line in the example on top.

If these assumptions are met:
1) the file is sorted like above with all /2/ lines
directly following the /1/ line, and
2) there are no duplicate lines in the list, and
3) there are no /3/ or higher numbered variants
of these lines, and
4) there are no "orphaned" /2/ lines (i.e. /2/ lines
without the /1/ variant being present),
then you could simply turn /2/ lines into their /1/
variant and only list the unique lines of the modified
list.

Something like:
sed -e "s#/2/index#/1/index#" your_list | uniq --unique
would give the requested output.

This is a fast and straight solution for huge amounts
of data if _all_ of the above assumptions are met.

If assumption 1) is not met then you could start the
pipeline with a "sort":
sort your_list | \
sed -e "s#/2/index#/1/index#" | \
uniq --unique

If assumption 2) is not met then you could start the
pipeline with a "uniq":
unique | \
sed -e "s#/2/index#/1/index#" | \
uniq --unique
to handle the case of two consecutive identical /1/
lines. Or, more general:
sort your_list | \
unique | \
sed -e "s#/2/index#/1/index#" | \
uniq --unique
Note that this last example suppresses the output of /1/ lines
without a /2/ line directly following if there are duplicate /1/
lines elsewhere in the file directly followed by a /2/ line.
(Actually this suppresses the /1/ line if a corresponding /2/
is found anywhere in the file -even in cases where assumption 1)
is not met- ).

If assumption 3) is not met and you just want to ignore
the higher numbered variants:
sed -e "s#/2/index#/1/index#" your_list | \
uniq --unique | \
grep '/1/index'
If you don't want to ignore these variants please supply
additional information how to handle such a case.

If assumption 4 is not met that would break the whole logic
behind my solution. Such a case can be handled but requires
more logic than the simple line substitution and unique filtering
I used.

Regards,

Kees.

--
Kees Thenissen.

pk

unread,

Sep 8, 2008, 1:56:58 PM9/8/08

to

awk -F'/' -v OFS='/' '$6==p{i++;next} i==1{print c}
{c=$0;i=1;p=$6}END{if(i==1)print c}' file

or, using a different approach

awk -F'/' '{b[$6]=$0;a[$6]++}END{for(i in a)if(a[i]==1)print b[i]}' file

although the latter method does not print the lines in the same order.