search for words in one file that is not in another file

19 views
Skip to first unread message

tiba...@gmail.com

unread,
Sep 15, 2017, 9:25:45 AM9/15/17
to ack users
Hi,
I have two files(file 1 file2), each contains one word in ecah line.
I am trying to get the words in file2 that are not in file1. saving result in file3
I need equivelant awk command to the following grep
grep -F -x -v -f file2 file1 > file3
grep takes time and is beeing killed because file2 is about 40000 long ,and file1 is about 25000

if I would do c program, It would have two nested for that will take time too.

please I will appriciate any suggestions.

thanks

James E Keenan

unread,
Sep 15, 2017, 9:34:14 AM9/15/17
to ack-...@googlegroups.com, tiba...@gmail.com
From the description you have provided I would first try the Unix
command-line program 'comm'.

Try 'man comm' and look in particular at the '-3' option.

Thank you very much.
Jim Keenan

tiba...@gmail.com

unread,
Sep 15, 2017, 10:03:05 AM9/15/17
to ack users
thank you
I have tried comm it is fast. the only thing is that I am looking to one side difference.
I mean if file2 contains word not in file1 that will not be needed to output. only if words in file1 but not in file2.
so I solved it by this bash script

for n in `cat file1`; do
    A='notfound'
       A=`awk -v word=$n -v f="A" -F ' ' '{if($1==word){print "found"}}' file2`
        if [ $A = 'notfound' ] ;then
                     echo $n >> f2
         fi
  
done

It is faster than grep but still any faster way is appriciated.

Thank you so much for introducing comm, I may use comm in other problems

Rob Hoelz

unread,
Sep 15, 2017, 11:17:02 AM9/15/17
to ack-...@googlegroups.com
I'd like to recommend an alternative to comm: combine
(https://joeyh.name/code/moreutils/). I find its human readable
boolean operators easier to remember than comm's numeric options!

-Rob

Bill Ricker

unread,
Sep 15, 2017, 11:51:23 AM9/15/17
to ack-...@googlegroups.com
On Fri, Sep 15, 2017 at 11:15 AM, Rob Hoelz <r...@hoelz.ro> wrote:
> On Fri, 15 Sep 2017 07:03:04 -0700 (PDT)
> tiba...@gmail.com wrote:
>
>> thank you
>> I have tried comm it is fast. the only thing is that I am looking to
>> one side difference.
>> I mean if file2 contains word not in file1 that will not be needed to
>> output. only if words in file1 but not in file2.
>> so I solved it by this bash script

In `comm` that's the -23 option.
Column 1 is words only in file 1.
-23 is minus 2,3, omit columns 2 (file 2 words) and 3 (both files words).

The other key to `comm` is files must be sorted by natural sort order.
So the shell command or alias needed is
comm -23 <(sort $file1) <(sort $file2)
with modern bash <() command substitution as file-pipes.

> I'd like to recommend an alternative to comm: combine
> (https://joeyh.name/code/moreutils/). I find its human readable
> boolean operators easier to remember than comm's numeric options!

yeah, -1 "minus one" meaning omit column one / file one only,
-3 omit column three (both files) is not very mnemonic.
It didn't take 35 years to become "natural" to me but ...
it's still not entirely natural, i just hear the author's peculiar mnemonic.
Reply all
Reply to author
Forward
0 new messages