deleting lines from trigram output

23 views
Skip to first unread message

BobS

unread,
Mar 22, 2015, 9:34:44 AM3/22/15
to nltk-...@googlegroups.com

I have processed a text file to create trigrams and have written the output to a file; here is an example of the output file.

 

earnings,evening,earnings,65303.132245

earnings,growth,2016,65161.018628

earnings,140,earnings,65130.507383

lower,earnings,2015,65096.017731

2015,base,earnings,65022.922881

earnings,reconcile,earnings,64946.317262

2015,earnings,interest,64848.114762

income,growth,earnings,64787.065685

terms,earnings,2015,64782.769760

adjusted,earnings,growth,64488.788412

 

 

 

Some of the lines include a year as one of the trigram terms ; 2015,base,earnings,65022.922881 for example.  Other lines include a term that is a number, not a year; earnings,140,earnings,65130.507383 for example.  I don’t care about these trigrams and would like to delete them but I do want to keep trigrams where the number/year is 2000 or greater. So the logic that I propose is

 

Read the line

If any of the first three terms is less than 2000 delete the line

Write the line to a file.

 

I would appreciate suggestions as to how to code this logic in python.  I have looked for examples and found examples of how to delete a line based on certain key words but did not find any code that is similar to what I want to do. My python skills are very basic and I need some help with this.

avitalp

unread,
May 4, 2015, 5:02:58 PM5/4/15
to nltk-...@googlegroups.com
Hi there,

Here's one way:
import fileinput


for line in fileinput.input("data.txt", inplace = 1):
    fields
= line.strip('\n').split(',')
    delete_line
= False


   
if len(fields) > 2:
       
# Check first 3 terms
       
for i in range(0, 3):
         
if fields[i].isdigit() and int(fields[i]) < 2000:
              delete_line
= True
             
break


       
if not delete_line:
           
# Please note the comma at the end, it prevents a newline
           
print(line),



Haven't tested it but should work as-is in Python 2.7 using the fileinput module, it assumes your output data is in a file called data.txt. From your description of the logic I understood that lines are to be kept if first 3 terms are strings (eg: line 1 from your example output file).

Anyway, quick sketch, hope it helps! :)
--Avi

Alexis Dimitriadis

unread,
May 5, 2015, 10:12:00 AM5/5/15
to nltk-...@googlegroups.com
Don't use this code: It will overwrite your original files with the edited versions. (inplace argument). Or use it with appropriate preparations.

Also, test your code. You strip the newlines and never add them back, and your program will crash as soon as it encounters a mixed letter-digit word like "2b".

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexis Dimitriadis

unread,
May 5, 2015, 10:13:44 AM5/5/15
to nltk-...@googlegroups.com
Correction: I see that you don't strip newlines from the output. Apologies for the accusation. But you should handle mixed digit-letter words, and should warn clearly if your program edits its input in place.

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

BobS

unread,
May 6, 2015, 3:38:41 PM5/6/15
to nltk-...@googlegroups.com
Thanks so much for your suggestion, I will use it soon and let you know the outcome
Cheers, BobS

avitalp

unread,
May 10, 2015, 11:08:44 PM5/10/15
to nltk-...@googlegroups.com
Alexis, thank you for your suggestion about clearer warnings regarding in place editing code, I will include such in future. When posting, I automatically assumed BobS would work with a copy of the data rather than run my code on his original set.

I didn't see any mixed letter-digit words in the sample provided so I didn't address it in code. But, in any case, my code was offered as a helpful suggestion and example rather than bullet-proof code :)

--Avi
Reply all
Reply to author
Forward
0 new messages