Download Parser

Skip to first unread message


Jun 4, 2021, 7:42:32 AM6/4/21
to WordSmith Tools
Hello Mike,
thank you so much for putting together the Download Parser to help with processing texts from newspaper archives.

Unfortunately, I can't seem to get it to work. I am using a file of newspaper texts that I have converted with the WST text converter to Unicode text files.
In the "First parse" tab, it finds my file (just one for testing) and it recognises that there are 3 texts in that file:  "3 texts found". But nothing else happens - when I click done, a popup states "No data found", and I cannot find the file with the tags around the headers anywhere. I have added the fields that my data has (because, the capitalisation is different, for example it's "Byline:" and "Length:" rather than "BYLINE:" and "LENGTH:") and I have unselected the fields that are not in my data. 
In the parsed files folder, the HEADLINES etc. folders remain empty. The author file also includes no data.
Can you point me to what is going wrong, by any chance?
Thank you very much!

All the best,

Mike Scott

Jun 4, 2021, 12:25:59 PM6/4/21
to WordSmith Tools
Thanks for this, Viola.

Looks as if LexisNexis may have changed format. The Download Parser parses .TXT format downloads, which LexisNexis allowed one to get, up to 500 at a time. The Word format you downloaded doesn't have quite the same form. I think it'd be perfectly possible to adapt the Download Parser to find the mark-up present in the Word .docx download -- but I would need to have spare time for that. At present I'm working hard on the 64-bit WordSmith..... Maybe in a few months?

Sorry! -- Mike


Jun 11, 2021, 9:05:18 AM6/11/21
to WordSmith Tools
Thank you Mike, just saw this! Yes they keep changing everything all the time, it's so difficult to keep up! (and not sure if that's worth the effort..)

If anybody else stumbles across this, I found a package that can "clean" the articles, importing them into R data frames, for anybody who is happy to use R: 


Reply all
Reply to author
0 new messages