how to clean data from Nexis Uni (lexis nexis) newspaper articles

81 views
Skip to first unread message

Paolo Orrù

unread,
Dec 30, 2023, 12:27:13 PM12/30/23
to WordSmith Tools
Dear Mike, 
I remember some year ago seeing a speech on youtube when you explained how to get rid of specific chunks of texts (mostly "metadata" like "date", "language" and numbers) from newspaper articles downloaded by Lexis Nexis.

I can't find the video anymore. 
Can I  kindly ask you to explain the procedure?

Thank you very much for all your work

Paolo

Mike Scott

unread,
Jan 1, 2024, 4:35:51 AMJan 1
to WordSmith Tools
Paolo, hi
Sorry for the delay -- I was away.

1. You will find a Download Parser at https://lexically.net/wordsmith/support/extras.html
2. Most of that program's functions are now in the Corpus Checker utility in WordSmith 8. 
3. In that utility, additionally, I have worked hard on methods to check relevance of downloaded texts to your search-term, etc.

HNY and Cheers -- Mike

Paolo Orrù

unread,
Jan 2, 2024, 7:22:47 AMJan 2
to WordSmith Tools
Hi Mike,
thank you so much and happy new year!

Paolo Orrù

unread,
May 4, 2024, 4:15:14 AMMay 4
to WordSmith Tools
Hallo Mike,
thanks to your help I was help to put tags around metadata on my news article corpus.
Now I would like to ask you if there's a way with the text converter tool to delete every chunk of text within brakets from the files.
I can use the corpus with the tags, of course, but I would like also to have a clean version of the texts.

For example, I would like to remove all these kind of lines from multiple files. Is there a way for doing this?

<h>«Il digitale? All' Europa manca una piattaforma per competere»</h>

<h>Corriere della Sera (Italy)</h>

<h>27 gennaio 2019 domenica</h>

<h>RIBATTUTA Edizione</h>

<h>Copyright 2019 RCS Mediagroup All Rights Reserved</h>

<h> </h>

<h>Section: ECONOMIA; Pag. 27</h>


Mike Scott

unread,
May 7, 2024, 1:08:34 PMMay 7
to WordSmith Tools
Paolo, hi

Sorry for the delay, we had a bank holiday, family visit etc.

1. You can already choose to exclude such paired tag sequences from concordances, word lists etc,  by defining the start and end tags (such as <h> and </h>) in Tag Handling.  https://lexically.net/downloads/version9/HTML/tags_to_exclude.html. In general that is best as you don't lose that information from the corpus, and might want it some day.
2. I don't think WS offers a way to cut out only those paired tags permanently from your corpus. (You can remove all tags in the Text Converter but that would leave the header strings in your examples still in the text.) I'm studying now the best way to offer that option. I will post here when I've done it.

Cheers, Mike

Reply all
Reply to author
Forward
0 new messages