Twitter data: multiple idependend texts in one txt file?

157 views
Skip to first unread message

David Adler

unread,
Dec 30, 2021, 8:35:53 AM12/30/21
to AntConc-Discussion
Hi,

for a small project I am working with Twitter-data. The corpus contains around 600.000 Tweets. Now my main problem is, that these are a lot texts that are independent from another. When I include them into a bunch of files (one for each month, for instance), I have the problem that collocates are “contaminated” by the adjacent tweets.

My first impulse was to use one file per tweet. But given the amount of tweets that will likely overcharge even the new, super fast AntConc 4.0. :)

Is there a way to include multiple independent texts in one txt-file? Or, in other words, is there a way to separate texts so that collocates cannot extend beyond the separator?

I hope I could make my problem somewhat clear.

David




Laurence Anthony

unread,
Dec 30, 2021, 12:16:14 PM12/30/21
to ant...@googlegroups.com
Hi David,

I think the best approach is to use one file per tweet. In AntConc 4, there will be zero performance loss except in the Plot tool, which needs to show results on a per-file basis. For that tool, I would recommend using the graphical view, which is much, much faster than the tabular view.

I'm very interested to hear how this goes. My FireAnt tool currently generates tweet data as a single file. So, I'm thinking of adding a new indexer to AntConc that will process single files using a 'one-file-per-line' concept. The result would be equivalent to what you are thinking of doing.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/7d1021ed-77e6-419e-a2e0-957c93d6fb7fn%40googlegroups.com.

David Adler

unread,
Dec 30, 2021, 1:12:56 PM12/30/21
to AntConc-Discussion
Hey, just had a short meeting with a friend who tries to use social media data with AntConc, too. For her the one file per tweet approach seems to work at first glance, but with a much smaller corpus. With my 600.000 files even the finder/explorer has some difficulty to deal with. So something along the one file per line would be great. Or just to have a separator tag additionally to the _ for annotations and the <> for comments. I think this could be useful for any corpus that contains many very short independent texts (comments, press releases, other social media posts).

I will have a look at FireAnt. FYI, the link to the Mac versions seems to be broken. https://www.laurenceanthony.net/software/fireant/releases/FireAnt205/FireAnt.zip

Laurence Anthony

unread,
Dec 31, 2021, 12:29:51 AM12/31/21
to ant...@googlegroups.com
Hi David,

Thanks for the input. I think the one file per line feature will be the first thing to add to AntConc 4.1. 

Thanks also for letting me know the link to FireAnt on MacOS was broken. I've fixed it now, but as you'll find, I used the .pkg approach. So, I need to update all my apps to use the new .dmg approach.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

David Adler

unread,
Dec 31, 2021, 4:51:18 AM12/31/21
to AntConc-Discussion
Yes, I downloaded the .pkg. Since I understood the problem lies with ownership and not permissions, fixing it is very simple for any of the .pkg apps.

In Terminal/Shell it is only a one-liner:

sudo chown -R [username] [path to application]

David Adler

unread,
Jan 6, 2022, 5:33:01 PM1/6/22
to AntConc-Discussion
Hi Laurence,

I tried to hotfix my problem by inlcuding a repeated seperator word, which I can than filter out with the stop-words, but that will effectively make that collocates will not be collected beyond the scope of one tweet.

My only problem now is, that the menu for the stopword list seems to have changed and somehow I am not able to find it. I went via Tool settings -> Words. Am I just not seeing the obvious here, or is there a different way of using stop-word lists in 4.0.2?

I had a look at the help document but did not find an answer to this at first sight.

All the best,

David

Dominique Fry

unread,
Jan 10, 2022, 7:29:32 PM1/10/22
to AntConc-Discussion
Does AntConc work with the latest Mac OS Catalina 10.15.07? I'm able to install the software but every time I try to upload a txt file it crashes.

I'll appreciate you advice.

Dominique

David Adler

unread,
Jan 11, 2022, 3:43:24 PM1/11/22
to AntConc-Discussion
Hey Dominique, have you downloaded the 4.0.2 .dmg file? The behaviour you describe seems to fit to 4.0.1, which was installed via a .pkg file.
Im am running AntConc 4.0.2 on Big Sur (11) without any problems.
Reply all
Reply to author
Forward
0 new messages