Creating / Editing a corpus

79 views
Skip to first unread message

Stephen Gourlay

unread,
Jan 22, 2022, 11:15:41 AM1/22/22
to AntConc-Discussion
Thanks very much Lawrence for this fantastic tool. I particularly like the ability to upload pdf files - saves the hassle of creating .txt files from papers.

Is it possible to create a corpus in stages? I want to upload some files from a folder so I have to add them one at a time. I can add many at once - but once I've created the corpus I cannot work out how to add more. So I guess I have to copy the wanted files to a new folder, then load the folder.

Best wishes
Stephen

Stephen Gourlay

unread,
Jan 23, 2022, 5:40:29 AM1/23/22
to AntConc-Discussion
On reflection, from a Corpus Linguistics perspective I guess it's unlikely a corpus will be added to after creation. But from the point of view of researchers using AntConc to analyse a collection of documents, it's possible that the collection might need editing and particularly adding to over time. E.g. I have a collection of pdf's on a topic that I create a corpus from - but then I find more relevant papers that I want to add if only because new ones have been published, or I have discovered a better search string. Loading new files into a folder and re-creating the corpus is obviously the best way to create an updated corpus.
Best wishes
Stephen

Emma Goldsmith

unread,
Feb 4, 2022, 5:54:51 AM2/4/22
to AntConc-Discussion
Adding my two cents here: 
I'm also missing the ability to add files to a quick temporary corpus and to my own pre-built corpora.
Translators and editors often create a quick corpus when working in unfamiliar field and then add files on the fly when we find useful references as we work.
Cheers,
Emma

Laurence Anthony

unread,
Feb 4, 2022, 6:44:40 AM2/4/22
to ant...@googlegroups.com
Thanks for the comment, Emma.

Yes, this was certainly a strength of the 3x version of AntConc. But, I think it's a more niche case. Also, the "Create KWIC corpus" option in the File menu, effectively allows you to rapidly rebuild a corpus. The important thing is keeping your source files in one place, adding them in a source folder. Then, it's a simple step to rebuild the corpus. There are huge benefits to the AntConc 4x version, too. Once you have a stable set of source files, you can access that directly from the corpus manager using it as either a target or reference corpus. If you switch between corpora for different projects, I think this becomes a much better way to work.

What do you think?

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/54d37174-4902-40f0-a4c5-da17585a458cn%40googlegroups.com.

Stephen Gourlay

unread,
Feb 6, 2022, 6:21:44 AM2/6/22
to ant...@googlegroups.com
My solution is to search for the files on the topic then select those
files in the list, and copy them to a new folder which is the source
for AntConc's corpus. When I find new papers on the topic these files
can be copied into this copy folder, and corpus creation run again.
Works best with a program that indexes file contents.
Best wishes
Stephen
> You received this message because you are subscribed to a topic in the Google Groups "AntConc-Discussion" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/antconc/8MhiqgujOo4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to antconc+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/CAL6Fgv08UA1s%3D_Xwq3GnOKdMWu%3DRKxarw0izKh-RfMZGK0RPTA%40mail.gmail.com.

Emma Goldsmith

unread,
Feb 6, 2022, 8:59:03 AM2/6/22
to AntConc-Discussion
I agree that keeping files in an AntConc-source folder makes sense. That indeed is part of my workflow.
Unfortunately, though, rebuilding every time I add a file there is time-consuming when big pdf files are involved, because of conversion time. I'm guessing that AntConcConverter is working under the hood to do that, because corpus rebuilding time is very similar to AntConcConverter processing time.

Since corpus building is amazingly fast with txt files, one solution would be to convert big pdfs to txt manually, store in the AntConc-source folder, and rebuild my corpus with a mixture of txt and pdf files. 
It would be a pity to have to do that, though. The ability to add files on the fly to an existing corpus would avoid a lot of this sort of manual work.  

Laurence Anthony

unread,
Feb 6, 2022, 10:24:03 AM2/6/22
to ant...@googlegroups.com
Hi Emma,

As you have noticed, AntConc is converting the PDFs on the fly when building the corpus. Normally, the corpus is static so you only have to do this one (makng AntFileConverter redundant). But, if you are constantly adding files, then it would make sense to convert all your PDFs to text at the start. This is good practice in my opinion because you can then see exactly what is going into the corpus. 

Then, there would be no need to mix PDFs and text files. Your source folder would only contain text files and rebuilding the corpus would then be very fast. In effect, you'd be recreating the environment that you have to create to use the old version of AntConc (where PDFs have to be converted before they can be used).

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

José Berjano

unread,
Feb 6, 2022, 11:43:10 AM2/6/22
to ant...@googlegroups.com
Hi,

Could I be removed from this conversation?

I don't work with antconc. 

Thank you.

José Berjano

JFlorian

unread,
Feb 11, 2022, 4:01:44 PM2/11/22
to ant...@googlegroups.com
Jose,

At the bottom of every email is this message:
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

Just follow the instructions to unsub from this list.

Cordially,
Judy

Timothy Barton

unread,
Jan 10, 2023, 6:19:46 AM1/10/23
to AntConc-Discussion
Chirping in to an old thread I found while looking up how to add files to an existing corpus. Would you consider adding a button or a pop-up window or something when pdfs are converted asking whether we'd like AntConc to save the txt files for future reference?

Laurence, you mentioned that adding additional reference files is a niche case, but you have a lot of users who are editors or translators, and I think we use it like this often. For me at least, it's almost always how we work. Perhaps there could be a way for you to allow us to keep the texts already added, with the caveat that we have to keep them in the same location, so that AntConc can still find them? It's not too bad if all my files are in the same place, but sometimes I work with a style guide that's in one folder, a translation memory that's in another and a collection of documents that's in another folder. Adding these manually each time is very slow.

Tim

Laurence Anthony

unread,
Jan 10, 2023, 7:06:19 AM1/10/23
to ant...@googlegroups.com
Hi Tim,

Am I right in assuming that you want to keep the .txt files so that you can add additional files and re-build the corpus? To this aim, I'm actually thinking now how to allow users to simply add files to an existing corpus, or delete files from an existing corpus. If this was possible, would you still want AntConc to save a .txt version of each file as well?

On this line of thought, would it be better to simply have AntConc be able to generate a .txt version of the entire corpus as a batch process through a file option at any time? This would mean you could have complete control of when and where the .txt files are generated. 

Let me know which option you prefer. Both are possible, but my preference is probably for adding the first option as soon as possible and maybe adding the second option later.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Reply all
Reply to author
Forward
0 new messages