Searching Across many corpora

105 views
Skip to first unread message

Forrest McSweeney

unread,
Sep 3, 2024, 11:43:16 PM9/3/24
to AntConc-Discussion
Hello,

I am looking to use Antconc to perform some keyword searches across a corpus I have made. I have installed the program and have checked out your tutorial videos on "Corpus Manager." I still have a significant question, though. Can Antconc perform text searches across multiple corpora? I have nearly 1000 corpora as named-subdirectories containing textfiles--ancient Chinese medical books--which are then all organized into a huge macrocorpus directory, and I need to search across them for keywords. The corpora's text files are all contained in their own individual directories which denote the titles of the books. It is clear what Antconc can do for an individual corpus, but can it do the same over many? Is it necessary that I combine all of the textfiles from the individual directories into a single corpus? 

Thank you for your help,

Forrest 

Laurence Anthony

unread,
Sep 3, 2024, 11:45:43 PM9/3/24
to ant...@googlegroups.com
Hi Forrest,

When you say you have "1000 corpora", do you mean "1000 files"? Or do you actually mean "1000 different sets of multiple files"?

Once you confirm this, I'll try to answer your question.

Regards,

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/b3793834-22cd-4cad-b98a-d10ba2fa7362n%40googlegroups.com.

Forrest McSweeney

unread,
Sep 4, 2024, 12:14:48 AM9/4/24
to AntConc-Discussion
Hello Professor Anthony,

I mean 1000 different sets of textfiles. The sets are all organized into separate directories. The name of each directory is the name of the text to which the content of the textfiles belong. It was not my doing organizing it this way, so my hands are nearly tied. 

For clarification a pathfile looks like this

macrocorpus/corpussubdirectory(title1)-corpussubdirectory(title2)-...etc./textfile(s)_containing_content(title1)-...etc.

Thank you,

Forrest

Troy E. Spier

unread,
Nov 22, 2024, 4:17:53 AM11/22/24
to AntConc-Discussion
With so many different files, you might have better luck loading them all at once, conducting your search, saving the results to a text file, and then parsing the text file itself. It wouldn't seem to make much sense to combine all the files from Corpus 1 into a single file, Corpus 2 into a single file, etc., as you would then only be looking at the level of the corpus and not the individual files in each corpus.

dl6...@gmail.com

unread,
Jan 8, 2025, 6:15:30 AM1/8/25
to AntConc-Discussion

I am returning to using AntConc in my Ubuntu workflow. And renewing visits to this forum.
I have installed AntConc 4.3.1 after a long break (2 years) since last using AntConc so I need to refresh my knowledge. 
Regard me as a beginner.
I am interested in the gameplan to scan a constellation of datasets.
In a modest use of scanning notebooks on desktop I use Recoll search engine.
This works very well. I found that I had over 500 notebooks of a particular MIME type on my desktop.
The notebook was CherryTree which offers a hierarchy of notes per file.
In Recoll hover over the query field to see the "cheat sheet".

I am thinking that if OP has 1000 corpora in a "huge macrocorpus directory" then these could first be indexed (in whole or part) using Recoll. But note that a large head room (or a dual drive)  is required to retain the Recoll index file. However target directories can be selected for indexing rather than the entire desktop.

Now Recoll (powerful as it is) does not offer the features of AntConc but my thinking is that it might be used in a toolchain, in concert with AntConc where Recoll searches at the beginning then a Python script is developed to exploit recollq (command line feature) to present to AntConc a list of files for next stage processing. Some conversion to text will be required in the toolchain.

This is just a flash of an idea.  Same thinking applies to scanning different MIME types in Recoll such as email archives. Then passing on to AntConc in a toolchain.

I am an advocate of using toolchains.

My current first thought with launching AntConc is how to improve the rather bland look and feel and I learn today by searching that styles can be tweaked through QtCreator. Adding a dark mode for example. And perhaps larger fonts for these old eyes.

dl6...@gmail.com

unread,
Jan 8, 2025, 6:22:06 AM1/8/25
to AntConc-Discussion
Reply all
Reply to author
Forward
0 new messages