Question about metadata

18 views
Skip to first unread message

Jeff Prucher

unread,
Mar 4, 2026, 11:32:21 PM (14 days ago) Mar 4
to AntConc-Discussion
I've read the 2023 thread about metadata, and it explains most of it well, but I don't fully get the instruction to align the doc_id in the metadata.csv file with the doc_id in the corpus db. Because in order to do that, I'd have to know the corpus ids for each file ahead of time. 

I did do some looking into the db using SQLite, and I think, but could use confirmation, that the doc_ids are assigned alphanumerically ascending, first by folder, and then by file name. With the caveat that numbers are sorted numerically rather than lexicographically (so "9" sorts before "10"). Additionally, punctuation sorts after numbers and letters, which is different than the sort order in Google Sheets (which, whatever) or BBEdit, which puts punctuation first. 

So two questions, I guess:
1. Am I correct that doc_id is assigned in alphanumeric order?
2. And if so, what is the actual order so I can tell BBEdit how to sort my metadata file? 

I have 40,000 files in my corpus so I can't just manually move a few rows around.

Thanks,
Jeff Prucher

Laurence Anthony

unread,
Mar 4, 2026, 11:50:52 PM (14 days ago) Mar 4
to ant...@googlegroups.com
Hi Jeff,
Yes, the ordering is exactly as you describe. You can test it by looking at the order of the files as they appear in the file list in the corpus manager. 
But, I should say that I'm currently working on an upgrade to the whole metadata mechanism in AntConc. My plan is the following:
1) The doc_id is determined by the file name only (no parent folder names, or file extensions included)
e.g. S820429, S820423, S498348, ...
This means that the ordering of the files becomes irrelevant.
2) The metadata table needs to simply provide a doc_id column, and any other doc level info, which can then be searched.
e.g. 
doc_id,     category, genre
S820429,  maths, journal
S820423, physics, presentation
...

It basically matches the system used in corpora like BNC.

Would this work well for you?

Laurence.




###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/antconc/44951ce6-6350-4466-b526-eb3d9517f037n%40googlegroups.com.

Jeff Prucher

unread,
Mar 5, 2026, 5:33:17 PM (13 days ago) Mar 5
to AntConc-Discussion
Yes, I think that would be much more straightforward. The only issue I can see would be collisions between file names in different directories; as long as there was some kind of notification about it so you could rename a file, I think it would be easier to use.

Jeff

Laurence Anthony

unread,
Mar 5, 2026, 7:25:51 PM (13 days ago) Mar 5
to ant...@googlegroups.com
Yes, good point. The problem with user created corpora is that you never know how they will prepare them. But, even though I could easily auto-rename the files on input so that they are always unique, the user might then get confused about what they are looking at. They also often don't want to mess about with their files and perhaps won't know how to quickly rename hundreds or thousands of them. So, it's always a bit of a balancing act. Let me think about this a bit more.

Jeff Prucher

unread,
Mar 10, 2026, 1:09:04 PM (8 days ago) Mar 10
to ant...@googlegroups.com
I've gotten my metadata working, so thank you for all the help. I'm posting my solution in case someone else has the same problem.

Basically, if you can't figure out how to align your file names with AntConc's doc_id (e.g., due to personal confusion or having a lot of really dumb file names that sort differently in every system you try -- I'm not saying which was my problem, but it might have been both):

1. Load the files (without metadata) into an AntConc corpus
2. Open the corpus .db file in some kind SQL editor (I used https://sqlitebrowser.org/)
3. Run this query: "select doc_id, doc_file_name from docs"
4. This will give you the doc_id for every file name
5. Export this list as .csv or .tsv
6. As long as your metadata is keyed to filenames, you should now be able to add the metadata to this file and load it into a new corpus

Jeff Prucher
You received this message because you are subscribed to a topic in the Google Groups "AntConc-Discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antconc/gX_SsoNTSLc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antconc+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/antconc/CAL6Fgv1GFtXFBzS0a4wo_pGfBr%3DsTbSAK%2BndUfmN4%3DExFk90Hw%40mail.gmail.com.

Laurence Anthony

unread,
Mar 10, 2026, 1:13:19 PM (8 days ago) Mar 10
to ant...@googlegroups.com
Those are great instructions! Note that you could load the metadata table back into the corpus that you extracted the doc_ids from and it will work there, too. 

Saying that, I'm putting the final touches to AntConc 4.4, which now handles metadata much more elegantly, with a viewer and selector directly in the main window.

Laurence.



Reply all
Reply to author
Forward
0 new messages