Automatically compiling glossaries with GPT-4 and WebPilot

124 views
Skip to first unread message

Tom Gally

unread,
May 27, 2023, 9:33:10 AM5/27/23
to hon...@googlegroups.com
After Honyaku started around 1994, and on the Twics BBS that preceded it, a topic that came up frequently was the compilation of glossaries. Some early Honyakkers shared glossaries with each other, and after personal websites became possible some people posted their glossaries to their sites. I don’t know how important such hand-built glossaries are to translators these days—search engines and Wikipedia may have replaced them for many people—but it is now possible to compile glossaries automatically with GPT-4.

Below are a couple of exchanges I had just now with GPT-4 using a free plugin called WebPilot. I asked it to extract bilingual pairs from English and Japanese Wikipedia pages about the same topic.



I haven’t gotten around to figuring out how to make my own GPT-4 plugins yet. I imagine, though, that it should be possible to automate the process on a large scale: have the plugin automatically find Japanese/English pairs of Wikipedia pages, compile a glossary for each pair of pages, write that glossary to an external database, and repeat.

It should also be possible to have it create something more dictionary-like, with parts-of-speech labels, multiple senses, example sentences, etc. I don’t know if people will be needing dictionaries anymore, though. Ouch.

Tom Gally

Oroszlany Balazs

unread,
May 27, 2023, 10:07:22 AM5/27/23
to hon...@googlegroups.com
Very interesting. 

Would you get a similar result if you remove the hyperlinks and cross references, e.g. upload both articles as plain texts, and compare those?

Balazs Oroszlany



--
You received this message because you are subscribed to the Google Groups "Honyaku E<>J translation list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to honyaku+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/honyaku/CACnS3NdooWPcEmq3wq0hY9cuLfYS5Npv1cbC0VmZrSFRMKtj6A%40mail.gmail.com.

Matthew Schlecht

unread,
May 27, 2023, 10:23:52 AM5/27/23
to hon...@googlegroups.com
I miss those days when listmembers shared their glossary content on-list!

This might indeed be one crest of the wave of the future in our profession.
I'll start with two caveats, and then comment more positively.
<> Depending on the content, timeliness, and any political/culture war overtones, the use of Wikipedia pages for reliable content is an iffy proposition at best.
<> AI has the tendency to produce some answer, even if not the correct answer. It seems that AI "wants" to please and "wants" to seem intelligent, even if it must resort to producing alternative facts.

Where something like this would be extremely helpful in my JA>EN work would probably not be so much from the content in Wikipedia pages, but rather the content in extant (and internet accessible) glossaries or in publications in which some/many of the terms in Japanese are glossed in English.
A non-comprehensive list would be:
plant (especially weed) names
insect (especially pest) names
plant disease (fungal, bacterial, viral) names 
agricultural terminology
macro- and micro-anatomical features (both human and other species)
disease names (both human and other species)
symptom names (humans)
clinical trial terminology
laboratory techniques and equipment

If it were feasible and not economically prohibitive, I could see configuring a digital assistant to run searches on a regular basis and storing the gloss pairs for future use. That's because the content comes and goes, and newer content is always coming along. My first instinct would be to vet the pairs myself, but perhaps I could teach my digital assistant to vet the pairs itself by looking for and confirming the occurrences from other sources (three would be nice).

One content area for which I would advise much caution in using an AI assist would be chemical names, on which I have commented on other occasions in this list. This is because of the huge amount of garbage content available, and because of AI's "desire" to produce some answer, even if not correct.

For a while, we had a separate Honyaku glossary compendium prepared and made available by one of the listmembers.
That was a great help while it lasted, but software evolution and the time and effort needed to maintain it led to its demise.
Maybe an AI assistant could make this come true for us again - a Honyaku list glossary of terms with context derived from the list content. 

Matthew Schlecht, PhD
Word Alchemy Translation, Inc.
Newark, DE, USA
wordalchemytranslation.com

Tom Gally

unread,
May 27, 2023, 8:27:46 PM5/27/23
to hon...@googlegroups.com
Yes, it is possible to upload your own documents to GPT-4 and have it extract glossaries from them.

However, there is a limit to the amount of text it can accept at one time—about four thousand words of English. Beyond that, it either refuses to do anything or, in the case of chats, starts forgetting what came earlier. Any glossary-creating program would have to process the documents in pieces.

Anthropic’s Claude LLM supposedly has a much larger input window, but I have not been able to get access to it yet.

The extracted glossaries will not be perfect, of course. But GPT-4’s amazing ability to infer the meanings of words from context should make them very useful to human translators.

A lot of work is going on in various places to develop LLMs, both commercial and open-source, that can be run locally—thus alleviating the confidentiality issues with uploading documents to a cloud-based commercial service. I have read that companies are also developing LLMs that can be custom-trained on a client’s document repository.

This week, I’m going to be giving a couple of online workshops for the staff members at the university where I am still employed part-time. I plan to show them how they can use GPT-4 in their work: translating and summarizing documents, writing and polishing e-mails, composing Excel functions, brainstorming, etc. I am sure that they will be able to use it for many tasks, but it would be even better if it could be trained on the university’s huge volume of internal documentation.

Tom Gally
Reply all
Reply to author
Forward
0 new messages