Can I generate a wordlist from a literary text in Toolbox?

31 views
Skip to first unread message

Ian Scales

unread,
May 9, 2020, 1:47:22 AM5/9/20
to Shoebox/Toolbox Field Linguist's Toolbox
I had a vague idea that I could generate a word list from a literary text within Toolbox. 

However, on reading the Toolbox Reference Manual (and not making much sense of it) I am doubting that.

Is there a way to do it? I don't need anything fancy. Just a list of each space-delimited text string ("word"), preferably in alphabetical order, and ideally with frequency of occurrence of each word found. 

The text is a 20,000 word txt file - actually an old local court transcript in an Oceanic language. 

BTW I am a long-time Toolbox user, but only for vernacular dictionary compilation. Never tried text analysis before. 

Tony Naden

unread,
May 9, 2020, 4:16:14 AM5/9/20
to shoeboxtoolbox-fiel...@googlegroups.com
You will need to break/number text under Tools.
Then set up a Text Corpus under Project, referring to the markers you assigned with 'break/number'
Then use Word List under Tools using the Corpus and marker for the text that you set up in steps 1 and 2

--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/dd237881-58d4-411f-acea-a11b7913c336%40googlegroups.com.


--
Address: "Lost Marbles", 31, Reading Road,
Pangbourne, Berks., RG8 7HY  -

Tel.: 01189842368

Keep us, good Lord,
under the shadow of your mercy
in this time of uncertainty and distress.
Sustain and support the anxious and fearful,
and lift up all who are brought low;
that we may rejoice in your comfort
knowing that nothing can separate us from your love
in Christ Jesus our Lord.
Amen.


ToolBox Support

unread,
May 9, 2020, 9:23:22 AM5/9/20
to shoeboxtoolbox-fiel...@googlegroups.com
Thank you, Tony, for a quick (and accurate) response.

Actually, though, you can do a wordlist without any markers in the text. It has to be a plain text file, not a Word document nor anything with formatting like that.
Toolbox will produce a Word List; you just can't have any references. So I set up the text corpus like this:

image.png
and then specified my files list. It would be important to specify a Language Encoding (the second item requested) so Toolbox would display the data in the right font -- especially if the data is legacy data.

For the Word List box itself, you should specify no references. But then I got the following:

image.png

The really big advantage of references is that in the Word List you can Right Click on a Reference and Toolbox will jump to the text (if it's loaded in Toolbox) and will highlight the word, which makes the Word List almost as good as a Concordance, and better in some ways.

Ian, about the Reference Manual. It's my opinion that much of it doesn't make much sense. Any suggestions you (or anyone else) might have are welcomed. One thing that might help clarify a bit is the section on "Text, Organizing and Preparing". It's actually under Interlinear and should at least be referenced by the Word List / Text Corpus section. Anyway, look for it on page 191 of the Reference Manual. Several of the following sub-headings describe the process of automatically setting up reference markers, called "Breaking and Numbering" or just "Numbering", for a text.

Also, if you would like help getting the text corpus set up, or getting the text broken and numbered, I'll be glad to help. Contact me through this list or directly at Toolbox @ sil.org.

Tony, I loved the prayer at the end of your message!

Karen
Toolbox Support


Ian Scales

unread,
May 9, 2020, 6:09:43 PM5/9/20
to Shoebox/Toolbox Field Linguist's Toolbox
Thanks Tony and Karen for your answers. I can understand Karen's procedure a bit more. Actually, it's this kind of concrete example with the screenshots that would make the reference manual so much better. 

By the way, in the period after I posted my question here and getting an answer, I also found an unrelated software app from Freie Universität Berlin called TextSTAT that "reads plain text files ... [and] produces word frequency lists and concordances from these files."
This is simple to use and does the job too. 

Let me also take the opportunity to thank Karen for her long-time dedication to running support for Toolbox here over the years.

Mugung Hwa

unread,
May 9, 2020, 9:07:32 PM5/9/20
to shoeboxtoolbox-fiel...@googlegroups.com
If you or one of your colleagues is familiar with Unix or Powershell, stackoverflow has two solutions:outside of Toolbox:


Below is the Doug McIlroy's celebrated Unix one-liner, and Windows 10 ships with Ubuntu these days

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

And the stackoverlfow post and resonses show a number of different Powershell approaches.  Good luck!

 

Nick Thieberger

unread,
May 10, 2020, 3:10:00 AM5/10/20
to shoeboxtoolbox-field-linguists-toolbox
For corpus kinds of functions, like word counts, you could use AntConc, free, and regularly updated software.

Nick

John R. Rennison

unread,
Aug 2, 2020, 5:06:15 PM8/2/20
to shoeboxtoolbox-fiel...@googlegroups.com
I have recently been encouraged to make an Android app of my Koromfe
dictionary, which is maintained in a fairly complex Toolbox file. Some
of you may know the paper version, on

https://www.univie.ac.at/linguistics/personal/john/dict_A4.pdf

Webonary only accepts Flex files, but it seems that my file is simply
too big and/or complex for Flex. Setting up the custom fields (e.g. for
French and German glosses) is a pain, and it just doesn't work.

Does anyone have any relevant experience?

ToolBox Support

unread,
Aug 2, 2020, 6:38:44 PM8/2/20
to shoeboxtoolbox-fiel...@googlegroups.com
Hi, John. I don't know about size limits (congratulations on creating such a complex dictionary!), but LexiquePro will put a dictionary into the LIFT format which works for Webonary. It's a way to avoid FLEX.

It's very possible that LexiquePro has size limits also, though, or even that Webonary does.

I would expect LexiquePro to be able to handle the four languages. It's based on the MDF model which assumed three gloss languages (which it named "English", "national" and "regional"). LP has little to no support at this time, but I can help with dealing with some of the oddities of that program that you are bound to encounter. 

How large is your dictionary? (I see the version you linked to has 203 pages, but I'm not sure what that amounts to for file size and your intro sounds like you planned to add a bunch more French.) What size does FLEX choke on? If it's close, maybe extensive use of abbreviations of some sort would help. (I could help you make the changes.)

Have you considered publishing in three editions, as Koromfe-English and as Koromfe-French and as Koromfe-German? Toolbox can export just specific fields, so you could export for each (second) language. (I agree it's not a nice solution.)

Those are a few initial thoughts. Let me know if I can help. 

Karen
Toolbox Support


--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.

Tony Naden

unread,
Aug 3, 2020, 4:08:36 AM8/3/20
to shoeboxtoolbox-fiel...@googlegroups.com
LP had no problems outputting my Ghana Kusaal dictionary @ 129npages, ile 15.5 MB.  The Mampruli was 1095 pages and 730MB (because: pictures); LP handled that OK as well but MS WORD struggled to remain stable with the export. I had to split it into fascicles to edit but after the final struggle to merge I have got a usable biggy-big file.

John R. Rennison

unread,
Aug 5, 2020, 11:53:30 AM8/5/20
to shoeboxtoolbox-fiel...@googlegroups.com
Thank you Karen (nice to hear from you again) and Tony for your quick
responses. Your advice prompted me to try Lexique Pro yet again, and
surprisingly it worked – apart from the ordering of the example fields.
If a record had more than one example, the “example” bundles (which
includes some markers that I added) were re-ordered as

\xv 1

\xv 2
\xe 1
\xf 1
\xg 1
\xnt 1
\xps 1

\xv 3
\xe 2
\xf 2
\xg 2
\xnt 2
\xps 2

etc.

I discovered that the correct ordering can be achieved if \xps (=part of
speech of the example) is brought to the top of the bundle – even though
it still appears AFTER the rest of the bundle in Lexique Pro. Clearly
I’ll need a CCT to re-order these bundles for Lexique Pro vs. RFT output.

Unfortunately, HTML export does not work, and Flex refuses to even start
importing the exported .lift file. I’ll report again once I have a
usable CCT, if there is anything worth reporting.

Incidentally, re. length of file: My own experience is pretty much the
same as Tony’s, except that sometimes Word 2000 does manage to open the
RTF file that Toolbox exports. My solution is to open the RTF file with
Word 365 and export it as a “Word 97-2003 document”. Then I can run the
“FinishExporting…” macro.

Thanks also for the hint that .lift is enough for Webonary. I’ll be
happy to give Flex a miss. Actually, I was criticised recently for using
“15 year old software”. In fact, I’ve been using the Windows version of
Shoebox/Toolbox for at least 24 years, and the DOS version before that,
and I’m proud of it. Good work, dear developers!

I will refrain from adding more about Flex, except to ask how Microsoft
would fare if they brought out a new version of Word that was completely
incompatible with previous ones.

All the best,
John

ToolBox Support

unread,
Aug 6, 2020, 11:14:41 AM8/6/20
to shoeboxtoolbox-fiel...@googlegroups.com
Thanks, John, for your kind words about Toolbox. 

You should be able to run the Finish Exporting macro from Word 365, but they hide it. A tab appears labelled "Add-ins", to the far right next to the "Help" tab. If you click on it, you will see the "Finish exporting from Shoebox" macro. Though I understand if you prefer the non-ribbon Word 2000.

Do you need help creating the CC table? I'm available, if so.

Karen
Toolbox Support


--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.

hattonjohn gmail

unread,
Aug 6, 2020, 6:27:42 PM8/6/20
to shoeboxtoolbox-fiel...@googlegroups.com
> I will refrain from adding more about Flex, except to ask how Microsoft
would fare if they brought out a new version of Word that was completely
incompatible with previous ones.

Ah, but see, how many versions of MS Word would we get if we received it (and support like Karen's) for free for 24 years? Some folks might not realize that almost everyone in SIL receives no salary, but instead we raise our own funds to give you Toolbox and FLEx for free.  Even our office space and equipment comes from the funds we individually raise. We don't get grants (though we try and are still trying!) That would be my answer to your question: if only we had 1/1000th of the resources that the Microsoft Word team has to work with, that'd be great!

Since this seems to be point of contention after almost 20 years, perhaps it would help the healing if I say a few words about what we were thinking in introducing FLEx with a more constrained model? In essence, we did not see how we could provide all the features of FLEx without having a baked in model, and we needed those features to empower a much wider range of people in working on dictionaries and texts, collaboratively. We knew that this trade off between flexibility and capability was one that would not suit many people. And that's OK; there is a reason we have MS Word, Adobe InDesign, LaTeX, etc.

Note, I'm not saying that we could not have built FLEx on top of SFM. We just did not see how to do it at the time, with the resources we had.

On the subject of conversion: One of the beauties of Toolbox is that you "bring your own model". While this is very powerful, it makes automatic conversation to some other model impossible, because the semantics of your model are not encoded in your database.  There is no way for the converter to know what maps to what. To help with this, we have built many tools. Consider SOLID, which allows you to add some semantics to your SFM model. SOLID can use that information to both do some structural checking for you or, if you want to convert, can then interpret your SFM and move it to LIFT/FLEx.  I would guess that thousands of people have used various SIL tools to migrate, so it's not as if we abandoned SFM. If you've found that SOLID cannot be taught your model, and if also our (free) conversion experts have thrown up their hands, then that's testament to Toolbox's enduring flexibility, not our indifference to your needs. 

Jokes about FLEx's ironic name (for which I take the blame) are always welcome :-)

--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.

ToolBox Support

unread,
Aug 7, 2020, 12:15:40 AM8/7/20
to shoeboxtoolbox-fiel...@googlegroups.com
Thanks, John, for writing. We considered writing something of that nature about the Toolbox-FLEx difference, but I think you explained it better and certainly with more authority than we could have.

Karen
Toolbox Support

Reply all
Reply to author
Forward
0 new messages