The maximum number of groups and stores has been increased from 127 to 1000.
The Next process in chain is the name of an Export Process. When I'm chaining CC tables -- or more accurately, chaining Export Processes -- I also do both the check-box options that I included in the red box above.
Overwrite file without asking - because I usually run various tests, and the names of the intermediate files are only useful for debugging.
Automatically open exported file - so Toolbox can feed it to the next export process
--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/7a374f9a-936c-461a-b504-78b84974e890%40gmail.com.
Thank you, Karen, for the comprehensive and concise answer about
your process of organizing the development of a chain of CCT
processes in Toolbox. Using this approach instead of doing
everything in one large CCT table. I have used this approach
before but helpful new information is how you number/label the
processes in order to keep track what is happening.
In addition what you describe actually gives me an idea how I can
organize the conversion process in a different way I originally
thought. In a more iterative and also reproducible way. I am
thinking of splitting the original Toolbox database into different
parts. I am not yet sure yet which criteria I want to use for the
splitting. It could be according to difficulty and/or semantic
domain(s) or structural complexity of the entry. I can treat
issues in the parts separately and merge them again together at
the end. The first I want to try is to create a database part
where records only have one sense and a second database part with
records with more than one sense. The issues about semantic
domains is that they apply to the sense and not on the lexical
entry level.
For keeping track of what is happening I think I need to give
each record of the original database an identification number. If
I choose that record id well, probably beyond simple numbering it
will help keeping track of what is happening. Do you have an
example or examples of a CCTs you can share which assigns a record
id to each record and maybe also the same id combined with the
sense number to the sense? This will be applied on the original
database. So that I can split the original database into a senses
database, treat the senses separately and later merge everything
at the end together again.
Another / parallel approach I am thinking of is also of
separating "simple" and "difficult" records of the original
database. Also manual conversion of a smaller subset is an option.
A large number of records might not pose problems and may be
converted automatically and it might be easier to have an
iteration / feedback loop on the original database a few times. An
example is a case where let's say the a \pl field has been used
for different purposes - e.g. plural if it is about a noun,
imperative if it is a verb.
--Hannes
P.S. Here are some details about one of the Toolbox database files I want to work with:
MawTbxDB records size
8066
"Note: mawDBrecColl = mawDatabaseRecordCollection"
mawDBrecColl := MawTbxDB records reject: [:rec | rec keyFieldValue
beginsWith: '%'].
mawDBrecColl := mawDBrecColl reject: [:rec | rec keyFieldValue
beginsWith: '* USERS GUIDE'].
mawDBrecColl size
8063
dbRecordsWithSingleSense :=mawDBrecColl select: [:record | record
numberOfSenses = 1].
dbRecordsWithSingleSense
size
4337
dbRecordsWithNoSnField := mawDBrecColl select: [:record |
record numberOfSenses = 0].
dbRecordsWithNoSnField size
3280
dbRecordsWithSimpleEntry := mawDBrecColl select: [:record | record
numberOfSenses <= 1].
dbRecordsWithSimpleEntry size
7617
(((7617 / 8066) * 1000) rounded / 10 ) asFloat
94.4 "percent"
8066 - 7617 = 449 "entries are not simple entries"
So the split according to senses will be slightly different:
a) one database part will be records with just one sense or no sense indication at all, thus in fact also only one sense: 7617 out of 8066 records
b) another database part with records with more one sense: 449 records out of 8066 records.
So a CCT needed would be one which picks out records with either
one or no \sn field and another done and another one with the
rest of it (i.e. two or more senses \sd fields)
--Hannes
I thought why not ask AI - and here it is. I hope you find it helpfull:
Bart.
I don’t know of any published hard limit like “max N groups” for Toolbox/CC tables, and I can’t point you to an official number. What I have seen in practice is that CC tables can handle very large numbers of rules, but using 8,000–10,000 group(...) sections is unusual and is where you’re most likely to hit an internal implementation limit (or at least serious load/compile-time pain).
What I’d recommend for your semdom mapping use case:
1) Don’t model each mapping as a “group”A CC group is a state in the CC “state machine” (you jump between groups with use(...)). You normally keep groups to a small number (e.g., group(first), group(main), maybe group(copy) etc.). That’s exactly how CC tables are typically written. (downloads.languagetechnology.org)
For a semantic-domain ID-to-ID mapping, you almost always want:
1–3 groups total, and
8,000–10,000 mapping lines (rules) inside a single mapping group.
So: 10k rules is fine-ish; 10k groups is the risky part.
2) If performance becomes an issue, use “dispatch groups” (few dozen, not thousands)If you’re mapping lots of short codes and CC starts to feel slow, you can still keep it readable by using a small “router”:
group(main) decides which bucket based on the first character(s)
jump into group(g_1), group(g_2), … or group(g_01) … group(g_99)
each bucket contains only the mappings for that prefix
That gives you maybe 10–100 groups, not 10,000, but it cuts down the number of rules CC has to consider at each point.
3) Protect yourself against partial matches (the #1 semdom hazard)With hierarchical IDs like 1.2 vs 1.2.3, a naive rule can accidentally rewrite inside a longer ID.
Two common CC-friendly strategies:
Normalize separators + enforce boundaries (e.g., ensure every code is surrounded by spaces, then match ' 1.2.3 ' not '1.2.3')
Sort rules by descending length of the source string before emitting them (longer IDs first), so 1.2.3.4 is handled before 1.2.3.
(You can generate the table in that order from your thesaurus DB.)
4) Unicode note (probably irrelevant for numeric IDs, but good hygiene)If there’s any chance your processing includes non-ASCII, put the Unicode mode header in, e.g. begin > utf8, so CC treats code points as characters. (downloads.languagetechnology.org)
5) When to switch to another languageStay with CC tables if:
the mapping is essentially key → value substitutions
you can make boundary-safe matches
you value auditability/readability (CC is great for that)
Consider a scripting language (Python, etc.) if:
you need true lookup semantics (hash/dict) over huge files for speed
you have one-to-many mappings with complex context rules
you need robust tokenization across messy punctuation/markup and want unit tests
Generate a CCT with few groups (1–10, maybe 100 max), not 10k groups.
Put all mapping pairs as rules in one mapping group (or bucketed by prefix).
Generate rules boundary-safe and length-descending.
Before committing, do a stress test: run the stand-alone CC/CCWin on a representative export to measure load time and throughput (CCWin + debugger exists specifically for developing/testing CC tables). (groups.google.com)
If you want, paste:
an example of your current source \sd (or whatever marker) format (how values are separated), and
5–10 sample mappings (including tricky ones like 1.2 vs 1.2.3)
…and I’ll sketch a robust CC-table pattern for the “normalize → map → cleanup” pipeline that scales to 10k mapping rules without needing thousands of groups.