Known limit for number of groups in a ConsistentChangesTable (CCT) in Toolbox?

10 views
Skip to first unread message

H. Hirzel

unread,
Jan 5, 2026, 1:10:19 AM (2 days ago) Jan 5
to shoeboxtoolbox-fiel...@googlegroups.com
Hi

I plan to write (actually generate) a ConsistentChangesTable (CCT) from
a database which will do changes to another database. It will do so
using groups, there will be around 8000...10000 groups.

It is about mapping semantic domain values from one system to another
system (the Semdom.org) system forth and back. The change I guess will
depend on the data (i.e. a thesaurus database).

Is there a known limit to the number of groups a CCT can handle in
Toolbox? Or do I need to program this in another language. I like simple
CCT tables because they are very readable.

Regards

Hannes


Karen Buseman

unread,
Jan 5, 2026, 12:40:13 PM (2 days ago) Jan 5
to shoeboxtoolbox-fiel...@googlegroups.com
Hi, Hannes,

Yes, there is a limit. I'm not 100% sure which version of CC is in Toolbox, but I believe it's the 1000-group limit.
The ReadMe for version 1.8.5 of CC says
The maximum number of groups and stores has been increased from 127 to 1000.
I thought that same limit and increase was true of switches as well, but it wasn't in the ReadMe. Maybe that was done at a different time.

Note that in Toolbox, it is also possible to chain CC tables. The limit is how many groups per table, so you can get your 10,000 groups that way if you can have the different tables focus on only one part of the data and ignore the rest, passing it on to the next table.

The chaining allows the output of one table to feed the next. This is usually done with Standard Format export processes, though I have occasionally made an MDF process the final process in the chain. I've chained 15 or 20 processes together using this approach. There's no practical limit that I'm aware of. I've set people up with at least 40 export processes (various of which are chained).
image.png
The Next process in chain is the name of an Export Process. When I'm chaining CC tables -- or more accurately, chaining Export Processes -- I also do both the check-box options that I included in the red box above. 
Overwrite file without asking - because I usually run various tests, and the names of the intermediate files are only useful for debugging.
Automatically open exported file - so Toolbox can feed it to the next export process
If you do CTRL+SHIFT+ALT+K when you run Toolbox, it will keep the intermediate files. I find this useful for debugging if something goes wrong. For a normal production run, Toolbox will delete all the intermediate files at the end, keeping just the output of the final chain.

To keep my sanity, I usually number the export processes -- and have that as part of the name of the CC table as well. For example, export process 2 - combine will use 2 - combine.cct and genereate the output file 2 - combine - temp.txt. Or whatever. Then the table and the output sort together in the directory. 

If a chained process is given to a user, then I name the head of the chain something informative, but I put "xx" or "zz" in front of the names of the steps they don't need to worry about so they will sort to the end of the list of processes. And I tell them not to worry about any of those processes. If I have various chains doing different things, then, again, for my own sake I name them with an abbreviation that's helpful like "xxF-2-combine" (for the Full export chain) and "xxVE-2-reorder" (for the Vernacular-English export chain) etc.

It's even possible to skip steps in the chain or go to a different "branch" of a chain. I've only done that once and will have to look up how that's done, if you want to do it. But it was a way to avoid duplicating a bunch of export processes  the time I did it.

Karen
Toolbox Support


--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/7a374f9a-936c-461a-b504-78b84974e890%40gmail.com.

H. Hirzel

unread,
Jan 6, 2026, 3:07:47 AM (23 hours ago) Jan 6
to shoeboxtoolbox-fiel...@googlegroups.com

Thank you, Karen, for the comprehensive and concise answer about your process of  organizing the development of a chain of CCT processes in Toolbox. Using this approach instead of doing everything in one large CCT table. I have used this approach before but helpful new information is how you number/label the processes in order to keep track what is happening.

In addition what you describe actually gives me an idea how I can organize the conversion process in a different way I originally thought. In a more iterative and also reproducible way. I am thinking of splitting the original Toolbox database into different parts. I am not yet sure yet which criteria I want to use for the splitting. It could be according to difficulty and/or semantic domain(s) or structural complexity of the entry. I can  treat issues in the parts separately and merge them again together at the end. The first I want to try is to create a database part where records only have one sense and a second database part with records with more than one sense. The issues about semantic domains is that they apply to the sense and not on the lexical entry level.

For keeping track of what is happening I think I need to give each record of the original database an identification number. If I choose that record id well, probably beyond simple numbering it will help keeping track of what is happening. Do you have an example or examples of a CCTs you can share which assigns a record id to each record and maybe also the same id combined with the sense number to the sense? This will be applied on the original database. So that I can split the original database into a senses database, treat the senses separately and later merge everything at the end together again.

Another / parallel approach I am thinking of is also of separating "simple" and "difficult" records of the original database. Also manual conversion of a smaller subset is an option. A large number of records might not pose problems and may be converted automatically and it might be easier to have an iteration / feedback loop on the original database a few times. An example is a case where let's say the a \pl field has been used for different purposes - e.g. plural if it is about a noun, imperative if it is a verb.

--Hannes

H. Hirzel

unread,
Jan 6, 2026, 7:46:38 AM (19 hours ago) Jan 6
to shoeboxtoolbox-fiel...@googlegroups.com

P.S. Here are some details about one of the Toolbox database files I want to work with:


MawTbxDB records size  
                            8066

"Note: mawDBrecColl = mawDatabaseRecordCollection"

mawDBrecColl := MawTbxDB records reject: [:rec | rec keyFieldValue beginsWith: '%'].
mawDBrecColl := mawDBrecColl  reject: [:rec | rec keyFieldValue beginsWith: '* USERS GUIDE'].
mawDBrecColl size  
                    8063     

dbRecordsWithSingleSense :=mawDBrecColl select: [:record | record numberOfSenses = 1].        
dbRecordsWithSingleSense size                                                 
                    4337
dbRecordsWithNoSnField    :=     mawDBrecColl select: [:record | record numberOfSenses = 0].    
dbRecordsWithNoSnField size
                     3280

dbRecordsWithSimpleEntry := mawDBrecColl select: [:record | record numberOfSenses <= 1].        
dbRecordsWithSimpleEntry size                
                      7617
                 
(((7617 / 8066) * 1000) rounded / 10 ) asFloat
                         94.4 "percent"
                            
 8066 - 7617 =  449 "entries are not simple entries"

So the split according to senses will be slightly different:

a) one database part will be records with just one sense or no sense indication at all, thus in fact also only one sense: 7617 out of 8066 records

b) another database part with records with more one sense: 449 records out of 8066 records.

So a CCT needed would be one which picks out records with either one or no \sn field and another done  and another one with the rest of it (i.e. two or more senses \sd fields)

--Hannes

Barthelémy

unread,
Jan 6, 2026, 9:04:03 AM (18 hours ago) Jan 6
to Shoebox/Toolbox Field Linguist's Toolbox

I thought why not ask AI - and here it is. I hope you find it helpfull:
Bart.

I don’t know of any published hard limit like “max N groups” for Toolbox/CC tables, and I can’t point you to an official number. What I have seen in practice is that CC tables can handle very large numbers of rules, but using 8,000–10,000 group(...) sections is unusual and is where you’re most likely to hit an internal implementation limit (or at least serious load/compile-time pain).

What I’d recommend for your semdom mapping use case:

1) Don’t model each mapping as a “group”

A CC group is a state in the CC “state machine” (you jump between groups with use(...)). You normally keep groups to a small number (e.g., group(first), group(main), maybe group(copy) etc.). That’s exactly how CC tables are typically written. (downloads.languagetechnology.org)

For a semantic-domain ID-to-ID mapping, you almost always want:

  • 1–3 groups total, and

  • 8,000–10,000 mapping lines (rules) inside a single mapping group.

So: 10k rules is fine-ish; 10k groups is the risky part.

2) If performance becomes an issue, use “dispatch groups” (few dozen, not thousands)

If you’re mapping lots of short codes and CC starts to feel slow, you can still keep it readable by using a small “router”:

  • group(main) decides which bucket based on the first character(s)

  • jump into group(g_1), group(g_2), … or group(g_01) … group(g_99)

  • each bucket contains only the mappings for that prefix

That gives you maybe 10–100 groups, not 10,000, but it cuts down the number of rules CC has to consider at each point.

3) Protect yourself against partial matches (the #1 semdom hazard)

With hierarchical IDs like 1.2 vs 1.2.3, a naive rule can accidentally rewrite inside a longer ID.

Two common CC-friendly strategies:

  • Normalize separators + enforce boundaries (e.g., ensure every code is surrounded by spaces, then match ' 1.2.3 ' not '1.2.3')

  • Sort rules by descending length of the source string before emitting them (longer IDs first), so 1.2.3.4 is handled before 1.2.3.

(You can generate the table in that order from your thesaurus DB.)

4) Unicode note (probably irrelevant for numeric IDs, but good hygiene)

If there’s any chance your processing includes non-ASCII, put the Unicode mode header in, e.g. begin > utf8, so CC treats code points as characters. (downloads.languagetechnology.org)

5) When to switch to another language

Stay with CC tables if:

  • the mapping is essentially key → value substitutions

  • you can make boundary-safe matches

  • you value auditability/readability (CC is great for that)

Consider a scripting language (Python, etc.) if:

  • you need true lookup semantics (hash/dict) over huge files for speed

  • you have one-to-many mappings with complex context rules

  • you need robust tokenization across messy punctuation/markup and want unit tests

My concrete advice for your plan
  • Generate a CCT with few groups (1–10, maybe 100 max), not 10k groups.

  • Put all mapping pairs as rules in one mapping group (or bucketed by prefix).

  • Generate rules boundary-safe and length-descending.

  • Before committing, do a stress test: run the stand-alone CC/CCWin on a representative export to measure load time and throughput (CCWin + debugger exists specifically for developing/testing CC tables). (groups.google.com)

If you want, paste:

  1. an example of your current source \sd (or whatever marker) format (how values are separated), and

  2. 5–10 sample mappings (including tricky ones like 1.2 vs 1.2.3)

…and I’ll sketch a robust CC-table pattern for the “normalize → map → cleanup” pipeline that scales to 10k mapping rules without needing thousands of groups.

Reply all
Reply to author
Forward
0 new messages