Known limit for number of groups in a ConsistentChangesTable (CCT) in Toolbox?

12 views
Skip to first unread message

H. Hirzel

unread,
Jan 5, 2026, 1:10:19 AM (5 days ago) Jan 5
to shoeboxtoolbox-fiel...@googlegroups.com
Hi

I plan to write (actually generate) a ConsistentChangesTable (CCT) from
a database which will do changes to another database. It will do so
using groups, there will be around 8000...10000 groups.

It is about mapping semantic domain values from one system to another
system (the Semdom.org) system forth and back. The change I guess will
depend on the data (i.e. a thesaurus database).

Is there a known limit to the number of groups a CCT can handle in
Toolbox? Or do I need to program this in another language. I like simple
CCT tables because they are very readable.

Regards

Hannes


Karen Buseman

unread,
Jan 5, 2026, 12:40:13 PM (4 days ago) Jan 5
to shoeboxtoolbox-fiel...@googlegroups.com
Hi, Hannes,

Yes, there is a limit. I'm not 100% sure which version of CC is in Toolbox, but I believe it's the 1000-group limit.
The ReadMe for version 1.8.5 of CC says
The maximum number of groups and stores has been increased from 127 to 1000.
I thought that same limit and increase was true of switches as well, but it wasn't in the ReadMe. Maybe that was done at a different time.

Note that in Toolbox, it is also possible to chain CC tables. The limit is how many groups per table, so you can get your 10,000 groups that way if you can have the different tables focus on only one part of the data and ignore the rest, passing it on to the next table.

The chaining allows the output of one table to feed the next. This is usually done with Standard Format export processes, though I have occasionally made an MDF process the final process in the chain. I've chained 15 or 20 processes together using this approach. There's no practical limit that I'm aware of. I've set people up with at least 40 export processes (various of which are chained).
image.png
The Next process in chain is the name of an Export Process. When I'm chaining CC tables -- or more accurately, chaining Export Processes -- I also do both the check-box options that I included in the red box above. 
Overwrite file without asking - because I usually run various tests, and the names of the intermediate files are only useful for debugging.
Automatically open exported file - so Toolbox can feed it to the next export process
If you do CTRL+SHIFT+ALT+K when you run Toolbox, it will keep the intermediate files. I find this useful for debugging if something goes wrong. For a normal production run, Toolbox will delete all the intermediate files at the end, keeping just the output of the final chain.

To keep my sanity, I usually number the export processes -- and have that as part of the name of the CC table as well. For example, export process 2 - combine will use 2 - combine.cct and genereate the output file 2 - combine - temp.txt. Or whatever. Then the table and the output sort together in the directory. 

If a chained process is given to a user, then I name the head of the chain something informative, but I put "xx" or "zz" in front of the names of the steps they don't need to worry about so they will sort to the end of the list of processes. And I tell them not to worry about any of those processes. If I have various chains doing different things, then, again, for my own sake I name them with an abbreviation that's helpful like "xxF-2-combine" (for the Full export chain) and "xxVE-2-reorder" (for the Vernacular-English export chain) etc.

It's even possible to skip steps in the chain or go to a different "branch" of a chain. I've only done that once and will have to look up how that's done, if you want to do it. But it was a way to avoid duplicating a bunch of export processes  the time I did it.

Karen
Toolbox Support


--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/7a374f9a-936c-461a-b504-78b84974e890%40gmail.com.

H. Hirzel

unread,
Jan 6, 2026, 3:07:47 AM (4 days ago) Jan 6
to shoeboxtoolbox-fiel...@googlegroups.com

Thank you, Karen, for the comprehensive and concise answer about your process of  organizing the development of a chain of CCT processes in Toolbox. Using this approach instead of doing everything in one large CCT table. I have used this approach before but helpful new information is how you number/label the processes in order to keep track what is happening.

In addition what you describe actually gives me an idea how I can organize the conversion process in a different way I originally thought. In a more iterative and also reproducible way. I am thinking of splitting the original Toolbox database into different parts. I am not yet sure yet which criteria I want to use for the splitting. It could be according to difficulty and/or semantic domain(s) or structural complexity of the entry. I can  treat issues in the parts separately and merge them again together at the end. The first I want to try is to create a database part where records only have one sense and a second database part with records with more than one sense. The issues about semantic domains is that they apply to the sense and not on the lexical entry level.

For keeping track of what is happening I think I need to give each record of the original database an identification number. If I choose that record id well, probably beyond simple numbering it will help keeping track of what is happening. Do you have an example or examples of a CCTs you can share which assigns a record id to each record and maybe also the same id combined with the sense number to the sense? This will be applied on the original database. So that I can split the original database into a senses database, treat the senses separately and later merge everything at the end together again.

Another / parallel approach I am thinking of is also of separating "simple" and "difficult" records of the original database. Also manual conversion of a smaller subset is an option. A large number of records might not pose problems and may be converted automatically and it might be easier to have an iteration / feedback loop on the original database a few times. An example is a case where let's say the a \pl field has been used for different purposes - e.g. plural if it is about a noun, imperative if it is a verb.

--Hannes

H. Hirzel

unread,
Jan 6, 2026, 7:46:38 AM (3 days ago) Jan 6
to shoeboxtoolbox-fiel...@googlegroups.com

P.S. Here are some details about one of the Toolbox database files I want to work with:


MawTbxDB records size  
                            8066

"Note: mawDBrecColl = mawDatabaseRecordCollection"

mawDBrecColl := MawTbxDB records reject: [:rec | rec keyFieldValue beginsWith: '%'].
mawDBrecColl := mawDBrecColl  reject: [:rec | rec keyFieldValue beginsWith: '* USERS GUIDE'].
mawDBrecColl size  
                    8063     

dbRecordsWithSingleSense :=mawDBrecColl select: [:record | record numberOfSenses = 1].        
dbRecordsWithSingleSense size                                                 
                    4337
dbRecordsWithNoSnField    :=     mawDBrecColl select: [:record | record numberOfSenses = 0].    
dbRecordsWithNoSnField size
                     3280

dbRecordsWithSimpleEntry := mawDBrecColl select: [:record | record numberOfSenses <= 1].        
dbRecordsWithSimpleEntry size                
                      7617
                 
(((7617 / 8066) * 1000) rounded / 10 ) asFloat
                         94.4 "percent"
                            
 8066 - 7617 =  449 "entries are not simple entries"

So the split according to senses will be slightly different:

a) one database part will be records with just one sense or no sense indication at all, thus in fact also only one sense: 7617 out of 8066 records

b) another database part with records with more one sense: 449 records out of 8066 records.

So a CCT needed would be one which picks out records with either one or no \sn field and another done  and another one with the rest of it (i.e. two or more senses \sd fields)

--Hannes

Barthelémy

unread,
Jan 6, 2026, 9:04:03 AM (3 days ago) Jan 6
to Shoebox/Toolbox Field Linguist's Toolbox

I thought why not ask AI - and here it is. I hope you find it helpfull:
Bart.

I don’t know of any published hard limit like “max N groups” for Toolbox/CC tables, and I can’t point you to an official number. What I have seen in practice is that CC tables can handle very large numbers of rules, but using 8,000–10,000 group(...) sections is unusual and is where you’re most likely to hit an internal implementation limit (or at least serious load/compile-time pain).

What I’d recommend for your semdom mapping use case:

1) Don’t model each mapping as a “group”

A CC group is a state in the CC “state machine” (you jump between groups with use(...)). You normally keep groups to a small number (e.g., group(first), group(main), maybe group(copy) etc.). That’s exactly how CC tables are typically written. (downloads.languagetechnology.org)

For a semantic-domain ID-to-ID mapping, you almost always want:

  • 1–3 groups total, and

  • 8,000–10,000 mapping lines (rules) inside a single mapping group.

So: 10k rules is fine-ish; 10k groups is the risky part.

2) If performance becomes an issue, use “dispatch groups” (few dozen, not thousands)

If you’re mapping lots of short codes and CC starts to feel slow, you can still keep it readable by using a small “router”:

  • group(main) decides which bucket based on the first character(s)

  • jump into group(g_1), group(g_2), … or group(g_01) … group(g_99)

  • each bucket contains only the mappings for that prefix

That gives you maybe 10–100 groups, not 10,000, but it cuts down the number of rules CC has to consider at each point.

3) Protect yourself against partial matches (the #1 semdom hazard)

With hierarchical IDs like 1.2 vs 1.2.3, a naive rule can accidentally rewrite inside a longer ID.

Two common CC-friendly strategies:

  • Normalize separators + enforce boundaries (e.g., ensure every code is surrounded by spaces, then match ' 1.2.3 ' not '1.2.3')

  • Sort rules by descending length of the source string before emitting them (longer IDs first), so 1.2.3.4 is handled before 1.2.3.

(You can generate the table in that order from your thesaurus DB.)

4) Unicode note (probably irrelevant for numeric IDs, but good hygiene)

If there’s any chance your processing includes non-ASCII, put the Unicode mode header in, e.g. begin > utf8, so CC treats code points as characters. (downloads.languagetechnology.org)

5) When to switch to another language

Stay with CC tables if:

  • the mapping is essentially key → value substitutions

  • you can make boundary-safe matches

  • you value auditability/readability (CC is great for that)

Consider a scripting language (Python, etc.) if:

  • you need true lookup semantics (hash/dict) over huge files for speed

  • you have one-to-many mappings with complex context rules

  • you need robust tokenization across messy punctuation/markup and want unit tests

My concrete advice for your plan
  • Generate a CCT with few groups (1–10, maybe 100 max), not 10k groups.

  • Put all mapping pairs as rules in one mapping group (or bucketed by prefix).

  • Generate rules boundary-safe and length-descending.

  • Before committing, do a stress test: run the stand-alone CC/CCWin on a representative export to measure load time and throughput (CCWin + debugger exists specifically for developing/testing CC tables). (groups.google.com)

If you want, paste:

  1. an example of your current source \sd (or whatever marker) format (how values are separated), and

  2. 5–10 sample mappings (including tricky ones like 1.2 vs 1.2.3)

…and I’ll sketch a robust CC-table pattern for the “normalize → map → cleanup” pipeline that scales to 10k mapping rules without needing thousands of groups.

Karen Buseman

unread,
Jan 8, 2026, 2:13:47 PM (yesterday) Jan 8
to Shoebox/Toolbox Field Linguist's Toolbox
Hi, Hannes,

Sorry to be slow getting back to you. Some days are wilder than others.

I had occasion in the past to split a database as you are requesting.
  • First, split entries with multiple sense numbers. 
  • Next split those with multiple parts of speech.
  • Third split entries with multiple glosses.
Hmmm. I thought I had four steps. It was all based on the hierarchy, anyway. Splitting at each level, beginning with the highest. (You don't need the hierarchy specified in Toolbox, just need to  know it.) I did replace the headword with a number, partly so I could check things. :) 

The tables were all fairly similar, mostly looking for different markers for the split. Here is one:
============================================
c table to split entries with multiple parts of speech
c assume all relevant info is at the top of the entry
c Modified to prevent improper split
c with sequential ps/pn sets (eg, \ps N, \ps V)

define(OutRecord) > out(StartOfRecord,PS,AfterPS)
                    store(PS,AfterPS) endstore
group(Main)
'\lx ' > do(OutRecord)
         store(StartOfRecord) dup
         set(FirstPS)
'\ps ' > ifn(FirstPS) do(OutRecord) endif
         clear(FirstPS)
         store(PS) dup
         use(PS)
endfile > do(OutRecord) dup
c -------------
group(PS)
nl '\ps ' > dup
nl '\pn ' > dup
nl '\' > store(AfterPS) dup
         use(Main)
=======================================================

I'll be glad to  discuss the details further. 

I haven't digested the message from Barthelémy yet. I'm surprised at such a detailed answer from AI.

Karen
Toolbox Support

Karen Buseman

unread,
Jan 8, 2026, 3:11:30 PM (yesterday) Jan 8
to Shoebox/Toolbox Field Linguist's Toolbox
One quick comment as I read through the  AI response from Barthelémy:
It emphasizes the importance of you sorting the CC rules by length. This is totally unnecessary. CC sorts the rules by length itself.
In fact, it is difficult (and pointless) to try to have CC do otherwise. 

Also, the article seems fixated on having just a single large CC table for the whole process. I find that a very large CC table is usually an unnecessary complication. There are exceptions, and the SD changing back and forth sounds like at some point there will be a number of changes ("rules" in the article) but many groups and switches are not usually called for.

I'm still digesting the rest of the suggestions.

Karen
Toolbox Support

H. Hirzel

unread,
Jan 8, 2026, 3:14:46 PM (yesterday) Jan 8
to shoeboxtoolbox-fiel...@googlegroups.com, Barthelémy

Hi Barthelémy

Thank you for your idea going for AI and posting the answer. Very useful indeed: The suggestion is to organize the tasks differently -- groups with many rules. And combined with the answer of Karen Buseman -- chain CCT processes which work on partial databases to be later combined again --- that makes a lot of sense.

Also the conclusion:

Stay with CC tables if:

  • the mapping is essentially key → value substitutions

  • you can make boundary-safe matches

  • you value auditability/readability (CC is great for that)

The Fieldworks Language Explorer software has in the 'Bulk edit' area the feature to include CCT processes. Which is an additional reason to go for CCTs. For input and/or output.

I need some time to digest the information and do some experiments on a sub-database. In the course of this process I am happy to develop a Toolbox example with typical cases and edge cases and post it to this list. Maybe more than once.

--Hannes

--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.

H. Hirzel

unread,
Jan 8, 2026, 3:55:54 PM (yesterday) Jan 8
to shoeboxtoolbox-fiel...@googlegroups.com, Karen Buseman, H. Hirzel

Hi Karen

On 08/01/2026 8:13 pm, Karen Buseman wrote:
Hi, Hannes,

Sorry to be slow getting back to you. Some days are wilder than others.

No you are not slow at all. To the contrary. I need to digest the information.

As I wrote in the note in the postscript about  the maw database -- The DB has 94% of the entries with only a single sense - one sn (sense number) field or none. So that would be a db part to start with. I guess I need to have a closer look at these 94%.


I had occasion in the past to split a database as you are requesting.
  • First, split entries with multiple sense numbers.  
So this is the step the title of this mail is about. The derived database "DB2" will only have entries with one sense?

  • Next split those with multiple parts of speech.
You mean you work on "DB2" to get a "DB3" where I only have one part of speech for each entry?

  • Third split entries with multiple glosses.

"DB3" --> "DB4": DB4 has only one gloss per entry. Right?

Hmmm. I thought I had four steps.
The fact that the are entries with definition field and/or gloss fields -- not necessarily for all records might complicate the situation. Maybe the fourth step was about this?

It was all based on the hierarchy, anyway. Splitting at each level, beginning with the highest. (You don't need the hierarchy specified in Toolbox, just need to  know it.) I did replace the headword with a number, partly so I could check things. :)

So you had something like

\lx fie

\ps N

\ge house

replaced by

\lx 0001

\lxc fie

\ps N

\ge house

?

I will have a look at your CCT proposal tomorrow and see if I understand the general approach for constructing this type of CCTs.

I have the impression that crucial parts are missing -

define(OutRecord) > out(StartOfRecord,PS,AfterPS)
                    store(PS,AfterPS) endstore

The part where the stores are filled.

My general question regarding all this splitting into individual senses: How do you make sure that you can combine everything together later?

There is also the question about how to deal with the issue of multiple senses vs. homonyms.

Hannes

Karen Buseman

unread,
Jan 8, 2026, 5:02:43 PM (yesterday) Jan 8
to shoeboxtoolbox-fiel...@googlegroups.com, H. Hirzel
Hi, Hannes,

I used poor terminology. I didn't mean to actually split the database into multiple files. I mean to split the entries, or perhaps the better term is to duplicate parts of some of the entries. So the first pass searches for everything with multiple senses. Then it divides those entries, leaving everything with only one sense alone. If you want the information about sense numbers for later, you can change the marker. I tend to do things like \snO for "sense number Original". Or, depending on how things are going, you can leave the \sn as is, if nothing will be looking at it any more.

The next pass watches for multiple \ps fields, again, duplicating the rest of the entry appropriately.

You asked about DB2 -> DB3, etc. Yes. Each step feeds the next, gradually changing things to entries with a single anything -- mulltiple entries having been generated from some of the entries, and other entries left alone. But it's still one database, not divided into multiple files.

You said something about when the stores are filled...it happens when a store is opened, as the data passes through the table.
group(Main)
'\lx ' > do(OutRecord)
         store(StartOfRecord) dup   opening a store allows any following contents to "fall into" the store
                                           so this collects everything after the \lx, until it finds a \ps (below)
         set(FirstPS)
'\ps ' > ifn(FirstPS) do(OutRecord) endif
         clear(FirstPS)
         store(PS) dup              at this point, the PS store is opened (NOTE: that closes the other store)
         use(PS)                    CC then switches control to the group(PS)
group(PS)                           this group collects the ps contents (marker already in the store)
nl '\ps ' > dup                     apparently this is looking for parts of speech that have glosses
nl '\pn ' > dup                     since this lets others (especially \pn) be included
nl '\' > store(AfterPS) dup         then when a different marker is encountered, it begins
         use(Main)                  colllecting the contents of the record following the ps
                                    and returns to the Main group, waiting for the next entry

This is one of the times that the CC Debugger is so useful -- you can see the data being collected into the stores.

I hope this clarifies instead of confusing things further. 

Karen
Toolbox Support

Karen Buseman

unread,
Jan 8, 2026, 5:23:19 PM (yesterday) Jan 8
to shoeboxtoolbox-fiel...@googlegroups.com, H. Hirzel
Hi, Hannes,

You had two other questions:

     1) My general question regarding all this splitting into individual senses: How do you make sure that you can combine everything together later?

     2) There is also the question about how to deal with the issue of multiple senses vs. homonyms.

1) To recombine later, you have to leave a trail. This can be done by leaving information about which an entry you divided. I usually leave the original lexeme (though I have often replaced it with a number). I have left original sub-entry information. (Ah, that was my other split!) And even the original sense numbers can be kept. Then, when trying to recombine, I have made pseudo headwords from the original lexeme plus the original subentry plus.... whatever is relevant, and in the order I want things combined. Then gradually clean up the bits, having multiple tables that gradually work back through the combining, basically reversing the duplication process and reducing the pseudo headword until it is just the original lexeme.

2) Multiple senses and homonyms are constructed differently in the entry. You may have a homonym number. It has a different tag from a sense number. But, yes, it does take some "bookkeeping". Anything you want to rejoin, you need to have kept information as to where it came from. I've had the pseudo headwords I mentioned in #1 with as many as four elements in a singlle "word", allowing me to sort the pieces back together. Again, just as the dividing was multiple steps, so is the recombining -- in the opposite order.

We can get really specific on this if you want. Send me your data to Toolbox @ sil.org (without the spaces, the group abbreviates anything it recognizes as an email address) -- we can even get online together and discuss details.

Your questions are welcome! Keep them coming.

Karen
Toolbox Support

Reply all
Reply to author
Forward
0 new messages