Known limit for number of groups in a ConsistentChangesTable (CCT) in Toolbox?

H. Hirzel

unread,

Jan 5, 2026, 1:10:19 AMJan 5

to shoeboxtoolbox-fiel...@googlegroups.com

Hi

I plan to write (actually generate) a ConsistentChangesTable (CCT) from
a database which will do changes to another database. It will do so
using groups, there will be around 8000...10000 groups.

It is about mapping semantic domain values from one system to another
system (the Semdom.org) system forth and back. The change I guess will
depend on the data (i.e. a thesaurus database).

Is there a known limit to the number of groups a CCT can handle in
Toolbox? Or do I need to program this in another language. I like simple
CCT tables because they are very readable.

Regards

Hannes

Karen Buseman

unread,

Jan 5, 2026, 12:40:13 PMJan 5

to shoeboxtoolbox-fiel...@googlegroups.com

Hi, Hannes,

Yes, there is a limit. I'm not 100% sure which version of CC is in Toolbox, but I believe it's the 1000-group limit.

The ReadMe for version 1.8.5 of CC says

The maximum number of groups and stores has been increased from 127 to 1000.

I thought that same limit and increase was true of switches as well, but it wasn't in the ReadMe. Maybe that was done at a different time.

Note that in Toolbox, it is also possible to chain CC tables. The limit is how many groups per table, so you can get your 10,000 groups that way if you can have the different tables focus on only one part of the data and ignore the rest, passing it on to the next table.

The chaining allows the output of one table to feed the next. This is usually done with Standard Format export processes, though I have occasionally made an MDF process the final process in the chain. I've chained 15 or 20 processes together using this approach. There's no practical limit that I'm aware of. I've set people up with at least 40 export processes (various of which are chained).

The Next process in chain is the name of an Export Process. When I'm chaining CC tables -- or more accurately, chaining Export Processes -- I also do both the check-box options that I included in the red box above.

Overwrite file without asking - because I usually run various tests, and the names of the intermediate files are only useful for debugging.

Automatically open exported file - so Toolbox can feed it to the next export process

If you do CTRL+SHIFT+ALT+K when you run Toolbox, it will keep the intermediate files. I find this useful for debugging if something goes wrong. For a normal production run, Toolbox will delete all the intermediate files at the end, keeping just the output of the final chain.

To keep my sanity, I usually number the export processes -- and have that as part of the name of the CC table as well. For example, export process 2 - combine will use 2 - combine.cct and genereate the output file 2 - combine - temp.txt. Or whatever. Then the table and the output sort together in the directory.

If a chained process is given to a user, then I name the head of the chain something informative, but I put "xx" or "zz" in front of the names of the steps they don't need to worry about so they will sort to the end of the list of processes. And I tell them not to worry about any of those processes. If I have various chains doing different things, then, again, for my own sake I name them with an abbreviation that's helpful like "xxF-2-combine" (for the Full export chain) and "xxVE-2-reorder" (for the Vernacular-English export chain) etc.

It's even possible to skip steps in the chain or go to a different "branch" of a chain. I've only done that once and will have to look up how that's done, if you want to do it. But it was a way to avoid duplicating a bunch of export processes the time I did it.

Karen

Toolbox Support

--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/7a374f9a-936c-461a-b504-78b84974e890%40gmail.com.

H. Hirzel

unread,

Jan 6, 2026, 3:07:47 AMJan 6

to shoeboxtoolbox-fiel...@googlegroups.com

Thank you, Karen, for the comprehensive and concise answer about your process of organizing the development of a chain of CCT processes in Toolbox. Using this approach instead of doing everything in one large CCT table. I have used this approach before but helpful new information is how you number/label the processes in order to keep track what is happening.

In addition what you describe actually gives me an idea how I can organize the conversion process in a different way I originally thought. In a more iterative and also reproducible way. I am thinking of splitting the original Toolbox database into different parts. I am not yet sure yet which criteria I want to use for the splitting. It could be according to difficulty and/or semantic domain(s) or structural complexity of the entry. I can treat issues in the parts separately and merge them again together at the end. The first I want to try is to create a database part where records only have one sense and a second database part with records with more than one sense. The issues about semantic domains is that they apply to the sense and not on the lexical entry level.

For keeping track of what is happening I think I need to give each record of the original database an identification number. If I choose that record id well, probably beyond simple numbering it will help keeping track of what is happening. Do you have an example or examples of a CCTs you can share which assigns a record id to each record and maybe also the same id combined with the sense number to the sense? This will be applied on the original database. So that I can split the original database into a senses database, treat the senses separately and later merge everything at the end together again.

Another / parallel approach I am thinking of is also of separating "simple" and "difficult" records of the original database. Also manual conversion of a smaller subset is an option. A large number of records might not pose problems and may be converted automatically and it might be easier to have an iteration / feedback loop on the original database a few times. An example is a case where let's say the a \pl field has been used for different purposes - e.g. plural if it is about a noun, imperative if it is a verb.

--Hannes

To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/CAFsQU46ZaD4zuaEXQ2Oys7_0272%2B%3DCk8S-jcTTPpChK%2B-pXT-w%40mail.gmail.com.

H. Hirzel

unread,

Jan 6, 2026, 7:46:38 AMJan 6

to shoeboxtoolbox-fiel...@googlegroups.com

P.S. Here are some details about one of the Toolbox database files I want to work with:

MawTbxDB records size
                            8066

"Note: mawDBrecColl = mawDatabaseRecordCollection"

mawDBrecColl := MawTbxDB records reject: [:rec | rec keyFieldValue beginsWith: '%'].
mawDBrecColl := mawDBrecColl reject: [:rec | rec keyFieldValue beginsWith: '* USERS GUIDE'].
mawDBrecColl size
                    8063

dbRecordsWithSingleSense :=mawDBrecColl select: [:record | record numberOfSenses = 1].
dbRecordsWithSingleSense size
                    4337
dbRecordsWithNoSnField    :=     mawDBrecColl select: [:record | record numberOfSenses = 0].
dbRecordsWithNoSnField size
                     3280

dbRecordsWithSimpleEntry := mawDBrecColl select: [:record | record numberOfSenses <= 1].
dbRecordsWithSimpleEntry size
                      7617

(((7617 / 8066) * 1000) rounded / 10 ) asFloat
                         94.4 "percent"

8066 - 7617 = 449 "entries are not simple entries"

So the split according to senses will be slightly different:

a) one database part will be records with just one sense or no sense indication at all, thus in fact also only one sense: 7617 out of 8066 records

b) another database part with records with more one sense: 449 records out of 8066 records.

So a CCT needed would be one which picks out records with either one or no \sn field and another done and another one with the rest of it (i.e. two or more senses \sd fields)

--Hannes

Barthelémy

unread,

Jan 6, 2026, 9:04:03 AMJan 6

to Shoebox/Toolbox Field Linguist's Toolbox

I thought why not ask AI - and here it is. I hope you find it helpfull:
Bart.

I don’t know of any published hard limit like “max N groups” for Toolbox/CC tables, and I can’t point you to an official number. What I have seen in practice is that CC tables can handle very large numbers of rules, but using 8,000–10,000 group(...) sections is unusual and is where you’re most likely to hit an internal implementation limit (or at least serious load/compile-time pain).

What I’d recommend for your semdom mapping use case:

1) Don’t model each mapping as a “group”

A CC group is a state in the CC “state machine” (you jump between groups with use(...)). You normally keep groups to a small number (e.g., group(first), group(main), maybe group(copy) etc.). That’s exactly how CC tables are typically written. (downloads.languagetechnology.org)

For a semantic-domain ID-to-ID mapping, you almost always want:

1–3 groups total, and
8,000–10,000 mapping lines (rules) inside a single mapping group.

So: 10k rules is fine-ish; 10k groups is the risky part.

2) If performance becomes an issue, use “dispatch groups” (few dozen, not thousands)

If you’re mapping lots of short codes and CC starts to feel slow, you can still keep it readable by using a small “router”:

group(main) decides which bucket based on the first character(s)
jump into group(g_1), group(g_2), … or group(g_01) … group(g_99)
each bucket contains only the mappings for that prefix

That gives you maybe 10–100 groups, not 10,000, but it cuts down the number of rules CC has to consider at each point.

3) Protect yourself against partial matches (the #1 semdom hazard)

With hierarchical IDs like 1.2 vs 1.2.3, a naive rule can accidentally rewrite inside a longer ID.

Two common CC-friendly strategies:

Normalize separators + enforce boundaries (e.g., ensure every code is surrounded by spaces, then match ' 1.2.3 ' not '1.2.3')
Sort rules by descending length of the source string before emitting them (longer IDs first), so 1.2.3.4 is handled before 1.2.3.

(You can generate the table in that order from your thesaurus DB.)

4) Unicode note (probably irrelevant for numeric IDs, but good hygiene)

If there’s any chance your processing includes non-ASCII, put the Unicode mode header in, e.g. begin > utf8, so CC treats code points as characters. (downloads.languagetechnology.org)

5) When to switch to another language

Stay with CC tables if:

the mapping is essentially key → value substitutions
you can make boundary-safe matches
you value auditability/readability (CC is great for that)

Consider a scripting language (Python, etc.) if:

you need true lookup semantics (hash/dict) over huge files for speed
you have one-to-many mappings with complex context rules
you need robust tokenization across messy punctuation/markup and want unit tests

My concrete advice for your plan

Generate a CCT with few groups (1–10, maybe 100 max), not 10k groups.
Put all mapping pairs as rules in one mapping group (or bucketed by prefix).
Generate rules boundary-safe and length-descending.
Before committing, do a stress test: run the stand-alone CC/CCWin on a representative export to measure load time and throughput (CCWin + debugger exists specifically for developing/testing CC tables). (groups.google.com)

If you want, paste:

an example of your current source \sd (or whatever marker) format (how values are separated), and
5–10 sample mappings (including tricky ones like 1.2 vs 1.2.3)

…and I’ll sketch a robust CC-table pattern for the “normalize → map → cleanup” pipeline that scales to 10k mapping rules without needing thousands of groups.

Karen Buseman

unread,

Jan 8, 2026, 2:13:47 PMJan 8

to Shoebox/Toolbox Field Linguist's Toolbox

Hi, Hannes,

Sorry to be slow getting back to you. Some days are wilder than others.

I had occasion in the past to split a database as you are requesting.

First, split entries with multiple sense numbers.
Next split those with multiple parts of speech.
Third split entries with multiple glosses.

Hmmm. I thought I had four steps. It was all based on the hierarchy, anyway. Splitting at each level, beginning with the highest. (You don't need the hierarchy specified in Toolbox, just need to know it.) I did replace the headword with a number, partly so I could check things. :)

The tables were all fairly similar, mostly looking for different markers for the split. Here is one:

============================================

c table to split entries with multiple parts of speech
c assume all relevant info is at the top of the entry
c Modified to prevent improper split
c with sequential ps/pn sets (eg, \ps N, \ps V)

define(OutRecord) > out(StartOfRecord,PS,AfterPS)
store(PS,AfterPS) endstore
group(Main)
'\lx ' > do(OutRecord)
store(StartOfRecord) dup
set(FirstPS)
'\ps ' > ifn(FirstPS) do(OutRecord) endif
clear(FirstPS)
store(PS) dup
use(PS)
endfile > do(OutRecord) dup

c -------------
group(PS)
nl '\ps ' > dup
nl '\pn ' > dup
nl '\' > store(AfterPS) dup
use(Main)

=======================================================

I'll be glad to discuss the details further.

I haven't digested the message from Barthelémy yet. I'm surprised at such a detailed answer from AI.

Karen

Toolbox Support

Karen Buseman

unread,

Jan 8, 2026, 3:11:30 PMJan 8

to Shoebox/Toolbox Field Linguist's Toolbox

One quick comment as I read through the AI response from Barthelémy:

It emphasizes the importance of you sorting the CC rules by length. This is totally unnecessary. CC sorts the rules by length itself.

In fact, it is difficult (and pointless) to try to have CC do otherwise.

Also, the article seems fixated on having just a single large CC table for the whole process. I find that a very large CC table is usually an unnecessary complication. There are exceptions, and the SD changing back and forth sounds like at some point there will be a number of changes ("rules" in the article) but many groups and switches are not usually called for.

I'm still digesting the rest of the suggestions.

Karen

Toolbox Support

H. Hirzel

unread,

Jan 8, 2026, 3:14:46 PMJan 8

to shoeboxtoolbox-fiel...@googlegroups.com, Barthelémy

Hi Barthelémy

Thank you for your idea going for AI and posting the answer. Very useful indeed: The suggestion is to organize the tasks differently -- groups with many rules. And combined with the answer of Karen Buseman -- chain CCT processes which work on partial databases to be later combined again --- that makes a lot of sense.

Also the conclusion:

Stay with CC tables if:

the mapping is essentially key → value substitutions
you can make boundary-safe matches
you value auditability/readability (CC is great for that)

The Fieldworks Language Explorer software has in the 'Bulk edit' area the feature to include CCT processes. Which is an additional reason to go for CCTs. For input and/or output.

I need some time to digest the information and do some experiments on a sub-database. In the course of this process I am happy to develop a Toolbox example with typical cases and edge cases and post it to this list. Maybe more than once.

--Hannes

--
You received this message because you are subscribed to the Google Groups "Shoebox/Toolbox Field Linguist's Toolbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shoeboxtoolbox-field-ling...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/7253c9c8-451d-4c77-bcf2-3325b38a0ea0n%40googlegroups.com.

H. Hirzel

unread,

Jan 8, 2026, 3:55:54 PMJan 8

to shoeboxtoolbox-fiel...@googlegroups.com, Karen Buseman, H. Hirzel

Hi Karen

On 08/01/2026 8:13 pm, Karen Buseman wrote:

Hi, Hannes,

Sorry to be slow getting back to you. Some days are wilder than others.

No you are not slow at all. To the contrary. I need to digest the information.

As I wrote in the note in the postscript about the maw database -- The DB has 94% of the entries with only a single sense - one sn (sense number) field or none. So that would be a db part to start with. I guess I need to have a closer look at these 94%.

I had occasion in the past to split a database as you are requesting.

First, split entries with multiple sense numbers.

So this is the step the title of this mail is about. The derived database "DB2" will only have entries with one sense?

Next split those with multiple parts of speech.

You mean you work on "DB2" to get a "DB3" where I only have one part of speech for each entry?

Third split entries with multiple glosses.

"DB3" --> "DB4": DB4 has only one gloss per entry. Right?

Hmmm. I thought I had four steps.

The fact that the are entries with definition field and/or gloss fields -- not necessarily for all records might complicate the situation. Maybe the fourth step was about this?

It was all based on the hierarchy, anyway. Splitting at each level, beginning with the highest. (You don't need the hierarchy specified in Toolbox, just need to know it.) I did replace the headword with a number, partly so I could check things. :)

So you had something like

\lx fie

\ps N

\ge house

replaced by

\lx 0001

\lxc fie

\ps N

\ge house

?

I will have a look at your CCT proposal tomorrow and see if I understand the general approach for constructing this type of CCTs.

I have the impression that crucial parts are missing -

define(OutRecord) > out(StartOfRecord,PS,AfterPS)
store(PS,AfterPS) endstore

The part where the stores are filled.

My general question regarding all this splitting into individual senses: How do you make sure that you can combine everything together later?

There is also the question about how to deal with the issue of multiple senses vs. homonyms.

Hannes

To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/83385e3b-1a3b-483b-b321-d27594795ad0n%40googlegroups.com.

Karen Buseman

unread,

Jan 8, 2026, 5:02:43 PMJan 8

to shoeboxtoolbox-fiel...@googlegroups.com, H. Hirzel

Hi, Hannes,

I used poor terminology. I didn't mean to actually split the database into multiple files. I mean to split the entries, or perhaps the better term is to duplicate parts of some of the entries. So the first pass searches for everything with multiple senses. Then it divides those entries, leaving everything with only one sense alone. If you want the information about sense numbers for later, you can change the marker. I tend to do things like \snO for "sense number Original". Or, depending on how things are going, you can leave the \sn as is, if nothing will be looking at it any more.

The next pass watches for multiple \ps fields, again, duplicating the rest of the entry appropriately.

You asked about DB2 -> DB3, etc. Yes. Each step feeds the next, gradually changing things to entries with a single anything -- mulltiple entries having been generated from some of the entries, and other entries left alone. But it's still one database, not divided into multiple files.

You said something about when the stores are filled...it happens when a store is opened, as the data passes through the table.

group(Main)
'\lx ' > do(OutRecord)
store(StartOfRecord) dup opening a store allows any following contents to "fall into" the store

so this collects everything after the \lx, until it finds a \ps (below)

set(FirstPS)
'\ps ' > ifn(FirstPS) do(OutRecord) endif
clear(FirstPS)

store(PS) dup at this point, the PS store is opened (NOTE: that closes the other store)
use(PS) CC then switches control to the group(PS)

group(PS) this group collects the ps contents (marker already in the store)
nl '\ps ' > dup apparently this is looking for parts of speech that have glosses
nl '\pn ' > dup since this lets others (especially \pn) be included
nl '\' > store(AfterPS) dup then when a different marker is encountered, it begins

use(Main) colllecting the contents of the record following the ps

and returns to the Main group, waiting for the next entry

This is one of the times that the CC Debugger is so useful -- you can see the data being collected into the stores.

I hope this clarifies instead of confusing things further.

Karen

Toolbox Support

To view this discussion visit https://groups.google.com/d/msgid/shoeboxtoolbox-field-linguists-toolbox/32a65c05-757c-4808-b9fb-1ad2d9bf0829%40gmail.com.

Karen Buseman

unread,

Jan 8, 2026, 5:23:19 PMJan 8

to shoeboxtoolbox-fiel...@googlegroups.com, H. Hirzel

Hi, Hannes,

You had two other questions:

1) My general question regarding all this splitting into individual senses: How do you make sure that you can combine everything together later?

2) There is also the question about how to deal with the issue of multiple senses vs. homonyms.

1) To recombine later, you have to leave a trail. This can be done by leaving information about which an entry you divided. I usually leave the original lexeme (though I have often replaced it with a number). I have left original sub-entry information. (Ah, that was my other split!) And even the original sense numbers can be kept. Then, when trying to recombine, I have made pseudo headwords from the original lexeme plus the original subentry plus.... whatever is relevant, and in the order I want things combined. Then gradually clean up the bits, having multiple tables that gradually work back through the combining, basically reversing the duplication process and reducing the pseudo headword until it is just the original lexeme.

2) Multiple senses and homonyms are constructed differently in the entry. You may have a homonym number. It has a different tag from a sense number. But, yes, it does take some "bookkeeping". Anything you want to rejoin, you need to have kept information as to where it came from. I've had the pseudo headwords I mentioned in #1 with as many as four elements in a singlle "word", allowing me to sort the pieces back together. Again, just as the dividing was multiple steps, so is the recombining -- in the opposite order.

We can get really specific on this if you want. Send me your data to Toolbox @ sil.org (without the spaces, the group abbreviates anything it recognizes as an email address) -- we can even get online together and discuss details.

Your questions are welcome! Keep them coming.

Karen

Toolbox Support

Reply all

Reply to author

Forward