Unsupervised Tokenization for Unsupervised Grammar Learning Re: How are things going?

4 views

Skip to first unread message

Anton Kolonin @ Gmail

unread,

May 3, 2022, 1:15:47 AM5/3/22

to linasv...@gmail.com, lang-learn, link-grammar

Hi Linas,

> What are the "freedom" models? I guess I should read the cited ncbi article...!?

Well, yes, shortly speaking it lime amount of possible ngram-to-ngram transitions along the text moving in + or - direction.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655800/

"We define freedom at character transitions as the size of the set of characters that were observed to follow a substring of the corpus. To calculate freedom, we counted the children of the trie node representing the string (Figure 1). ... We analyzed text for specific statistical properties indicative of token boundaries. First, we calculated the forward- and backward-reading freedom of transitions following each substring of the text, lengths one through seven. We chose not to calculate the freedom substrings with a length of greater than seven because it was clear from previous work that the algorithm performed optimally using strings of length three through five, as is discussed later in the paper. We then used these values to assign additional statistical properties to characters, including increase and decrease in freedom. We assigned a “peak” value to each character transition, computed by adding the value of the preceding increase in freedom to the following decrease in freedom. We characterized token boundaries based on the sum of their forward- and backward-reading peak values."

I have got astonishingly good results (F1=0.96-0.99) on unsupervisedly tokenizing English and Russian on quite different corpora.

Now I am gracefully stuck doing the same for Chinese....

Do you have some insights on Chinese tokenization and parsing from the "unsupervised" perspective?

Can Link Grammar be applied for Chinese at all?

Best regards,

-Anton

On 10/04/2022 02:09, Linas Vepstas wrote:

In meantime, I have managed to get F1=0.96 on unsupervised tokenization learning ;-)

Bravo!

https://github.com/aigents/pygents/blob/main/tokenization.md

I assume that this is the key sentence:

> Improved the "freedom" models removing the low-frequency "tails" for each of the corpora

Yes, in my work, I have found that removing the low-frequency tails does improve things ... but that you have to be careful. It's subtle. Remove too much, and things won't work. Also, there are several ways to slice; some work better than others.

What are the "freedom" models? I guess I should read the cited ncbi article...!?

https://github.com/aigents/pygents/blob/main/notebooks/nlp/TokenizerTest-Runs-100.ipynb

involving Brown, Gutenberg Adult/Chilldren and custom social media corpora.

What I found is that MI is not working for tokenization, BTW.

Umm, I am not sure what you are saying, what that means.

-- First, MI is not a magic formula; it is just a stepping stone, a tool from which more complex algos can be built. For example, if one can somehow create/find a vector, then vector similarity is more accurate than MI. If one does not have a vector, then MI can be used to build a vector.

-- I don't know how you used MI for tokenization. Was it letter-by-letter? I imagine that would work very badly. Is it by block-of-letters? For example, in French, if I give you the words partez, partons, entrez, entrons, soupconnez, soupconnes, parlez, parlons and then try all possible random splittings of these words into two parts, then MI should be able to easily tell you that -ez and -ons are the suffixes. But if you try all possible random splittings into three or 4 or 5 parts, then MI will fail.

-- I imagine there are several ways of using MI to build vectors for segmentation, but this email is too long, so I won't start inventing new things.

In the long run, I am trying to build the infrastructure for coupled systems, where knowledge from high-level layers can be used to improve the workings of lower-level layers. For example, learned grammar can be used to improve segmentation (by telling you where segmentation failed).

The problem is, of course, building infrastructure takes a long time. And then after it is built, running the experiments, figuring out how to make things work, that takes a long time. it's a very slow process.

Linas

Reply all

Reply to author

Forward

0 new messages