regular expression search -- letters with diacritics

2,128 views
Skip to first unread message

Aaron Broadwell

unread,
Sep 23, 2018, 10:42:22 AM9/23/18
to FLEx list
Can anyone help me with the following regular expression problem?  I would like to be able to search my corpus for examples of vowels marked for accent in combination with certain suffixes.  E.g. if the corpus has these words:

holatáco
itimico
tòomaco
toomaco
tómanco
êyeco
eyema
eyéma

I would like the regular expression seach to find words with the suffix -co and an accented vowel earlier in the word.  (That is, it should return holatáco, tòomaco, tómanco, êyeco.)

I have tried the regular expression [áàâéèêìíîòóô]\w*co\s, but alas this does not seem to work.  It returns any instance of [aeio], and not just the vowels with a diacritic.  It seems to make no difference whether the box Match diacritics is checked or not.

This suggests to me that there is something special I should be doing to search over a group of characters with diacritics, but I cannot figure out what it is.

Can anyone help?

Aaron Broadwell

unread,
Sep 23, 2018, 6:30:34 PM9/23/18
to FLEx list
To answer my own question ( :-) )....

After reading various sources on regular expressions and trial and error, I was able to figure out that \p{M} in the regular expression will find any diacritic.  {M} is the unicode property for being a diacritic, so \p{M} means 'a character that is a diacritic'.

So the expression \p{M}\w*co\b  comes fairly close to what I want.  It finds a diacritic, some number of word-forming characters, co and a boundary.

[It is not exactly what I want, since this expression also finds the tilde over nasal vowels, but it is close enough.]

I'll pass on my own progress on this problem, in the event that anyone else needs to search for diacritics in FLEx...

Jeff Heath

unread,
Sep 24, 2018, 5:39:02 AM9/24/18
to FLEx list
It would be good to get a developer opinion, but I believe you have found a bug. I would think that:

[áàâéèêìíîòóô]\w*co\b

should give the desired result (i.e. not including the vowels with tilde, as your second solution does), as long as you have "Match diacritics" selected. But in my quick test with your data, I also see it still selecting words without a diacritic, like "itimico" and "toomaco".

I seem to recall that FLEx uses a fully decomposed representation of characters "under the hood", but that shouldn't nullify the fact that when I define a set of characters to match in square brackets, it should ONLY match those characters. Testing with the expression above clearly shows that is not searching for ONLY the characters in the set.

Furthermore, if I filter for the following regular expression:

[áàâ]

The filtered results include the word "êyeco"! That word doesn't have any kind of "a" at all, so clearly there is something weird going on here.

Can one of the developers give us an explanation or report the bug?

Jeff Heath

unread,
Sep 24, 2018, 6:27:23 AM9/24/18
to FLEx list
I just had a thought about my final regular expression, [áàâ]. If FLEx is in fact decomposing everything, then the expression in square brackets would actually be:

[a\u0301a\u0300a\u0302]

In other words, the set includes "a" 3 times, and the three combining diacritics, one of which is \u0302, the combining circumflex accent. Which is why it is finding "êyeco".

So this leads me to a solution that actually works, although it is a bit longer:

(á|à|â|é|è|ê|ì|í|î|ò|ó|ô)\w*co\b

This regex searches for a number of sequences of characters, separated by vertical bars (ORs), not for characters in a set (which messes up if those characters get decomposed in the process.

So I think that probably explains FLEX's strange behavior, and gives a work-around, but it seems to me like needing to know that FLEx decomposes characters "under the hood" in order to write a regex that works is a bit unfortunate. Should FLEx use fully composed forms when trying to do searches, especially with regexes?

Robert Hedinger

unread,
Sep 24, 2018, 6:37:51 AM9/24/18
to flex...@googlegroups.com
What about [\u00301\u00300\u00302]\w*co\s and any other codes for diacritics.
Robert

--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
To change your status, please write to flex_d...@sil.org.
You can join this group by going to http://groups.google.com/group/flex-list.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/5ce64fbb-6c51-4152-b883-d8c31f0df788%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff Heath

unread,
Sep 24, 2018, 7:53:00 AM9/24/18
to FLEx list
On Monday, September 24, 2018 at 11:37:51 AM UTC+1, Robert Hedinger wrote:
What about [\u00301\u00300\u00302]\w*co\s and any other codes for diacritics.
Robert

Yes, that would work well in this case. Note that Robert has an extra "0" in the codepoints; I believe they should be [\u0301\u0300\u0302]\w*co\b
Also note that I changed the \s at the end to \b (as Aaron did in his second attempt) - you really want to find all words that end this way (\b), even if they aren't followed by some sort of space character (\s). Using \s would not match these words in the form of a single Headword, for example.

But I would recommend remembering the sequences of characters solution and keeping it in your "toolbox", because it is easier to fine tune. For example, what if I said that I only wanted to find the words that have "a" with those diacritics, you wouldn't be able to do that with a set in square brackets. You would have to return to the other solution: (á|à|â)\w*co\b

Paul Nelson

unread,
Sep 24, 2018, 8:17:33 AM9/24/18
to flex...@googlegroups.com
Would I be correct to assume that this is a regular expression issue and not unique to FieldWorks?

Paul

--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
To change your status, please write to flex_d...@sil.org.
You can join this group by going to http://groups.google.com/group/flex-list.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.

Jeff Heath

unread,
Sep 24, 2018, 12:18:49 PM9/24/18
to FLEx list
>Would I be correct to assume that this is a regular expression issue and not unique to FieldWorks?

No, I think it's worse in FieldWorks because FieldWorks takes the extra step of converting text to its fully decomposed form. If I try the same regex:

\b\w*[áàâéèêìíîòóô]\w*co\b

on the original posted list of words in EditPad Pro 7, it correctly selects the right words. (Note that I added "\b\w*" to the beginning of the regex so that it selects the entire word - that shouldn't make a difference if the regex is just used for filtering, but it helps if you are searching for entire words...)

All of the letters in that set (between the square brackets) are composed characters, i.e. individual units, in fact these Unicode characters:

\u00E1\u00E0\u00E2\u00E9\u00E8\u00EA\u00EC\u00ED\u00EE\u00F2\u00F3\u00F4

When you put them between square brackets, you're asking the regex to "match any one character from the set in brackets" (the definition from FLEx help). As long as they are composed characters, there's no problem - the regex engine looks for one of the following: a-acute, a-grave, a-circumflex, e-acute, etc. and will try to match any one of those characters. That’s exactly what it does in EditPad Pro, in a Python script, in LibreOffice Writer (with Regular expressions and Diacritic-sensitive selected), etc.

But I believe in the FLEx regex search/filtering, it first fully decomposes those characters, so the string between the square brackets gets changed to:

a\u0301a\u0300a\u0302e\u0301e\u0300e\u0302...

Now when the regex engine goes to "match any one character from the set in brackets", it says, OK, I want to match any one of these characters: an "a", an acute accent, another "a", a grave accent", another "a"...  In other words, it’s not matching one of the composed characters that the user entered in the search field, but any of the individual component parts of those characters when they are completely decomposed.

That’s why, in the example I gave, when you search for [áàâ], it still finds "êyeco" - because the circumflex is one of the "characters" in the set once you completely decompose the string [áàâ] into [a\u0301a\u0300a\u0302] (a set of 6 characters).

That's why I wondered in my earlier post "Should FLEx use fully composed forms when trying to do searches, especially with regexes?" If you use the fully decomposed forms, it certainly leads to some surprising results, at least as we've seen in sets that have composed characters.
Reply all
Reply to author
Forward
0 new messages