POSIX in morphological filters

96 views
Skip to first unread message

AT

unread,
Oct 19, 2020, 8:02:24 AM10/19/20
to Unitex-GramLab
Dear all,

I have a question about the use of the POSIX standard in a morphological filter.

Context

This question arose during a simple pattern matching query in Unitex: find all verbs in lowercase. It seems to me that only the following regex can be used: <V><<^[a-zà-ÿ]+$>>_f_ (find all verbs, apply morphological filter to lowercase, force case sensitivity). However, I think that the following regex is a better option: <V><<^[[:lower:]]+$>>_f_.

Now, it seems that this syntax is not recognized by Unitex (at least with my installation). Yet, the manual specifies that the POSIX standard is used (p.86 in the v3.2 manual for French). Moreover, after looking at the Github page of the TRE library (https://github.com/laurikari/tre/) used by Unitex, I read that the POSIX classes should be recognized ("Note that [^[:class:]] works already"). In consequence, I assume that, since the POSIX character classes should be recognized by the library that is used by Unitex, there must be an error somewhere.

So, my question is: Is this a bug (in Unitex/in my local installation)? Or is this to be expected?

Steps to reproduce

OS = macOS Catalina (10.15.7)
Unitex version = 3.1 / 3.2
Language = French
Text = 80jours.txt (default)
Processing = 80jours.snt (default)
Locate pattern > Regular expression: <<^[[:lower:]]+$>>
Morphological filter ‘^[[:lLoOwWeErR:]]+$’ : Syntax error : Unknown character class name
Cannot compile filter(s)

Locate pattern > Regular expression: <<^[[:lower:]]+$>>_f_
Morphological filter ‘^[[:lower:]]+$’ : Syntax error : Unknown character class name
Cannot compile filter(s)


I thank you in advance for your reply and help.

KR,
Anaïs Tack

Cristian Martinez

unread,
Oct 19, 2020, 1:57:03 PM10/19/20
to Unitex-GramLab
Hi Anaïs,

Note that by default, morphological filters are case insensitive. Thus, the filter <<^[[:lower:]]+$>> will be internally converted to <<^[[:lLoOwWeErR:]]+$>>, such expression contains, as expected, an  unknown character class: lLoOwWeErR

To force the matcher to respect the case, thus to interpret the right class, you must  append _f_ to your filter, i.e: <V><<^[[:lower:]]+$>>_f_

I just checked this expression (with versions 3.1, 3.2 and with the upcoming 4.0-alpha) and it works without any issue:

lowercaseverbs-regex.png

If you still having issues with the filter <V><<^[[:lower:]]+$>>_f_, I suggest to create a .ulp log file as is described in the User's Manual, section 13, and share with us the last log file (unitex_log_0000000X.ulp) that is created after clicking the button "Search". To produce a smaller log, please check that no morphological dictionaries are activated (Info > Preferences > Morphological-mode dictionaries must be empty)

Finally, if you are later going to develop a graph to find all verbs in lowercase, you may want to use the following alternative:

lowercase-verbs.png

I hope this will helpful for you.

Best regards,

--
Cristian Martinez

Anaïs Tack

unread,
Oct 19, 2020, 2:42:07 PM10/19/20
to Unitex-GramLab, Cristian Martinez
Dear Cristian,

Thank you for your prompt reply. As I expected, it should be possible to use [:lower:] in Unitex.

I followed the instructions to create a log and added the last log file in attachment. I guess I will have to dig further to understand what caused the error on my machine. Maybe something related to issues in macOS Catalina?

Thank you as well for your suggestion. I will definitely try the alternative in FSGraph mode. Very helpful.

Best,

Anaïs Tack
Le 19 oct. 2020 à 19:58 +0200, Cristian Martinez <cristian...@univ-paris-est.fr>, a écrit :
Hi Anaïs,

Note that by default, morphological filters are case insensitive. Thus, the filter <<^[[:lower:]]+$>> will be internally converted to <<^[[:lLoOwWeErR:]]+$>>, such expression contains, as expected, an  unknown character class: lLoOwWeErR

To force the matcher to respect the case, thus to interpret the right class, you must  append _f_ to your filter, i.e: <V><<^[[:lower:]]+$>>_f_

I just checked this expression (with versions 3.1, 3.2 and with the upcoming 4.0-alpha) and it works without any issue:

<lowercaseverbs-regex.png>

If you still having issues with the filter <V><<^[[:lower:]]+$>>_f_, I suggest to create a .ulp log file as is described in the User's Manual, section 13, and share with us the last log file (unitex_log_0000000X.ulp) that is created after clicking the button "Search". To produce a smaller log, please check that no morphological dictionaries are activated (Info > Preferences > Morphological-mode dictionaries must be empty)

Finally, if you are later going to develop a graph to find all verbs in lowercase, you may want to use the following alternative:

--
You received this message because you are subscribed to a topic in the Google Groups "Unitex-GramLab" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/unitex-gramlab/TXFQKl4wRWk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to unitex-gramla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/6d9ce1c1-a5d6-44b2-847f-6d5b88de23c7n%40googlegroups.com.
unitex_log_00000003.ulp

Cristian Martinez

unread,
Oct 19, 2020, 3:54:12 PM10/19/20
to Unitex-GramLab
Anaïs,

I have not found major problems running the log.

I propose to do the following:

- Extract the executable: /App/platform/osx/UnitexToolLogger
- On a console type: ./UnitexToolLogger RunLog lowercase-verbs.ulp
- Finally type: echo $?

The RunLog exit code (echo'ed with $?) should be 0 if no errors were found.

You will find the lowercase-verbs.ulp as an attachment.

Best,
lowercase-verbs.ulp

AT

unread,
Oct 20, 2020, 9:59:15 AM10/20/20
to Unitex-GramLab
The program returned the following exit code: 
80 (Accessing a corrupted shared library)

I tried to do a clean install of Unitex, but the error remains.

FYI, I have the latest version of TRE (v0.8.0, installed via Homebrew), which works fine.
For example, I can match lowercase words with the agrep command (with UTF-8 encoding). 
$ agrep -c -w -e '[[:lower:]]+' /Users/anais/workspace/Unitex-GramLab/Unitex/French/Corpus/80jours.txt
1932

I guess that means there might be an issue with Unitex trying to access the compiled TRE library on Catalina? Should it be better if I opened an issue on Github?

Thank you for your help.

Best,
Anaïs Tack

Cristian Martinez

unread,
Oct 20, 2020, 11:42:01 AM10/20/20
to Unitex-GramLab
Hi,

This seems to be directly related with the TRE shared library installed on your system.

If possible, I can suggest that you uninstall the Homebrew package, and compile Unitex using the shipped TRE library.

1. brew remove tre
3. unzip Unitex-GramLab-3.2-application.zip
4. cd App/platform/osx
5. rm UnitexToolLogger (this must be done to force the setup script below to recompile UnitexToolLogger)
6. cd App/install
7. sh setup (this will compile UnitexToolLogger)

If the compiling step succeed, you will have a new UnitexToolLogger under App/platform/osx.

Then, iIf you want to check the TRE library that is being used by the executable, you can type:

otool -L App/platform/osx/UnitexToolLogger

I hope this will be helpful in overcoming the issue.

Best,

--
C.

AT

unread,
Oct 21, 2020, 3:59:46 AM10/21/20
to Unitex-GramLab
The application.zip does not have a Src/ folder, so I downloaded the source-distribution.zip instead. 

With sh setup, it was impossible to compile UnitexToolLogger:

g++ -o ../bin/UnitexToolLogger [...]
Undefined symbols for architecture x86_64:
  "_libintl_gettext", referenced from:
      _tre_regerror in libtre.a(regerror.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [../bin/UnitexToolLogger] Error 1

So, I figured out the errors did not stem from the TRE installation, but from the fact that libintl was not properly linked during compilation.

When adding the library to the setup command

LDFLAGS=-lintl sh setup

the compilation completed successfully.

================================================================================
Info: Unitex/GramLab compilation completed successfully
All binaries were copied to platform/osx-x86_64
================================================================================
Unitex/GramLab installation has successfully been completed!
 [OK] Java bin directory     : /usr/local/Cellar/openjdk/14.0.1/libexec/openjdk.jdk/Contents/Home/bin
 [OK] Installation directory : /Users/anais/Unitex-GramLab-3.2
 [OK] Platform binaries      : /Users/anais/Unitex-GramLab-3.2/App/platform/osx-x86_64
 [OK] Unitex Workspace       : /Users/anais/workspace/Unitex-GramLab/Unitex
 [OK] GramLab Workspace      : /Users/anais/workspace/Unitex-GramLab/GramLab
================================================================================

Now, the <<^[[:lower:]]+$>>_f_ regex works perfectly.

OS = macOS Catalina (10.15.7)
Unitex version = 3.2

Language = French
Text = 80jours.txt (default)
Processing = 80jours.snt (default)
Locate pattern > Regular expression: <<^[[:lower:]]+$>>
Morphological filter ‘^[[:lLoOwWeErR:]]+$’ : Syntax error : Unknown character class name
Cannot compile filter(s)

Locate pattern > Regular expression: <<^[[:lower:]]+$>>_f_
53171 matches
53171 recognized units
(32.179% of the text is covered)

Thanks for your help!

Anaïs
Reply all
Reply to author
Forward
0 new messages