GMX counts...

23 views
Skip to first unread message

jim

unread,
Feb 15, 2021, 8:12:04 PM2/15/21
to Group: okapi-devel
Trying to follow the GMX spec makes our word counting much more
complicated than it needs to be.

Add for Th meeting topic: remove GMX support?

Jim

Yves Savourel

unread,
Feb 15, 2021, 11:57:20 PM2/15/21
to okapi...@googlegroups.com
Hi Jim,

Not an easy question.
Argos doesn't specifically use GMX, so from that viewpoint I wouldn't be against it, if it make things a lot simpler.
But it's an effort toward standards and it would be step back to drop support. I also believe some users do use it so it would impact them.
I'll put the item in the agenda. (and people should voice their thoughts here if they have a strong opinion on this).

-ys
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/582f6b96-8ac0-1cf1-66a5-ba6050ccf5db%40gmail.com.

jim

unread,
Feb 16, 2021, 10:50:26 AM2/16/21
to okapi...@googlegroups.com, Yves Savourel
Thank you Yves. I don't have any strong opinions (other than a desire to simplify the code base). But I have noticed that GMX is contrary to defacto word counting standards in a few cases. For example:

  1. Numbers counted as words
  2. words with apostrophes are broken up and counted as multiple words (french)
  3. hyphenated words are counted as a single word.

Jim

Chase Tingley

unread,
Feb 16, 2021, 5:34:34 PM2/16/21
to okapi...@googlegroups.com, Yves Savourel
I agree that GMX isn't really in line with common practices, but just arbitrarily removing support for it makes me a little uneasy. 

jim

unread,
Feb 16, 2021, 7:00:21 PM2/16/21
to okapi...@googlegroups.com, Chase Tingley, Yves Savourel
No worries. I've only got 6 GMX unit tests to fix after my tokenization refactor (detangle lib-extra and the various tokenization "engines").

Jim

Stephen Holmes

unread,
Feb 17, 2021, 9:23:29 AM2/17/21
to 'GitLab' via okapi-devel, Chase Tingley, Yves Savourel

I’d be super interested to learn what multilingual word counting “common practices” are.  If I may, I’d like to add a comment on GMX-V in general…

…unless there is another “open source” counting algorithm, I would be against seeing this removed.  GMX-V, although it may have flaws, is really the only solution in play where one can stand over generated metrics from a consistency perspective.   Is it (or was it) not an unspoken goal of the Okapi suite to be OAXAL compliant? http://docs.oasis-open.org/oaxal/V1.0/cd02/oaxal-v1.0-cd02.html.   If so, would GMX-V not fall into Okapi’s remit?

Word count arguments still persist in the industry, and it’s no wonder when performing an analysis across a basic Microsoft Word document in two versions of very popular CAT tool yields different results - even over minor version increments.   And comparing word counts across various other tools can show wild variances of over 15%!   With GMX-V, it’s nice to be able to say that the manner in which our metrics are generated have an open algorithmic basis for inspection!

Just my 1 Euro’s worth!
Stephen




jim

unread,
Feb 18, 2021, 4:39:19 PM2/18/21
to okapi...@googlegroups.com, Stephen Holmes, Chase Tingley, Yves Savourel
Wanted to post this to assure everyone that Okapi will continue to support the GMX spec. Even in the cases where either we or other tools disagree with the GMX rules. Maybe this will prompt discussion and the creation of a dot release to the GMX spec and encourage more vendors to use it.

The coming changes to the tokenizer should give us more accurate counts, across more languages, with new token types (emojis).

Sorry if  this caused anyone alarm :-)  We always check with the community when we have questions on features and really appreciate the feedback! 

Jim Hargrave
Reply all
Reply to author
Forward
0 new messages