srx-repository info

44 views
Skip to first unread message

Aaron Madlon-Kay

unread,
Mar 10, 2021, 10:36:57 PM3/10/21
to okapi-users
Hi all.

I just learned about the srx-repository:
https://bitbucket.org/okapiframework/srx-repository/src/master/

I recently implemented an SRX engine for Ruby
(https://github.com/amake/srx-ruby) and was looking for good "canned"
rules to offer as defaults.

I'm wondering:

- Is there any more information published about the repository
somewhere? Like intended use case, etc. I think such a document would
probably answer my remaining questions...

- What is the license on the SRX data itself? I see that some files
are noted as coming from LanguageTool and mention the LGPL; does LGPL
meaningfully apply to data files? What about the files that don't
mention a license? I don't see any overall license for the repo.

- I see that the SRX data is split into languageMap and languageRule
text files. Is there a way provided to build a fully formed SRX
document, or is that munging left to the user?

- Are the rules taken from LanguageTool kept up to date with upstream?
(I see a lot of things last modified in 2015 so I guess not.)

Thanks,
Aaron

Marc

unread,
Mar 11, 2021, 5:12:31 AM3/11/21
to okapi-users
Interesting questions, the answers would be very interesting for me too.

What we did when we merged everything to one SRX file for Okapi was to
assume, that where you can find no license the general okapi license is
valid, since it is in the Okapi repo.

And we outline in comments in the SRX file, what part of the rules is
under what license.

would be interesting to know about keeping upstream from LanguageTool.
Anyone knows?

best

Marc
--
Marc Mittag
MittagQI - Quality Informatics

Service Desk for Requests:
https://jira.translate5.net/servicedesk
Please request a login via mail, if you have none

MittagQI
Konrad-Lorenz-Weg 10
D-72116 Mössingen
Germany
Tel.: ++49 (0)7473/220202
Fax: ++49 (0)7473/220211
mailto: Ma...@MittagQI.com
Web: www.MittagQI.com

Optionale PGP-Verschlüsselung:
Für jeden Mitarbeiter von MittagQI ist auf pool.sks-keyservers.net
ein PGP-Key hinterlegt den Sie zur PGP-Verschlüsselung Ihrer Mails an uns
nutzen können.

jim

unread,
Mar 11, 2021, 11:19:44 AM3/11/21
to Aaron Madlon-Kay, okapi-users
Hi Aaron

This repository was created by me as a place to store example SRX files and segmentation code to build a test suite to compare
our production segmenter against. As you know, testing SRX rules in a consistent way is very difficult. My solution was to create
a JUnit test suite against known, difficult cases. Also pull in third party libs for chinese.

answers below...


On 3/10/21 8:36 PM, Aaron Madlon-Kay wrote:
Hi all.

I just learned about the srx-repository:
https://bitbucket.org/okapiframework/srx-repository/src/master/

I recently implemented an SRX engine for Ruby
(https://github.com/amake/srx-ruby) and was looking for good "canned"
rules to offer as defaults.
Have you seen: https://rubygems.org/gems/pragmatic_segmenter

It's algorithmic but I found its a good way to test SRX rules.


I'm wondering:

- Is there any more information published about the repository
somewhere? Like intended use case, etc. I think such a document would
probably answer my remaining questions...

We only have the Readme.md - but it needs to be updated.



- What is the license on the SRX data itself? I see that some files
are noted as coming from LanguageTool and mention the LGPL; does LGPL
meaningfully apply to data files? What about the files that don't
mention a license? I don't see any overall license for the repo.

The default license would be apache - like the main okapi project. But it was created when we were still using LGPL

Interesting question on data files and LGPL. I don't know the answer to that. One reason we pulled language-tool into another okapi project
(okapi-linguistic-tools) was that some of the data files had questionable licenses and we didn't want there to be any doubts about the main okapi
project. 


- I see that the SRX data is split into languageMap and languageRule
text files. Is there a way provided to build a fully formed SRX
document, or is that munging left to the user?

Yes, those rules are there as examples. Also the list of abbreviations. My best attempt to consolidate some of these is found in the okapi file:
okapi_default_icu4j.srx. That is the defacto standard that the unit tests use to test against.

That file enables ICU4J segmentation as default - then applies SRX as exceptions (IMHO the only accurate way to segment is algorithmic + SRX)


- Are the rules taken from LanguageTool kept up to date with upstream?
(I see a lot of things last modified in 2015 so I guess not.)

No. Those haven't been updated in a while. But any updated or new files would be appreciated.


Thanks,
Aaron


Aaron Madlon-Kay

unread,
Mar 11, 2021, 6:31:17 PM3/11/21
to okapi-users
Hi Jim. Thanks very much for the info! That’s very helpful.

-Aaron
Reply all
Reply to author
Forward
0 new messages