Question Regarding Segmentation

110 views
Skip to first unread message

Anthony Demetri Kostacos

unread,
Dec 8, 2023, 2:41:26 PM12/8/23
to unim...@googlegroups.com
Hello,

My name is Anthony Kostacos, and I'm a member of Dr. Barbara Landau's linguistics and cognitive science lab. I'm working on a project with the Hopkins Human Language Technology Center of Excellence on a project in which we intend to use Unimorph to analyze sentences in bitext corpora. Specifically, we intend to use the segmentation described in the Unimorph 4.0 paper. Is this available to be used? If so, how can we use it? Thank you for your time.

Sincerely,

Anthony

Kat Vylomova

unread,
Dec 8, 2023, 6:35:27 PM12/8/23
to Anthony Demetri Kostacos, unim...@googlegroups.com, Khuyagbaatar Batsuren
Dear Anthony, 

Segmentations should be provided for the languages listed in Table 7 of the UniMorph 4.0 paper, e.g. https://github.com/unimorph/fin , although the data quality might vary. Huygaa (CC'ed) is leading this part and might be able to help if you have any questions. 
In 2022, we organised a shared task on segmentation, you may have a look at the results: https://aclanthology.org/2022.sigmorphon-1.11/ (word- and sentence-level).

Hope it helps!

Warm regards,
Kat

--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimorph/BL3PR01MB6961C473F2D6E1AB742D8BD8868AA%40BL3PR01MB6961.prod.exchangelabs.com.

Peter Viechnicki

unread,
Jan 14, 2024, 1:52:20 PM1/14/24
to unimorph
Hi, I'm a colleague of Anthony's, and we appreciate your trying to help us.

We have unimorph 0.0.4 installed via pip.  When I list its functions, I don't see anything related to segmentation.

FUNCTIONS
    analyze_word(word: str, *, lang: str)
   
    download_unimorph(lang: str)
   
    get_list_of_datasets() -> List[str]
   
    inflect_word(word: str, *, lang: str, features=None)
   
    is_empty(dir: pathlib.Path) -> bool
   
    load_dataset(lang: str, specific_file=None)
   
    main() -> None
   
    not_loaded(lang: str) -> bool
   
    parse_args()

Can you help us find the 'Segment' function which is described in the paper you reference?  We're eager to use this for the 16 languages provided.  Do we need a different version of unimorph?

Thanks much,
-Peter Viechnicki
JHU

Khuyagbaatar Batsuren

unread,
Jan 14, 2024, 3:03:54 PM1/14/24
to Peter Viechnicki, unimorph
Hi Peter,

You are right. At the moment, the UniMorph library tool doesn't have any segmentation function.

For Finnish, we have two types of segmentation from the UniMorph 4.0 update:
Huygaa

Peter Viechnicki

unread,
Jan 18, 2024, 10:18:29 AM1/18/24
to unimorph
Thanks, Huygaa.  Are there any plans to release a new version of Unimorph which does include the segmentation capability?  I'm looking at paragraph 3.3 of the recent Unimorph paper (https://aclanthology.org/2022.lrec-1.89.pdf) which I interpreted to mean that the current release would support segmentation for 16 languages.

Thanks for helping us set expectations on our side.  If this capability is coming soon, we're eager to use it for a project related to spatial languages.

Sincerely,
-Peter Viechnicki, JHU

Khuyagbaatar Batsuren

unread,
Jan 18, 2024, 8:11:26 PM1/18/24
to Peter Viechnicki, unimorph
Hi Peter,

Thank you so much for your interest. You can find segmentation resources mentioned in section 3.3 and table 7 as follows: 
www.github.com/ *language_name* / *language_name*.segmentations (except for some languages, these files are large so that they are zipped or partitioned)

You can also train the morpheme segmentation tool for these resources by using the participating system tools in "SIGMORPHON Shared Task 2022 on Morpheme Segmentation" and use them to segment the out of the resource words into morphemes. https://aclanthology.org/2022.sigmorphon-1.11/ 

Best regards,
Huygaa

Peter Viechnicki

unread,
Jan 26, 2024, 12:14:47 PM1/26/24
to unimorph
Thanks, Huygaa.  That's very helpful.

We will try to use those segmentation files to accomplish our goal.

Am I correct that we could just use the segmentation files as reference tables without invoking unimorph? In other words, we would use them as a lookup where we take a form and search for it in the first column of the appropriate language, and then return the last column as the segmentation?  Or would this gloss over additional functionality of Unimorph?

We very much appreciate your help using this tool.

Thanks,
-Peter
Reply all
Reply to author
Forward
0 new messages