Hi All,
I’m a sophomore undergraduate from Beijing University of Post and Telecommunications. Interested in NLP, I’d like to contribute to Unitex/GramLab in GSoC. To make sure that it’s suitable for us to collaborate together in the future, I’d like to introduce myself and my basic ideas about individual tasks.
Briefly introduce my self: I’m interested in NLP the most, and have already read most part of Speech and Language Processing 2/e in my part time. I used to be a competitive programmer, and have a good command of data structures and algorithms. I’ve been coding in C for 8 years, written tens of thousands lines of code with it. I also participate in open source activities, ranked top 20 in Google Code-in 2012. My GitHub account is hghwng, you can see that I contributed to darkhttpd, oh-my-zsh and more projects. I also tried to start a FOSS project, called yet-another-graphics-engine.
For tasks of GSoC this year, I have more detailed ideas listed below:
Build an annotation comparison module:
In my understanding, this tasks consists of following stages:
Is that correct? Could you please provide a simple example explaining the “match” stage?
Support Locate Pattern function on treebanks:
Could you please explain the details of matching? Such as support of mixing PoS tags and words, matching part of sentence. Are there any other needs of fuzzy match?
(S
(NP-SBJ The/DT flight/NN )
(VP should/MD
(VP arrive/VB
(PP-TMP at/IN
(NP eleven/CD a.m/RB ))
(NP-TMP tomorrow/NN )))))
Fig. 1: Example tree ("The flight should arrive at eleven a.m tomorrow")
<S> -DT-> The -NN-> flight -MD-> should -VB-> arrive -IN-> at -CD-> eleven -RB-> a.m -NN-> tomorrow -> <E>
| ^ | | | | ^ | ^
--------NP-SBJ----- | | | ---------NP--------^ ------NP-TMP---------^
| | -------------PP-TMP--------------^ ^
| ----------------------------VP---------------------------------------^
---------------------------------------VP---------------------------------------^
Fig. 2: Example automaton
Enhance MultiFlex: a module for automatic inflection of multi-word units:
I guess the task is to port Multiflex to Unitex/GramLab, but I can’t find the source. Give me more time to digest the paper, and it’s very likely that I’ll be able to implement it from scratch.
Although I’m mainly interested in linguistics related tasks, I’m also confident of the following tasks:
Improve the grammar compiler by creating a bytecode:
save_Fst2()
in Fst2.cpp
serializes the Fst2
object into plain text. From the task description, I guess the goal is to design a binary format for efficient serialization of the Fst2
object.
However, it dosen’t make sense since it’s not about “bytecode” and only speeds up the process of deserialization. I guess maybe the goal is to speed up the execution of the finite state transducer: compile and optimize the original fst2
file, and replace locate_pattern()
in LocatePattern.cpp
for better performance.
Could you please explain the task in detail?
Create a Package Manager for Linguistic Resources:
In my understanding, this package manager is designed to replace the unitex-packaging-*
system, making things easier for both developer and users.
vinber-backend
BTW, while playing around with the source and searching for the implementation details, I found Unitex/GramLab can be improved in several ways:
It’s more difficult for us — a first-time GSoC participant and a first-time GSoC organization — to work together. Hope my feedback is of help to you, and I’m looking forward to hearing your ideas!
Regards,
Hugh
Hello, you wrote
o «
Move the core function to a shared library. In this way, executables only contain unique code, and duplicated core functions are eliminated”
I don’t think we can say there is duplicated code. There is just two way to compile Unitex C++ Core code:
1) Build an executable UnitexToolLogger, which need no other file to run. All the Unitex engine is in this executable, without duplicated code.
2) Build a shared library (or jni, which is just shared library with additional function exported) and a very small Test_lib. We can use Test_lib + shared Library like UnitexToolLogger . (start ./Test_Lib Locate -h by example)
I think creating “old way” (like before Unitex 2.1) small utilites executable (Locate, Normalize as stand alone utility) is historic and can be removed.
UnitexToolLogger was created six year ago for replace 40 executable by only one.
And we will’ll delete all Main_*.cpp (except Main_UnitexTool*.cpp)
Hi Gilles,
Move the core function to a shared library. In this way, executables only contain unique code, and duplicated core functions are eliminated”
I don’t think we can say there is duplicated code.
I think creating “old way” (like before Unitex 2.1) small utilites executable (Locate, Normalize as stand alone utility) is historic and can be removed.
UnitexToolLogger was created six year ago for replace 40 executable by only one.
And we will’ll delete all Main_.cpp (except Main_UnitexTool.cpp)
I agree with you. It makes sense if small utilities are removed.
BTW, I haven’t received any followup about the details of tasks I referred previously. However, I’m not familiar with the community culture here. Is it better if I post in the developer mailing list, or mail each developer individually?
Regards,
Hugh
C’est bien entendu juste une option supplémentaire. Par défaut rien n’est changé !
De : Denis Maurel [mailto:mau...@univ-tours.fr]
Envoyé : lundi 8 octobre 2018 10:06
À : Gilles Vollant
Cc : Liste Unitex
Objet : Re: [Unitex-GramLab] Re: [GSoC] Ideas for improving the project and my interested tasks
Bonjour Gilles
Merci de maintenir le fst2 quand même... Mais je suppose qu'on aura le choix en compilant? .fst2 et .bin?
Cordialement,
Denis Maurel
____________________________________
Professeur Denis Maurel
Université de Tours
Recherche: bureau 215
Lifat (Laboratoire d'informatique fondamentale et appliquée)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Tel. (33) 2.47.36.14.35
Telc. (33) 2.47.36.14.22
Enseignement:
Responsable de la licence Matic, bureau 3200
IUT, Département TC
29 rue du Pont-Volant
37082 Tours cedex 02
France
Tel. 02.47.36.75.50
Telc. 02.47.36.76.23
Secretariat: 02.47.36.76.30
mailto:denis....@univ-tours.fr
http://www.univ-tours.fr/maurel
http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/
http://international.univ-tours.fr/offre-de-formation/licence-professionnelle-commerce-specialite-marketing-et-technologies-de-l-information-et-de-la-communication-matic--264012.kjsp?RH=ACCUEIL_FR
--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/24cd9505-6397-4c2c-b560-7a8d929281cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.