[GSoC] Ideas for improving the project and my interested tasks

Hugh Wang

unread,

Mar 2, 2016, 10:14:54 AM3/2/16

to Unitex-GramLab

Hi All,

I’m a sophomore undergraduate from Beijing University of Post and Telecommunications. Interested in NLP, I’d like to contribute to Unitex/GramLab in GSoC. To make sure that it’s suitable for us to collaborate together in the future, I’d like to introduce myself and my basic ideas about individual tasks.

Briefly introduce my self: I’m interested in NLP the most, and have already read most part of Speech and Language Processing 2/e in my part time. I used to be a competitive programmer, and have a good command of data structures and algorithms. I’ve been coding in C for 8 years, written tens of thousands lines of code with it. I also participate in open source activities, ranked top 20 in Google Code-in 2012. My GitHub account is hghwng, you can see that I contributed to darkhttpd, oh-my-zsh and more projects. I also tried to start a FOSS project, called yet-another-graphics-engine.

For tasks of GSoC this year, I have more detailed ideas listed below:

Build an annotation comparison module:

In my understanding, this tasks consists of following stages:

implement core algorithm in C++
- given two annotations on the same text, match the annotation by Viterbi
- calculate stats about the degree of match
- export the match and stats to file
develop corresponding Java GUI code for the IDE

Is that correct? Could you please provide a simple example explaining the “match” stage?

Support Locate Pattern function on treebanks:

Convert a treebank (such as BNC XML) to automaton (See figure 1 and figure 2)
match the automaton with the given input

Could you please explain the details of matching? Such as support of mixing PoS tags and words, matching part of sentence. Are there any other needs of fuzzy match?

  (S
   (NP-SBJ The/DT flight/NN )
   (VP should/MD
    (VP arrive/VB
     (PP-TMP at/IN
      (NP eleven/CD a.m/RB ))
     (NP-TMP tomorrow/NN )))))

Fig. 1: Example tree ("The flight should arrive at eleven a.m tomorrow")

  <S> -DT-> The -NN-> flight -MD-> should -VB-> arrive -IN-> at -CD-> eleven -RB-> a.m -NN-> tomorrow -> <E>
      |                 ^ |          |            |             |                  ^ |                    ^
      --------NP-SBJ----- |          |            |             ---------NP--------^ ------NP-TMP---------^
                          |          |            -------------PP-TMP--------------^                      ^
                          |          ----------------------------VP---------------------------------------^
                          ---------------------------------------VP---------------------------------------^

Fig. 2: Example automaton

Enhance MultiFlex: a module for automatic inflection of multi-word units:

I guess the task is to port Multiflex to Unitex/GramLab, but I can’t find the source. Give me more time to digest the paper, and it’s very likely that I’ll be able to implement it from scratch.

Although I’m mainly interested in linguistics related tasks, I’m also confident of the following tasks:

Improve the grammar compiler by creating a bytecode:

save_Fst2() in Fst2.cpp serializes the Fst2 object into plain text. From the task description, I guess the goal is to design a binary format for efficient serialization of the Fst2 object.

However, it dosen’t make sense since it’s not about “bytecode” and only speeds up the process of deserialization. I guess maybe the goal is to speed up the execution of the finite state transducer: compile and optimize the original fst2 file, and replace locate_pattern() in LocatePattern.cpp for better performance.

Could you please explain the task in detail?

Create a Package Manager for Linguistic Resources:

In my understanding, this package manager is designed to replace the unitex-packaging-* system, making things easier for both developer and users.

release system
- integrate with vinber-backend
- package the program and data files
- manage metadata
  - per-package
    - size / hash of files
    - version / date / architecture / install path
  - metadata for packages
    - package list
    - dependency (?)
package manager
- fetch and update metadata
- download the contents from release site or mirror
- install / upgrade
- CLI
- … with Java GUI support (?)

BTW, while playing around with the source and searching for the implementation details, I found Unitex/GramLab can be improved in several ways:

Open up mailing list and set up forum, for easier reach of other GSoC students and potential developers
Clarify tasks. The task description is abstract and ambiguous IMHO. It’s better if examples are provided
Clean up code base
- Use a better make system, such as CMake, to automate dependencies in the original makefile and more
- Move the core function to a shared library. In this way, executables only contain unique code, and duplicated core functions are eliminated
- Set up coding standards. From the perspective of software engineering, the coding style needs improving

It’s more difficult for us — a first-time GSoC participant and a first-time GSoC organization — to work together. Hope my feedback is of help to you, and I’m looking forward to hearing your ideas!

Regards,
Hugh

Gilles Vollant

unread,

Mar 2, 2016, 3:44:48 PM3/2/16

to Hugh Wang, Unitex-GramLab

Hello, you wrote

o «

Move the core function to a shared library. In this way, executables only contain unique code, and duplicated core functions are eliminated”

I don’t think we can say there is duplicated code. There is just two way to compile Unitex C++ Core code:

1) Build an executable UnitexToolLogger, which need no other file to run. All the Unitex engine is in this executable, without duplicated code.

2) Build a shared library (or jni, which is just shared library with additional function exported) and a very small Test_lib. We can use Test_lib + shared Library like UnitexToolLogger . (start ./Test_Lib Locate -h by example)

I think creating “old way” (like before Unitex 2.1) small utilites executable (Locate, Normalize as stand alone utility) is historic and can be removed.

UnitexToolLogger was created six year ago for replace 40 executable by only one.

And we will’ll delete all Main_*.cpp (except Main_UnitexTool*.cpp)

Hugh Wang

unread,

Mar 3, 2016, 8:03:48 PM3/3/16

to Unitex-GramLab, hgh...@gmail.com

Hi Gilles,

Move the core function to a shared library. In this way, executables only contain unique code, and duplicated core functions are eliminated”

I don’t think we can say there is duplicated code.

I think creating “old way” (like before Unitex 2.1) small utilites executable (Locate, Normalize as stand alone utility) is historic and can be removed.

UnitexToolLogger was created six year ago for replace 40 executable by only one.

And we will’ll delete all Main_.cpp (except Main_UnitexTool.cpp)

I agree with you. It makes sense if small utilities are removed.

BTW, I haven’t received any followup about the details of tasks I referred previously. However, I’m not familiar with the community culture here. Is it better if I post in the developer mailing list, or mail each developer individually?

Regards,
Hugh

Gilles Vollant

unread,

Oct 8, 2018, 3:03:29 AM10/8/18

to Unitex-GramLab

I added support for a binary FST2 file format.

I wrote this code 8 year ago for private uses and integrate now in open source unitex

https://github.com/UnitexGramLab/unitex-core/commit/9115056b5ca82536db4b23a644596e1b23d72b52

Gilles Vollant

unread,

Oct 8, 2018, 4:09:39 AM10/8/18

to denis....@univ-tours.fr, Liste Unitex

C’est bien entendu juste une option supplémentaire. Par défaut rien n’est changé !

De : Denis Maurel [mailto:mau...@univ-tours.fr]
Envoyé : lundi 8 octobre 2018 10:06
À : Gilles Vollant
Cc : Liste Unitex
Objet : Re: [Unitex-GramLab] Re: [GSoC] Ideas for improving the project and my interested tasks

Bonjour Gilles

Merci de maintenir le fst2 quand même... Mais je suppose qu'on aura le choix en compilant? .fst2 et .bin?

Cordialement,

Denis Maurel

____________________________________
Professeur Denis Maurel
Université de Tours

Recherche: bureau 215
Lifat (Laboratoire d'informatique fondamentale et appliquée)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Tel. (33) 2.47.36.14.35
Telc. (33) 2.47.36.14.22

Enseignement:
Responsable de la licence Matic, bureau 3200
IUT, Département TC
29 rue du Pont-Volant
37082 Tours cedex 02
France
Tel. 02.47.36.75.50
Telc. 02.47.36.76.23
Secretariat: 02.47.36.76.30

mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/

http://international.univ-tours.fr/offre-de-formation/licence-professionnelle-commerce-specialite-marketing-et-technologies-de-l-information-et-de-la-communication-matic--264012.kjsp?RH=ACCUEIL_FR

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/24cd9505-6397-4c2c-b560-7a8d929281cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eric.laporte

unread,

Oct 9, 2018, 4:26:36 AM10/9/18

to Unitex-GramLab

Hi Hugh,

You're welcome to contribute to the code of Unitex/GramLab! Thanks for your suggestions. In addition to Gilles' posting, here is some information that might be useful to you.

Google Summer of Code: we are not certain to participate next year yet, it depends on the number of available supervisors.
Annotation comparison module: Philippe Gambette (UPEM) has supervised an internship this year on this project, you should talkl with him.
Coding standards: starting Unitex/GramLab v3.2, we will start to follow the Google C++ Style Guide (https://google.github.io/styleguide/cppguide.html). For now, this only applies for new code files (i.e not for legacy sources previous to the v3.2-alpha), cf. Cristian Martinez's comment https://github.com/UnitexGramLab/unitex-core/pull/13
Support Locate Pattern function on treebanks: the contact for this project is Matthieu Constant (Univ. Lorraine), you should talk with him.
Enhance MultiFlex: a module for automatic inflection of multi-word units: the contact for this project is the author of Multiflex, Agata Savary (Univ. Tours). There are two implemented versions of Multiflex: one of them is already in Unitex/GramLab and works quite correctly, and most of the changes to the other version are probably worth adopting. You should talk to her to get the source.