Unit testing of grammars

Javier Miguel Sastre Martínez

unread,

Dec 14, 2017, 5:05:42 AM12/14/17

to Unitex-GramLab

Hi all,

I would like to use Unitex for building grammars whose purpose is to detect certain sentences or expressions of interest (e.g. cardinal numbers) for a chatbot that provides services. The construction of cardinal numbers is quite regular with the exception of some cases (e.g. 100 is written "cien" in Spanish, while numbers from 101 to 199 are written as "ciento uno" - "ciento noventa y nueve"). Is there a way of developing grammars with the assistance of an annotated corpus as unit testing cases? For instance, I would develop an annotated corpus with a few examples of each case, such as:

uno/1

dos/2

diez/10

cien/100

ciento uno/101

mil/1000

etc.

Then it would be nice to have in Unitex some button that would apply the grammar to all the unit test cases and list those where the grammar failed to properly translate the input into the output.

Now extrapolate cardinal numbers to a big grammar detecting different kinds of requests the user may ask the chatbot. As the number of services to be provided by the chatbot increases and the grammar is adapted to cover the new cases, the probability of breaking something that was previously working increases, and the only way of having some control is to validate the grammar against an annotated corpus. Is there something like this in Unitex? Thank you.

Regards,

Javier

eric.laporte

unread,

Dec 14, 2017, 10:49:56 AM12/14/17

to Unitex-GramLab

Hi Javier,

There is no such tool for benchmarking a local grammar. Note that the result of a grammar depends not only on the grammar, but also on the preprocessing, so the tool should include a specification of the preprocessing, a bit like in GramLab. But thanks for the good idea.

Eric

Javier Miguel Sastre Martínez

unread,

Dec 14, 2017, 11:04:35 AM12/14/17

to Unitex-GramLab

Hi Éric,

if I'm going to develop a chatbot providing commercial services this would be a must. I already implemented a command line tool for applying a grammar to a set of tagged sentences in order to list the sentences that are not being recognized/translated by the grammar as expected, but it would be nice to have this feature integrated in Unitex short of how it is currently done in IntelliJ Idea for unit testing Java code (see for instance https://i.ytimg.com/vi/5P_bCJMFxhI/maxresdefault.jpg, the frame at the bottom left corner indicates which tests have been passed and which have not).

Regards,

Javier

eric.laporte

unread,

Dec 14, 2017, 11:34:54 AM12/14/17

to Unitex-GramLab

Javier,

You are welcome to integrate your tool to Unitex if it satisfies some requirements, for example being multiplatform (command line tools tend not to be, this is why contributions are in C/C++ or Java) and being adapted to general needs, not only to a specific project (if your tool uses the '/' character to separate input from input in the annotated corpus, it might be a good idea to make the choice of the character parameterizable, since '/' is not rare in texts, and other users may be interested in inputs with '/' characters inside). If you integrate a feature, software-related issues (e.g. libraries if needed) will be discussed by the community of developers on the GitHub platform, and other developers will review your code. User-related issues (e.g. is it important to parameterize the choice of the delimiter?) can be discussed on this forum.

Best,

Eric

Javier Miguel Sastre Martínez

unread,

Dec 14, 2017, 11:57:36 AM12/14/17

to Unitex-GramLab

I guess first it would be necessary to have some graphical interface in Unitex for managing the annotated dataset, which would be a list of pairs of text sequences (the input and the expected output). The file format for an annotated corpus could simply be a JSON such as

[

{

"in": "mil quinientos veinticuatro",

"out": "1524"

},

{

"in": "dos mil ciento uno",

"out": "2101"

}

]

There wouldn't be a separating character, and a JSON library (e.g. Jackson) could be used for taking care of serialization/deserialization.

The tool I spoke about is certainly constrained to a specific project, though it would be simply a matter of launching the Locate Pattern tool in Unitex for each pair of in/out utterances and comparing the actual and expected outputs, and storing the indexes of the offending pairs within the array in order to later list only those.

Regards,

Javier

eric.laporte

unread,

Jan 10, 2018, 5:07:37 AM1/10/18

to Unitex-GramLab

Dear Javier,

The next step to make it real is to open an 'issue' in the GitHub platform and to describe what is missing in Unitex/GramLab and outline a solution, to give a chance for other developers to discuss about it.