How to get offsets from original text after multiple cascade application?

65 views

Skip to first unread message

Jean-Christophe Pautre

unread,

Nov 25, 2013, 5:55:29 AM11/25/13

to unitex-...@googlegroups.com

Dear all,

I am adventuring into the Casen cascades, both analysis and synthesis, and try to get offsets corresponding to the original text. It do not know really if I am getting it the right way, but it seems that I can get offsets after the application of the first cascade (analysis). My workflow calls Normalize, fst2txt for sentence and replace, then tokenize, propagating offsets with the --input_offsets and --output_offsets parameters. So that I can get a proper TokenOffset.txt after normalization (it looks like).

To sum-up the kind of workflow :

normalize file.txt --output_offsets=norm_offsets.txt ...
fst2txt sentence --input_offsets=norm_offsets.txt --output_offsets=after_sentence_offsets.txt ...
fst2txt replace --input_offsets=after_sentence_offsets.txt --output_offsets=after_replace_offsets.txt ...
tokenize file.snt --input_offsets=after_replace.txt --output_offsets=tokenOffsets.txt ...
cassys ...

I do not know what kind of information are propagated, with the input_offset and output_offset, since the format is a bit cryptic. So I am a bit stuck when it comes to apply the second cascade (synthesis).

Then :

normalize file_csc.txt --input_offsets=after_replace_offsets.txt --output_offsets=csc_norm_offsets.txt ...
tokenize file_csc.snt --input_offsets=csc_norm_offsets.txt --output_offsets=csc_tokenOffsets.txt ...
cassys ...

I tried using only normalization (with the last after_replace_offsets.txt from the application of the fst2txt/replace application) + tokenization with the resulting norm_offsets, before applying the synthesis cascade. But I can't achieve to get a resulting csc_TokenOffset.txt relative to the original text, but only relative to the csc.snt file from the second normalization. So, it looks like I can't propagate offsets information relative to the application of the analysis cascade through the application of the second cascade.

Am I right or did I miss the point? What could be the proper way to propagate offsets?

Thanks in advance,
Jean-Christophe

Nathalie Friburger

unread,

Jan 23, 2014, 8:39:49 AM1/23/14

to unitex-...@googlegroups.com

Dear Jean-Christophe,

I forgot to answer your problem. Excuse me for the delay.

You try to get offsets corresponding to the original text.
The problem is that Cassys does not have the options --in/output_offsets. Cassys works on the files token.txt and concord.ind.
After the preprocessings (like normalize, tokenize), Cassys is launched : it builds his own concord.ind (with the offsets of the initial text after preprocessings of course ). This concord.ind contains all the patterns found with the cascade of graphs and the offsets relatively to the initial file.
The Concord.ind file (created by cassys) allows to provide correct offsets for further processings like building the resulting file or like building a concordance in unitex.

Best regards,
Nathalie

Reply all

Reply to author

Forward

0 new messages