Dear all,
I am adventuring into the Casen cascades, both analysis and synthesis, and try to get offsets corresponding to the original text. It do not know really if I am getting it the right way, but it seems that I can get offsets after the application of the first cascade (analysis). My workflow calls Normalize, fst2txt for sentence and replace, then tokenize, propagating offsets with the --input_offsets and --output_offsets parameters. So that I can get a proper TokenOffset.txt after normalization (it looks like).
To sum-up the kind of workflow :
- normalize file.txt --output_offsets=norm_offsets.txt ...
- fst2txt sentence --input_offsets=norm_offsets.txt --output_offsets=after_sentence_offsets.txt ...
- fst2txt replace --input_offsets=after_sentence_offsets.txt --output_offsets=after_replace_offsets.txt ...
- tokenize file.snt --input_offsets=after_replace.txt --output_offsets=tokenOffsets.txt ...
- cassys ...
I do not know what kind of information are propagated, with the input_offset and output_offset, since the format is a bit cryptic. So I am a bit stuck when it comes to apply the second cascade (synthesis).
Then :
- normalize file_csc.txt --input_offsets=after_replace_offsets.txt --output_offsets=csc_norm_offsets.txt ...
- tokenize file_csc.snt --input_offsets=csc_norm_offsets.txt --output_offsets=csc_tokenOffsets.txt ...
- cassys ...
I tried using only normalization (with the last after_replace_offsets.txt from the application of the fst2txt/replace application) + tokenization with the resulting norm_offsets, before applying the synthesis cascade. But I can't achieve to get a resulting csc_TokenOffset.txt relative to the original text, but only relative to the csc.snt file from the second normalization. So, it looks like I can't propagate offsets information relative to the application of the analysis cascade through the application of the second cascade.
Am I right or did I miss the point? What could be the proper way to propagate offsets?
Thanks in advance,
Jean-Christophe