Parallel corpus with many corpora

49 views
Skip to first unread message

mansayk

unread,
Dec 3, 2019, 3:50:07 AM12/3/19
to NoSketch Engine
Hello!

I approximately understand how to create a map file for a parallel corpus consisting of two corpora. Could you share your experience about what are the steps to make a parallel corpus consisting of 5-10 and more corpora? As I understand making cross map files for every language pair will be a mess in this case, or I am looking at wrong direction?

With best wishes,
Mansur

Sketch Engine Support

unread,
Dec 5, 2019, 3:27:13 AM12/5/19
to NoSketch Engine
-- do not edit --

Hi Mansur,
yes, you need to create map files for all language pairs. If your let's say corp_lang1_lang2.align file contains numbers from 0 to the last sentence_id in both columns (=all sentence ids from both corpora are covered), you can create corp_lang2_lang1.align by simply swapping the columns.

More info here: https://www.sketchengine.eu/guide/setting-up-parallel-corpora/#tab-id-3

Best regards,
Vít Baisa

Sketch Engine Team

mansur

unread,
Dec 5, 2019, 3:39:21 AM12/5/19
to NoSketch Engine
Hi, Vit!

Ok, thank you very much for clarifying. I will write some script to make all those mappings. But there are 2 ways:
- Prepare the mappings in XML format, then use your scripts (calign.py...) to convert them into noske mappings format (tsv).
- Prepare the mappings directly in noske format (tsv).

Do your scripts make any additional checking, validation? Which way would you recommend?

Best,
Mansur

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/BWD9xIl-xFaFQ-hQAAABpxAADOHAAAQ61tLz9D-support%40sketchengine.co.uk.

Vít Baisa

unread,
Dec 10, 2019, 3:47:05 AM12/10/19
to mansur, NoSketch Engine
Hi Mansur,
we don't have a script for validation or checking. We usually do:
prepare_alignment_with_your_script | ./fixgaps.py | ./compressrng.py > OUT
where fixgaps.py will add missing IDs, compressrng.py will merge subsequent -1 alignments:

x -1
x+1 -1
x+2 -1
...
x+n -1

into

x,x+n -1

You can also compress subsequent 1:1 alignments with ":" notation to save space:

x y
x+1 y+1
x+2 y+2
...
x+n y+n

into

x:x+n y:y+n

Cheers,
Vítek B.

čt 5. 12. 2019 v 9:39 odesílatel mansur <668...@gmail.com> napsal:


--
Best regards,
Vít Baisa
Sketch Engine team

mansur

unread,
Dec 14, 2019, 1:57:18 AM12/14/19
to Vít Baisa, NoSketch Engine
Hi, Vit!

I did everything you said and it works, thank you! But I get a strange messages during compilation in both corpora:
ERROR: aligned corpus ... not aligned back.

I created mappings, and reverted them to have backward alignment. I tested the corpus and have no problems there, it works correctly both ways. Could you say how to fix this error?

Thank you!

Best,
Mansur

Vít Baisa

unread,
Dec 16, 2019, 8:42:00 AM12/16/19
to mansur, NoSketch Engine
Hi,
it is a warning from corpcheck script after the compilation. It means that in the configuration of corpus_a you have ALIGNED "...,corpus_b,..." but in corpus_b there is no "corpus_a" in ALIGNED. (I am looking into the code.)
All parallel corpora must be aligned to each other.

It is also advisable to compile the corpora first with --no-align and then again without this parameter.

V.

so 14. 12. 2019 v 7:57 odesílatel mansur <668...@gmail.com> napsal:

mansur

unread,
Dec 17, 2019, 9:45:22 AM12/17/19
to Vít Baisa, NoSketch Engine
Hi, Vit!

Yes, I also thought so and I double checked everything. It is correct according to your example and it works both ways well. Despite that I still have those errors. Maybe there are some others nuances I don't yet know?..

And could you please show me some example how to use authentication in noske? I know it somehow supports Apache's basic authentication system, but I don't know how to use it.

With best regards,
M.

Vít Baisa

unread,
Dec 17, 2019, 10:21:36 AM12/17/19
to mansur, NoSketch Engine
The basic auth is independent from noske. You need to configure Apache, e.g. in your site configuration file, add

...
<Directory "/var/www/bonito/or/similar">
            AuthType Basic
            AuthName "Auth prompt title"
            AuthUserFile /path/to/your/htpasswd/file
            Require valid-user
            ....
</Directory>
...

Then create the htpasswd file with htpasswd command which manages users and passwords.

Check permissions, reload Apache etc. Then it should work. You will find plenty of related resources online.

V.

út 17. 12. 2019 v 15:45 odesílatel mansur <668...@gmail.com> napsal:
Reply all
Reply to author
Forward
0 new messages