Asymmetrical parallel corpora

25 views
Skip to first unread message

Slavomír Čéplö

unread,
Oct 26, 2021, 12:29:34 PM10/26/21
to NoSketch Engine
Dear all,

I'm not sure if I already asked this, if I did, I can't find the
question, so mea maxima culpa.

I have a parallel corpus with some 12 sentence-split texts, each in a
number of translations; the corpora are aligned by sentence (or, well,
whatever passes for a sentence in these trying times).

The problem is that while the languages are always the same (GRE, SYR,
ARA), some texts have one translation into SYR, some have two, some
have three. I ended up with six parallel corpora where for those texts
missing, I just added a dummy text with the same number of dummy
sentences as if there were a translation there.

Let's put aside the issue that I will end up with three corpora which
are at least 60% full of these dummy elements. Is there a simpler way
to do something like this, something where I don't have to create all
the dummy sentences?
Thank you,

Slavo

Miloš Jakubíček

unread,
Nov 1, 2021, 5:52:02 AM11/1/21
to Slavomír Čéplö, NoSketch Engine
Hi Slavo,

just to clarify: you've got sentences in language A, some of which have zero translations into language B, some have one translation into language B and some have multiple translations into language B (and C, D, E ...respectively)?

Best
Milos

Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton, UK


--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/CAESoJ_ytO0ynEnL8c8XRGZOj49PMnirVyksyghPbMTT1maLOqg%40mail.gmail.com.

Slavomír Čéplö

unread,
Nov 1, 2021, 6:00:26 AM11/1/21
to Miloš Jakubíček, NoSketch Engine
Hi Miloš,

yes, that is the situation, although I like to think of it in terms of
docs; the overview here https://bulbul.sk/H/help.html might help.

Best,

Slavo

Ruprecht von Waldenfels

unread,
Nov 1, 2021, 6:56:08 AM11/1/21
to no...@sketchengine.co.uk
Dear All,
what I have been doing is defining each text as its own corpus which
then has different sets of languages.
Have a look:
www.parasolcorpus.org/Ursynow

Best,
R

Am 01.11.21 um 11:00 schrieb Slavomír Čéplö:

Slavomír Čéplö

unread,
Nov 1, 2021, 6:59:05 AM11/1/21
to Ruprecht von Waldenfels, NoSketch Engine
Dear Ruprecht,

this is cool and I was thinking about it, but with three languages, it
seems a bit of an overkill for my use case.
Best,

Slavo
> To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/e51d4345-818d-d0b9-73fb-79e0d27f3c07%40gmail.com.

Ruprecht von Waldenfels

unread,
Nov 1, 2021, 7:10:03 AM11/1/21
to Slavomír Čéplö, NoSketch Engine
Dear Slawomir,
I guess there are no really good solutions.

In theory, it depends on what you want to do with it. You could have
THREE different corpora,
SYR, GRE, ARA, which would each have ALL the available text in that
language, and in addition the other translations (either incomplete or
with dummy sentences).

This would give you correct counts and easy searches in the primary
language. So, if you search for VerbX in all the texts available in
lang1, you search corpusLang1, and you get exactly what's there.

I think it will be difficult to get any better without complicated
dynamic measures.


Best,
Ruprecht



Am 01.11.21 um 11:58 schrieb Slavomír Čéplö:

Miloš Jakubíček

unread,
Nov 1, 2021, 7:10:34 AM11/1/21
to Slavomír Čéplö, Ruprecht von Waldenfels, NoSketch Engine
I can't really think of any better solution than the one suggested by Ruprecht, i.e. having separate corpora for separate translations.
You could also take the multiple translations as a single alignment chunk if they are consecutive text in the corpus, but I guess that's not what you want.

Best
Milos

Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton, UK

Reply all
Reply to author
Forward
0 new messages