How to export tatoeba in simple format

454 views
Skip to first unread message

gleki

unread,
Mar 6, 2012, 9:19:39 AM3/6/12
to loj...@googlegroups.com
I wanna export tatoeba databse into a simple spreadsheet with two columns.
One for English and another one for Lojban

Does anyone know how to do that ?

ianek

unread,
Mar 6, 2012, 4:17:22 PM3/6/12
to lojban
http://tatoeba.org/pol/download_tatoeba_example_sentences
http://tatoeba.org/files/downloads/sentences.csv

There are actually three columns: id, language, sentence, but with
some database-fu or script-fu or maybe even spreadsheet-fu you can get
what you want. Or maybe I'll hack it together in a while.

mu'o mi'e ianek

ianek

unread,
Mar 6, 2012, 5:47:17 PM3/6/12
to lojban
I've created the list for you, but it was an ugly hack in bash. A
better way would be to create a database and import sentences.csv and
links.csv to it, and then write a very simple program instead of
hacking around with grep etc. But it would be more work of course. And
maybe not faster, considering that import would take time.

Here you go: http://dl.dropbox.com/u/17805197/jbo-eng.csv
It's tab-seperated list, any spreadsheet program should read it.

As a by-product, I am able to produce such a list for any other
language available in tatoeba instantly, if anyone's interested.

mu'o mi'e ianek

On 6 Mar, 22:17, ianek <jane...@gmail.com> wrote:
> http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoeba.org/files/downloads/sentences.csv

gleki

unread,
Mar 7, 2012, 5:44:16 AM3/7/12
to loj...@googlegroups.com
I'm interested. And actually in periodically doing it myself.  Not by request. 
Because the database is live and is being updated by us.

Of course I know about those three files.

For now, I'd prefer such export for several directions at one (a multilingual spreadsheet).
I want all sentences for which we have lojban translations.
i.e. 
first column    lojban
2 column   english
then i need
japanese
chinese
russian
arabic
spanish
polish
french
german

I'll repeat once again. An automated script for doing so  would be awesome.

ianek

unread,
Mar 7, 2012, 9:51:44 AM3/7/12
to lojban

gleki

unread,
Mar 7, 2012, 11:47:52 AM3/7/12
to loj...@googlegroups.com

On Wednesday, March 7, 2012 6:51:44 PM UTC+4, ianek wrote:
What platform? Is Linux ok?

For me ? Well, yes. As for other interested in lojban we can periodically do snapshots of the database ourselves.

gleki

unread,
Mar 7, 2012, 11:51:20 AM3/7/12
to loj...@googlegroups.com
And if the script is not yet ready for publishing
please compile a jbo-rus.csv file.
Our Russian lojbanists will highly estimate it.
Your name will be next to the spreadsheet for sure.

mu'o

ianek

unread,
Mar 7, 2012, 1:32:58 PM3/7/12
to lojban
I've just found out that links.csv is not complete, ie. it doesn't
cover all the pairs. For example, we have a Lojban sentence "lo purci
ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e na
ka'e te djuno" and a Polish sentence "Przeszłość może być tylko
poznana, nie zmieniona. Przyszłość może być tylko zmieniona, nie
poznana." and they're not linked to each other, but they both are
linked to "The past can only be known, not changed. The future can
only be changed, not known.". I wonder if there's a rule that such
sentence always have a "common relative", it would certainly make
things easier. But I think that now using a database (maybe sqlite3)
would be necessary.

mu'o mi'e ianek

ianek

unread,
Mar 7, 2012, 1:36:44 PM3/7/12
to lojban
http://dl.dropbox.com/u/17805197/jbo-rus.csv

But it's probably not complete, for the reason I mentioned.

ianek

unread,
Mar 21, 2012, 12:52:12 PM3/21/12
to loj...@googlegroups.com
OK, I've made it. http://dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz
Unpack it to a directory with links.csv and sentences.csv from Tatoeba.
Run ./prepare-links.sh once. (You'll have to do it again only if you replace links/setences with newer files).
Then run ./make-pairs.sh [language-code] > [some filename].csv
For example ./make-pairs.sh eng > jbo-eng.csv

I've made it so that it gathers all of the interlinked sentences. This has some drawbacks. Do you know the "phone game"? If you do, you know what I'm saying. If you don't, you will know when you look at some pairs...

mu'o mi'e ianek

la gleki

unread,
Mar 2, 2013, 3:37:52 AM3/2/13
to loj...@googlegroups.com, evar...@gmail.com


On Wednesday, January 2, 2013 11:00:34 PM UTC+4, evar...@gmail.com wrote:
Hi,
I'm a german Teacher at a spanish University and i've tried to adapt your script to download a bilingual csv (german-spanish) from tatoeba. The problem is i have absolute no programming/ linux knowledge and i can't figure out why this doesn't work. It would be very nice if you could give me a hint how to do that.
Thank you!

I suggest that you replace the sequence "jbo" in all files of the script to the sequence "deu" (the list of all language codescan be seen here).
Also open all the files of the script and replace "jbo" with "deu" there.

Then add the downloaded database to the folder and run the script (if on Windows you can use Cygwin).
Note that the script os rather slow. It might take several hours to complete it.

la gleki

unread,
May 20, 2013, 10:27:00 AM5/20/13
to loj...@googlegroups.com


On Wednesday, March 21, 2012 8:52:12 PM UTC+4, ianek wrote:
OK, I've made it. http://dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz
Unpack it to a directory with links.csv and sentences.csv from Tatoeba.
Run ./prepare-links.sh once. (You'll have to do it again only if you replace links/setences with newer files).
Then run ./make-pairs.sh [language-code] > [some filename].csv
For example ./make-pairs.sh eng > jbo-eng.csv

I've made it so that it gathers all of the interlinked sentences. This has some drawbacks. Do you know the "phone game"? If you do, you know what I'm saying. If you don't, you will know when you look at some pairs...


This is a great script. But can we have another one with only direct translations to remove that broken phone game effect?
Also can we have a script that will link indirect translations only a given (e.g. 1) level deep?
Reply all
Reply to author
Forward
0 new messages