How to create parallel texts for language learning

48 views
Skip to first unread message

Vera Surkova

unread,
Feb 9, 2011, 11:24:19 AM2/9/11
to listenin...@googlegroups.com
Ниже вы найдете инструкцию по созданию параллельных текстов. Пока только на английском. Я спросила у автора разрешения перевести и опубликовать русскую версию, жду ответа. Вцелом, никакой «ракетной науки» там нет, я часто примерно таким способом обрабатываю информацию :)

Original: http://languagefixation.wordpress.com/2011/02/09/how-to-create-parallel-texts-for-language-learning-part-1/

I’d like to say a bit about ways to make parallel texts. I think parallel texts to be a very valuable learning resource, as I’ve mentioned in the past. They enable you to learn a language much faster than from textbooks, because they make an enormous amount of content instantly comprehensible.

Unfortunately, it’s nearly impossible to find parallel texts. The most common commercially available ones seem to be books of poetry and “classic” works of literature. Call me uncultured, but I usually get easily bored by books from the 1800s. I want something with an *interesting* plot, and I’ve been known to read a lot of fantasy and sci-fi, for which there are basically zero parallel texts commercially available. Also, the commercial ones are not usually sentence-aligned or even paragraph-aligned…at best they’re page-aligned, if that. For easy learning, you want all the little translated bits right beside each other for easy comparison.

So, for that reason, it’s more realistic to assume that you’re going to have to either make your parallel texts yourself, or get someone else to make them for you. To this end, I’ll give you a bit of info about how I do it, so that you can perhaps give it a try.

Ok, first the basics. What you’re going to start with is two ebooks. I don’t care where you get them, that’s not my problem. You might find public domain works at Project Gutenberg, or maybe you buy modern ebooks from online booksellers (for example, I found some Danish ebooks and mp3 audiobooks for sale here). Or maybe you borrow them from a friend. Ideally you want a place that doesn’t sell crippled files, like the bastards at audible.com. I really really want to buy a lot of their audiobooks, but I just can’t play them on my operating system due to their crippling DRM. Some places sell ebooks with DRM as well, which make them only viewable on certain devices, and prevent you from sharing them with your neighbour. This is bad…you should help your neighbour :)

Anyway, back to ebooks. So you need an ebook in your target language, and another one in a language that you understand really well (hopefully your native language, if such a translation exists). The next step is that you probably want a text format version of these ebooks, since that’s much easier to process than things like PDF and EPUB. There are some software programs that will convert between several different ebook formats, but I just use a document viewer called Okular, which is able to view a PDF or EPUB and then “export to text” to give me a clean file.

Next, you need a way to align these texts. What this means is that you’re going to create a file in which the equivalent paragraphs or sentences will match up with each other. For example, the one I’m currently reading has individual Dutch sentences on the left-hand column, and each sentence is matched with its English translation in the right-hand column. There are two main ways to achieve this. One is more time-consuming but technically very simple, and the other involves a bit of computer know-how but is much more time-efficient.

In this article, I’ll be describing the “easy” way, and then my next article will be for the people who know what I mean when I say things like “emacs”, “regular expressions”, and “Makefile”. You know who you are. For those who don’t recognize the software terms, but are still keen to put your growing computer skills to the test, be sure to take a look at that article when I publish it in a few days. That method requires much less manual repetitive work. But for now, the less-technical way!

First you change all empty lines (ie [ENTER][ENTER] ) in the book to something unique (like a weird character like Ĉ that doesn’t exist in that language) in order to save the paragraph breaks. Then you remove all remaining [ENTER]s from the document so it’s all one line. Now you go back and restore the paragraph breaks by changing Ĉ to , which means each paragraph is now on a separate line. Do this for both copies of the book. This step got rid of a bunch of [ENTER]s that were just breaking individual sentences into a bunch of pieces unnecessarily. You only want the paragraphs to be divided, in this method.

Now that you have a collection of separate paragraphs, you open up a spreadsheet program (such as openoffice.org, gnumeric, koffice, or maybe that famous one from Microsoft, if you’re desperate), and you create a table with two columns and one row. Paste one language on the left-hand cell, and the other language on the right-hand cell. Now you just have to make sure that each paragraph lines up with its appropriate neighbour by adding extra [ENTER]s to make them even out. Sometimes you may have to bust 1 paragraph into smaller ones to do that.

At the end, once you know that they all line up, then you remove excess lines by changing [ENTER][ENTER] into just [ENTER] (perhaps multiple times if necessary), and now you have one paragraph per line. Now you copy-paste as table cells with one line per row, so the whole text of each language is still in one column. Now that each paragraph has its own row, then the matching paragraphs show up beside each other!

Now, just as a disclaimer, I’ve never actually done this method myself, so you might have to experiment a bit if you get stuck. I just wanted to mention a method that doesn’t require tons of in-depth computer hackery. I heard about this method from people who have used it successfully many times, and I’ve seen the result of their work (such as a paragraph-aligned Chinese / English Harry Potter, for example), so I know it can work well for some people.

Next time I’ll elaborate on my more automated process, but first I want to try and automate it a bit more. I think I can save a couple of steps in the sentence-dividing stage by using another little script, so then I might be able to automate the whole thing from start to finish. Hopefully this will also make it a bit more accessible to others as well.

Until then, keep reading!

Vera Surkova

unread,
Feb 9, 2011, 11:39:40 AM2/9/11
to listenin...@googlegroups.com
Разрешение получено! Скоро переведу :)
Reply all
Reply to author
Forward
0 new messages