Giga French-English parallel corpus updated

117 views
Skip to first unread message

Chris Callison-Burch

unread,
Dec 1, 2008, 4:55:23 PM12/1/08
to Fourth Workshop on Statistical Machine Translation (WMT09)
I have released an updated version of my large French-English corpus.
The new version is nearly twice as large as the pre-release. Here are
the stats according to wc:

wc -lw giga-fren.release1.* giga-fren.prerelease.*

17671660 416922880 giga-fren.release1.en
17671660 488019651 giga-fren.release1.fr

8600471 190274564 giga-fren.prerelease.en
8600471 224287326 giga-fren.prerelease.fr

I've done a slightly better job at de-duplication, so the new part of
the corpus starts at line 7474145.

There is a reasonable chance that I'll be able to release an
additional 150 million words worth of text by the end of the week, but
I wanted to get this out to give people enough time to train up Giza
before the test data is released next week.

--Chris

Yves Scherrer

unread,
Dec 2, 2008, 5:25:10 AM12/2/08
to Fourth Workshop on Statistical Machine Translation (WMT09)
Unfortunately, the corpus is not accessible...

Forbidden
You don't have permission to access /wmt09/training-giga-
fren.release1.tar on this server.

Chris Callison-Burch

unread,
Dec 3, 2008, 9:29:47 PM12/3/08
to Fourth Workshop on Statistical Machine Translation (WMT09)
Permissions have been updated.

--C

Chris Callison-Burch

unread,
Dec 5, 2008, 10:45:48 PM12/5/08
to Fourth Workshop on Statistical Machine Translation (WMT09)
I've updated the large French-English parallel corpus again. It's got
an additional 150 million words. Getting pretty close to my goal of a
10^9 word parallel corpus.

wc -lw giga-fren.release2.*

22520400 575758246 giga-fren.release2.en
22520400 672373978 giga-fren.release2.fr

The new part of the corpus starts around line 17355680.

--Chris

joshi

unread,
Dec 7, 2008, 3:48:24 AM12/7/08
to Fourth Workshop on Statistical Machine Translation (WMT09)
may i access this corpus

allauzen

unread,
Dec 18, 2008, 11:18:16 AM12/18/08
to Fourth Workshop on Statistical Machine Translation (WMT09)
Is there a description of how the "large French-English parallel
corpus" was built (soures, websites, algorithms, ... ) ?

Chris Callison-Burch

unread,
Dec 19, 2008, 1:20:12 PM12/19/08
to WM...@googlegroups.com
I will be writing a more detailed description of how I created the
corpus in our overview paper, but here's a short description:

(1) I crawled a variety of Canadian, European and international web
sites, gathering somewhere on the order of 40 million files consisting
of more than 1TB of data.
(2) I converted pdf, doc, html, asp, php, etc. files into text,
preserving the directory structure of the web crawl.
(3) I wrote a set of simple heuristics to transform French URLs onto
English URLs (i.e. replacing "fr" with "en" and about 40 other hand-
written rules), and assume that these documents are translations of
each other.
(4) This yielded 2.9 million French files paired with their English
equivalents.
(5) I split each of these files into their sentences, and put <P>
markers between paragraphs.
(6) I used Bob Moore's sentence aligner to align the files in batches
of 10,000 files.
(7) I de-dupcliated the corpus, removing all sentence pairs that occur
more than once in the parallel corpus. A lot of the documents are
duplicates or near duplicates, and a lot of the text is repeated (for
instance web site navigation). I used a Bloom Filter to do de-
duplication, so I might have thrown out more than I need to.
(8) I further cleaned the corpus by eliminating sentence pairs that
are mainly numbers, or which varied from previous sentences by only
numbers.
(9) I deleted sentence pairs where the "French" and "English"
sentences are identical. Sometimes one or the other of the documents
wasn't actually translated. This is an easy way of handling many of
the untranslated documents without having to do language identification.
(10) Finally, I concatenated all of the cleaned sentence alignments
together.

--Chris
Reply all
Reply to author
Forward
0 new messages