I will be writing a more detailed description of how I created the
corpus in our overview paper, but here's a short description:
(1) I crawled a variety of Canadian, European and international web
sites, gathering somewhere on the order of 40 million files consisting
of more than 1TB of data.
(2) I converted pdf, doc, html, asp, php, etc. files into text,
preserving the directory structure of the web crawl.
(3) I wrote a set of simple heuristics to transform French URLs onto
English URLs (i.e. replacing "fr" with "en" and about 40 other hand-
written rules), and assume that these documents are translations of
each other.
(4) This yielded 2.9 million French files paired with their English
equivalents.
(5) I split each of these files into their sentences, and put <P>
markers between paragraphs.
(6) I used Bob Moore's sentence aligner to align the files in batches
of 10,000 files.
(7) I de-dupcliated the corpus, removing all sentence pairs that occur
more than once in the parallel corpus. A lot of the documents are
duplicates or near duplicates, and a lot of the text is repeated (for
instance web site navigation). I used a Bloom Filter to do de-
duplication, so I might have thrown out more than I need to.
(8) I further cleaned the corpus by eliminating sentence pairs that
are mainly numbers, or which varied from previous sentences by only
numbers.
(9) I deleted sentence pairs where the "French" and "English"
sentences are identical. Sometimes one or the other of the documents
wasn't actually translated. This is an easy way of handling many of
the untranslated documents without having to do language identification.
(10) Finally, I concatenated all of the cleaned sentence alignments
together.
--Chris