We are engaged in a project to acquire and annotate texts in order to
create multilingual corpora for linguistic research, particularly
computational linguistics. We currently have available two forms of
data:
The Bible in CES Format
Religious texts such as the Bible are widely available,
carefully translated, and appear in a huge variety of languages.
We provide versions of the Bible consistently annotated according
to the Corpus Encoding
Standard.
The STRAND Bilingual Databases
Parallel translations automatically mined from the Web.
These vary in quality but provide a dynamic, broad sample
of language use.