University of Maryland Parallel Corpus Project: Bible

University of Maryland Parallel Corpus Project: The Bible

This project is no longer active, though I am still always happy to receive feedback or pointers to useful resources.

Investigators

Summary

We are engaged in a project to acquire and annotate texts in order to create multilingual corpora for linguistic research, particularly computational linguistics. Religious texts such as the Bible are widely available, carefully translated, and appear in a huge variety of languages. We provide versions of the Bible consistently annotated according to the Corpus Encoding Standard. Resnik et al. (1999) discusses the project in detail, including a study on the vocabulary coverage of Biblical text with respect to dictionary and corpus resources, demonstrating the surprising extent to which it is relevant for research on everyday language.

Publications

Philip Resnik, Mari Broman Olsen, and Mona Diab, ``The Bible as a Parallel Corpus: Annotating the `Book of 2000 Tongues''', Computers and the Humanities, 33(1-2), pp. 129-153, 1999.
Tapas Kanungo and Philip Resnik, "The Bible, Truth, and Multilingual OCR Evaluation", In Proc. of SPIE Conference on Document Recognition and Retrieval (VI) (to appear), San Jose, CA, 27-28 January, 1999.
Philip Resnik, Mari Broman Olsen, and Mona Diab, "Creating a Parallel Corpus from the Book of 2000 Tongues", Text Encoding Initiative 10th Anniversary User Conference (TEI-10), Providence, November 1997.

Contact

Philip Resnik (resnik@umiacs.umd.edu)

Available Versions of the Bible

Biblical text is available for the following languages, annotated in conformance with the Corpus Encoding Standard. Note that as of this date, the seg tag has not yet been added to the official CES DTD; we are told by the CES coordinator that this will happen by August 1999. See header information in each file for pointers to the source for that version. Encoded versions for other languages will be added as they become available to us in forms that can be redistributed without violation of copyright. Please write to us with any errors you discover and any pointers to on-line biblical text for other languages that is available for redistribution.

Here is a key to the book codes we used (e.g. 1KI for "1 Kings", etc.).

Some files below may be temporarily unavailable.

Cebuano
Chinese
Danish
English
WARNING: This may not actually be a modern English version of the Bible, owing to some confusion about files. We are working to resolve this as quickly as possible. Meanwhile, use at your own risk, or, better yet, go to http://ebible.org/bible/WEB and scroll down for downloads to their public domain modern translation. Apologies for any inconvenience.
Finnish
French
Greek
Lefteris Avramidis has kindly pointed out the following characteristics of the Greek version of the Bible corpus:

it is encoded using ISO-8859-1 (instead of ISO-8859-7 suggested by CES) where every Greek character has been mapped to a latin one. The mapping seems to be one-by-one, but some latin characters don't look at all like their Greek equivalents, and that makes the text hardly readable.
all accents (spirits and tones) of old greek have been removed
that's not modern Greek either, but I assume that it is the so called "Translation of the 70", around 300AC, officially used by the Church in Greece.
Indonesian
Latin
Spanish
Swahili
Swedish
Erratum: A user reports that in Proverbs 21:23 the word besvarar should be bevarar.
Vietnamese

Pointers to Related Projects

NWU Bible corpus, North-West University, Potchefstroom, South Africa. A trilingual parallel corpus, consisting of the 1983 version of the Afrikaans Bible, the Dutch Statenvertaling Bible, the World English Bible. The corpus is fully aligned on sentence and word level, and also contains some part-of-speech and syntactic annotations.

Return to Maryland Parallel Corpus Project