University of Maryland Parallel Corpus Project: Bible
University of Maryland Parallel Corpus Project: The Bible
This project is no longer active, though I am still
always happy to receive feedback or pointers to useful resources.
Investigators
Summary
We are engaged in a project to acquire and annotate texts in order to
create multilingual corpora for linguistic research, particularly
computational linguistics.
Religious texts such as the Bible are widely available, carefully
translated, and appear in a huge variety of languages. We provide
versions of the Bible consistently annotated according to the Corpus Encoding Standard.
Resnik et al. (1999) discusses the project in detail, including a
study on the vocabulary coverage of Biblical text with respect to
dictionary and corpus resources, demonstrating the surprising extent
to which it is relevant for research on everyday language.
Publications
- Philip Resnik, Mari Broman Olsen, and Mona Diab, ``The Bible as
a Parallel Corpus: Annotating the `Book of 2000 Tongues''',
Computers and the Humanities, 33(1-2), pp. 129-153, 1999.
-
Tapas Kanungo and Philip Resnik, "The Bible, Truth, and Multilingual
OCR Evaluation", In Proc. of SPIE Conference on Document
Recognition and Retrieval (VI) (to appear), San Jose, CA, 27-28
January, 1999.
-
Philip Resnik, Mari Broman Olsen, and Mona Diab, "Creating a
Parallel Corpus from the Book of 2000 Tongues", Text Encoding
Initiative 10th Anniversary User Conference (TEI-10), Providence,
November 1997.
Contact
Philip Resnik
(resnik@umiacs.umd.edu)
Available Versions of the Bible
Biblical text is available for the following languages, annotated in
conformance with the Corpus
Encoding Standard. Note that as of this date, the
seg tag has not yet been added to the official CES
DTD; we are told by the CES coordinator that this will happen by
August 1999. See header information in each file for pointers to the
source for that version. Encoded versions for other languages will be
added as they become available to us in forms that can be
redistributed without violation of copyright. Please write to us with
any errors you discover and any pointers to on-line biblical text for
other languages that is available for redistribution.
Here is a key to the book codes we used
(e.g. 1KI for "1 Kings", etc.).
Some files below may be temporarily unavailable.
- Cebuano
- Chinese
- Danish
- English
WARNING: This may not actually be a modern English version of the Bible,
owing to some confusion about files. We are working to resolve this as quickly as possible.
Meanwhile, use at your own risk, or, better yet, go to http://ebible.org/bible/WEB and scroll down
for downloads to their public domain modern translation. Apologies for any inconvenience.
- Finnish
- French
- Greek
Lefteris Avramidis has kindly pointed out the following characteristics of the Greek version of the Bible corpus:
- it is encoded using ISO-8859-1 (instead of ISO-8859-7 suggested
by CES) where every Greek character has been mapped to a latin
one. The mapping seems to be one-by-one, but some latin characters
don't look at all like their Greek equivalents, and that makes the
text hardly readable.
- all accents (spirits and tones) of old greek have been removed
- that's not modern Greek either, but I assume that it is the so
called "Translation of the 70", around 300AC, officially used by the
Church in Greece.
- Indonesian
- Latin
- Spanish
- Swahili
- Swedish
Erratum: A user reports that in Proverbs 21:23 the word
besvarar should be bevarar.
- Vietnamese
Pointers to Related Projects
- NWU
Bible corpus, North-West University, Potchefstroom, South
Africa. A trilingual parallel corpus, consisting of the 1983
version of the Afrikaans Bible, the Dutch Statenvertaling Bible,
the World English Bible. The corpus is fully aligned on sentence
and word level, and also contains some part-of-speech and
syntactic annotations.
Return to Maryland Parallel Corpus Project