NotesFAQContact Us
Search Tips
ERIC Number: ED413744
Record Type: Non-Journal
Publication Date: 1995
Pages: 15
Abstractor: N/A
Reference Count: N/A
Encoding Standards for Linguistic Corpora.
Ide, Nancy
The demand for extensive reusability of large language text collections for natural languages processing research requires development of standardized encoding formats. Such formats must be capable of representing different kinds of information across the spectrum of text types and languages, capable of representing different levels of information, descriptive and analytical, and application-independent. In 1988, the Text Encoding Initiative (TEI) was established as an international cooperative research project to develop a general and flexible set of guidelines for preparation and interchange of electronic texts. In 1994, TEI issued its standardized encoding conventions for both written and spoken text of any date and in any genre or text type. The guidelines conform to international encoding standards, and are based on the assumption that there is a common core of textual features, beyond which many different elements can be encoded. Eight distinct base tagsets are proposed: prose; verse; drama; transcribed speech; letters or memos; dictionary entries; terminological entries; and language corpora and collections. Additional tagsets will be developed. Each base tagset determines the basic structure of all the documents with which it is to be used, defining the components of text elements and features. Sources for the guidelines are included. Contains 10 references. (MSE)
Publication Type: Reports - Descriptive; Speeches/Meeting Papers
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A