ERIC Number: ED111204
Record Type: Non-Journal
Publication Date: 1975
Reference Count: N/A
Problems of Assembling, Describing, and Computerizing Corpora. Research Techniques and Prospects. Papers in Southwest English, No. 1.
Francis, W. N.
The paper investigates the problems of assembling, describing and computerizing corpora, defined as collections of "texts assumed to be representative of a given language, dialect or other subject of a language, to be used for linguistic analysis." Specific reference is made to the formation of the Brown Standard Corpus. The formation of a corpus is justified in terms of saving effort and in providing a compilation of data that will serve as a research tool in comparative studies. Important questions in the process concern the body of language from which the sample will be drawn, the size of the sample, and its structure. These, in turn, are dependent on the purpose for which the corpus is assembled: graphic analysis will require a different corpus than will phonological or grammatical analysis, for example, the latter presenting the most problems. Practical constraints on the size of the corpus, including time, energy and money are mentioned. The organization of the corpus is discussed, underlining such factors as the size of the base units, mode of selection and collection, assembly of the corpus and computerization. The question of how much additional explanatory material should accompany the corpus is raised, with particular reference to lexical and semantic analyses. (CLK)
Descriptors: Comparative Analysis, Computational Linguistics, Contrastive Linguistics, Data Collection, Descriptive Linguistics, Language Research, Linguistic Competence, Linguistic Performance, Research Tools, Semantics, Word Frequency, Word Lists
Trinity University, San Antonio, Texas 78222 ($2.00)
Publication Type: Reports - Research
Education Level: N/A
Authoring Institution: Trinity Univ., San Antonio, TX.