Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

This dissertation provides a detailed description of the construction and analysis ofthe University of Georgia Tobacco Documents Corpus, a representative corpus of tobaccoindustrydocuments designed to serve as a norm of written tobacco-industry discourse forthe University of Georgia Tobacco-Documents Project (20012004). The Tobacco DocumentsCorpus was constructed as part of the National Cancer Institute, National Institutesof Health, U.S. Department of Health and Human Services (NIH-NCI) grant 1 RO1CA87490-01, Linguistic Analyses of Tobacco Industry Documents. This description is providedprimarily as a means of demonstrating the viability of the given premise, that it ispossible to manage and describe large document setsapart from extensive review of individualtextsby using a combination of Corpus Linguistics, Humanities Computing, andStatistics methods. Secondarily, it provides the specifics of the project necessary to 1) properlyimplement the resultant corpus as a norm for comparison studies and interpret relateddata, and 2) use the Tobacco Documents Corpus as a model for similar projects. In particular,this work presents the underlying theory, implementation, and results of each stepin the process of corpus creation and description, from the initial sampling and conversionof documents, through the statistical description and analysis of the resultant corpus, andultimately (although in a limited form) to the distribution of the corpus and associated analysesvia Compact Disc and the Internet (http://www.tobaccodocs.uga.edu/TDC). Subtopicsaddressed include category theory (categorization and classification), statistical sampling,text markup using Extensible Markup Language (XML), text extraction using ExtensibleStylesheet Language (XSL) and XSL transformations (XSLT), tokenizing, parsing, countmethods, and proportions analysis. To a limited extent, this work addresses scripting usingthe Python programming language as a tool for corpus construction and analysis, and theInternet as a means for displaying corpus data and analyses. Based on the overall success ofthe Tobacco Documents Corpus, it is believed that this process description will be a contributionto the developing field of Corpus Linguistics, particularly in the area of large-scaledocument analysis and text-mining.

Details

PDF

Statistics

from
to
Export
Download Full History