Construction and analysis of the University of Georgia Tobacco Documents Corpus

Darwin, Clayton Martin

Construction and analysis of the University of Georgia Tobacco Documents Corpus

Darwin, Clayton Martin

2008

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

This dissertation provides a detailed description of the construction and analysis ofthe University of Georgia Tobacco Documents Corpus, a representative corpus of tobaccoindustrydocuments designed to serve as a norm of written tobacco-industry discourse forthe University of Georgia Tobacco-Documents Project (20012004). The Tobacco DocumentsCorpus was constructed as part of the National Cancer Institute, National Institutesof Health, U.S. Department of Health and Human Services (NIH-NCI) grant 1 RO1CA87490-01, Linguistic Analyses of Tobacco Industry Documents. This description is providedprimarily as a means of demonstrating the viability of the given premise, that it ispossible to manage and describe large document setsapart from extensive review of individualtextsby using a combination of Corpus Linguistics, Humanities Computing, andStatistics methods. Secondarily, it provides the specifics of the project necessary to 1) properlyimplement the resultant corpus as a norm for comparison studies and interpret relateddata, and 2) use the Tobacco Documents Corpus as a model for similar projects. In particular,this work presents the underlying theory, implementation, and results of each stepin the process of corpus creation and description, from the initial sampling and conversionof documents, through the statistical description and analysis of the resultant corpus, andultimately (although in a limited form) to the distribution of the corpus and associated analysesvia Compact Disc and the Internet (http://www.tobaccodocs.uga.edu/TDC). Subtopicsaddressed include category theory (categorization and classification), statistical sampling,text markup using Extensible Markup Language (XML), text extraction using ExtensibleStylesheet Language (XSL) and XSL transformations (XSLT), tokenizing, parsing, countmethods, and proportions analysis. To a limited extent, this work addresses scripting usingthe Python programming language as a tool for corpus construction and analysis, and theInternet as a means for displaying corpus data and analyses. Based on the overall success ofthe Tobacco Documents Corpus, it is believed that this process description will be a contributionto the developing field of Corpus Linguistics, particularly in the area of large-scaledocument analysis and text-mining.

Details

Record ID

11367

Record Created

2024-12-05

Title

Construction and analysis of the University of Georgia Tobacco Documents Corpus

Author

Darwin, Clayton Martin

Contributor

Kretzschmar, William A. Advisor
Baptista, Marlyse Committee Member
Covington, Michael A. Committee Member
Rubin, Donald L. Committee Member

College or School

Franklin College of Arts and Sciences

Department

Linguistics

Date

2008

Publisher

University of Georgia

Content Type

Dissertation

Language

English

Dissertation/ Thesis Note

Doctoral

Degree Type

Doctor of Philosophy (PHD)

Name of Granting Institution

University of Georgia, Spring 2008

Year Degree Granted

2008

Keywords

Corpus Linguistics; Humanities Computing; Markup schema; Statistical sampling; Text mining; Tobacco documents; Dissertations (academic)

Record Appears in

College, School, or Unit > Franklin College of Arts and Sciences > Linguistics
Electronic Theses and Dissertations > Doctoral Dissertation
All Resources
Doctoral

System Control Number

9949334414502959

PDF

Statistics

Download Full History