Building an efficient, scalable, and trainable probability-and-rule-based part-of-speech tagger of high accuracy

Han, Jiayun

Building an efficient, scalable, and trainable probability-and-rule-based part-of-speech tagger of high accuracy

Han, Jiayun

2009

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

This project is aimed to build an efficient, scalable, portable, and trainable part-of-speech tagger. Using 98% of Penn Treebank-3 as the training data, it builds a raw tagger, using Bayes theorem, a hidden Markov model, and the Viterbi algorithm. After that, a reinforcement machine learning algorithm and contextual transformation rules were applied to increase the taggers accuracy. The taggers final accuracy on the testing data is 96.51% and its speed is about 251,000 words per second on a computer with two-gigabyte random access memory and two 3.00 GHz Pentium duo processors. The taggers portability and trainability are proved by the tagger-makers success in building a new tagger out of a corpus that is annotated with the tagset different from that of Penn Treebank.

Details

Record ID

8235

Record Created

2024-12-05

Title

Building an efficient, scalable, and trainable probability-and-rule-based part-of-speech tagger of high accuracy

Author

Han, Jiayun

Contributor

Covington, Michael Advisor
Schwanenflugel, Paula Committee Member
Williams, Alexander Committee Member

College or School

Franklin College of Arts and Sciences

Department

Institute for Artificial Intelligence

Date

2009

Publisher

University of Georgia

Content Type

Thesis

Language

English

Dissertation/ Thesis Note

Graduate

Degree Type

Master of Science (MS)

Name of Granting Institution

University of Georgia, Spring 2009

Year Degree Granted

2009

Keywords

Part-of-Speech; Tagging; Markov Model; The Viterbi Algorithm; The Bayes' Theorem; Machine Learning; Contextual Rules; Natural Language Processing

Record Appears in

College, School, or Unit > Franklin College of Arts and Sciences
Electronic Theses and Dissertations > Graduate Thesis
All Resources

System Control Number

9949334848602959

PDF

Statistics

Download Full History