Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

This project is aimed to build an efficient, scalable, portable, and trainable part-of-speech tagger. Using 98% of Penn Treebank-3 as the training data, it builds a raw tagger, using Bayes theorem, a hidden Markov model, and the Viterbi algorithm. After that, a reinforcement machine learning algorithm and contextual transformation rules were applied to increase the taggers accuracy. The taggers final accuracy on the testing data is 96.51% and its speed is about 251,000 words per second on a computer with two-gigabyte random access memory and two 3.00 GHz Pentium duo processors. The taggers portability and trainability are proved by the tagger-makers success in building a new tagger out of a corpus that is annotated with the tagset different from that of Penn Treebank.

Details

PDF

Statistics

from
to
Export
Download Full History