Don Blaheta's thesis proposal, ``Function tagging''.


Parsing sentences using statistical information gathered from a treebank was first examined more than a decade ago and is by now a fairly well-studied problem; it has been attacked with a wide variety of tools, including decision trees, probabilistic context-free grammars, maximum entropy, and data-oriented parsing. Nearly all this work, however, has focused entirely on low-level syntactic structure: a bracketing (or dependency labelling) with simple constituent labels like NP, VP, or SBAR. The Penn Treebank contains a great deal of additional syntactic and semantic information from which to gather statistics; reproducing more of this information automatically is a goal which has so far been mostly ignored. I am interested in making it possible to recover some of this information---the function tags---automatically.

In this work I will present a system that utilises a maximum-entropy-inspired algorithmic framework along with a number of commonly used features (label, syntactic head, etc) to predict function tags with relatively high accuracy. I will then present two other algorithmic frameworks and a number of new features to be used with them. I propose to use these expanded systems to improve performance on the function tagging task, and having done so, analyse the results to determine which features were most helpful in the task as a whole and in its various subtasks.


Please do not cite this work. This is my thesis proposal, and as such, not a final product; I provide it here for informational purposes only. If there is some portion you wish to cite, please contact me ( of it has been or will be published separately, and of course all of it will be in the thesis itself, forthcoming sometime in 2003.

Full text

Talk slides

Don Blaheta /