Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers

We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transfor-mation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ?tagging on demand? filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types)
Published in 2000