An Annotated German-Language Medical Text Corpus as Language Resource

We describe the structure of a German-language corpus which contains a variety of medical text genres. Clinical documents (discharge summaries, pathology, histology and surgery reports) are distinguished from non-clinical ones (textbook articles and consumer health care documents from a Web portal). After introducing a medical extension of the general-language STTS tagset which accounts for unique features of the medical sublanguage encountered in these documents, we discuss some of the quantitative properties of the annotations (e.g., distribution patterns of part-of-speech tags)
Published in 2004