A Class Library for the Integration of NLP Tools: Definition and Implementation of an Abstract Data Type Collection for the Manipulation of SGML Documents in a Context of Stand-off Linguistic Annotation

In this paper we present a program library conceived and implemented to represent and manipulate the information exchanged in the process of integration of NLP tools. It is currently used to integrate the tools developed for Basque processing during the last ten years at our research group. In our opinion, the program library is general enough to be used in similar processes of integration of NLP tools or in the design of new applications built on them. The program library constitutes a class library that provides the programmer with the elements s/he needs when manipulating SGML documents in a context of stand-off linguistic annotation, where linguistic analyses obtained at different phases (morphology, lemmatization, processing of multiword lexical units, surface syntax, and so on) are represented by well-defined typed features structures. Due to the complexity of the information to be exchanged among the different tools, feature structures (FS) are used to represent it. Feature structures provide us with a well-formalized basis for the exchange of linguistic information among the different text analysis tools. Feature structures are coded in SGML following the TEI?s DTD for Fs, and Feature-System Declarations (FSD) have been thoroughly specified. So, TEI-P3 conformant feature structures constitute the representation schema for the different documents that convey the information from one linguistic tool to the next in the language processing chain. The tools integrated so far are a lexical database, a tokenizer, a wide-coverage morphosyntactic analyzer, a general purpose tagger/lemmatizer and a shallow syntactic parser. The type of information contained in the documents exchanged among these tools has been analyzed and characterized using a set of Abstract Data Types
Published in 2002