Language Resource Creation and Distribution at the Linguistic Data Consortium: a Progress Report

Changes in the supply of and demand for language resources continues to affect the role of large data centers such as the Linguistic Data Consortium (LDC) and European Language Resource Center (ELRA) within the research communities they serve. The past few years have seen increased demand for: intensively multi-modal resources, larger data sets in high-density languages and new data in low density languages; standards and tools for corpus development and re-useable resources. The next few years will bring demand for extensive batteries of coordinated language resources with sophisticated annotation in several major languages. The DARPA program in Translingual Information Detection Extraction and Summarization (TIDES) has already undertaken such resource development; programs with similarly broad scope addressing other technologies will surely follow. Data centers will be well placed to address these needs if they integrate new resource development with distribution of existing resources to fill known gaps by creating or assisting the creation of new data. LDC has projects ongoing to address all of these issues. This paper will provide an overview of LDC activity in corpus creation, annotation and distribution and describe new efforts bring together communities of researchers, to identify best practices and develop tools of general use
Published in 2002