Hybrid Citation Extraction from Patents

The Quaero project organized a set of evaluations of Named Entity recognition systems in 2009. One of the sub-tasks consists in extracting citations from patents, i.e. references to other documents, either other patents or general literature from English-language patents. We present in this paper the participation of LIMSI in this evaluation, with a complete system description and the evaluation results. The corpus shown that patent and non-patent citations have a very different nature. We then separated references to other patents and to general literature papers and we created a hybrid system. For patent citations, the system used rule-based expert knowledge on the form of regular expressions. The system for detecting non-patent citations, on the other hand, is purely stochastic (machine learning with CRF++). Then we mixed both approaches to provide a single output. 4 teams participated to this task and our system obtained the best results of this evaluation campaign, even if the difference between the first two systems is poorly significant
Published in 2010