Saffron provides knowledge extraction from text by means of Natural Language Processing techniques. Saffron is a highly configurable open source tool, consisting of a set of self-contained and independent modules that individually provide distinct text analysis capabilities. It can be used both through command line or Web interface. The fully automatic system extracts terms and creates a taxonomy of these terms and displays the results in the Web interface as a graph along with the list of extracted terms that can be explored further.

How does it work?

Saffron builds on a multi step architecture and its algorithms include term extraction (also referred to as ‘topic extraction’), authors connections, author expertise extraction, and taxonomy generation, each step being highly configurable in the advanced mode.

Saffron takes a text collection as input dataset. From there it detects and extracts domain-specific concepts (terms) that are prominent in the dataset, ranks authors according to these terms, and builds on statistical algorithms to link and hierarchically organize terms together, creating a knowledge taxonomy.  

For each phase of the analysis, Saffron implements and allows to choose from a range of algorithms each following a different approach (all documented in the Documentation), and a range of parameters to tweak to your needs (terms that are more or less specific, use of external resources, etc.)

NLP analysis steps

Term Extraction

A term is a single- or multi-word expression (phrase), often representing a domain-specific concept. Term extraction is a subtask of information extraction, the objective of which is to automatically identify terms relevant to a given collection of texts. 

A term can be more or less specific. Saffron provides default settings for extracting terms, but also allows the user to choose the degree of specificity required for its needs, by configuring the linguistic pattern that define a term, or the allowed length of term, based on the principle that the longer the term is, the more specific it is likely to be (e.g. “heart” vs. “target heart rate”).

Taxonomy Extraction

A taxonomy of terms is a structure whereby terms are linked together and organized as a hierarchy, ie. a tree, where the root is the most generic term of the domain and the leaves, the most specific ones. Each pair of nodes in the taxonomy are linked together with a parent-child  type of relation (also called hyperonyme-hyponyme). Therefore, the whole taxonomy (which is made up of the terms extracted from the given dataset) gives a representation of the knowledge domain.

Expert Finding and Community Identification 

In Saffron it is possible to include information about the author of each text of the dataset as metadata. The terms found in each text will thus be linked to its author, which consequently allows not only to identify the subjects of expertise of a person (referred to as expert), but also to link experts together by common subjects of expertise. This enables tasks such as expert finding and community identification.