Software

On this page, you can find the different software modules developed by the NewsReader project. The easiest setup is provided by the virtual machine package that contains the complete pipelines. For those interested in trying out different parts of the pipelines, all separate modules are listed below as well. Please note that the pipelines take NAF files as input, for which we have made available Java and Python libraries.

With each module, we specify who developed it. The quickest way to get help with a module is to contact that person. If a publication is associated with a module, it will be specified on the module’s page.

 

‘Black box’ setup

For each of the processing pipelines (English, Spanish, Italian and Dutch), we have a downloadable virtual machine package that sets up the pipeline with the default settings as described in Deliverable D4.2.2 Event Detection v2.

  • English Virtual Machine: The instructions to download the virtual machine (VM) with all the required modules and NLP processors to run the English pipeline developed within the Newsreader project for event extraction is available at this page
  • Spanish Virtual Machine: The instructions to download the virtual machine (VM) with all the required modules and NLP processors to run the Spanish pipeline developed within the Newsreader project for event extraction is available at this page
  • Italian Virtual Machine: The instructions to download the virtual machine (VM) with all the required modules and NLP processors to run the Italian pipeline developed within the Newsreader project for event extraction is available at this page
  • VMC from scratch: You can also download the instructions to automatically build the distributed pipeline for NLP processing from this page

Hadoop package for batch processing (by SURFsara)

All modules for English and the overall setup to run the pipeline on a Hadoop cluster.

direct download (±5GB)

 

KnowledgeStore

  • KnowledgeStore: A scalable, fault-tolerant, and Semantic Web grounded storage system to jointly store, manage, retrieve, and semantically query, both structured and unstructured data.
  • RDFpro: An extensible tool for building stream-oriented RDF processing pipelines, originated from the need of a tool supporting typical Linked Data integration tasks, involving dataset sizes up to few billions triples.

Simple API

  • NewsReader Simple API: an API that wraps a set of parameterised SPARQL queries to access the KnowledgeStore RDF structured content, and calls to the KnowledgeStore CRUD endpoint to retrieve unstructured resources.

Converting NAF to RDF

  • vua-naf2sem: Set of functions within the EventCoreference module that reads NAF files and creates the RDF-TRiG format according to the GAF-SEM model where events, events and relations are represented as unique instances with pointers to their mentions in text. It also creates the RDF-TRiG for the GRASP perspective model. The output can be loaded into the KnowledgeStore.

NAF and KAF parsers

  • KafSaxParser: Java library that reads and writes NAF and has an internal data structure for all NAF layers in memory.
  • pynaf: Python library that reads and writes NAF and has an internal data structure for all NAF layers in memory.
  • kaflib: Java library that reads and writes NAF and has an internal data structure for all NAF layers in memory.
  • KafNafParser: Python module that produces and interprets NAF and KAF and can convert between the two.

Individual pipeline modules and libraries

English and multilingual modules

  • newsparser: tool to extract metadata from large numbers of news articles and store them in a compressed archive.
  • ixa-pipe-tok: A multilingual rule-based tokenizer for English, Spanish and Dutch compliant with Penn Treebank and Ancora Corpus tokenization.
  • ixa-pipe-pos: English/Spanish POS tagging with Perceptron models (Collins 2002) as implemented by Apache OpenNLP using the WSJ and Ancora corpus respectively.
  • ixa-pipe-parse: English/Spanish Constituent Parsing with Maximum Entropy models (Ratnaparkhi 1999) as implemented by Apache OpenNLP using the Penn and Ancora Treebanks respectively.
  • ixa-pipe-nerc: English/Spanish/Dutch Named Entity Recognition with Perceptron models (Collins 2002) as implemented by Apache OpenNLP on CoNLL datasets for NER.
  • ixa-pipe-topic: a module to extract a set of topics based on the Multilingual Eurovoc thesaurus descriptors and the JRC Eurovoc Indexer JEX. It is used for English and Spanish.
  • vua-svm-wsd: This program svm_wsd implements a machine learning Word Sense Disambiguation system based on Support Vector Machines. It is used in the Dutch pipeline.
  • wsd-ukb: This program applies the so-called Personalized PageRank on a Lexical Knowledge Base (LKB) to rank the vertices of the LKB for word sense disambiguation for English and Spanish.
  • MATE-based Parser and SRL: a tool providing lemmatization, POS-tagging, dependencies and semantic roles for English and Spanish based on the MATE-tools (Björkelund et al., 2010).
  • CorefGraph: a python reimplementation of the coreference resolution tool proposed by the Stanford NLP group (Lee et al., 2013) for English and Spanish.
  • vua-ontotagger: module that inserts ontological labels to Wordnet synsets associated with terms or directly to the lemmas of the term based on the external resources provided. It is typically used to assign Predicate Matrix mappings to synsets.
  • vua-factualityr: module that indicates the certainty (certain/probable/possible) of an event, whether the event is confirmed or denied (pos/neg) and whether it is in the future (future/non-future).
  • vua-eventcoreference: set of functions in the EventCoreference module that read NAF files of English, Spanish and Dutch text and determines intra-document event-coreference based on lemmas and wordnet synsets.
  • TimePro: English module to recognize temporal expression (part of TextPro tool)
  • HeidelTime NAF-wrapper: NAF-wrapper around Strötgen (2013)’s HeidelTime that can be used for recognizing time expressions in Dutch and English, see also. Note that the Heideltime NAF-adaptation is more robust than this wrapper. It works for Dutch and Spanish and can easily be adapted to work for English.
  • TempRelPro: English module to recognize temporal relations (part of TextPro tool).
  • CausalRelPro: English module to recognize causal relations (part of TextPro tool)

Dutch modules

  • ixa-pipe-tok: A multilingual rule-based tokenizer for English, Spanish and Dutch compliant with Penn Treebank and Ancora Corpus tokenization.
  • Alpino_naf_wrapper: Naf-wrapper around the Alpino parser for Dutch (Bouma et al., 2001), which provides morphological (lemmas and POS tags) and syntactic information (constituents and dependencies), see also.
  • ixa-pipe-nerc: English/Spanish/Dutch Named Entity Recognition with Perceptron models (Collins 2002) as implemented by Apache OpenNLP on CoNLL datasets for NER.
  • vua-svm-wsd: This program svm_wsd implements a machine learning Word Sense Disambiguation system based on Support Vector Machines. It is used in the Dutch pipeline.
  • HeidelTime NAF-wrapper: NAF-wrapper around Strötgen (2013)’s HeidelTime that can be used for recognizing time expressions in Dutch and English. This version uses Treetagger and only needs the NAF token-layer. The ixa-pipe-time wrapper (Heideltime NAF-adaptation) is more robust and recommended instead of this NAF-wrapper.
  • Heideltime NAF-adaptation: An integrated NAF-wrapper around Strötgen (2013)’s HeidelTime that can be used for recognizing time expressions in Spanish and Dutch. The wrapper works on the NAF term layer and can easily be adapted to English.
  • vua-ontotagger: module that inserts ontological labels to Wordnet synsets associated with terms or directly to the lemmas of the term based on the external resources provided. It is typically used to assign Predicate Matrix mappings to synsets.
  • SONAR SRL: NAF-compliant reimplementation of De Clerq et al. (2012)’s SoNar SRL module.
  • vua-framenet-classifier: module that reads NAF files of English, Spanish and Dutch text and applies FrameNet frames and roles to the SRL layer using the PredicateMatrix for the respective language.see also.
  • nominal-event-detection: module that reads NAF files of Dutch and identifies which nouns refer to an event. This is another class within the vua-ontotagger package and it assume that the terms have been typed with class information through the ontotagger.
  • nominal-predicate-srl: a basic module that reads NAF files of Dutch and checks whether nominal events are modified by one or more prepositional phrases. These phrases are labelled as Arg1 or ArgM.
  • vua-eventcoreference: set of functions in the EventCoreference module that read NAF files of English, Spanish and Dutch text and determines intra-document event-coreference based on lemmas and wordnet synsets.

Italian modules

  • fbk-tagpro: A POS-tagger for Italian using a Conditional Random Field algorithm assigning a subset of ELRA tagset.
  • fbk-lemmapro: A lemmatizer for Italian, disambiguating output of MorphoPro using TagPro.
  • fbk-entitypro: A named entity recognition and classification system for Italian trained on ICAB (Magnini et al. 2006).
  • fbk-chunkpro: Module that provides chunks for Italian outputting constituents in of two categories: B-NP and B-VX.
  • fbk-depparserpro: Dependency parser for Italian based on the Malt Parser (Lavelli et al. 2013).
  • fbk-eventpro: module that identifies events and classifies them according to TimeML using a svm trained on the EVENTI-EVALITA2014 data.
  • fbk-factpro: module that determines factuality values according to the newsreader annotation guidelines trained on the Fact-Ita Bank corpus.
  • fbk-timepro: Italian module to identify and normalize time expressions. It uses a svm trained on the EVENTI-EVALITA2014 data.
  • fbk-temprelpro: Italian module to extract temporal relations between events and time expressions. It uses a svm trained on the EVENTI-EVALITA2014 data.
  • fbk-srl: SRL system for Italian based on dependency relations.
  • fbk-timeanchor: module that extracts relation between predicates and time anchors.

Spanish modules

  • ixa-pipe-tok: A multilingual rule-based tokenizer for English, Spanish and Dutch compliant with Penn Treebank and Ancora Corpus tokenization.
  • ixa-pipe-pos: English/Spanish POS tagging with Perceptron models (Collins 2002) as implemented by Apache OpenNLP using the WSJ and Ancora corpus respectively.
  • ixa-pipe-parse: English/Spanish Constituent Parsing with Maximum Entropy models (Ratnaparkhi 1999) as implemented by Apache OpenNLP using the Penn and Ancora Treebanks respectively.
  • ixa-pipe-nerc: English/Spanish/Dutch Named Entity Recognition with Perceptron models (Collins 2002) as implemented by Apache OpenNLP on CoNLL datasets for NER.
  • ixa-pipe-topic: a module to extract a set of topics based on the Multilingual Eurovoc thesaurus descriptors and the JRC Eurovoc Indexer JEX. It is used for English and Spanish.
  • wsd-ukb: This program applies the so-called Personalized PageRank on a Lexical Knowledge Base (LKB) to rank the vertices of the LKB for word sense disambiguation for English and Spanish.
  • MATE-based Parser and SRL: a tool providing lemmatization, POS-tagging, dependencies and semantic roles for English and Spanish based on the MATE-tools (Björkelund et al., 2010).
  • CorefGraph: a python reimplementation of the coreference resolution tool proposed by the Stanford NLP group (Lee et al., 2013) for English and Spanish.
  • Heideltime NAF-adaptation: An integrated NAF-wrapper around Strötgen (2013)’s HeidelTime that can be used for recognizing time expressions in Spanish and Dutch. The wrapper works on the NAF term layer and can easily be adapted to English.

Evaluation modules

  • Evaluation package: Modules for the evaluation of the event detection pipeline against the MEANTIME corpus.

Additional modules and libraries

  • vua-multiwordtagger: Java module that reads NAF with a term layer and uses a wordnet in WordNet-LMF format to detect multiword phrases. The output is NAF.
  • naf_ukb: script to add sense information to NAF input, thus producing new NAF
  • vua-wordnettools: Java module that reads any wordnet in WordNet-LMF format and carries out similarity measurements. This tools is used with the EventCoreference module.
  • NAF: NewsReader Annotation Format documentation

All our software modules can be found on GitHub.