Unlocking the Corpus

Enriching Metadata with State-of-the-Art NLP Methodology and Linked Data

Authors

  • Jennifer Ecker Leibniz Institute for the German Language (IDS), Mannheim, Germany
  • Stefan Fischer Saarland University, Saarbrücken, Germany
  • Pia Schwarz Leibniz Institute for the German Language (IDS), Mannheim, Germany
  • Thorsten Trippel Leibniz Institute for the German Language (IDS), Mannheim, Germany
  • Antonina Werthmann Leibniz Institute for the German Language (IDS), Mannheim, Germany
  • Rebecca Wilm Leibniz Institute for the German Language (IDS), Mannheim, Germany

DOI:

https://doi.org/10.3384/ecp216.11

Keywords:

German Reference Corpus, DeReKo, Keyword Extraction, Named Entity Recognition, NER, Topic Modeling, Knowledge Base, Semantic Metadata Enrichment

Abstract

In research data management, metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles. Metadata in catalogues and registries are usually recorded either by archivists or subject matter experts, i.e. researchers involved in the creation or assembling of the data, or provided in the data preparation workflow. Extracting metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. If we look at descriptive metadata from a large corpus of newspapers, the basic metadata may consist of information, for example, about the title, or year of publication. Our  approach is to add semantic metadata on the text level to facilitate the search over data. We show how to enrich metadata with three methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search
for texts that are about certain topics or described using certain keywords or to identify people, places, and organisations mentioned in texts without actually having to read them.

Downloads

Published

2025-08-25