Unlocking the Corpus: Enriching Metadata with State-of-the-Art NLP Methodology and Linked Data

Jennifer Ecker; Stefan Fischer; Pia Schwarz; Thorsten Trippel; Antonina Werthmann; Rebecca Wilm

doi:10.3384/ecp216.11

Authors

Jennifer Ecker Leibniz Institute for the German Language (IDS), Mannheim, Germany
Stefan Fischer Saarland University, Saarbrücken, Germany
Pia Schwarz Leibniz Institute for the German Language (IDS), Mannheim, Germany
Thorsten Trippel Leibniz Institute for the German Language (IDS), Mannheim, Germany
Antonina Werthmann Leibniz Institute for the German Language (IDS), Mannheim, Germany
Rebecca Wilm Leibniz Institute for the German Language (IDS), Mannheim, Germany

DOI:

https://doi.org/10.3384/ecp216.11

Keywords:

German Reference Corpus, DeReKo, Keyword Extraction, Named Entity Recognition, NER, Topic Modeling, Knowledge Base, Semantic Metadata Enrichment

Abstract

In research data management, metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles. Metadata in catalogues and registries are usually recorded either by archivists or subject matter experts, i.e. researchers involved in the creation or assembling of the data, or provided in the data preparation workflow. Extracting metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. If we look at descriptive metadata from a large corpus of newspapers, the basic metadata may consist of information, for example, about the title, or year of publication. Our approach is to add semantic metadata on the text level to facilitate the search over data. We show how to enrich metadata with three methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search
for texts that are about certain topics or described using certain keywords or to identify people, places, and organisations mentioned in texts without actually having to read them.

Unlocking the Corpus

Enriching Metadata with State-of-the-Art NLP Methodology and Linked Data

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License