Unlocking the Corpus
Enriching Metadata with State-of-the-Art NLP Methodology and Linked Data
DOI:
https://doi.org/10.3384/ecp216.11Keywords:
German Reference Corpus, DeReKo, Keyword Extraction, Named Entity Recognition, NER, Topic Modeling, Knowledge Base, Semantic Metadata EnrichmentAbstract
In research data management, metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles. Metadata in catalogues and registries are usually recorded either by archivists or subject matter experts, i.e. researchers involved in the creation or assembling of the data, or provided in the data preparation workflow. Extracting metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. If we look at descriptive metadata from a large corpus of newspapers, the basic metadata may consist of information, for example, about the title, or year of publication. Our approach is to add semantic metadata on the text level to facilitate the search over data. We show how to enrich metadata with three methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search
for texts that are about certain topics or described using certain keywords or to identify people, places, and organisations mentioned in texts without actually having to read them.