A Pipeline for Manual Annotations of Risk Factor Mentions in the COVID-19 Open Research Dataset
Keywords:topic modelling, active learning, annotation tools, pre-annotation, COVID-19
We here demonstrate how a set of tools that are being maintained and further developed within the Språkbanken Sam and
SWE-CLARIN infrastructures can be employed for creating manually labelled training data in a low-resource setting. As example text, we used the “COVID-19 Open Research Dataset”, and created manually annotated training data for its associated Kaggle task, “What do we know about COVID-19 risk factors?”. We first used our topic modelling tool to i) select a text set for manual annotation, ii) classify the texts into preliminary classification categories, and iii) analyse the texts in search for potential refinements of the annotation categories. We then annotated the text set on a more granular level by labelling the token sequences that indicated the existence of the refined categories in the text. Finally, we used the granularly annotated text set as a seed set, and applied our active learning tool for actively selecting additional texts for annotation. For the token-sequence annotations, we used our text annotation tool, which includes support for incorporating automatic pre-annotations.
Copyright (c) 2021 Maria Skeppstedt, Magnus Ahltorp, Gunnar Eriksson and Rickard Domeij
This work is licensed under a Creative Commons Attribution 4.0 International License.