Pipeline for a Data-driven Network of Linguistic Terms
DOI:
https://doi.org/10.3384/ecp184176Keywords:
terminology extraction, automated domain ontology, linguistic terminologyAbstract
The present work is aimed at (1) developing a search machine adapted to the large DReaM corpus of linguistic descriptive literature and (2) getting insights into how a data-driven ontology of linguistic terminology might be built. Starting from close to 20,000 text documents from the literature of language descriptions, from documents either born digitally or scanned and OCR’d, we extract keywords and pass them through a pruning pipeline where mainly keywords that can be considered as belonging to linguistic terminology survive. Subsequently we quantify relations among those terms using Normalized Pointwise Mutual Information (NPMI) and use the resulting measures, in conjunction with the Google Page Rank (GPR), to build networks of linguistic terms.Downloads
Published
2021-08-12
Issue
Section
Contents
License
Copyright (c) 2021 Søren Wichmann
This work is licensed under a Creative Commons Attribution 4.0 International License.