The LiRI Corpus Platform

Authors

  • Johannes Gra¨en Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Jonathan Schaber Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Daniel McDonald Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Igor Mustač Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Nikolina Rajovi´c Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Gerold Schneider Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Teodora Vukovi´c Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Jeremy Zehr Linguistic Research Infrastructure, University of Zurich, Switzerland
  • Noah Bubenhofer Linguistic Research Infrastructure, University of Zurich, Switzerland

DOI:

https://doi.org/10.3384/ecp210010

Abstract

We present the LiRI Corpus Platform (LCP), a software system and infrastructure for querying a vast array of corpora of different kinds. It heavily relies on the PostgreSQL relational database management system, employing state-of-the-art data representation and indexing techniques, which lead to significant performance gains when querying, even for structurally complex queries involving nested logical operations and quantifiers. In this work, we describe the requirements that led to the development of this novel system, discuss methods from corpus linguistics and beyond that we considered key for such a system, and provide details on a number of technological features that we take advantage of. Our platform also comes with its own query language tailored both to the requirements in terms of information need and our philosophy of how to define corpora in an abstract way.

Downloads

Published

2024-07-09