CLARIN Annual Conference
https://ecp.ep.liu.se/index.php/clarin
<p>CLARIN, the Common Language Resources and Technology Infrastructure, is a virtual platform for everyone interested in language. CLARIN offers access to language resources, technology, and knowledge, and enables cross-country collaboration among academia, industry, policy-makers, cultural institutions, and the general public. Researchers, students, and citizens are offered access to digital language resources and technology services to deploy, connect, analyse and sustain such resources. In line with the OpenScience agenda, CLARIN enables scholars from the Social Sciences and Humanities (SSH) and beyond to engage in and contribute to cutting-edge, data-driven research driven by language data.</p>Linköping University Electronic Pressen-USCLARIN Annual Conference1650-3686Constructing SABeD: A Spoken Academic Belgian Dutch Corpus
https://ecp.ep.liu.se/index.php/clarin/article/view/1016
<p>We present the Spoken Academic Belgian Dutch (SABeD) corpus and a description of its construction. It was compiled from selected first bachelor academic lectures in higher education institutions in Flanders, as students indicate that the language used in such lectures is one of the hurdles for comprehension and academic success. We first applied speech recognition on these lectures and then applied manual utterance segmentation and manual correction of the automated transcription. A filtered version of the resulting transcriptions was automatically punctuated and linguistically annotated with CLARIN tools and is currently available for search in the Autosearch online corpus query environment. The manual transcriptions and the ELAN files with the final annotation will soon be made available to the research community for download in the CLARIN infrastructure at <a href="http://hdl.handle.net/10032/tm-a2-w4">http://hdl.handle.net/10032/tm-a2-w4</a>.</p>Jolien MathysenVincent VandeghinsteElke PetersPatrick Wambacq
Copyright (c) 2024 Jolien Mathysen, Vincent Vandeghinste, Elke Peters, Patrick Wambacq
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210001XSL-HoReCo and GoSt-ParC-Sign: Two New Signed Language - Written Language Parallel Corpora
https://ecp.ep.liu.se/index.php/clarin/article/view/1017
<p>Developments in language technology targeting signed languages are lagging behind in comparison to the advances related to what is available for so-called spoken languages.1 This is partly due to the scarcity of good quality signed language data, including good quality parallel corpora of signed and spoken languages. This paper introduces two parallel corpora which aim at reducing the gap between signed and spoken-only language technology: The XSL Hotel Review Corpus (XSL-HoReCo) and the Gold Standard Parallel Corpus of Signed and Spoken Language (GoSt-ParC-Sign). Both corpora are available through the CLARIN infrastructure.</p>Mirella De SistoVincent VandeghinsteCaro BrosensMyriam VermeerbergenDimitar Shterionov
Copyright (c) 2024 Mirella De Sisto, Vincent Vandeghinste, Caro Brosens, Myriam Vermeerbergen, Dimitar Shterionov
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210002Teaching Syntax with CLARIN Corpora and Resources
https://ecp.ep.liu.se/index.php/clarin/article/view/1018
<p>The recent COVID-19 pandemic has brought online learning to the forefront for learners and teachers. As a consequence, the demand for self-paced and adaptive learning resources has reached unprecedented levels. Fortunately, universities had been using e-learning platforms such as Moodle, or other SCORM-compliant LMS, which has helped make the transition from onsite to on-line learning. However, teachers still have had to design and implement assessment activities in the form of self-correcting activities (true/false, multiple answer questions, mark the words, fill in the blanks questions, etc.). This step has proved to be a major hurdle in the on-site to on-line learning transition, since designing and, most of all, manually editing formative and evaluative assessment activities is a very labour-intensive task. In this article, we present a framework that takes advantage of the corpora and resources available from the LINDAT / CLARIAH-CZ Data & Tools platform in order to generate quizzes and other activities related to syntax. After some background on using NLP for teaching grammar, we present our corpus-to-quiz processing chain, and we outline preliminary results on deploying automatically generated French syntax quizzes in the classroom.</p>Antonio Balvet
Copyright (c) 2024 Antonio Balvet
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210003Standards Information System for CLARIN centres and beyond
https://ecp.ep.liu.se/index.php/clarin/article/view/1020
<p>The present contribution describes features of the CLARIN Standards Information System that have been designed to assist data deposition centres in CLARIN. We also show what is needed and what has been done in order to go beyond the originally designated target, so as to provide service to sibling and descendant research infrastructures, of which DARIAH and Text+ are taken as examples. This paper is aimed primarily at representatives of research infrastructure nodes, responsible for preparing and sharing data deposition information about their centres or repositories. It assumes a degree of technical knowledge or experience in using the XML format and tools, the REST API, and version control systems.</p>Piotr Ba´nskiEliza Margaretha Illig
Copyright (c) 2024 Piotr Ba´nski, Eliza Margaretha Illig
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210004The CLARIN:EL infrastructure: Platform, Portal, K-Centre
https://ecp.ep.liu.se/index.php/clarin/article/view/1021
<p>This paper presents the CLARIN:EL infrastructure, which comprises three pillars: the language resources and technologies Platform, the Portal and the Knowledge Centre. It serves as a com-prehensive and interoperable environment that supports language-related research in the fields of language technology, language studies, digital humanities, and political and social sciences. The Platform facilitates deposition, curation and sharing of digital language resources (catering for providers’ needs), and access to and automatic processing of these resources (catering for con-sumers’ needs). The Portal offers informative material about CLARIN:EL and support services to the community, including dissemination, awareness raising and training activities. The Knowledge Centre promotes digital literacy in the scientific domains served, by providing infor-mation on studies, educational and training material and publications. This paper discusses the CLARIN:EL pillars, the technical architecture, its design and implementation principles, the functionalities offered to the users, the support activities provided, usage analytics and future steps.</p>Maria GavriilidouStelios PiperidisDimitrios GalanisKanella PouliPenny LabropoulouJuli BakagianniIro TsiouliMiltos DeligiannisAthanasia KolovouDimitris GkoumasLeon VoukoutisKaterina Gkirtzou
Copyright (c) 2024 Maria Gavriilidou, Stelios Piperidis, Dimitrios Galanis, Kanella Pouli, Penny Labropoulou, Juli Bakagianni, Iro Tsiouli, Miltos Deligiannis, Athanasia Kolovou, Dimitris Gkoumas, Leon Voukoutis, Katerina Gkirtzou
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210005The SSH Open Marketplace and CLARIN
https://ecp.ep.liu.se/index.php/clarin/article/view/1022
<p>This paper showcases the SSH Open Marketplace, which is a discovery portal which pools and contextualises resources for Social Sciences and Humanities research communities, and its tight connections with the CLARIN infrastructure. The proposal presents how the SSH Open Marketplace can provide insights into the use of tools, methods and standards in the Social Sciences and Humanities communities in general, and for the CLARIN community in particular. The paper also describes how the SSH Open Marketplace can increase serendipity in the discovery of new methods and standards, by interlinking the resources and describing workflows. As contextualisation is provided between the items of the catalogue, it is easy to understand and assess the usefulness of a resource.</p>Alexander K¨onigLaure BarbotCristina GrisotMichael KurzmeierEdward J. Gray
Copyright (c) 2024 Alexander K¨onig, Laure Barbot, Cristina Grisot, Michael Kurzmeier, Edward J. Gray
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210006Domain-Specific Languages for Epigraphy: the Case of ItAnt
https://ecp.ep.liu.se/index.php/clarin/article/view/1023
<p>This contribution illustrates how the definition of a Domain-Specific Language can support the activities of epigraphists and historical linguists. It presents and discusses a method and technological solution, based on Domain-Specific Languages, for facilitating scholars in digitally representing the available knowledge of archaic languages and cultures. This is achieved by increasing the human readability of the encoded data without sacrificing compliance with standard models and formats. The work is framed within the context of an Italian National collaborative research project devoted to the study of the languages and cultures of ancient Italy. The platform developed within this project offers an interesting use case and motivation for experimenting with Domain-Specific Languages for the creation of necessary digital critical editions of the inscriptions relevant for these languages. After explaining the definition process of the DSL grammar, we finally test the applicability of the DSL grammar to five example inscriptions in the Faliscan language.</p>Federico BoschettiLuca RigobiancoValeria Quochi
Copyright (c) 2024 Federico Boschetti, Luca Rigobianco, Valeria Quochi
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210007Mind the Ownership Gap? Copyright in AI-generated Language Data
https://ecp.ep.liu.se/index.php/clarin/article/view/1024
<p>For language scientists, a <em>prima facie</em> advantage of AI-generated data over human-created content is that AI outputs are generally regarded as free from copyright. This contribution addresses this issue in some detail.</p>Pawel KamockiToby BondKrister LindénThomas MargoniAleksei KelliAndrius Puksas
Copyright (c) 2024 Pawel Kamocki, Toby Bond, Krister Lindén, Thomas Margoni, Aleksei Kelli, Andrius Puksas
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210008MWE-Finder: An evaluation through three case studies
https://ecp.ep.liu.se/index.php/clarin/article/view/1026
<p>In this paper we showcase and evaluate MWE-Finder, a system that allows users to search for occurrences of an MWE in a large Dutch text corpus. To this end, we conduct three small case studies, and discuss the results in detail. We make use of the MWEs <em>0geen *+haan zal naar iets kraaien</em> ‘no one will say anything about something’, <em>iemand zal 0dat *+varken wassen</em> ‘someone will deal with that problem’ and <em>iemand zal iemand het hemd van het lijf vragen</em> ‘someone will want to know all the ins and outs of something from someone’, which are all in canonical form following Odijk (2023) and Odijk and Kroon (2024).</p> <p>The results show that MWE-Finder is very accurate in retrieving the target MWEs, reaching an accuracy of 93.7%, and an F<sub>1</sub>-score of 95.2%. The case studies additionally lay bare points of improvement of MWE-Finder, specifically concerning the enrichment of syntactic parses by making the object relation explicit in certain constructions.</p>Martin KroonJan Odijk
Copyright (c) 2024 Martin Kroon, Jan Odijk
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210009The LiRI Corpus Platform
https://ecp.ep.liu.se/index.php/clarin/article/view/1027
<p>We present the LiRI Corpus Platform (LCP), a software system and infrastructure for querying a vast array of corpora of different kinds. It heavily relies on the PostgreSQL relational database management system, employing state-of-the-art data representation and indexing techniques, which lead to significant performance gains when querying, even for structurally complex queries involving nested logical operations and quantifiers. In this work, we describe the requirements that led to the development of this novel system, discuss methods from corpus linguistics and beyond that we considered key for such a system, and provide details on a number of technological features that we take advantage of. Our platform also comes with its own query language tailored both to the requirements in terms of information need and our philosophy of how to define corpora in an abstract way.</p>Johannes Gra¨enJonathan SchaberDaniel McDonaldIgor MustačNikolina Rajovi´cGerold SchneiderTeodora Vukovi´cJeremy ZehrNoah Bubenhofer
Copyright (c) 2024 Johannes Gra¨en, Jonathan Schaber, Daniel McDonald, Igor Mustač, Nikolina Rajovi´c, Gerold Schneider, Teodora Vukovi´c, Jeremy Zehr, Noah Bubenhofer
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210010CLARIN in Training and Education
https://ecp.ep.liu.se/index.php/clarin/article/view/1028
<p>To help realise its potential as the research infrastructure for language as social and cultural data, CLARIN is supporting the training of students and scholars in using its language data, tools and services. Lecturers and teachers in the CLARIN network have integrated CLARIN language resources into higher education programmes and other training activities. This paper showcases some recent courses and training initiatives, along with inventories and new learning materials, partly developed in EU-funded projects, which are accessible through the CLARIN Learning Hub. Each section briefly describes the motivation behind the initiative, the authors’ experience, related efforts in the field, and future perspectives.</p>Koenraad De SmedtIulianna van der LekHenk van den HeuvelAntonio BalvetMaarten JanssenSilvie CinkováAmelia SanzStavros AssimakopoulosLouis ten Bosch
Copyright (c) 2024 Koenraad De Smedt, Iulianna van der Lek, Henk van den Heuvel, Antonio Balvet, Maarten Janssen, Silvie Cinková, Amelia Sanz, Stavros Assimakopoulos, Louis ten Bosch
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210011Analyses of information security standards on data crawled from company web sites using SweClarin resources
https://ecp.ep.liu.se/index.php/clarin/article/view/1029
<p>With the purpose of analysing Swedish companies’ adherence and adoption of the information security standard ISO 27001 and to examine the communicative constitution of preventive innovation in organisations, we have created a corpus of corporate texts from Swedish company websites. The corpus was analysed from multiple interdisciplinary perspectives in close cooperation with management researchers and SweClarin researchers using SweClarin tools and resources as well as standard language technology tools. Some analyses require deep reading, which was performed by management researchers, often guided by results from language analyses. Initial results have been presented at a management studies conference. In this paper, we focus on presenting the research issues, the methods used in the project, the results, and the experience of SweClarin researchers supporting researchers in social sciences. Our contribution is to show how it is possible, through the integration of human insights and digital methods, to increase the credibility and validity of a digitally acquired data set and subsequent research findings. In our view, a combination of human deep reading (management researchers), contextual lexical verification (management studies) and language technology (content and sentiment analysis) can help to sensitise computational text analysis for medium-sized data sets.</p>Arne JönssonSubhomoy BandyopadhyaySvjetlana Pantic DragisicAndrea Fried
Copyright (c) 2024 Arne Jönsson, Subhomoy Bandyopadhyay, Svjetlana Pantic Dragisic, Andrea Fried
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210012Protective Measures for Sharing the Finnish DarkWeb Marketplace Corpus (FINDarC)
https://ecp.ep.liu.se/index.php/clarin/article/view/1030
<p>We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a Finnish dark web marketplace website. The site was active from 2017 to 2021 and during this time one of the most prominent online illegal narcotics markets in Finland. A reduced version of the corpus, Finnish Dark Web Marketplace Corpus (FINDarC), has been archived in the Language Bank of Finland. In the current work, we focus on the protective measures for storing the data and how researchers can apply for access rights to the corpus under the CLARIN RES licence.</p>Krister Lind´enTeemu RuokolainenLasse H¨am¨al¨ainenJ. Tuomas HarviainenMartin MatthiesenMietta Lennes
Copyright (c) 2024 Krister Lind´en, Teemu Ruokolainen, Lasse H¨am¨al¨ainen, J. Tuomas Harviainen, Martin Matthiesen, Mietta Lennes
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210013“Hier in diesem Hause sitzen keine Idioten!” - Emotion and Concreteness in Austrian Parliamentary Discourse
https://ecp.ep.liu.se/index.php/clarin/article/view/1031
<p>This study examines Austrian parliamentary discourse styles by combining utterances from the Corpus of Austrian Parliamentary Records (ParlAT; Wissik & Pirker, 2018) with a large dataset of affective norms for German (Köper & Schulte im Walde, 2016). The results suggest that parliamentary discourse styles differ significantly depending on gender, party affiliation and utterance type (regular speech vs. unauthorized utterances). The findings are discussed within the context of gendered language usage and the literature on political speech in general. In particular, we find evidence for a characteristically male right-wing populist mode of parliamentary discourse marked by negative and concrete language use and a penchant for heckling. It is also shown that discourse styles can vary over time, specifically when the parties in power change from one period to the next (e.g. a center-left/center-right coalition government following a center-right/right one).</p>Klaus HofmannTanja Wissik
Copyright (c) 2024 Klaus Hofmann, Tanja Wissik
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210014Topics in Periodicals from the Swedish Diabetes Association 1949 – 1990: Extending the Topic Modelling Tool Topics2Themes with a Timeline Visualisation
https://ecp.ep.liu.se/index.php/clarin/article/view/1032
<p>Existing methods for visualising temporal topic models typically present the information in an aggregated form, and do not offer any possibility to track the specific texts responsible for the change in topic prevalence over time. We present a new type of topic modelling-based timeline visualisation. It still provides an overview with aggregated topic information suitable for distant reading, while also allowing the user to gradually zoom into the image for more detail. At the most detailed level, the individual texts can be reached, which makes it possible to switch to close reading. The timeline visualisation was implemented as an extension of the topic modelling tool Topics2Themes, but this visualisation technique can be adapted to other topic modelling tools and algorithms. We showcase the timeline visualisation on a corpus of periodicals from the Swedish Diabetes Association, which is one of the patient organisation corpora studied within the interdisciplinary project ActDisease. One timeline visualisation was generated for the entire corpus. Additionally, we generated a timeline focusing on the texts that contain the word “dietitian”. The two timelines, including the functionality to zoom into the graphs and reach the texts, were used to analyse the topics and how they vary. It could be concluded that some of the topics and topic timelines were predictable, while others revealed content that might be less expected. These results indicate validity of the method applied, and they also show that this visualisation technique could help us learn something new.</p>Maria SkeppstedtGijs AangenendtVera DanilovaYlva Söderfeldt
Copyright (c) 2024 Maria Skeppstedt, Gijs Aangenendt, Vera Danilova, Ylva S¨oderfeldt
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210015Re-Reading Lists in Historical Newspapers: Digital Insights into an Overlooked Text Type
https://ecp.ep.liu.se/index.php/clarin/article/view/1033
<p>The paper presents an ongoing doctoral project dedicated to periodically published lists in historical newspapers between 1600 and 1850. By employing approaches from Corpus Linguistics and Digital Humanities, the project aims to locate the studied ‘small’ texts within existing digital resources, analyse them with regard to their textual characteristics and evaluate their potentials and challenges for automated information extraction. The article primarily focuses on two key aspects: firstly, on search strategies for locating lists in digital newspaper corpora and collections, and secondly, on a case study into lists of arriving persons published in the <em>Wien[n]erisches Diarium</em> between 1703 and 1725. These empirical investigations reveal that periodically published lists form a central and frequent component of early modern newspapers and offer numerous potentials for Digital Humanities research due to their textual features, such as periodicity, repetitiveness or inherent (semi-)structuredness. In this regard, the paper identifies the overlooked newspaper text type as a data treasure awaiting discovery and underscores the need to investigate ‘small’ newspaper texts on a large scale.</p>Nina C. Rastinger
Copyright (c) 2024 Nina C. Rastinger
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210016Adding political orientation metadata to ParlaMint corpora
https://ecp.ep.liu.se/index.php/clarin/article/view/1034
<p>Parliamentary debates are an important source for political discourse research as well as research in other disciplines. The ParlaMint project aims to create comparable corpora of parliamentary debates which, through unified encoding, provide a comprehensible resource to support such research. Within these corpora, speeches are attributed to speakers, and speaker metadata, including temporal affiliations with different organizations such as parliamentary groups and political parties. This paper discusses the addition of metadata on the political orientation of parties and parliamentary groups to the ParlaMint corpora. The paper explains our two sources for this information, namely the Chapel Hill Expert Survey Dataset and Wikipedia, the process of data collection and its subsequent encoding in the corpora. Furthermore, the paper presents an analysis of the extent of the added metadata, along with an example of exploratory data analysis. It also outlines the distribution of utterances across political orientation categories within ParlaMint, offering a comprehensive overview of the diverse perspectives and ideologies within the corpora. The inclusion of this supplementary metadata could prove valuable for parliamentary data research, while the methodology developed could be used to add further metadata to the ParlaMint corpora.</p>Katja MedenJure SkubicTomaž Erjavec
Copyright (c) 2024 Katja Meden, Jure Skubic, Tomaž Erjavec
https://creativecommons.org/licenses/by/4.0
2024-07-092024-07-0910.3384/ecp210017