https://ecp.ep.liu.se/index.php/clarin/issue/feed CLARIN Annual Conference 2023-06-09T13:00:19+02:00 CLARIN clarin@clarin.eu Open Journal Systems <p>CLARIN, the Common Language Resources and Technology Infrastructure, is a virtual platform for everyone interested in language. CLARIN offers access to language resources, technology, and knowledge, and enables cross-country collaboration among academia, industry, policy-makers, cultural institutions, and the general public. Researchers, students, and citizens are offered access to digital language resources and technology services to deploy, connect, analyse and sustain such resources. In line with the OpenScience agenda, CLARIN enables scholars from the Social Sciences and Humanities (SSH) and beyond to engage in and contribute to cutting-edge, data-driven research driven by language data.</p> https://ecp.ep.liu.se/index.php/clarin/article/view/723 Analysing Changes in Official Use of the Design Concept Using SweCLARIN Resources 2023-06-09T13:00:12+02:00 Lars Ahrenberg Daniel Holmer Stefan Holmlid Arne Jönsson We investigate changes in the use of four Swedish words from the fields of design and archi- tecture. It has been suggested that their meanings have been blurred, especially in governmental reports and policy documents, so that distinctions between them that are important to stakeholders in the respective fields are lost. Specifically, we compare usage in two governmental public reports on design, one from 1999 and the other from 2015, and additionally in opinion responses to the 2015 report. Our approach is to contextualise occurrences of the words in different representations of the texts using word embeddings, topic modelling and sentiment analysis. Tools and language resources developed within the SweClarin infrastructure have been crucial for the implementation of the study. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Lars Ahrenberg, Daniel Holmer, Stefan Holmlid, Arne Jönsson https://ecp.ep.liu.se/index.php/clarin/article/view/724 The CLaDA-BG Dictionary Creation System: Specifics and Perspectives 2023-06-09T13:00:12+02:00 Zhivko Angelov Kiril Simov Petya Osenova Zara Kancheva The paper reports on the current status of a system for creating dictionaries within the CLaDA-BG infrastructure. The system is called CLaDA-BG-Dict. At the heart of the system lies the lexical thesaurus BTB-Wordnet around which all other language resources for Bulgarian are organized. These are various types of dictionaries (morphological, explanatory, terminological, etc.), ontologies (such as DBpedia), corpora (in-house and external). The specific features and functionalities of the system are discussed with respect to the language resourse integrity. Also, the rationale behind the construction of such a system are given together with an outline of its utility for a number of NLP tasks and for various types of users. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Zhivko Angelov, Kiril Simov, Petya Osenova, Zara Kancheva https://ecp.ep.liu.se/index.php/clarin/article/view/725 Linguistic Autobiographies. Towards the Creation of a Multilingual Resource Family 2023-06-09T13:00:13+02:00 Silvia Calamai Rosalba Nodari Claudia Soria Alessandro Carlucci This paper describes a project aimed at creating a new resource family of multilingual and multimodal resources centered around the concept of “Linguistics of self”, that is personal re-flections on the role of languages in shaping one’s identity. Language portrait silhouettes, drawing bilingualism, and linguistic autobiographies are different types of resources that share this common feature. We describe the resources and the criteria for their metadata annotation, focusing in particular on linguistic autobiographies, where the writer explicitly reflects on the relationship between him/herself and language. These genres are fruitfully used in different educational settings, and research has shown that they help to uncover the social, affective, and psychological dimensions of language learning. The potential of a multilingual and mul-timodal collection is discussed starting from data collected in Italy and Norway. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Silvia Calamai, Rosalba Nodari, Claudia Soria, Alessandro Carlucci https://ecp.ep.liu.se/index.php/clarin/article/view/726 The Pipeline for Publishing Resources in the Language Bank of Finland 2023-06-09T13:00:13+02:00 Ute Dieckmann Mietta Lennes Jussi Piitulainen Jyrki Niemi Erik Axelson Tommi Jauhiainen Krister Lindén We present the process of publishing resources in Kielipankki, the Language Bank of Finland. Our pipeline includes all the steps that are needed to publish a resource: from finding and receiving the original data until making the data available via different platforms, e.g., the Korp concordance tool or the download service. Our goal is to standardize the publishing process by creating an ordered checklist of tasks with the corresponding documentation and by developing conversion scripts and processing tools that can be shared and applied on different resources. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Ute Dieckmann, Mietta Lennes, Jussi Piitulainen, Jyrki Niemi, Erik Axelson, Tommi Jauhiainen, Krister Lindén https://ecp.ep.liu.se/index.php/clarin/article/view/727 TEI and Git in ParlaMint: Collaborative Development of Language Resources 2023-06-09T13:00:14+02:00 Tomaž Erjavec Matyáš Kopp Katja Meden This paper discusses the encoding, validation and development of language resources of the completed ParlaMint I and on-going ParlaMint II CLARIN projects, which centre on the collaborative development of a large set of interoperable corpora of parliamentary proceedings. It focuses on the ParlaMint encoding and the GitHub development platform and the evaluation of their use by project partners. We introduce the use of TEI ODD for the encoding guidelines and validation schemas. We motivate and explain using Git to develop and maintain the encoding schemas, validation and conversion scripts and valid samples of the corpora. Apart from revision control, issues, and publishing documentation, the GitHub platform also supports integrating code execution with pull requests, which significantly automates the data submission process. The paper also presents the results of a survey on the use of TEI and Git in the ParlaMint projects among the project participants. Overall, participants were mostly positive about their experiencewith TEI and Git, although some difficulties were reported. Some partners also expressed doubts about whether the current scheme is flexible enough to support encoding the unique features of sometimes drastically different parliamentary systems. On the other hand, reactions to Git were very positive in terms of information and feedback received via GitHub Issues, effectiveness in the communication process, and plans to use Git in the future. However, there was less agreement on whether the requirements and workflows were adequately explained.The reported difficulties will serve as a basis for further Git workflow optimisation in ParlaMint. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Tomaž Erjavec, Matyáš Kopp, Katja Meden https://ecp.ep.liu.se/index.php/clarin/article/view/728 EU Data Governance Act: Outlining a Potential Role for CLARIN 2023-06-09T13:00:15+02:00 Paweł Kamocki Krister Linden Andrius Puksas Aleksei Kelli The Data Governance Act was proposed in late 2020 as part of the European Strategy for Data, and adopted on 30 May 2022 (as Regulation 2022/868). It will enter into application on 24 September 2023. The Data governance Act is a major development in the legal framework affecting CLARIN and the whole language community. With its new rules on the re-use of data held by the public sector bodies and on the provision of data sharing services, and especially its encouragement of data altruism, the Data Governance Act creates new opportunities and new challenges for CLARIN ERIC. This paper analyses the provisions of the Data Governance Act, and aims at initiating the debate on how they will impact CLARIN and the whole language community. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Paweł Kamocki, Krister Linden, Andrius Puksas, Aleksei Kelli https://ecp.ep.liu.se/index.php/clarin/article/view/729 Semantic Classification of Prepositions in BulTreeBank WordNet 2023-06-09T13:00:15+02:00 Zara Kancheva The paper presents the work in progress for a PhD thesis about preposition incorporation in the Bulgarian BulTreeBank WordNet. Being one of the most polysemous parts of speech, prepositions are still relatively challenging for NLP and are usually missing in wordnets. A preposition semantic classification, a model for preposition synsets and synset relations are proposed. The planned applications of the prepositions and the directions for future processing are introduced. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Zara Kancheva https://ecp.ep.liu.se/index.php/clarin/article/view/730 Neural Metaphor Detection for Slovene 2023-06-09T13:00:16+02:00 Matej Klemen Marko Robnik-Šikonja Metaphors are linguistic expressions using comparison with another concept to potentially improve the language expressivity. Due to relevant downstream applications, metaphor detection is an active topic of research. Most of the research is focused on English, while other languages are less covered. In our work, we focus on Slovene, presenting the first word-level metaphor detection experiments. We apply multiple transformer-based large language models on four versions of two publicly available Slovene corpora: KOMET and G-KOMET. We perform monolingual, multilingual, and cross-lingual experiments, using the VU Amsterdam metaphor corpus as an additional source of metaphor knowledge. We evaluate the models quantitatively using word-level $F_1$ score and find that (1) the most consistently well-performed model is the trilingual CroSloEngual BERT model, (2) the addition of English data in multilingual experiments does not improve the performance significantly, and (3) the cross-lingual models achieve significantly worse results than their monolingual and multilingual counterparts. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Matej Klemen, Marko Robnik-Šikonja https://ecp.ep.liu.se/index.php/clarin/article/view/731 Evaluation of the Archivio Vi.Vo Architecture: A Case Study on the Reuse of Legacy Data for Linguistic Purposes 2023-06-09T13:00:16+02:00 Roberta Bianca Luzietti The object of this paper is the validation of the Archivio Vi.Vo. architecture, developed within the CLARIN-IT consortium for the preservation and accessible consultation of historical oral archives. Following the first case study employing the Caterina Bueno archive, the goal is to now show how this innovative architecture is also suitable for conducting research investigations on the archival data and hosting different types of archives. The real use case study presented in this contribution, aims at employing the Angela Spinelli archive for conducting a sociophonetic investigation on Tuscan vernacular. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Roberta Bianca Luzietti https://ecp.ep.liu.se/index.php/clarin/article/view/732 It-Sr-NER: CLARIN Compatible NER and Geoparsing Web Services for Italian and Serbian Parallel Text 2023-06-09T13:00:17+02:00 Olja Perišić Ranka Stanković Milica Ikonić Nešić Mihailo Škorić The paper will showcase the outcomes of the ”It-Sr-NER:Web services for named entities recognition, linking and mapping” project for Serbian and Italian languages. The project was a collaboration between the University of Turin and the Society for Language Resources and Technologies JeRTeh, with the goal of creating the It-Sr-NER web service. This service is designed to annotate named entities such as people, places, organizations, ethnicities, events, and works of art in text and display them on a map. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Olja Perišić, Ranka Stanković, Milica Ikonić Nešić, Mihailo Škorić https://ecp.ep.liu.se/index.php/clarin/article/view/733 Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction 2023-06-09T13:00:17+02:00 Aleksi Sahala Tero Alstola Jonathan Valk Krister Lindén We present BabyLemmatizer, a hybrid lemmatizer and POS-tagger for Akkadian, the language of the ancient Assyrians and Babylonians, documented from 2350 BCE to 100 CE. In our approach the text is first POS-tagged and lemmatized with TurkuNLP trained with human-verified labels, and then post-corrected with dictionary-based methods to improve the lemmatization quality. The post-correction also assigns labels with confidence scores to flag the most suspicious lemmatizations for manual validation. We demonstrate that the presented tool achieves a Lemma+POS labeling accuracy of 94%, and a lemmatization accuracy of 95% in a held-out test set. We also apply lemmatizer to a previously unlemmatized text corpus to test it in practice. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Aleksi Sahala, Tero Alstola, Jonathan Valk, Krister Lindén https://ecp.ep.liu.se/index.php/clarin/article/view/734 Developing Resources for Measuring Text Readability in Sesotho 2023-06-09T13:00:18+02:00 Johannes Sibeko This article presents a work-in-progress doctoral project that explores measuring text readability in Sesotho, a Bantu language spoken by more than 10 million speakers across Southern Africa. The main project adopts a classical readability formulas approach to text readability analysis. We aim to adapt nine existing readability metrics into Sesotho using English as a higher-resourced helper language. So far, five resources have been developed as part of the study. The rule-based and the TeX-based syllabification systems, the syllable annotated word list, and the grade 12 exam reading comprehension and summary writing corpus have been published on the South African Centre for Digital Language Resources' (SADiLaR) online repository. The machine-translated corpus is still under development. This article describes the progress of the PhD project by overviewing the basic digital language resources developed for the project. The metrics under consideration for adaptation into Sesotho are also briefly discussed. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Johannes Sibeko https://ecp.ep.liu.se/index.php/clarin/article/view/735 WebLicht-Batch -- A Web-Based Interface for Batch Processing Large Input with the WebLicht Workflow Engine 2023-06-09T13:00:19+02:00 Claus Zinn Ben Campbell WebLicht is a workflow engine that gives researchers access to a well-inhabited space of natural language processing tools that can be combined into tool chains to perform complex natural language analyses. In this paper, we present WebLicht-Batch, a web-based interface to WebLicht's chainer back-end. WebLicht-Batch helps users to automatically feed large input data, or input data of multiple files into WebLicht. It disassembles large input into smaller, more digestible sizes, feeds the resulting parts into WebLicht's pipelining and execution engine, and then assembles the results of such processing into files that preserve the usual input-output dichotomy. 2023-06-09T00:00:00+02:00 Copyright (c) 2023 Claus Zinn, Ben Campbell