XSL-HoReCo and GoSt-ParC-Sign: Two New Signed Language - Written Language Parallel Corpora

Developments in language technology targeting signed languages are lagging behind in comparison to the advances related to what is available for so-called spoken languages. 1 This is partly due to the scarcity of good quality signed language data, including good quality parallel corpora of signed and spoken languages. This paper introduces two parallel corpora which aim at reducing the gap between signed and spoken-only language technology: The XSL Hotel Review Corpus (XSL-HoReCo) and the Gold Standard Parallel Corpus of Signed and Spoken Language (GoSt-ParC-Sign). Both corpora are available through the CLARIN infrastructure


Introduction
In Europe about half a million people have a sign language as their main or preferred means of communication (Pasikowska-Schnass, 2018).Nevertheless, when talking about language technology, sign language technology is extremely lagging behind in comparison to the tools available for spoken languages (Vandeghinste et al., 2023).One of the reasons is the scarcity of data. 2 This is partially due to the fact that sign languages do not have a widely-used written form, hence collecting written sign language data is not an option (in contrast to what is the case for many spoken languages).
Data collection and data storage also face a number of challenges, such as GDPR 3 restrictions, difficulties in recruiting participants, etc.A lot of short videos are scattered around different platforms and websites, which makes it difficult and time consuming to track them down and get the informed consents of the signers (Vandeghinste et al., forthcoming).
The majority of sign language data comes in the form of videos.To date there is no automatic tool able to annotate or translate sign language videos (Morgan et al., 2022;Vandeghinste et al., 2023), which means that these processes rely on very time-consuming manual work; consequently, the amount of available annotations or translations is scarce.
In addition to that, the quality of the data which are available is often rather problematic (Vandeghinste et al., forthcoming).Most of the sign language datasets readable for machine learning consist of television broadcasts with a spoken language as a source, such as Camgöz et al. (2021) and Koller et al. (2015) which is then interpreted into a sign language by a hearing interpreter.In those cases sign language is the target language of interpreting, which often occurs simultaneously and in real time, and might be influenced by the source language as well as affected by the interpreting process.Most hearing interpreters do not use a sign language as their main or preferred means of communication (the exception being interpreters who are CODA's -Children of Deaf Adults -and some other specific cases); consequently, they are considered L2 signers.
Additionally, the length of e.g.news broadcasts and the range of specific topics with associated specific lexicon, as well as the speed at which information is disseminated, and the number of names that need to be fingerspelled, all heavily factor into the quality of the result.Interpreters are usually required to take a break every 15-20 minutes when they interpret simultaneously, to keep the quality up and avoid cognitive overload, while news broadcasts are often longer and the pace of information is very high.Different from face to face situations, the interpreter cannot ask the newsreader to repeat themselves or go slower, so in order to keep up with the pace, the interpreters might lean more towards the source text than is ideal.
Within the two projects we present in this paper, we take this into account.
Along with the open distribution of these data sets (making them available for the wider research community), the quality of the data (professional translations, involvement of native signers for translation and validation, etc.), and the different (identifiable) domains, they have been collected in a way that suits their use in Machine Learning (ML) applications, and thus have the potential to stimulate the advancements in the field of signed language technology through both high-quality data for training models as well as a gold standard for testing.
After presenting related work in Section 2, we present two recent projects that each aim to address the lack of good quality data by providing parallel data of signed and spoken language data.
• the XSL Hotel Review Corpus (XSL-HoReCo) consists of a parallel dataset of hotel reviews in written English (the source language), videos in Sign Language of the Netherlands (Nederlandse Gebarentaal, NGT), Flemish Sign Language (Vlaamse Gebarentaal, VGT), Spanish Sign Language (Lengua de Signos Española, LSE), written Dutch, Spanish and Irish.This dataset is described in Section 3.
• the Gold Standard Parallel Corpus of Signed and Spoken Language (GoSt-ParC-Sign), a gold standard dataset of semi-spontaneous Flemish Sign Language (Vlaamse Gebarentaal) (VGT) videos translated into written Dutch.This dataset is described in Section 4.

Related work
Although sign languages are low resource languages, there have been some data collection efforts in the past.Kopf et al. (2021) contains a comprehensive list of available corpora for sign languages, but is limited to those cases where sign language is the source.The associated Sign Language Compendium (Kopf et al., 2022) 4 requires as a criterion for inclusion that a corpus must contain (semi-)spontaneous signing, provide transcriptions or translations for at least some of its content and contain at least 10 hours of sign language recordings.Various sign language datasets have been collected over the years, e.g.CorpusNGT (Crasborn et al., 2008) or DGSKorpus (Prillwitz et al., 2008).However, such datasets are not particularly suited for machine learning or deep learning applications, and require substantial processing prior to building language technology for signed languages (De Sisto et al., 2022;Vandeghinste et al., forthcoming).
For the signed languages addressed in XSL-HoReCo and GoSt-ParC-Sign the following data are available.For NGT existing datasets with authentic signers are the Corpus NGT (Crasborn et al., 2020) and part of the ECHO corpus (Nonhebel et al., 2004).For VGT this is limited to the Corpus VGT (Van Herreweghe et al., 2013).For LSE there is the Corpus de la Lengua de Signos Española (CORLSE)5 and the small corpus iSignos. 6s already mentioned, some sign language datasets that are regularly used for sign language recognition or translation contain non-authentic sign language.In these cases we cannot assume that the signers belong to the respective sign language community of the language they sign, as most often they are hearing sign language interpreters.There is still debate in the sign language technology research community whether the price for using lower quality data can be compensated by the amount of such data, which is much more abundantly available.
Such data for VGT is available in the Content4All corpus (Camgöz et al., 2021).To alleviate thism, in the BeCoS data (Vandeghinste et al., 2022) the interpreters are deaf signers re-interpreting hearing signers (which are not on the video), so the resulting sign language, although still being the target language, can be consdered authentic.More data has been collected, such as more television broadcasts with sign language interpretation in VGT and videos of the plenary sessions of the Belgian Federal Parliament, with live interpretation into VGT and French Belgian Sign Language (Langue des signes de Belgique francophone; LSFB), but has not yet been processed nor released, partly due to legal constraints (for the broadcasts).
The availability of LSE data is more scattered and not easily gathered.For instance, the corpus created by Porta (2014) contains Spanish texts from different domains which were translated by an interpreter into LSE.However, the video data are not publicly available.Another example of data with limited availability in which LSE is the target language is the material produced by the Fundación CNSE (State Conferedation of Deaf),7 such as an online driving license manual platform. 8The signed videos are accessible online but the source texts are only visible in the images displaying street signs, hence, not easy to compile nor ML-usable.In most cases, not many metadata are provided concerning the source of the video material.Therefore, it is not immediately possible to evaluate the quality and the authenticity of the signing.

The XSL Hotel Review Corpus
The XSL Hotel Review Corpus is a multilingual parallel corpus of Sign Language of the Netherlands (Nederlandse Gebarentaal -NGT), Flemish Sign Language (Vlaamse Gebarentaal -VGT), Spanish Sign Language (Lengua de Signos Española, LSE),written English, Dutch, Spanish and Irish.
The focus on a restricted domain ensures recurrence of similar constructions and terms and facilitates the mapping of words or messages to different realisations of signs or signed utterances.
The choice for the domain of hospitality was motivated by the results of co-creation events of the SignON project:9 during these events, deaf individuals identified the set of circumstances connected to hotels, restaurants, etc., as an appropriate environment in which sign language technology tools would be useful and acceptable.This relates to the concern of some members of the deaf community about the use of these technologies in sensitive or critical situations in which the presence of a human interpreter is preferred.

Written text
The English source text was taken from the Hotel Reviews dataset publicly available on Kaggle. 10he original dataset contains a list of 1,000 hotels and their reviews provided by Datafiniti's Business Database.The dataset includes hotel location, name, rating, review data, title, username, and more.
For XSL HoReCo, we only used a selection of the actual hotel reviews.300 reviews were selected according to the following criteria: • The review is in English; • The text is grammatically complete and correct; • The text does not contain uncommon abbreviations (e.g.mntns for 'mountains').
• In the case in which the review contains incomplete sentences, the removal of these does not affect the meaning of the whole text (an example is provided in table 1).

Original text Text without incomplete sentence The Southside Motel and Marina is a
The Southside Motel and Marina is a diamond in the rough.My room had a diamond in the rough.My room had a comfortable king size bed, nice size comfortable king size bed, nice size fridge, microwave and coffee pot.The room fridge, microwave and coffee pot.The room was clean and the staff went out of their was clean and the staff went out of their way to make sure I always had clean way to make sure I always had clean towels, the room and was clean and that I towels, the room and was clean and that I had coffee supplies.The motel owners...More had coffee supplies.
Table 1: Example of reviews with incomplete sentences whose removal does not affect the meaning of the text as a whole.
Within the XSL-HoReCo project, the selected reviews were translated into different languages.The translations from English into Dutch and into Spanish were performed by professional translation companies which used automatic translation (generated by DeepL) followed by in-depth human post-editing.The Irish translation was performed manually by a professional translation company.
XSL-HoReCo consists of 297 hotel reviews, corresponding to 21,464 words in the English source, 22,274 words in Dutch, and 26,469 words in Irish.Only 283 reviews were translated into Spanish,11 which consists of 20,470 words.A distribution of the length of these reviews is presented in Figure 1 and shows that most reviews have a length between 100 and 350 characters.

Translation into signed languages
The translations into the three signed languages (i.e.NGT, VGT and LSE) were produced according to the same guidelines (see 'Translation specifications' below), concerning the translators, the types of videos produced, and the availability of the data.All translations were made by deaf translators.This reduced as much as possible the interference of the source language.Note that NGT and VGT were translated from the manually post-edited Dutch, while LSE was translated straight from English.
The Dutch text was translated into NGT and VGT by six deaf professional translators each.For the translations into NGT, reviews were shared among translators as shown in table 2. Four translators for the NGT-HoReCo were women.In total, 167 reviews were produced by female signers.

Signer /
No For VGT, five were recent graduates from KU Leuven's training program for deaf translators and interpreters.Translations were divided among translators by assigning to each of them a (close to) equal number of words.Four interpreters were female and two were male.
Translations into LSE were produced by a single translator, due to the very limited availability of deaf professional LSE translators.
Figure 2 shows the distribution of the lengths of the videos for NGT, VGT and LSE, which are mostly between 10 and 60 seconds.Figure 3 shows an example of the videos and texts of the XSL-HoReCo.Translation specifications Translators were asked to make the recordings in an everyday-life, quiet environment, with a high quality camera.Each video contains one signer translating one review.Each review has been translated once.A future possible expansion of the corpus could include more translations of the same review to better account for inter-signer variation.Nevertheless, given that the corpus focuses on a single domain, i.e. hospitality, a certain recurrence of topics and signs in different possible combinations is already attested; therefore, even if to a limited extent, it allows to account for inter and intra-signer variation.
No time constraint was set for the preparation of a translation before video recording.This was done to ensure the quality of the translation and avoid a "simultaneous-interpretation effect": during simultaneous interpretation, interpreters are under time-pressure and often need to prioritize efficiency on preserving the complete content of the original message.By having the possibility of preparing the translation beforehand, XSL-HoReCo translators could make sure that the content of the reviews would be preserved as much as possible during the translation process.The Gold Standard Parallel Corpus of signed and spoken language focuses on spontaneous and semispontaneous VGT and its translation into written Dutch.
The GoSt-ParC-Sign project was developed in three phases: data gathering, manual translation, and quality control.All phases were coordinated and overseen by the Vlaamse GebarentaalCentrum (Flemish Sign Language Centre).

Phase 1: Data collection
During the first phase, roughly ten hours of publicly available semi-spontaneous VGT videos were initially identified.All VGT material contained in this corpus was produced by deaf authentic signers for a signing audience.12Therefore, the quality of the signing is as close as it could possibly be to real life signing.Written consent was gathered from the authors and signers of these videos to ensure that we would be allowed to redistribute the material.
The final content of the corpus is presented in Table 3 and amounts to just about 10 hours of footage.
Corpus Part Duration Spontaneous conversation from the VGT corpus 3:11:05 Talkshow "Dagelijks Doof" 2:24:24 Vlog regarding typical language use in VGT 1:15:24 Game show "wie wordt miljonair" 1:07:00 Various research rapports professionally translated into VGT 1:46:06 Opinion pieces in VGT 0:13:18 Total 9:57:07 Table 3: Description of the sources of the differnt parts in the GostParc-Sign corpus The footage contains 43 different signers of different ages and different regions, as presented in Table 4. Age groups are presented in Table 5.This information is relevant for future work on this corpus because much variation and differences are attested across different regions and age groups.Since the corpus only contains already existing data, we had no real control over the distribution of these sociolinguistic factors.The data in the two languages are aligned at the sentence (or message) level, since there is no one-toone correspondence between VGT signs and Dutch words.

Phase 2. Manual translation
The second phase focused on the manual translation task from VGT into Dutch text.Translations were performed by mixed teams of deaf and hearing translators, in total four deaf and six hearing translators were involved.Having a mixed team had a double purpose: i. ensuring that the original meaning of the signed message was preserved through the deaf translator; ii.providing a good quality Dutch text, through the native Dutch translator.
Translations were organised in ELAN (Sloetjes & Wittenburg, 2008), which allows to synchronise multiple annotation tiers with the video timeline.A 'Translation' tier was created for each of the participants in the video to contain the written Dutch translation in each ELAN Annotation Format (EAF) file of each video (an example of the format is provided in Figure 4).The image shows one tier for each of the four participants in the talkshow, this way even overlapping utterances could be correctly captured (as seen in the image).Having files in EAF can serve for linguistic research; in addition, this format can be easily adjusted into an ML-suited format with the framework proposed in De Sisto et al. ( 2022).
Our initial estimation was that 133 hours of translation work would lead to approximately 9-10 hours of videos being translated.This estimation was based on a consultation with professional signed language to spoken language translators, according to which we concluded that 15 minutes of translation work would correspond roughly to one minute of video translation.Unfortunately, during the translation phase we realised that, given the breadth of the topics covered in the video and the spontaneity of the signing, more time was needed for translating them.In total between 180 and 220 hours of translations lead to 8 hours of videos being translated.Consequently, in order to reach the target of having 10 hours of videos, we opted for including in the corpus additional footage publicly available with subtitles.In the third phase the quality control of the translations was performed by a professional editor who was not part of the initial translation team, to ensure that the produced translations were correctly reporting the original message of the VGT videos and that the Dutch texts were of high quality.
This corpus is made available under CC BY license, at the Instituut voor de Nederlandse Taal (INT) at http://hdl.handle.net/10032/tm-a2-x9and will soon be made available on the European Language Grid.

Conclusion
In this paper we have introduced two signed language data collection projects which aim at supporting advances in more inclusive language technology which also targets signed languages.The XSL-HoReCo project led to the creation of a multilingual parallel corpus of NGT, VGT, LSE, English (source text), written Spanish, Irish and Dutch.The very recently concluded GoSt-ParC-Sign project produced a parallel corpus of authentic VGT videos and a translation into written Dutch.The creation of similar parallel data is fundamental for supporting research and developments into fields such as signed language translation, recognition and processing.
In addition, another important outcome of these data collection project is the lesson learned throughout the process and from the challenges encountered, which can be useful for future high quality signed language data collection projects: • Guidelines for recording a video need to be extremely clear and specific; potential vagueness might lead to differences in the quality, style and type of the recording.
• It is quite difficult, if not impossible, to have an exact estimation of the ratio of translation time needed per hour of signed language videos.Many factors are at play, which affect the translation process, such as topics discussed, spontaneity of the signing, monologue vs. group conversation, potential peculiarities of individual signing styles, etc.In the GoSt-ParC-Sign project, our initial estimation turned out to be dramatically lower than the actual time needed by translators.
• The translation of spontaneous signing can be particularly challenging.Just as with spontaneous speech, unplanned signing might contain unnecessary repetitions, unclear articulations, false starts; in some circumstances, identifying what is being signed can be challenging even for an expert authentic user of a signed language, independently from whether the user is deaf, hard of hearing or hearing.
• Signers remain at all time the owners of the data they produce.

Figure 1 :
Figure 1: Distribution of the length (in characters) of the different reviews (in bins of 50 characters).

Figure 2 :
Figure 2: Distribution of lengths of videos (in seconds) in NGT, VGT and LSE

Table 2 :
Distribution of videos across NGT-HoReCo signers

Table 5 :
Age groups (at the time of recording)