Evaluation of the Archivio Vi.Vo Architecture: A Case Study on the Reuse of Legacy Data for Linguistic Purposes

The object of this paper is the evaluation of the Archivio Vi.Vo. architecture, developed within the CLARIN-IT consortium for the preservation and accessible consultation of historical oral archives. Following the ﬁrst case study employing the Caterina Bueno archive, the goal is to now to show how this innovative architecture is also suitable for conducting research investigations on the archival data and hosting a different type of archive. The real use case study presented in this contribution, aims at employing the Angela Spinelli archive for conducting a sociophonetic investigation on Tuscan vernacular.


Introduction
Thanks to infrastructures such as CLARIN (Common Language Resource and Technology Infrastructure) ERIC, different kinds of data and metadata can be safely stored in federated access repositories to become searchable, accessible in a digital format, standard compliant, interoperable with different tools and software and, most importantly, reusable (Krauwer and Hinrichs, 2014). To reach these goals the CLARIN consortia are committed to providing and making linguistic resources and tools available for social science research. In this regard, the Italian node CLARIN-IT recently developed the innovative Archivio Vi.Vo. architecture (Calamai et al., 2022) for the preservation, access, dissemination and reuse of historical audio archives. The architecture was conceived as a model for safeguarding the digitized content of analog carriers (preservation copies) and to make use of their recorded material (archival units). The internal structure of Archivio Vi.Vo. was specifically designed to make archival data interoperable, compliant with the CLARIN-IT infrastructure and safely deposited in the CLARIN repository. Along with reuse and dissemination, another reason for developing Archivio Vi.Vo. was the necessity to ensure the accountability of previous projects that have been dealing with the digitization and restoration of historical archives, such as Gra.fo (Calamai et al., 2013) (Calamai and Biliotti, 2017), and which risk becoming inaccessible once web portals remain unmaintained. Gra.fo was a two year project jointly conducted by the Scuola Normale di Pisa and the University of Siena (Regione Toscana PAR FAS 2007-2013 with the purpose to discover, digitize, catalogue and partially transcribe oral documents (e.g., oral biographies, ethno-texts, linguistic questionnaires, oral literature, etc.) collected within the Tuscan territory. The aim was to provide first-hand documentation of Tuscan speech varieties and Tuscan oral documents from the 1960s to the present. In the end, the project digitized nearly 3,000 hours of speech recordings stemming from around 30 oral archives collected by scholars and amateurs in the Tuscan territory.
The first Archivio Vi.Vo. case study was conducted on the Caterina Bueno archive characterized by a complex archival history and containing highly heterogeneous audio material such as ethnomusicological data (Calamai et al., 2022). The idea is now to validate the platform for hosting different types of archives, such as oral history archives, and see whether an archive collected within another field of study can still fit inside Archivio Vi.Vo. This paper presents a new case study involving the reuse of the Angela Spinelli archive by first including the data in Archivio Vi.Vo and then optimizing the exploration of its content for conducting sociophonetic residual investigation on Tuscan vernacular, directly from the platform interface. The structure of the paper is as follows. Section 2 describes the Angela Spinelli archive. Section 3 illustrates the steps for including a new archive within the Archivio Vi.Vo architecture. Section 4 presents the benefits that come from using the platform for research purposes. Finally, section 5 presents the author's conclusions and future perspectives.

The Angela Spinelli Archive
The archive was collected by the historian Angela Spinelli at the beginning of the 1980s. The researcher conducted a series of face-to-face interviews with the inhabitants of small towns around the rural area of Prato along the upper Bisenzio river valley (e.g., Cantagallo, Vaiano, Montemurlo) 1 to ensure the preservation of memories about the historical events happening during the post Second World War period (Spinelli, 1981). The archive is made of 59 audio cassettes (corresponding to more than 120 hours of recording) and a large set of accompanying handwritten material consisting of important research protocol information and annotations made by the researcher during her work ( Figure 1). The archive was legally acquired from Angela Spinelli and digitized in 2011 within the Gra.fo project (Calamai and Biliotti, 2017). Since the archive is unfortunately currently unavailable for online consultation from the project website, the Spinelli archive was recently stored into the Archivio Vi.Vo. architecture.  Spinelli, 1998). Her method of investigation belongs to the field of oral history and consisted of eliciting unedited life histories of rural people born between 1880 and 1930 and belonging to different economic groups (e.g., small owners, sharecroppers, tenant farmers, charcoal burners, coal merchants, shepherds). Her research interest was the memory of everything pertaining to the aspects of the rural/material and associative life between the two world wars. Furthermore, she wanted to verify the essence of the long-term peasant mentality 2 , at that time, challenged by the development model of small and medium-sized industries. To do so, she analyzed the periods of crisis (war) and transition (post-war) in which the old (rural) mentality was being tested against the new (industrial) one characterized by better job opportunities (in terms of both salary and physical efforts) (Spinelli, 1988). Another interesting aspect of the work of Angela Spinelli, is the use of keywords when interacting with the witnesses, to elicit their memory about the British Second World War refugees and other historical events.
The speakers that participated to the investigation were selected according to their age, origin and other speaker's recommendations. The interview protocol consisted in conducting three days of interviews and one day of pause in which Angela Spinelli would "close the ring": listen to the recordings, recapitulate her work and decide with whom and where to move forward with the interviews (Andreini and Clemente, 2007). The selected subjects could either be interviewed individually or in pairs (e.g., mother and son, husband and wife). The interviews started with the collection of personal details and information on the family and social network of the speakers as a way to put them at ease. From that point, the research would invite the speakers to discuss topics, such as: literacy, their relationship with British/American/South African allies, religion (ceremonies and pilgrimages), fascism, the female condition (working as servants, nurses, nuns), farming (the work cycles of coal, wheat, chestnuts, grape harvesting, work tools), diseases, popular culture (proverbial system, legends, superstitions), dowry and female hereditary succession, emigration, the economy, the family structure, life experiences during the first and second world war, marriage systems, land ownership, the resistance period after the Second World War, and "inurbation" phenomenon characterised by people abandoning the rural life for a better job and lifestyle in textile factories and railway construction sites in the nearby city of Prato. As shown in this section, the Spinelli archive is remarkably rich in both the number of hours of recording and the quality of oral history interviews, and it is, therefore, indeed well suited for research reuse purposes.
3 From Preservation Copy to Archival Unit in Archivio Vi.Vo.
The preservation process of a historical archive in the Archivio Vi.Vo. architecture begins with the upload of the digitized archival data in the storage server, followed by the creation of preservation copies within the platform (Calamai et al., 2022). Although the procedure for including archives in Archivio Vi.Vo. is still under development, it is possible to pre-announce that a single-sign-on (SSO) security access is currently being installed to provide different types of access to the archives inside the platform, according to the user and license. The creation of the preservation copies consists in adding the metadata 3 about the digitized copy of the carriers in Archivio Vi.Vo. The mandatory information is: the denomination of the preservation copy and the digitization date. The creation date is automatically detected by the system, whereas additional information, in this section, can include the names of the people responsible for digitization and of the preservation copy supervisor(s). The required technical information (regarding the original carrier) is: the signature(s) in the original archive, the owner, the container brand and model, the flange brand and model, the tape brand and model, the tape width, the carrier type, the carrier conditions before digitization, the restoration operations conducted before digitization, and additional miscellaneous information (Figure 3). At the end of this process, it is also possible to add photographic material about the carrier (Figure 4). The status of completion of the preservation copies is indicated by a yellow (in progress) or green (completed) dot next to each file. The process of creation can be paused and resumed until the final confirmation after which it is no longer be editable. Most importantly, the deletion of a preservation copy created in Archivio Vi.Vo. does not entail the deletion of the file from the storage server. Between the preservation copies creation and the audio files conversion into the high quality compressed FLAC format, the architecture will soon provide access to the restoration interface (Calamai et al., 2022). The Archivio Vi.Vo. section that follows is called description and is dedicated for accessing the content of the preservation copies and creating the archival units. This step is necessary to make the contents of the archive searchable and reusable. As explained in (Calamai et al., 2022) selecting the clips from audio file is needed because multiple communication events (such as conferences, interviews, concerts etc.) might have been recorded on a single carrier. Conversely, a single communicative event might have been recorded on more than one carrier. The communicative events consist in the various possible contents that can be found inside the recordings. In the description interface the audio can be played following the order in which it was recorded. The selection of the clips is done selecting the boundaries of beginning and end of a communicative event either using the slider or manually entering the time length. Once the clips are correctly segmented, descriptive annotations can be added to create the regesto 4 (archival record) in alignment with the audio. For example, they can correspond to the various topics covered during an interview ( Figure 5). The final result consists in a data structure in which the various contents of each preservation copy are neatly segmented and described, also known archival units, and can finally be safely employed for different purposes.

The Case Study
There are many benefits to reusing data, such as replicating studies, validate research claims, and achieving new discoveries, even across different disciplines. In the case of legacy data 5 , many scholars in the social sciences domain have shown how reusing (oral) historical archives can help the validation of research claims and the investigation of past, rare or disappeared phenomena that are no longer elicitable. Some examples of reuse of historical archives collected for different research purposes in linguistic investigations are: (De Fina, 2000), (Bornat, 2003), ( Van De Mieroop, 2009), (Schiffrin, 2009), (Roller, 2015), (Braber and Davies, 2016), (Renwick and Olsen, 2017) and (Nodari and Calamai, 2021).
As mentioned in section 1, employing the Archivio Vi.Vo. architecture can bring benefit to research projects on archival data. In this case the aim was to explore the content of the Angela Spinelli archive to investigate the presence of a residual phonetic phenomenon in Tuscan vernacular speech and, similarly to (Nodari and Calamai, 2021), to find correlations with social variables characterizing the speakers. The phenomenon under consideration is the phonetic reduction of the double (geminate) consonant /r:/ (rr) into a single /r/ in intervocalic contexts. The analysis is carried out on a set of words containing /r:/ in intervocalic position that also happen (but are not limited) to coincide with the keywords and topics addressed by the researcher during the interviews (e.g., terra (land) > tera, guerra (war) > guera). In past research on the description of Italian dialects along the La Spezia-Rimini isogloss, some authors reported the presence of rhotic degemination phenomena in the nearby territories investigated by Angela Spinelli (Giannelli, 1976) (Rohlfs, 1966). However, until recent times the phenomenon has been under-investigated, especially from a (socio)phonetic perspective, in the Italian and Tuscan scientific literature (Celata et al., 2019). To find evidence of rhotic degemination in a phonetic laboratory would now be impossible since this phenomenon embodies the residual vestige of a past tendency challenged by the diffusion of more standard-like pronunciations involving the maintenance of the singleton-geminate contrast. Hence, the necessity to look for degemination phenomena within oral history archives. As in (Nodari and Calamai, 2021), the assumption is that the consultation of past oral data collected within the area, where participants narrate emotional events, such as war or life-threatening situations and during which their speech is potentially less controlled and more spontaneous, could favor the presence of past non-standard forms. This is because during such interview settings the speakers focus more on what is being said rather than how (Labov, 1963).
Once the data were secured, the recordings identified, and described according to the various topics, with the help of the accompanying material (Monachini et al., 2021), the consultation of the interviews through the regesto was rapid and accurate. For example, having to look for keywords such as terra (land) and guerra it was decided to trace back the parts of interviews where the speakers would discuss war events and land cultivation, where they would have had higher chances to appear. Then, to verify the presence of the keywords within the selected interview portions, the corresponding audio was listened to while taking notes on whether the keywords were or were not present. Subsequently, the parts of interviews containing the keywords required transcription and phonetic annotation, and were the only two operations that could not be carried out from the platform. For the purpose of the research, the transcription was carried out manually and only for the interested discourse segments, whereas the data imported in PRAAT (Boersma, 2001) was manually traced back and extracted from the original preservation copy files. Correlations between the phenomenon and social variables of the speakers were, once again, traced back by listening to the audio and consulting the regesto. In this case, the sections to look for were those where the speakers would, for example, present themselves and talk about their family. The main advantage Archivio Vi.Vo. brought to this case study investigation was the optimization of the research protocol by reducing the time required for consultation of the data. Without the support of the architecture the whole process, involving listening to the recordings, operating the transcription and only then working on the data, would have been much more time consuming and dispersive.
Regarding the results of this research, it is important to note that the study is still ongoing and comprehensive results are not yet available. However, it could be said that historical oral archives can provide insightful perspectives for sociophonetic research. In particular, a preliminary analysis on part of the occurrences found in the Spinelli archive seems to indicate the presence of degemination in the Prato area and also suggests correlations with social and demographic variables. These findings offer a promising indication of the potential significance of this research, and highlight the importance of further investigation to fully understand the dynamics of degemination in this context. Overall, the ongoing research shows great potential for contributing to the understanding of sociophonetic variation in this area, and has the potential to yield valuable insights into past language usage and its relation to social and historical factors.

Discussion and Conclusion
The object of this article was to validate the Archivio Vi.Vo. architecture suitability for preserving different types of archives as well as favoring their reuse for different purposes. After the first Archivio Vi.Vo. case study involving the Caterina Bueno archive, the architecture requires validation for hosting different types of archives and support research projects. This paper presented a second case study involving the reuse of the Angela Spinelli archive to include the data in Archivio Vi.Vo and then optimize the exploration of its content for conducting sociophonetic residual investigation on Tuscan vernacular, directly from the platform interface. In addition, the data is finally be available for consultation and allows for the replication of the analysis, which is exceedingly rare in phonetics research.
What this case study shows is that the platform is, indeed, suitable for hosting a different type of (oral) archive and related accompanying material as well as help the optimization research protocols for what concerns data consultation. The main feature that emerged in this work and makes the platform a valuable research tool is its ability to provide a single environment for all the relevant archive material. This reduces the risk of material dispersion and facilitates the search for correlations between oral data and accompanying material. In the case study presented here, both of these aspects allowed for a quick and accurate retracing of the various details of different oral narratives, such as for reconstruction of the so-Selected papers from the CLARIN Annual Conference 2022 cial network of different speakers. The system does not (yet) provide the ability to extract statistical data information about the data from the platform, but will be taken into consideration for further developments. However, choosing what kind of statistical variables to collect from the platform for quantitative analysis strongly depends on the research goals. In this case, one of the variables of interest was the duration (in milliseconds) of the consonants and preceding vowels that could not be directly extracted and measured directly from the architecture and required the use of PRAAT. As a first step towards linking external tools to Archivio Vi.Vo. could be the integration of already existing resources within CLARIN, such as the Transcription Portal for Interview Data (Draxler et al., 2020)) Given the benefits obtained in terms of saving time and resources with Archivio Vi.Vo, it is possible to conclude that the platform is certainly suitable for supporting research investigations dealing with oral archives, without excluding the possibility of implementing new functions in the future. As more (and different types of) archives are uploaded to the platform, in addition to allowing for more validation examples, it is expected that it will potentially give rise to the implementation of additional features that will allow the use of Archivio Vi.Vo. as a research tool to be expanded even further.
Future perspectives regarding Archivio Vi.Vo. concern: i) incorporating the SSO federated access to the platform, ii) completing the restoration interface, iii) implementing the possibility to export audio data directly from the platform into software for data transcription and for phonetic analysis and annotation, and iv) promote the use of the platform within the CLARIN and CLARIN-IT consortia training activities.