Developing Resources for Measuring Text Readability in Sesotho

This article presents a work-in-progress doctoral project that explores measuring text readability in Sesotho, a Bantu language spoken by more than 10 million speakers across Southern Africa. The main project adopts a classical readability formulas approach to text readability analysis. We aim to adapt nine existing readability metrics into Sesotho using English as a higher-resourced helper language. So far, ﬁve resources have been developed as part of the study. The rule-based and the TEX-based syllabiﬁcation systems, the syllable annotated wordlist, and the grade 12 exam reading comprehension and summary writing corpus have been published on the South African Centre for Digital Language Resources’ (SADiLaR) online repository. The machine-translated corpus is still under development. This article describes the progress of the PhD project by overviewing the basic digital language resources developed for the project. The metrics under consideration for adaptation into Sesotho are also brieﬂy discussed.


Introduction
Automated text readability evaluation has been applied in different application domains such as finding educational materials (Collins-Thompson, 2014). The scholarship of text readability has continued for over a century (Collins-Thompson, 2014;De Clercq and Hoste, 2016). To date, more than 200 metrics have been developed (DuBay, 2004). However, indigenous African languages have been neglected in this area (Sibeko and Van Zaanen, 2021). As a result, matching texts of the right level of readability to readers such as books for learning and teaching in languages where no metrics are available depends on each assessor's intuition. Undoubtedly, intuition-based choices are likely to be flawed, inconsistent and influenced by the content of the text. Unfortunately, although textbooks are the most important teaching material in language teaching discourse, inappropriate textbook use can deskill both language teachers and language learners (Mohammed et al., 2022). Deskilling can be propelled by a mismatch between textbooks and intended readers (Zamanian and Heydari, 2012). Such dissonances can result from incorrect levels of readability which can be expected when texts are chosen based on intuition. To reduce the chances of deskilling the language reader, it is essential to have a system for objectively estimating the readability of reading texts of varied lengths such as textbooks and comprehension texts.
This article reports on a work-in-progress doctoral project on measuring text readability in Sesotho using classical readability metrics. A brief overview of writing in Sesotho is presented (Section 2) followed by a brief synopsis of the classical readability metrics that are adapted in the main project (Section 3). Then the relevance of the project to CLARIN via SADiLaR is briefly highlighted (Section 4) followed by a discussion of the resources produced as part of the project (Section 5). The article is concluded with a discussion of future works (Section 6).

Contextualising Sesotho
Sesotho is used by more than ten million speakers in a few countries in Southern Africa including South Africa, Lesotho, and Zimbabwe (Marupi and Charamba, 2022;Sibeko and Setaka, 2022). In fact, it is one of the official languages in Lesotho, Zimbabwe, and South Africa. It is used as a language for learning and teaching in these countries as either a mother tongue, second language, or marginalised language. It is also used for media, political, religious, and other uses. Even so, a recent investigation of the Sesotho Basic Language Resource Kit (BLARK) content has revealed that there is a severe shortage of digital language resources available for Sesotho (Sibeko and Setaka, 2022). As such, Sesotho remains a low-resourced language (Roux and Bosch, 2019;Sibeko and Setaka, 2022). Consequently, automating the process of objectively investigating text readability in Sesotho using classical readability metrics requires the development of a few basic language resources.
In addition to the lack of necessary resources which hinders the development of objective automated metrics for measuring text readability in Sesotho, there are two widely recognised orthographies for Sesotho. The two orthographies are differentiated by the two countries with the most speakers of Sesotho. They are therefore labelled accordingly as the South African Sesotho (SAS) orthography and the Lesotho Sesotho (LS) orthography. The main differences between the two orthographies include the use of w and y in the SAS orthography as opposed to the use of o and e in the LS orthography for representing semivowels. This is exemplified in example 1 below.
(1) Ke The LS orthography also uses l in place of d, and c in place of the digraph tj. The differences are exemplified in example 2 below.
(2 Although differences such as preferences for certain single letters do not affect the results of the metrics adapted in the main project, the different representations of the semivowels will affect syllable identification. Furthermore, the use of single letters in one orthography and the use of digraphs in another orthography may affect average word lengths. Beyond orthography, there may be region-based vocabulary variations in Sesotho. For instance, see Lemeko (2018) for a discussion of region-based variations of SAS. Nonetheless, these variations are not within the scope of the current project. The research focus is limited to the effects of orthographic conventions.
Selected papers from the CLARIN Annual Conference 2022

Classical Readability Metrics
The doctoral project described in this article focuses on how text readability could be measured in Sesotho. An automated process for measuring text readability in Sesotho is desired. We believe that classical readability metrics are a good place to start. We identified a total of nine classic readability formulas for adaptation into Sesotho. These metrics are used in the Python 3.2 readability package 1 . All nine metrics have been used in previous research on South African educational texts, for instance, see Sibeko and Van Zaanen (2021). The readability metrics that we hope to include in our web-based platform for measuring Sesotho text readability are briefly described below. In-depth discussions of these metrics are provided elsewhere, for instance, see Heydari (2012) and Zamanian and Heydari (2012).

Syllable-Based Metrics
Four of the readability metrics identified are based on syllable information. The metrics are described below.

Flech-Kincaid Grade Level (FKGL)
The FKGL uses US grades for labelling readability levels . For example, a score of 10 corresponds to the tenth-grade (Toyama et al., 2017). The FKGL metric uses the following formula: FKGL = 0.39( #tokens #sentences ) + 11.8( #syllables #tokens ) 15.59 The process for calculating readability follows four steps. First, the total number of words is divided by the total number of sentences and multiplied by the weight given to sentence difficulty, i.e., 0.39. Second, the total number of syllables is divided by the total number of words and multiplied by 11.8. which is the weight given to average word difficulty, that is, the average number of syllables per word. In the third step, the resulting numbers from the first and the second steps are added together. Finally, 15.59 is subtracted from the result of step 3 (Boles et al., 2016).
The FKGL metric was developed for the United States Navy. However, it is suitable for use in multiple contexts including educational contexts (Zhang et al., 2019).

Flesch Reading Ease (FRE)
Flesch's (1948) FRE is calculated using the following formula: FRE = 206.835 1.015( #tokens #sentences ) + 84.6( #syllables #tokens ) The FRE formula outputs scores between zero and 100 (Flesch, 1948). While a text with a score of 100 should be easily readable to a language learner with a fourth-grade education, a text with a score of 0 requires at least a college graduate level for reading with ease.
FRE is one of the most used classical readability formulas. In fact, when combined, the FKGL and FRE can be used for both first and second-language texts (Greenfield, 2004). Both FRE and FKGL are integrated into Microsoft Office (Bendová, 2021). As a result, they can be easily used by anyone who uses Microsoft office products. Furthermore, the FRE metric is the most adapted to other languages (Bendová and Cinková, 2021). For instance, it has been adapted to Italian, French, Spanish, German, Russian, Danish, Bangla, Hindi, and Japanese.

Gunning Fog Index (GFI)
The GFI identifies foggy words which are words comprised of more than two syllables (Zhang et al., 2019;Gunning, 1952;Gunning, 1969;Gunning, 2003). The GFI follows four steps. First, the number of words used per sentence is averaged. Second, the number of foggy words is counted. Third, the percentage of foggy words in the sample is calculated. Finally, the totals are added and multiplied by 0.4 (Eleyan et al., 2020). The following equation is used in the GFI: The readability score generated by the English equation above typically falls within the range of 6 to 20. A score of 6 indicates that the text is suitable for a sixth-grade reading level, while a score of 20 or higher suggests that the text is appropriate for advanced readers, such as those in university postgraduate programs.

Simple Measure of Gobbledygook (SMOG)
When calculating SMOG for long texts, three samples are used, one from the beginning of the text, one from the middle, and one from the end of the text ((Mc Laughlin, 1969;Zhou et al., 2017). Each sample comprises ten sentences. The samples are used to calculate SMOG using the following formula: Polysyllabic words as indicated in the formula refer to words with more than two syllables (Kasabwala et al., 2012). The SMOG formula also outputs US grade levels.

Word-Length-Based Metrics
Four of the selected classical readability metrics are based on word lengths. The four metrics are described below.

Lasbarhetsindex (Lix) and Rate Index (Rix)
Lix was originally developed for Swedish (Björnsson, 1983). For access, we use the English version as a point of reference. It is suggested that ten samples comprising ten sentences each be analysed when estimating both Lix and Rix (Anderson, 1983). The Lix and the Rix formulas pay special attention to 'long words,' that is, words that have more than six characters. The Lix formula is presented in table 1. The Lix formula outputs numbers that are then converted to grade levels. Anderson (1983) states that fractions can be ignored. This is particularly important in instances where adjusting the scores changes the predicted grade. For instance, when adjusting a Lix score of 47.99 to 48.0, the predicted grade level changes from the 11th to the 12th grade.
The Rix metric is an adaptation of the Lix metric (Courtis, 1987;Anderson, 1983). It considers the ratio of long words to the number of sampled sentences. While shorter texts may be considered as a whole, longer texts may use sentence sampling methods. The Rix metric assigns grade levels through the formula presented in table 1.

Coleman-Liau Index (CLI)
The CLI metric also uses a sampling method (Coleman and Liau, 1975). First, the text is divided into shorter samples of 100 words each. Second, the samples are counted. Third, the number of characters in each word from the samples is calculated. Fourth, the number of characters per word is divided by the number of samples. Fifth, the number of sentences is counted. Sixth, the number of sentences is divided by the number of samples. Finally, the results are applied to the following formula: CLI = 0.0588( #letters #samples ) 0.296( #sentences #samples ) 15.8 According to Coleman and Liau (1975), samples should end with complete sentences. As a result, CLI samples may contain a little less or more than 100 words depending on the last complete sentence sampled.
Selected papers from the CLARIN Annual Conference 2022 3.2.3 Automated Readability Index (ARI) ARI is derived from fractions representing predictions of word and sentence difficulty (Kaur et al., 2018;Smith and Senter, 1967). The process follows a few steps. First, sentence lengths are averaged and multiplied by 0.5. Second, word lengths are averaged and multiplied by 4.7. Third, the totals are combined and 21.43 is deducted. The grade level is assigned through the following formula: ARI = 4.7( #letters #words ) + 0.5( #words #sentences ) 21.43 Letters as used in CLI and ARI, refer to all letters and numbers that build words (Zhang et al., 2019). Thomas et al. (1975) describe it as strokes representing each word.

Frequency-List-Based Metric
One frequency-list-based metric was identified from Python 3.2's readability package. The metric is described below.

Dale-Chall Index (DCI)
The DCI metric uses a frequency list (Dale and Chall, 1948). The frequency list is based on a list of 3000 words that a grade 4 learner is expected to be familiar with (Stocker, 1971). Difficult words are considered as those that do not appear in the list. Variations include words in plural forms, verbs that end in -s, -ed, -ing, and -ied, adverbs that end in -ly, names of both people and organisations [note that organisation names are counted only two times per 100-word sample], abbreviations, and compound words (Barry and Stevenson, 1975). DCI is computed using the following formula: It is advisable to use the whole text when texts are too short for sampling. For longer texts, one may sample four sets of 100 words per 2000 words (Barry, 1980;Dale and Chall, 1948).

Summary
Two important things can be noted from the nine classical readability metrics briefly overviewed in this article. First, the metrics use specific processes for estimating appropriate readability and grade levels. It is important to consider these processes. For instance, this is useful when considering the minimal corpus size necessary for adapting the metrics to Sesotho. Second, the metrics use specific weights that may need to be adapted to Sesotho. For instance, syllable lengths may have minimal effect on the level of readability and therefore need to carry minimal weighting. Additionally, it is not possible to investigate the effect of syllable lengths on Sesotho texts without a system for identifying the syllables. For this reason, we need the resources that are developed in the doctoral project discussed in this article.
Moreover, it is important to consider the expected outputs from the formulas. This is most important for formulas that do not output grade levels, for instance, the FRE, Lix and Rix. For these metrics, we may need to redefine the conversions to suit the context of Sesotho and to reflect South Africa's grade levels. In spite of being used in multiple contexts, the classical readability metrics have been criticised for failing to measure comprehension (Tanprasert and Kauchak, 2021). Furthermore, the use of frequency lists, such as in the DCI list of common words, has been criticised for failing to account for specialised meanings (Yan et al., 2006). Even so, the classical readability metrics remain relevant to our project since our focus is not on meaning or comprehension but on the ease with which the text can be read.

Relevance to SADiLaR
This PhD project is conducted at North-West University which hosts the South African Center for Digital Language Resources (SADiLaR). SADiLaR is an observer at the CLARIN European Research Infrastructure Consortium. North-West University functions as a hub of a network of linked nodes for Selected papers from the CLARIN Annual Conference 2022 SADiLaR. There are currently six nodes including four universities and two independent research entities. SADiLaR is a national center supported by the South African Department of Science and Innovation as part of the South African Research Infrastructure Roadmap (Wilken et al., 2018;Roux and Bosch, 2019). It has an enabling function with a focus on all official languages of South Africa (Roux and Ndinga-Koumba-Binza, 2019). It supports research and development in language technologies and language-related studies in the humanities and social sciences. The center impacts three domains, namely, (i) humanities and social sciences, (ii) language technology, and (iii) socio-economic domains. This doctoral project benefits from SADiLaR's humanities and social sciences domain which focuses on building research capacity. For instance, several capacity-building training opportunities were freely provided by SADiLaR. The project also benefits from the language technology domain which focuses on the development of high-level resources and NLP tools for use in applications. For instance, some of the resources discussed in this article resulted from collaborative work with experts from SADiLaR. Furthermore, the project has contributed to the development of digital language resources as part of the main project in adapting classical readability metrics to Sesotho.

Developing Resources for Sesotho
This section describes five resources developed for the project. Two syllabification systems are described, followed by three annotated datasets.

Syllabification Systems
A survey of Sesotho digital language resources listed on SADiLaR's repository web interface indicated the absence of syllabification systems for Sesotho 2 (Sibeko and Setaka, 2022). For this reason, previous assessments of the readability of Sesotho texts using classical readability metrics relied on the manual extraction of textual properties such as syllable information. Sadly, using annotators to manually extract Sesotho syllable information from written texts is laborious (Krige and Reid, 2017). Additionally, reliance on such manual methods for extracting textual properties would not suffice for the envisaged automated tool. For this reason, two syllabification systems were developed. The systems are briefly described below.
As a tonal language, Sesotho carries tone by vowels and nasal consonants (Guma, 1982;Sekere, 2004;Mohasi et al., 2011). According to Guma (1982), nasal consonants, that is, two simple nasal consonants n and m and two complex nasal consonants N and ñ, and the lateral consonant l can occur as syllables. Furthermore, vowels [A, e, i, o, u, I, E, O, and U] can function as syllables (V). The vowel-only syllables can occur at word initial, word medial, and word-final positions. Nasal consonant-only (C) syllables can also occur in these positions. However, only the complex nasal consonant N can occur at word-final position (Demuth, 2007). Finally, syllables can be composed of consonants and vowels (CV). Table 2 presents the syllable types, subtypes, and examples for each subtype. The syllable boundaries are indicated by the use of dashes (-).
We based our syllabification rules on Guma's (1982) syllable types. For testing the system, we extracted syllabification information from Chitja's (2010) dictionary. The process for extracting syllable information from the dictionary and creating a wordlist is described in section 5.2.1. The wordlist represented all syllabification types presented in Table 2. The rule-based system achieved an accuracy rate of 99.69%. We also experimented with a T E X-based approach. We used the wordlist [see section 5.2.1] for training and testing the machine learning system. This system achieved an accuracy rate of 78.92%. The lower accuracy rate of the T E X-based system is attributed to two unavoidable shortcomings. First, we noticed that there was some human oversight while manually cleaning the training corpus. Second, the T E X-based system cannot handle single-letter syllables at the beginning or end of words. Both systems are publicly available on SADiLaRs's repository (see Sibeko and Van Zaanen (2022a)). shwang shwa-ng dieing three consonants -one vowel tlhase tlha-se spark three consonants -semi-vowel-one vowel tshwela tshwe-la spit C nasal consonant n, m -non-nasal consonant ntate n-ta-te father nasal consonant n, m -nasal consonant mme m-me mother nasal consonant n -complex nasal consonant nnyatsa n-nya-tsa disrespects me complex nasal consonant N -vowel ngala nga-la abandon complex nasal consonant N-non-nasal consonant mangmang ma-ng-ma-ng so so word-ending complex nasal consonant N hang ha-ng once consecutive lateral consonants l llela l-le-la weep for

Syllabified Wordlist
As part of developing the syllabification systems, we developed a gold-standard syllable information annotated corpus. We extracted dictionary entries and syllable information from Bukantswe ya Machaba ya Sesotho 'The international dictionary of Sesotho' (Chitja, 2010). Each dictionary entry contains valuable pieces of information. See, for instance, example 3 below. In example 3, the dictionary entry indicates the Sesotho word, Diepollo 'exhumation', then provides pronunciation information in brackets (di-e-pu-l-law), the English translation 'exhumations', the definition 'Ketso tsa ho epolla kapa ho ntsha ntho tse epetsoeng tlasa mobu. Ketso ya ho ntsha bafu mabitleng.' Which translates to '[T]he act of unearthing things buried under the soil. Acts of digging up deceased people from their tombs.' Finally, a similar word is provided, that is 'bon. kepollo' which in this case is the singular form: 'exhumation'. For our project, we extracted the dictionary entries, followed by the pronunciation information as in the example below (4) Diepollo (di-e-pu-l-law) -SAS Exhumations (ex-hu-ma-tions) As illustrated in example 4, pronunciation information was not always consistent with orthographic conventions. In instances of such inconsistencies, the wordlist was manually cleaned on a word-for-word basis to ensure consistent orthography. For instance, we altered the pronunciation information at example 4 above. That is, we adjusted the third syllable which illustrated a high tone o by using the letter 'u', as in pu and the fifth syllable which indicated a lower tone o by using the digraph 'aw'. The modified syllables are presented in example 5 below: Selected papers from the CLARIN Annual Conference 2022 (5) Diepollo

'Exhumations'
(di-e-po-l-lo) -SAS Some pronunciation information included words ending in non-syllabic consonants, others changed the spelling such as in example 4, and others had incorrectly placed syllable boundaries. All of these issues were manually checked and fixed. After manual cleaning and fixing orthographic inconsistencies, we obtained a total of 13 551 words. The cleaned wordlist was also uploaded onto SADiLaR's repository (see Sibeko and Van Zaanen (2022b)).

Reading Comprehension and Summary Writing Texts
We were granted access to grade twelve exam question papers by the South African National Department of Basic Education (DBE). Grade twelve is the high school exit grade in South Africa. We have since extracted reading comprehension and summary writing texts from the exam question papers. We did this for all eleven official languages of South Africa 3 . The texts in our corpus are split into two categories, that is, the home language (HL) and the first additional language (FAL). Previous research indicated that the English exam texts show consistently lower readability levels for the HL texts as opposed to the FAL texts (Sibeko, 2021;Sibeko and Van Zaanen, 2021). The lengths of texts in the collection vary according to the orthographies such as disjunctive and conjunctive, and text types such as reading comprehension and summary writing. Consistently, the lengths of summary texts are about a third of the reading comprehension texts in all eleven languages. The corpus has been uploaded to SADiLaR's repository (see Sibeko and Van Zaanen (2022c)). We hope that the differences in text readability and linguistic complexity are uniform throughout the different languages.

Machine-Translated Corpus
Previous studies evaluating the text readability of Sesotho texts, for instance, Krige and Reid (2017) and Reid et al. (2019), assumed that classical readability metrics that are based on syllable information and word-length-based textual properties can be directly used in Sesotho without taking the differences between the superficial textual features of English and Sesotho into consideration. However, it is evident from other studies adapting the weights of these textual features to language-specific conventions that the metrics cannot be applied to new languages without adjustments. The syllabification systems described in this article enable the automatic identification and counting of syllables. This is already different from previous research on Sesotho text readability. However, we still aim to adapt the metrics to the specific context of Sesotho. Such an adaptation is important for accounting for differences in superficial text properties between Sesotho and English. A gold-standard corpus with clear levels of text difficulty is needed to develop an automated readability model (Van Oosten et al., 2010;François and Fairon, 2012). Unfortunately, Sesotho, like other LRLs, does not have corpora readily annotated with levels of difficulty (Filho et al., 2016). The use of translated texts may provide a solution to this lack of levelled texts. For instance, the texts can be easily levelled according to grades in English.
It was observed that when Obonerva's (2006) readability model that was trained on fiction texts was evaluated on non-fiction texts, exaggerated readability levels were observed (Solovyev et al., 2018). Since we hope that education stakeholders such as teachers, parents, learners, textbook authors, and examiners can use our envisaged automated tool for analysing Sesotho text readability, we are training our models on educational texts. We are relying on texts collected as part of Sibeko and Van Zaanen's (2022c) corpus described in section 5.2.2 above. We identified texts from grade 12 Sesotho HL and FAL examinations from the collection. As a rule of thumb, we followed Zamanian and Heydari's (2012) guideline that a text should have at least 200 words for metrics like FRE and FKGL to be applied successfully. As a result, we identified texts of no less than 200 words each. In the end, we could only use longer reading comprehension and summary texts.
For an illustration of the texts collected, It is important to have good relations with your neighbors. You live together and should help each other in times of trouble. However, relationships have challenges when the neighbor likes the news. Meet your neighbor often and share ideas to improve your lives. By sharing information, your neighbor will realize that you are working hard to get what you have. Whether he wants to add to the house or buy something, advise him where to find things at a low price. This way she will see that you care about her and support her. When you share dreams and aspirations, make sure that your words do not give him the suspicion that you are proud, otherwise he will look for a way to distract you. Some of the things should be your secrets. When he asks you questions about something, give a longer answer that answers his questions. If there is something that does not satisfy you, speak clearly and humbly, solve the problem in a way that will not cause conflict. If your neighbor thinks that you are a high-class person, don't do things that confirm that idea. Stay humble at all times, don't pretend to be better than him. If this neighbor lies about you, ignore it and wait for someone else to point it out and take the step to talk about it. this version of the text, we have removed the sentence markers <utt> that were inserted during Sibeko and van Zaanen's (2022c) tokenization and sentence segmentation process. The text contains examples of figurative language. The Google Translate machine translation in table 3 indicates that at least all the words are successfully translated. Even so, instances of figurative language used in the Sesotho source text were translated out of context and new meanings were created.
The machine translations were post-edited to enable checking whether meaning influenced the readability of texts. Our translation corpus contains the original Sesotho texts, the original machine translations, and the human post-edited versions. The post-editing brief indicated that texts should not be changed unless meaning had been lost. As a result, in the human post-edited versions, machine translations such as liking the news were adapted to meaning-appropriate constructions such as nosy neighbours.

Conclusion
This article reported a work-in-progress PhD project. A survey of methods used for measuring text readability in low-resource languages indicated a prevalence of adapting classical readability metrics from high-resourced languages such as English. One of the common methods for adapting classical readability metrics was the use of translated texts between higher-resourced languages as helper languages and lower-resourced languages. Classical readability metrics use shallow textual features such as (i) the number of words, and (ii) the lengths of sentences, both of which can easily be counted, (iii) syllabic information for which we had to develop systems, and (iv) a frequency wordlist. Four resources were created and made available on the SADiLaR repository. One more resource is still under development. The resources include both gold-standard corpora and basic digital language resources such as syllabification systems. Finally, we identified the metrics we hope to adapt to Sesotho while also indicating the textual properties considered by each metric. At this point, we have put together most of the necessary tools for the identification and assessment of surface-level textual properties used in the nine readability metrics chosen. The main aim of the bigger study is to develop a platform for automated measurement of the readability of Sesotho texts. To this end, future works include the development of a list of frequently Selected papers from the CLARIN Annual Conference 2022 used words in Sesotho which will then enable the adaptation of the Dale-Chall index to Sesotho. Furthermore, all nine metrics will be adapted and a web-based platform will be developed and made publicly accessible.