Historical Language Models in Cryptanalysis: Case Studies on English and German

In this paper, we study the impact of language models (LM) on decipherment of historical homophonic substitution ciphers. In particular, we investigate if deci-pherment by using hill-climbing and simulated annealing can benefit from LMs generated from historical texts in general and century-specific texts in particular. We carry out experiments on homophonic sub-stitution ciphers with English and German as plaintext languages. We take into account ciphertext length as well as n-gram size of the LMs. We compare the results on decipherment based on historical LMs with large LMs generated from modern texts. The results show that using historical LMs in decipherment of homophonic substitution ciphers leads to significantly better performance on ciphertext produced in the 17th century or earlier, and century-specific language models yield better re-sults on longer and older ciphertexts.


Introduction
One of the main components in cryptanalysis of historical ciphertexts is language models of the underlying plaintext.Oftentimes the choice of texts on which the language model is based is opportunistic, i.e. the cryptanalysts choose what texts or models are available.Since collections of (more or less) contemporary texts, such as the collection of the Gutenberg project or the Wikipedia articles are accessible and freely available for many languages, these are often used by cryptanalysts for the generation of LMs, e.g. by Lasry (2018) and Bean (2020).However, using language models derived from contemporary languages might not be optimal for the decipherment of historical ciphertexts since language changes over time.
Before spelling and grammar were normalized for many European written languages in the course of the 18th and 19th century, we find a large variation in spelling in historical texts.The variation can be found not only across regions and writers but the same author could also spell the same word in the same document differently.The use of punctuation marks such as dots and commas was rarer than in modern texts.Further, writers used abbreviations to save space of the expensive paper or parchment.Another important aspect when dealing with historical texts is the fact that language changes over time; new words enter the language while others disappear.Not only words, but also the grammatical structure of the language changes with respect to word order and the internal structure of words (so called morphology).Given the above mentioned reasons we can expect that the usage of LMs generated from contemporary or historical texts have an impact on decipherment accuracy.
In this paper we aim to investigate the role of historical LMs in the decipherment process of historical ciphertexts.In particular, we are interested in finding the answer to the following research questions: • Do historical LMs have a positive impact on the decipherment of historical ciphers?
• Do historical LMs created from the same century as the cipher originates from lead to an increase in decipherment performance?
• Does increasing the n-gram size of a model result in an increase of accuracy in decipherment of historical manuscripts?
We present a pilot study on English and German ciphertexts and LMs generated from texts originating from the 14th to the 20th centuries of various n-gram sizes.We focus on homophonic substitution ciphers of various lengths.As historical homophonic substitution ciphertexts have been successfully deciphered by using hill-climbing and simulated annealing (see e.g.Lasry et al. (2020)), we chose to apply the same decipherment method in our experiments.
In Section 2, we present previous studies on using historical LMs in decipherment.In Section 3, we give an overview of the data sets used for the experiments on English and German.In Section 4, we describe the method with the experimental setup.In Section 5, the results from the various experiments are presented, and discussed in Section 6.Finally, in Section 7, we conclude our paper.

Background
The usage of language models generated from the underlying plaintext language of the cipher is inevitable for successful decipherment.Already in the 9th century, the Arab philosopher and mathematician Al-Kindi described the value of frequency analysis derived from the plaintext and the ciphertext in cryptanalysis.Since then, cryptographers used LMs with information about frequencies of letters and letter co-occurrences to create models of the underlying plaintext.One of the most simplest, efficient and also commonly occurring language models are n-grams that use a statistical approach to predict the probability of a word or character given its context.N-gram models are built by analyzing a sequence of n words or characters in a text corpus and building a probability distribution of the next word or character based on the occurrence of each n-gram in the training data.The distribution of various n-gram orders (e.g.unigrams, bigrams, trigrams) is also reflected on the ciphertext and can thus give more clues into how the text was encrypted (Kahn, 1996) and (Dooley, 2018).
In order to train or generate LMs for decipherment purposes, a wide range of text collections are used.These might include the translation of the Human Rights, texts from Wikipedia, or Google books.When dealing with historical ciphers, the most commonly used corpus is the Gutenberg collection from Project Gutenberg 1 .The Gutenberg project is a digital library of free e-books, collected since 1971.The collection consists of more than 60.000 books in a large number of languages, 1 www.gutenberg.org for which U.S. copyright has been expired; the great majority originating from the 19th century.The collection is publicly available and copyrightfree, which explains its popularity to use when building historical language models.
Another collection of historical texts is the HistCorp corpus (Pettersson and Megyesi, 2018) which contains sixteen European languages including Czech, Dutch, English, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Polish, Portuguese, Russian, Slovene, Spanish, and Swedish.The transcriptions of the original manuscripts are diplomatic editions, i.e. the orthography of the original text is kept and mistakes in the original are preserved.The texts are released in a uniform format.Noteworthy is that the number of texts and the data size included for the various languages vary greatly in the collection depending on what kind of historical text corpora are available for the particular language.
To generate language models, character as well as word-based models are common: unigram, bigram, trigram up to sixgram models are used for the purpose of decipherment.Naturally, the higher the order of the n-grams, the more texts are needed to generate suitable models, and the bigger the models become.
Surprisingly, even though historical cryptologists agree on the importance of LMs in cryptanalysis, the role of historical LMs has not been studied before, neither extensively, nor systematically.
A few studies report, however, on the evaluation of various n-gram sizes in decipherment using modern texts.Ravi and Knight (2008) present a method for solving substitution ciphers using loworder character-based n-gram models and show how decipherment accuracy varies as a function of cipher length and n-gram order.
The same authors (Ravi and Knight, 2011) present a Bayesian approach for deciphering complex substitution ciphers and evaluate with different setups of LMs including character-based bigrams and trigrams, as well as word-based trigrams.They conclude that the best decipherment results are achieved with trigram models and a word list.Nuhn et al. (2014) apply a method for solving substitution ciphers, including the Zodiac-408 cipher, and evaluate n-gram models of orders four, five and six, where the sixgram models performed the best with lowest error rate.Hauer et al. (2014) presents an approach for deciphering monoalphabetic substitution ciphers that combines both character-level and word-level based LMs.In the above mentioned studies the language models were generated from contemporary texts, and evaluated in the light of specific approaches.
Indications for using historical sources instead of contemporary texts for the decipherment of historical ciphertexts have been given in the study by Pettersson and Megyesi (2019) for the automatic language identification task in three types of historical ciphertexts.The results showed that historical LMs perform considerably better on the tested languages (German and Italian) and that using models based on historical texts enables to capture old word forms that are not present in modern corpora, despite their larger size.
Historical LMs were also used to identify cleartext and its language in historical ciphertexts (Gambardella et al., 2022).The authors conducted a series of experiments on 214 documents in 8 languages including Dutch, French, Hungarian, Italian, Latin, Portuguese, and Spanish, and tested the ability of the models in various n-gram settings; both character and word based, trained on historical corpora from the HistCorp collection.
More extensive studies on the impact of historical LMs in decipherment are based on two thesis works: one bachelor's thesis in linguistics carried out by Fornmark (2022) on English, and one master's thesis in language technology by Sikora (2022) on German.The thesis works are based on the same method designed by the authors of this paper.In this study, we built upon the two theses and compare the similarities and differences of the decipherments using modern vs historical LMs for the two languages.

Text Collections
To carry out experiments on the impact of historical texts in decipherment for English and German, we use the HistCorp collection for the generation of LMs from historical texts, and contemporary texts from the Gutenberg project.
In English, the earliest texts are mostly Biblical sources, for example from The EDGeS Diachronic Bible Corpus (EDGes) (Bouma et al., 2020), whereas the later texts are of different genres and styles, including sources from the Corpus of Late Modern English Texts (DeSmet, 2006) and the Lampeter Corpus of Early Modern English Tracts (Schmied et al., 1999).
The texts from all corpora for both languages were then sorted into centuries to create subdomains per time interval of 100 years to serve as basis for the creation of century-specific LMs.
For English, texts from between the 11th and 13th centuries as well as the 15th century are missing from the source material, and could thus not be included.For German only one time period, namely the 20th century, could not be represented due to data sparseness.More detailed information about the data source and creation is given by Fornmark (2022) for English and by Sikora (2022) for German.

Method
To investigate whether LMs generated on historical corpora can lead to a better performance on decipherment of historical ciphers, compared to LMs generated on large modern corpora, we choose to experiment with English and German as the underlying plaintext languages.We use plaintexts from the 11th to the 20th centuries.We first describe the features that might have effect on the decipherment results.Then, we present the experimental design with a walk-through of the various stages.

Features
For historical ciphers, we select one of the most commonly occurring types, namely homophonic substitution ciphers with a mixture of the number of homophones per plaintext alphabet letter.The reason behind varying the number of homophones for each plaintext letter is to create more authentic ciphers which are similar to original ciphers, as retrieved from European archives and libraries.Previous studies showed that in homophonic substitu-tion ciphers we can find variable number of code elements for different plaintext letters depending on their frequency in the particular language; frequently occurring plaintext letters usually receive two or three code elements, while less frequent letters are assigned one code element (Kahn, 1996) and (Megyesi et al., 2022).
To measure the correlation between decipherment accuracy and the order of LMs on the length of ciphertexts, we apply ciphertexts consisting of 200 and 500 characters.The decision was determined by the assumption that generating shorter ciphers would significantly increase the level of decipherment difficulty.For the investigation of the impact of various n-gram sizes on decipherment, we experiment with trigram, fourgram, and fivegram character-based LMs.The experimental setup of the features and their values that our study is built upon is summarized in Table 1.

Experiment design
Our point of departure is a ciphertext and its corresponding plaintext for evaluation without any access to the cipher key.First, we collect and preprocess the plaintexts in English and German to generate the plaintext alphabet for various centuries for both languages.Since we do not have access to the original ciphertexts (with their corresponding plaintext) we need to generate them to be able to evaluate the decipherment results.We then create the LMs of various order sizes given the plaintexts for the various centuries and languages.Finally, we run cryptanalysis using the various LMs and evaluate the output.The entire process is illustrated in Figure 1.

Alphabet generation
To create century specific LMs, we generate plaintext alphabets for the specific time periods by counting unigram frequencies on the basis of texts in the HistCorp collection for both languages.It turns out that the size of the alphabet varies greatly across centuries, which causes memory problems in the generation of LMs.Therefore, we set several threshold values for letter frequencies: 0.001, 0.01 and 0.05, and linguistically analyze the alphabet output.We decide to use characters more frequent than 0.01% for English and 0.05% for German.
We then normalize the output of the alphabets; all letters are upper-cased, with the exception of the German SS letter2 , while punctuation marks, digits and white space are removed.Furthermore, we expand abbreviations and transform characters consisting of a regular letter and a superscript letter into two separate characters.

Key, ciphertext and plaintext generation
To be able to test the impact of historical and contemporary LMs on decipherment, the keys, ciphertexts and their corresponding plaintexts are automatically generated.To generate timely typical historical cipher keys, we studied original cipher keys collected from the DECODE database (Megyesi et al., 2019) with certain characteristics.We decided to derive homophonic substitution keys containing English or German plaintext or cleartext languages with available transcription.Since language identification in cipher keys might be challenging, we studied more carefully the words in the nomenclature list to decide whether the cipher key was created for English or German plaintexts.
To generate plaintext we choose the HistCorp collection.For each language, we split the language specific data set in HistCorp into a training set and a test set in the portion of 80%-20%, respectively.Table 2 presents the number of texts in the training and test sets for English and German.The training set serves for the generation of LMs in the subsequent step while the test set is used for the gold-standard decrypted text to be used for evaluation.To be able to make automatic comparisons between the automatically generated decrypted text and their corresponding gold-standard, the test set is preprocessed; normalized to capitalize all characters, and double spaces, punctuation, and non-letter characters are removed.
The text files for each century are randomly Given the keys and the plaintexts, we then generate ciphertexts.For each time period, ciphertexts are created, both of length 200 and 500.
Furthermore, for each ciphertext of a given length and for a given time period, homophonic ciphers with a mixed number of homophones are built.The keys use the percentage value of the measured character-based unigram frequencies described above.A number of homophones between one and five is assigned to each plaintext letter.Between three and five homophones for the most common 14 letters, one homophone for the least common, and between two and three homophones for letters in between.Null characters and nomenclature elements are not considered.
In the experiments, the keys and ciphertexts are all numeric, and use a fixed, uniform length for each ciphertext letter.If the key needs less than 100 homophones, all plaintext letters map to two numbers (00, 01, . . ., 99) in the ciphertext.Otherwise, the plaintext letters map to three numbers (000, 001, . . ., 999) in the ciphertext.An example of a generated cipher key is illustrated in Figure 2.

Language model generation
Before generating the models themselves, duplicated texts -the same texts appearing in several centuries -are removed.Then, character-based n-gram models of order 3, 4, and 5 are created from the training set for each century, and a more generic model with all texts available for each language from the Gutenberg collection.
The model format consists of a data and a metadata section.The data section is an array with n-grams and their respective frequencies.The number of occurrences of any particular n-gram is stored as the logarithm of the frequency of that n-gram relative to the full body of text data.The metadata section, which is located in the beginning of the LM file starts with a file identifier "CTLS" (CrypTool Language Statistics), followed by the language code ("EN" for English and "DE" for German), an integer describing the "n" value of the gram, and the model alphabet.
From the training sets, various character-based models for the three n-gram sizes are created for the different centuries, as well as combined models are generated from all texts.Lastly, word-based LMs are generated from the texts.The data is cleaned up by removing residual punctuation for both languages.For English, the diacritics were also removed while for German they were kept.The resulting format is a single word per line, with words occurring in the source material.A combined dictionary from the included English and German HistCorp material is also created along with a general German and English dictionary generated from the Gutenberg data.

Decipherment
The cryptanalysis is performed using CrypTool 23 (CT2), a freely available open source tool4 which allows the automatic decipherment of historical and modern ciphers (Kopal, 2018).CT2 contains a component for the cryptanalysis of homophonic substitution ciphers, the so-called Homophonic Substitution Analyzer, see Figure 3.To ease the use of the cryptanalytic algorithm implemented in the component, we extracted the core cryptanalysis algorithm.Thus, it could be used without the need of starting a full-blown CT2 instance.This furthermore speed up the cryptanalysis and allowed us to perform several hundreds of cryptanalysis runs per model needed for our evaluations.
To decipher a given ciphertext, CT2's Homophonic Substitution Analyzer component implemented with hill climbing with simulated annealing (Kopal, 2019) was used.Additionally, a dictionary of common words was given to the algorithm.During the cryptanalysis process, the dictionary is used to already "lock" partially correct decipherments to improve and speed up the further analysis process.

Evaluation
To evaluate the effect the have on decipherment, we calculate decipherment accuracy as defined in the equation below.
where Length is the length of the ciphertext, D is the deciphered ciphertext, and P is the real plaintext.
We compute the percentage of the correctly deciphered letters of a deciphered ciphertext as defined below.
Finally, a LM receives a "point" if it was able to decipher a given ciphertext with Accuracy ≥ 80%.Here, only the best of all models received a point.

Results
We provide the results for English in Figures 4 and  5 and for German in Figure 6.
For the English texts we found that texts composed in a certain time period were, in general, best analyzed by the model trained on texts from the same century.This trend was especially clear for the earliest texts, i.e. for texts from the 14th, 16th and 17th centuries.For later texts it became less clear which model performed the best, but there was a marked drop in the performance of the early models.These results can be linked to the development of English orthography, which with the spread of dictionaries in the 18th century became more standardized.
Given the choice of the n-gram order, the results showed that 5-gram models achieved the highest decipherment accuracy on more modern texts, and 4-gram models led to highest performance for older ones, produced in the 17th century or earlier.
Not surprisingly, the longer ciphertexts (500 characters) achieved higher decipherment accuracy in general.
The order size of LMs has also impact on the decipherment performance of the cipherlength.We found that shorter texts require higher order ngrams; the 5-gram models performed better compared to the 3-gram and 4-gram models on the 200 character long text.
Interestingly, using the Gutenberg model compared to the model generated from the combination of all text material (1350-1999), the results are diverse.While Gutenberg 4-and 5-gram models perform best on longer and more modern texts (18th-20th century), the historical 4-and 5-gram merged model yields best result for shorter texts.However, noteworthy is that for all 3-gram models, the historical merged model leads to best decipherment performance.
For German, the results show similar trends, albeit not as pure and straightforward as for English.to English, historical century-specific LMs achieve best decipherment performance, but for German, that applies to all historical texts produced earlier than the 19th century.The 16th century model yielded most impressive results for ciphertexts produced up to the 18th century.
Interestingly, the Gutenberg model outperformed a merged model generated from all historical texts from the HistCorp collection (1000-1899) with the exception of a few cases, see 4-gram and 5-gram models on 500 character long texts from 1200-1299 and 1400-1499, and 200 character long ciphertexts from 1700-1799.
Like English, the longer the ciphertext, the higher decipherment accuracy is.5-gram models fit best for shorter as well as long ciphertexts if these are century-specific.However, if no centuryspecific data is available for LM generation, 4gram models seem to be optimal for longer texts.

Discussion
The overall best results for English and German are achieved by applying 4-and 5-gram models, historical 4-grams for texts produced in the 17th century or earlier for English, and 19th century or earlier for German, while 5-gram models are preferable for longer and more modern texts.The results are not surprising.Shorter cipher-  texts require more reliable LMs for successful decipherment, which in turn require more specific larger amounts of input data to achieve cryptanalysis with higher performance.Thus, the results are highly dependent on the amount of available training data, since the larger models tend to perform better in these cases.The conclusion should therefore be considered carefully, and be regarded as potential trends, and further validation with the use of different models, source data, and languages is needed.
Noteworthy is also that accuracies reported in this study might have become higher by applying more restarts and other parameters in the decipherment process.The goal of this study, however, is not to reach the highest decipherment accuracy, but to evaluate the general performance of the models given language data from various time periods.
Considering previous studies discussed in Section 2, such as Nuhn and Knight (2014) and Bean (2020) in which better results are reported on using higher order n-gram models, the data sizes required are significantly larger than for most of the models used in our study.The ciphers themselves are, furthermore, different from the ones analysed here, why a full comparison of the results is not possible.Our next step is to carry out evaluation of the language models on various cipher types with underlying plaintext of many European languages.While we only investigated two languages, both belonging to the Germanic language family with rather similar structure, we would like to include other, more dissimilar languages of different types, such as the Indo-European Romance and Slavic, as well as Finno-Ugric languages such as Finnish and Hungarian.
Another plan is to investigate the impact of the size and coverage of LMs and their impact on decipherment.The texts which the models were generated from cannot be said to be balanced, and not of the same size with respect to centuries.In addition, for comparison between historical LMs and those generated from other sources, several language models could be used: from unigram up to 6-gram character-as well as wordbased models.Apart from the Gutenberg collection, we aim to use the recently released google ngram models for a wide range of languages from 1 to 5 character-based n-grams generated from printed books of different genres and time periods (Google, 2022)5 .
In this paper, we investigated the influence and impact of language models on decipherment of historical ciphertexts.We conducted experiments on English and German to find out if language models generated from historical texts or modern text fit best for decipherment.We ran experiments on texts from the 11th to the 19th centuries.We investigated character-based n-gram models of size 3-, 4-, and 5-grams.We focused on homophonic substitution ciphers of various lengths.For ciphertext length, we experimented with 200 and 500 characters long ciphertext messages.We conducted experiments on homophonic substitution ciphers with a mixture of 1, 2, 3, 4, and 5 code elements to imitate the nature of original ciphers from early modern times.For comparison we generated LMs from contemporary texts derived from the Gutenberg project, and a merged model of all historical century-specific texts.
The experiments clearly indicate that the age and the length of the ciphertext have great influence on the results, and that ciphertext characteristics shall be taken into account when choosing suitable language models for cryptanalysis.Likewise, the amount of available plaintext data serving for the generation of the language models also be considered when choosing a suitable n-gram order.
The results show that decipherment by hillclimbing and simulated annealing using historical n-gram models perform better on ciphertext produced in the 17th century or earlier for English and in the 19th century or earlier for German, and century-specific language models perform better on longer, older, and less complex ciphertexts.The larger LMs generated from the Gutenberg collection are preferable on ciphers from 18th-19th centuries.Further, experiments on n-gram size show that LMs based on 4-grams and 5-grams achieve highest performance on both English and German; 5-gram models for longer text and more modern texts and 4-gram models for historical ones.

Figure 2 :
Figure 2: A generated key; homophones per plaintext letter based on unigram frequencies.

Table 1 :
Experimental setup with features and their values.

Table 2 :
Number of documents in training and test sets in English and German per century.