How Does the Language of ‘Threat’ Vary Across News Domains? A Semi-Supervised Pipeline for Understanding Narrative Components in News Contexts

—By identifying and characterising the narratives told in news media we can better understand political and societal processes. The problem is challenging from the perspective of natural language processing because it requires a combination of quantitative and qualitative methods. This paper reports on work in progress, which aims to build a human-in-the-loop pipeline for analysing how the variation of narrative themes across different domains, based on topic modelling and word embeddings. As an illustration, we study the language associated with the threat narrative in British news media.


I. INTRODUCTION
Due to the drastic changes in news distribution over the past decades, considerable attention has been given to how ongoing events are framed in news reporting. In the realm of digital news media, concerns include increasing polarisation and a decrease in the relative share of political reporting [1]. News reporting has a direct effect on the political landscape because alternative news framing translates into competing public discourses and, by extension, electoral results [2]. Studying the framings and narratives in the media is, therefore, vital for understanding political processes. Extensive qualitative analysis of large-scale news corpora is, however, expensive and can hardly be feasible. This provides the motivation to apply natural language processing to both facilitate qualitative research at scale and enable quantitative approaches to narrative understanding. In this paper, we propose a pipeline for descriptive multi-domain analysis of narrative subcomponents in the news media.
Advancements in natural language processing (NLP) methods provide a variety of tools for political communication analysis. Some of these are found in applications within or adjacent to the area of narrative understanding, such as stance detection [3] and sentiment analysis [4]. Algorithms based on neural networks have been shown to discern the difference in published texts by different partisan actors [5] and predict the ideological alignment of social media posters [6]. The majority of these methods can be described as classification algorithms, relying on supervised learning on either narrow domain-specific annotated datasets or fine-tuning large language models (LLMs) which have been trained on huge general-domain corpora. They are primarily used for quantitative studies and practical applications where the target phenomena are well-defined and the domain shift is limited. The methods, however, are not always suitable to assist qualitative and descriptive research. The low level of interpretability of the machine learning methods in general, and of the LLMs in particular, is also a factor.
We are studying the usefulness of NLP for fine-grained narrative structure analysis, e.g. extracting narrative substructures and revealing context-specific language. The proposed pipeline, which is still a work in progress, serves to describe the contextual use of narrative themes within different overarching topics. As an example, we study the language used by news publishers to express the notion of threat and risk. We choose this type of semantic relation because its presence in a news article almost guarantees a degree of partiality which affects the readers' perception of the issue.
The pipeline consists of the following steps: • Applying semi-supervised or unsupervised topic modelling to find latent topics in a text corpus • Training contextual embeddings for each discovered topic • Computing the closest terms in the embedding space to describe the notion of threat in each topic In the system presented here, the embeddings are produced by Word2Vec, and topics are derived through Correlation Explanation (CorEx) where clusters are shaped around userprovided anchor words [7]. Depending on the available domain knowledge and discoveries from unsupervised clustering, the anchor words can guide the model to find crisper topics in a semi-supervised fashion. The output of the pipeline is a collection of descriptions of a selected concept (in our case, threat) for each of the generated topics.

II. BACKGROUND
In the context of NLP applications, interpretations and definitions of narrative structures and elements can vary greatly, depending on specific tasks and domains. Here we relate our problem to several of these approaches. While our goal to investigate the language abstract notions in different contexts (here exemplified by threat) does not match them exactly, it shares many similarities with e.g. stance detection and narrative discovery.

A. Opinion mining and stance detection
As demonstrated in a number of studies, certain narrativelike notions can be captured by large language models trained on the document level. The documents or sentences are labelled by human annotators as containing such notions, and the definition of the notion is left to expert judgement. The assumption is that the representation of the document is rich enough that it captures narrative elements regardless of form. This is very prominent in, e.g. hate speech detection, where the key challenge comes from the fact that the hateful intent can take misleading forms and does not rely on any specific device to be conveyed [8], [9]. Similarly, it can be the case for the stance detection task, where the stance towards an issue cannot always be determined by positive or negative vocabulary and sentiment analysis is not enough to draw meaningful conclusions [3].
The downside is that the model trained to classify entire documents would be able to identify, e.g. a stance towards a specific political issue, but not necessarily what constitutes the narrative within the text. For example, an article can be shown to include the notion of 'threat', but explaining what constitutes 'threat' beyond the label becomes problematic. Since the algorithmic decision applies to an entire document, for narrative detection purposes, these models are more applicable to shorter messages and higher-level narratives ('pro-abortion' as opposed to 'threat' or 'success').

B. Narrative extraction
In the field of computational narratology, a common approach to narratives involves determining key entities and their relations. The inspiration for such methods comes from the structural interpretations of stories in formalist folklore studies [10]. Character or role detection can take various forms but often includes assigning a fixed set of archetypal roles or narrative frames ('villain', 'protagonist', 'victim', etc.) to specific entities in the story. In the news article domain, abstract role detection has been realised, for example, through the combination of entity extraction and sentiment analysis [11]. There has been, for example, some success in applying these methods in computational studies of conspiracy theories classifying entities within the publication as 'insiders' or 'outsiders' [12].
A more complex approach involves constructing the relations of the extracted entities, where the resulting narrative representation usually takes the form of a graph. This method has been applied, for instance, to conspiracy theory discovery. Relating this directly to our research, the authors used the presence of the notion of 'threat' encoded as subjectverb-copula triplets as the main criterion to detect specific theory elements [13]. We, however, are interested in how the narrative component of 'threat' is different in different news contexts and not in which contexts it defines.

C. Perspective extraction
Recently, Minnema et al. [14] put forth a framework based on Frame Semantics. They apply a FrameNet [15] parser LOME [16] to analyse perspectives in news media event description. Instead of building a graph, the focus is on analysing linguistic frames invoked by specific texts. While the purpose of the model is similar to our task, it is focused on the analysis of the specific events or topics (e.g. femicide reporting in Italy [17]) rather than comparing the contexts.

III. PRE-STUDY
A. Semi-supervised topic modelling for contextual 'threat' understanding In our goal to keep the pipeline as robust and explainable as possible, we investigated the possibility of extracting contextual descriptions of 'threat' purely by applying semisupervised topic modelling, which has the advantage of being interpretable in terms of probabilities, unlike Word2Vec word embeddings. Semi-supervised topic modelling has been used, for example, to investigate the presence of gendered latent topics in different contexts [18]. We explored the option of taking a similar approach, presupposing that there exists a specific cluster of news articles or their fragments centred around the target concept of threat. All of the pre-study has been performed on the same dataset as the rest of the paper and its thematical subsets: sports and politics. If the subsets would each contain a threat-related cluster of articles, we would have been able to compare their content and, therefore, the definitions of threat in these contexts.

B. Experiments with pSSLDA and CorEx
In the first series of experiments, we applied Latent Dirichlet Allocation (LDA) with z-labels (pSSLDA) [19]. At the initialisation step, it assigns additional weight to predefined seed words for specific topics. After initialisation, the algorithm proceeds in an unsupervised fashion. Through our experiments, we initialised clusters with various threatrelated word combinations, as well as tested other similar notions, such as 'success'. The resulting topical distribution remained near-identical to the output of the unsupervised LDA model. Moreover, the topic order remained unstable even with the seeding, somewhat counterintuitively: one could have expected the military conflict-related news topic to be consistently initiated by seed words, such as 'danger'.
The second series of preliminary experiments using the more restrictive CorEx also displayed negative results: the topics initialised with the threat-related anchor (seed) words did not seem to be immediately humanly interpretable and had significant overlaps with other clusters. Our interpretation is that in a news dataset, the event-specific language dominates all other vocabulary particularities, making eventbased topics very easily separable. So even if clusters of text corresponding to the notions, such as 'threat' exist, they remain statistically insignificant in comparison. While this may not be the case for other abstract topics, it is reasonable to expect a co-occurrence-based method to find clusters based on topic-specific terms rather than the presence of a higher-level semantic construct. Thus, we rejected using purely semi-supervised topic modelling for our task.

A. Topic Modelling
To demonstrate the chosen approach, we apply it to study the language of threat in news media. The initial topic modelling is done with the help of CorEx, and in our analysis, we equate these topics with 'contexts'. Since the purpose of the pipeline is to perform exploratory analysis and assist in qualitative studies, it is vital to have some control over topic distribution based on domain knowledge. CorEx is based on mutual information between words and topics and offers more restrictive and flexible semi-supervised functionality compared to, e.g. Latent Dirichlet Allocation with z-labels [19]. It is also valuable for corpus exploration because it does not require transfer learning: Topic modelling based on neural networks can perform better on specific tasks [20], but also introduces additional bias from out-of-context corpora. Due to the nature of the models, this bias cannot be easily separated from the properties of the target datasets, which is a disadvantage in exploratory analysis.

B. Word Embeddings
After the topics have been established, we train Word2Vec embeddings on texts in each individual topic. Word em-beddings preserve some degree of semantic relations from natural language [21] and have been used for investigating the definitions of concepts, as well as the narratives surrounding them. For example, Papasavva et al. [22] apply Word2Vec to find words associated with QAnon. It has also been used to compare semantic contexts: e.g. the language use of parliamentary motions of the opposing Swedish political parties [23]. While our goal is somewhat different, the approach is similar. We use a set of keywords as a representation for the target concept: in this case, threat.
Since our principal interest is investigating concrete media contexts, we avoid language models that require pre-training. Such models would inject implicit out-of-context bias, making comparisons harder. Additionally, the document cluster for each topic is relatively small, and it has been shown that Word2Vec can outperform transformer-based language models on smaller datasets when trained from scratch [24].

A. Dataset
To evaluate the pipeline (outlined in Figure 1), we experiment with a collection of news articles from mainstream free-access British news media. The dataset was collected between May and early August 2022 and contains 57,996 unique articles out of 100,000 in total. Top-5 most frequent news sources are The Sun, The Independent, The Daily Express, The Daily Mirror and The Daily Star together constituting approximately 23% of the articles in the dataset.
For each article, the following information was collected: the title (headline), the preamble, the body, the URL to the article, and the publication date and time. For the purposes of the experiments, the headline, the preamble and the body are concatenated into single text entries. The articles are tokenised and processed into the matrix of token counts with the Scikit-learn library.

B. Threat Definition
In this experiment, our goal is to define the notion of threat through the common language terms that encapsulate the relation of one entity presenting a threat to another. In that, our definition is similar to the definition of the frames in FrameNet: a frame is "a script-like conceptual structure that describes a particular type of situation, object, or event along with its participants and props" [15]. The semantically closest frame of FrameNet is 'Risky_situation' -"A particular Situation is likely (or unlikely) to result in a harmful event befalling an Asset.", so we choose to use the nouns in the FrameNet ontology that invoke the 'Risky_situation' (threat, danger, risk) as a triple of keywords.

VI. EXPERIMENTS A. Unsupervised Topic Modelling
We assume minimal domain knowledge and task CorEx to identify news topics without anchor words to detect latent topics in the dataset. The number of topics is initially chosen to be below 10 to limit the scope of further analysis and after the preliminary experimentation set to 5. We evaluated the top-20 most relevant terms for each cluster, and in four cases out of five, the topics seem to have a clear focus (top-3 terms listed in parenthesis): • T 0 : (league, season, premier) • T 1 : (government, cost, crisis) • T 2 : (police, court, officers) • T 3 : (love, instagram, star) • T 4 : (like, just, think) The final topic T 4 has the lowest topic correlation value. It is based on non-topic-specific words and seems to include articles not matching with the rest of the clusters.

B. Semi-supervised Topic Modelling
Based on the initial result above, we can expect that the four topics (loosely defined as 'Sports', 'Costs crisis' 'Police', and 'TV and celebrities') are present in the data, but it would be unreasonable to assume that all (or even most) articles would belong to one of them. Gallagher et al. [7], when investigating the performance of CorEx, set the number of clusters as high as 50 for a news dataset while initialising some of them with anchors to create crisp topics. The four topics described above are already shown to be present in the data. Even with minimal domain knowledge, we can also expect significant coverage of the Russia-Ukraine war in the summer of 2022, which is a potentially valuable context for threat interpretation. To isolate the five chosen topics, we restrict them with three anchor words each and set the total number of topics to 20.
The anchor words and the resulting topics are shown in Table I. Each of them is described by the top 10 most relevant terms. The list includes anchor words used for initialisation (in bold). Corex is a discriminative model and allows articles to belong to several topics.

C. Word Embeddings
For the articles in each topic, we train a Word2Vec model and extract the terms in the embedding space that are the closest to the three keywords: threat, danger and risk. We use the Gensim implementation of Word2Vec with the following parameters for all individual models: ignoring unique words (frequency 2 or more), word window size -10, vector size -100. The results are presented in Table II. For each contextkeyword pair, we show seven words with the closest vector representation measured by cosine distance. Words unique to the contexts are highlighted in bold. As we can see, these neighbourhoods vary greatly between the contexts.

D. Analysis
At this step in the pipeline, the automatic analysis could be complemented with expert knowledge to add a qualitative element to the study. However, even by analysing the output of the models superficially and without specialised knowledge, we notice certain peculiarities. One such thing is that the language of the 'cost crisis' topic is as strong if not stronger than the language of the 'war' topic. Another observation is that the word 'fetus' is likely present together with 'court' because the Roe v. Wade ruling was overturned by the US Supreme Court within the time frame. With greater domain knowledge, one could choose better anchors and obtain crisper clusters. It is, for example, likely the term 'TV' caused the last topic to skew towards war instead of the cultural sphere as was intended.

A. Limitations
One immediate limitation of the pipeline is the need for human guidance in the clustering process. While annotated data is not necessary, clustering does benefit greatly from domain knowledge, as the unsupervised version is unlikely to produce meaningful results. Another related disadvantage is the lack of 'ground truth' knowledge to test the results with the problem framed as it is. We, however, work on the assumption that a media researcher would not look for purely quantitative output in their application, using this framework as a facilitator for qualitative research instead.
Another potential weakness is the need to define any potential target concept with the keywords. While it is not in itself problematic with a domain expert's input, we so far lack the evidence to judge what kind of concept definition is preferable in a general case.  In our ongoing project, we have set ourselves several goals that would expand on existing experiments: • We plan to study how the frequency and distribution of keywords or their combinations affect the results. As we propose to use the pipeline to study relatively high-level and not explicitly domain-specific concepts, we would like to at least outline how unspecific the keywords need to be. Based on this, we hope to provide a highlevel recommendation on how to define the concepts. Similarly, we plan to formulate a recommendation on how to guide topic modelling with anchor words. • Then, we aim to repeat the experimental scenario for other abstract narrative concepts, such as 'success' or 'failure', and compare the model's performances. • The next step is to extend the experiments to an analogous dataset of the Swedish media. While it is reasonable to expect topic modelling and Word2Vec to work similarly well in another Germanic language, the news media culture is different, which is likely to cover the use of language. Moving further, we can see this pipeline being used in comparative studies of news publishing language in different contexts, not only limiting them to event-based topics. The contexts can include, e.g. different types of publications (mainstream vs tabloids vs new media) or political alignments ('left wing' vs 'right wing'). Another potential use case is comparing the language of the same publication over a time period to investigate how language shifts within particular news contexts. Finally, an even more challenging task requiring more topical expertise would be drawing comparisons between the same concepts in the news media of different countries in their respective languages.

C. Conclusion
We have implemented a semi-supervised pipeline to analyse the expression of narrative themes in different media contexts. Previous studies have used word embeddings to describe terms and stances, and we extend this to a more abstract notion and produce a complete pipeline to perform comparative analysis. Our next steps include applying the pipeline to other languages and comparing the performance and results to English-language media. We also reach out to media researchers to identify other relevant applications. We believe that when there is sufficient domain knowledge to guide topic formation, this mixed-method approach can be an effective tool for narrative analysis.