Private Sensitive Content on Social Media: An Analysis and Automated Detection for Norwegian

Haldis Borgen; Oline Zachariassen; Pelin Mise; Ahmet Yildiz; Özlem Özgöbek

doi:10.3384/ecp208001

Authors

Haldis Borgen
Oline Zachariassen
Pelin Mise
Ahmet Yildiz
Özlem Özgöbek

DOI:

https://doi.org/10.3384/ecp208001

Abstract

This study addresses the notable gap in research on detecting private-sensitive content within Norwegian social media by creating and annotating a dataset, tailored specifically to capture the linguistic and cultural nuances of Norwegian social media discourse. Utilizing Reddit as a primary data source, entries were compiled and cleaned, resulting in a comprehensive dataset of 4482 rows. Our research methodology encompassed evaluating a variety of computational models—including machine learning, deep learning, and transformers—to assess their effectiveness in identifying sensitive content. Among these, the NB BERT-based classifier emerged as the proficient, showcasing accuracy and F-1 score. This classifier demonstrated remarkable effectiveness, achieving an accuracy of 82.75% and an F1-score of 82.39%, underscoring its adeptness at navigating the complexities of privacy-sensitive content detection in Norwegian social media. This endeavor not only paves the way for enhanced privacy-sensitive content detection in Norwegian social media but also sets a precedent for future research in the domain, emphasizing the critical role of tailored datasets in advancing the field.

Private Sensitive Content on Social Media: An Analysis and Automated Detection for Norwegian

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

License