Private Sensitive Content on Social Media: An Analysis and Automated Detection for Norwegian
DOI:
https://doi.org/10.3384/ecp208001Abstract
This study addresses the notable gap in research on detecting private-sensitive content within Norwegian social media by creating and annotating a dataset, tailored specifically to capture the linguistic and cultural nuances of Norwegian social media discourse. Utilizing Reddit as a primary data source, entries were compiled and cleaned, resulting in a comprehensive dataset of 4482 rows. Our research methodology encompassed evaluating a variety of computational models—including machine learning, deep learning, and transformers—to assess their effectiveness in identifying sensitive content. Among these, the NB BERT-based classifier emerged as the proficient, showcasing accuracy and F-1 score. This classifier demonstrated remarkable effectiveness, achieving an accuracy of 82.75% and an F1-score of 82.39%, underscoring its adeptness at navigating the complexities of privacy-sensitive content detection in Norwegian social media. This endeavor not only paves the way for enhanced privacy-sensitive content detection in Norwegian social media but also sets a precedent for future research in the domain, emphasizing the critical role of tailored datasets in advancing the field.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Haldis Borgen, Oline Zachariassen, Pelin Mise, Ahmet Yildiz, Özlem Özgöbek
This work is licensed under a Creative Commons Attribution 4.0 International License.