Skip to content

Latest commit

 

History

History
88 lines (58 loc) · 8.34 KB

README.md

File metadata and controls

88 lines (58 loc) · 8.34 KB

SA-Comments

This is the repository for the course "Obrada prirodnog jezika" 2024.

Description

The Croatian variety, belonging to the Bosnian-Croatian-Montenegrian-Serbian language (BCMS), comprises three dialect families: Štokavian, Kajkavian, and Čajkavian. These dialects differ in how they ask "what," using “što”, “kaj”, and “ča” respectively. The Štokavian dialect forms the basis of the Standard language, revitalised after the Yugoslavian war in the 1990s. Subsequently, Croatia implemented an extensive language policy [1], aiming to differ as much from the Serbian Standard language as possible. A side effect of this is the decreasing numbers of dialectal speakers of both Čakavian and Kajkavian. One possible explanation for this development could be the newly developed language attitude from speakers of Croatian, which tend to perceive the Standard language as more correct than the dialectal varieties [2].

The concept of “language attitudes” refers to individuals' reactions toward language varieties and their users, shaped by psychological and sociocultural factors. Language attitudes are an interplay of cognitive elements involving beliefs and stereotypes, affective aspects that pertain to evaluations, and behavioural dimensions that reflect observable actions and responses [3]. Several methods have been developed to research language attitudes such as the societal treatment approach. Garrett [3] describes the societal treatment approach as a broad research category, analysing public domain sources like government discourse, media, and literature. This approach focuses on the socio-cultural and political context of attitudes, aiming to provide a comprehensive understanding of the interplay between societal discourses, political events, and individual attitudes. In the past, there have been computational approaches to study societal treatment, e.g. Durham’s analysis of tweets discussing the Welsh English Accent [4]. While the data was collected automatically, the categorisation and analysis of them was done manually, which is time and resource consuming.

A possible solution to this problem is sentiment analysis, a NLP technique providing high-level sentiment classification for an entire document, sentence or word [5]. As sentiment analysis is already applied on other use cases for Croatian [7][8] and Serbian [9][10], already exsiting resources can be used and finetuned on the task. Therefore, we first annotate 5000 sentences from the MetaLangNEWS-COMMENTS-Hr corpus [11]. The corpus is a collection of user comments from major Croatian news sources between January 1, 2015, and January 1, 2020, focusing on language-related topics. The corpus was tagged using CLASSLA-StanfordNLP models for morphosyntactic annotation and lemmatisation. It's accessible in plain text, XML with metadata, and tagged CONLL-U format. After annotation, we analyse the results from different perspectives. [tbd] Finally, we fine tune a transformer model for sentiment analysis on language-related texts. For Croatian, two transformer models have been created, namely BERTić [12] and CroSloEngual BERT [13].

References

[1] Mønnesland, S. (1997, October). Emerging literary standards and nationalism. The disintegration of Serbo-Croatian. In Actas do I Simposio Internacional sobre o Bilingüismo (pp. 1103-1113).

[2] Kalogjera, D. (2001). On attitudes toward Croatian dialects and on their changing status. International Journal of the Sociology of Language , 2001(147), 91-100. https://doi.org/10.1515/ijsl.2001.009

[3] Garrett, P. (2006). Language attitudes. In The Routledge companion to sociolinguistics (pp. 136-141). Routledge.

[4] Durham, M. (2016). Changing attitudes towards the Welsh English accent: A view from Twitter. Sociolinguistics in Wales, 181-205.

[5] Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in information retrieval, 2(1–2), 1-135.

[6] Zhang, W., Li, X., Deng, Y., Bing, L., & Lam, W. (2022). A survey on aspect-based sentiment analysis: Tasks, methods, and challenges. IEEE Transactions on Knowledge and Data Engineering.

[7] Pelicon, A., Pranjić, M., Miljković, D., Škrlj, B., & Pollak, S. (2020). Zero-shot learning for cross-lingual news sentiment classification. Applied Sciences, 10(17), 5993.

[8] Thakkar, G., Preradovic, N. M., & Tadic, M. (2022). Multi-task learning for cross-lingual sentiment analysis. arXiv preprint arXiv:2212.07160.

[9] Draskovic, D., Zecevic, D., & Nikolic, B. (2022). Development of a multilingual model for machine sentiment analysis in the serbian language. Mathematics, 10(18), 3236.

[10] Nikolić, N., Grljević, O., & Kovačević, A. (2020). Aspect-based sentiment analysis of reviews in the domain of higher education. The Electronic Library, 38(1), 44-64.

[11] Bogetić, K., & Batanović, V. (2020). Annotated corpus of Croatian language-related news comments MetaLangNEWS-COMMENTS-Hr. [Ljubljana]: ZRC SAZU; [Beograd]: Regional Linguistic Data Initiative Centre, 2020. 1 spletni vir. CLARIN.SI data & tools. https://www.clarin.si/repository/xmlui/handle/11356/1370. [COBISS.SI-ID 35287299]

[12] Ljubešić, N., & Lauc, D. (2021, April). BERTić-The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing (pp. 37-42).

[13] Ulčar, M., & Robnik-Šikonja, M. (2020). FinEst BERT and CroSloEngual BERT: less is more in multilingual models. In Text, Speech, and Dialogue: 23rd International Conference, TSD 2020, Brno, Czech Republic, September 8–11, 2020, Proceedings 23 (pp. 104-111). Springer International Publishing.

Members

  • Anja Brnjakovic
  • Petra Mazavac
  • Petra Kokalovic
  • Barbara Kovacic

Next Steps

Previous Steps

  1. Create local Github branch
  2. Push original dataset
  3. Write a Python script that extracts the sentences and metadata from the XML file
  4. Extract the data and create a template annotation sheet as CSV
  5. Write a short explanation of the annotation environment (Excel) and how to save it (as csv)
  6. Write a description
  7. Annotate the first 150 sentences
  8. Create Inter-Annotator-Agreement-Dashboard
  9. Data Analysis of Original Dataset

Next Meeting

Data Catalogue

  • global-id: Global document ID, in the form 'source-id'-'local-id'
  • source-name: Full name of the source website
  • article-title: Article title in its original script
  • article-time: Date on which the article was published
  • article-author: Name or initials of the article author
  • article-text: Main text of the article in its original script
  • sentence-id: ID of the sentence in the CONLL-u file
  • sentence: target sentence

Annotation Scheme

  • Sentiment: sentiment of the whole sentence
  • Target: target of the sentiment in the sentence

Annotation Environment

We use the Google Drive Ecosystem for Annotation. Therefore, we created a folder on Google Drive. In this folder, each annotator has an own Google spreadsheet. The spreadsheet contains the dataset, which you can find in the folder "02 Raw data". Additionally, we added the columns "sentiment" and "target". At the "sentiment" column, we added a Drop Down menu, where the annotator can choose between the tags "positive", "neutral" and "negative". By doing so, we hope to avoid spelling mistakes in the annotation layer.

The aim is to fully automize the calculation of the inter-annotator-agreement rate, using Fleiss Kappa, so we created a Google Sheets Dashboard which extracts the value from each annotation spreadsheet and colour codes differences in the labels, using red for total disagreement, yellow where two annotators agree and green for total agreement. This overview is then used to discuss cases where annotations differ and decide the final annotation.

Annotation Sheet of Anja Brnjakovic: https://docs.google.com/spreadsheets/d/1JYaB7dj9b7yw52g_t3VjzGaVVqwy68s3TSamU9YXoss/edit?usp=sharing

Annotation Sheet of Petra Kokalovic: https://docs.google.com/spreadsheets/d/1sUAEd9g_6BaeY-QbHPIXhE0kIuvZt2qmHooD9Jtwalo/edit?usp=sharing

Annotation Sheet of Petra Mazavac: https://docs.google.com/spreadsheets/d/1JktG2oO_ZFSQCk7bMRF6XhpS1lozFZagy8g-o9oxkzI/edit?usp=sharing

Dashboard: https://docs.google.com/spreadsheets/d/1V7ZnyfZKyHmSvaHSO9VYoqivQh-i9791aJZaYWR7JSg/edit?usp=sharing

Results

The analysis of the results can be seen here: https://docs.google.com/spreadsheets/d/19YdR9KLlv6nReZZ3_5qsJNDe6XS706GQ29NzxDo3v3g/edit?usp=sharing