The promises and pitfalls of Twitter corpora and neural word embeddings: Modeling fine-grained sociolinguistic variation

Miletić, Filip; Przewozny-Desriaux, Anne; Tanguy, Ludovic

doi:10.1075/scl.118.09mil

Part of

Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 142–170

Modeling fine-grained sociolinguistic variation

The promises and pitfalls of Twitter corpora and neural word embeddings

Filip Miletić | CLLE, CNRS & Université Toulouse - Jean Jaurès | IMS, Universität Stuttgart

Anne Przewozny-Desriaux | CLLE, CNRS & Université Toulouse - Jean Jaurès

Ludovic Tanguy | CLLE, CNRS & Université Toulouse - Jean Jaurès

This chapter examines the use of recent data sources and computational methods to study fine-grained sociolinguistic phenomena. We deploy a custom-built corpus of tweets (Miletić et al. 2020) and neural word embeddings to investigate the use of contact-induced semantic shifts in Quebec English. Drawing on an analysis of 40 lexical items, we show that our approach is beneficial in facilitating manual inspection of vast amounts of data and establishing fine-grained patterns of language variation. While it is affected by a range of noise-related issues, which we describe in detail, coarse-grained annotation provides an efficient way of circumventing them. We use the results filtered in this way to conduct a quantitative analysis of sociolinguistic constraints on contact-induced semantic shifts, further confirming the relevance of our approach.

Keywords: semantic shifts, language contact, Twitter corpora, word embeddings, large language models, Quebec English

Article outline

1.Introduction
2.Theoretical and methodological background
- 2.1Semantic shifts in Quebec English: The need for corpus studies
- 2.2Twitter-based corpora for language variation
- 2.3Vector space models for lexical semantic variation
3.Data and method
- 3.1A corpus of tweets
- 3.2A set of semantic shifts in Quebec English
- 3.3Neural word embeddings
- 3.4Clustering and annotating the uses of a lexical item
4.Results
- 4.1An overview of regionally specific clusters
- 4.2Types of variation captured by the analysis
  - 4.2.1True positives
    - A clear-cut distinction
    - A subtler distinction
  - 4.2.2False positives
    - Cultural effects
    - Proper names
    - French homographs in codeswitched tweets
    - Structural patterns affecting model performance
- 4.3Deploying coarsely annotated data for linguistic description
5.Discussion and conclusion
Notes
References

This content is being prepared for publication; it may be subject to changes.

https://doi.org/10.1075/scl.118.09mil

References (60)

References

Bamman, David, Eisenstein, Jacob & Schnoebelen, Tyler. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2): 135–160.

Barber, Katherine (ed.). 2004. Canadian Oxford dictionary. Oxford: OUP.

Bird, Steven, Loper, Edward & Klein, Ewan. 2009. Natural Language Processing with Python. Sebastopol CA: O’Reilly Media.

Boberg, Charles. 2005. The North American Regional Vocabulary Survey: New variables and methods in the study of North American English. American Speech 80(1): 22–60.

. 2010. The English Language in Canada: Status, History and Comparative Analysis. Cambridge: CUP.

. 2012. English as a minority language in Quebec. World Englishes 31(4): 493–502.

Boberg, Charles & Hotton, Jenna. 2015. English in the Gaspé region of Quebec. English World-Wide 36(3): 277–314.

Boleda, Gemma. 2020. Distributional semantics and linguistic theory. Annual Review of Linguistics 6: 213–234.

Cajolet-Laganière, Hélène, Martel, Pierre, Masson, Chantal-Édith & Mercier, Louis. 2014. Usito. <[URL]> (20 May 2024).

Chambers, J. K. & Heisler, Troy. 1999. Dialect topography of Québec City English. Canadian Journal of Linguistics/Revue Canadienne de Linguistique 44(1): 23–48.

De Pascale, Stefano. 2019. Token-based Vector Space Models as Semantic Control in Lexical Lectometry. PhD dissertation, KU Leuven.

Del Tredici, Marco & Fernández, Raquel. 2017. Semantic variation in online communities of practice. In IWCS 2017 – 12th International Conference on Computational Semantics – Long papers. <[URL]> (20 May 2024).

Dendien, Jacques & Pierrel, Jean-Marie. 2003. Le trésor de la langue française informatisé. Un exemple d’informatisation d’un dictionnaire de langue de référence. Traitement Automatique des Langues 44(2): 11–37.

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton & Toutanova, Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis MN: Association for Computational Linguistics.

Dollinger, Stefan. 2015. The Written Questionnaire in Social Dialectology: History, Theory, Practice. Amsterdam: John Benjamins.

Dollinger, Stefan & Fee, Margery. 2017. DCHP-2: The Dictionary of Canadianisms on Historical Principles, 2nd edn. <[URL]> (20 May 2024).

Donoso, Gonzalo & Sánchez, David. 2017. Dialectometric analysis of language variation in Twitter. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 16–25. Valencia: Association for Computational Linguistics.

Durkin, Philip. 2012. Variation in the lexicon: The ‘Cinderella’ of sociolinguistics? Why does variation in word forms and word meanings present such challenges for empirical research? English Today 28(4): 3–9.

Fee, Margery. 1991. Frenglish in Quebec English newspapers. In Papers of the Fifteenth Annual Meeting of the Atlantic Provinces Linguistic Association, 12–23. New Brunswick: Atlantic Provinces Linguistic Association.

. 2008. French borrowing in Quebec English. Anglistik: International Journal of English Studies 19(2): 173–188.

Firth, John R. 1957. A synopsis of linguistic theory, 1930–1955. In Studies in Linguistic Analysis, 1–32. Oxford: Blackwell.

Gimpel, Kevin, Schneider, Nathan, O’Connor, Brendan, Das, Dipanjan, Mills, Daniel, Eisenstein, Jacob, Heilman, Michael, Yogatama, Dani, Flanigan, Jeffrey & Smith, Noah A. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 42–47. Portland OR: Association for Computational Linguistics.

Giulianelli, Mario, Del Tredici, Marco & Fernández, Raquel. 2020. Analysing lexical semantic change with contextualised word representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3960–3973. Stroudsburg PA: Association for Computational Linguistics.

Grant, Pamela. 2010a. English usage in contemporary Quebec: Reflections of the local. In Canadian English: A Linguistic Reader [Strathy Occasional Papers on Canadian English 6], Elaine Gold & Janice McAlpine (eds), 177–197. Kingston ON: Queen’s University.

. 2010b. Is Quebec English distinct? English usage in contemporary Quebec [lecture slides]. <[URL]> (20 May 2024).

Grant-Russell, Pamela. 1999. The influence of French on Quebec English: Motivation for lexical borrowing and integration of loanwords. In LACUS Forum 26, Shin Ja J. Hwang & Arle R. Lommel (eds), 473–486. Fullerton CA: The Linguistic Association of Canada and the United States.

Grieve, Jack, Montgomery, Chris, Nini, Andrea, Murakami, Akira & Guo, Diansheng. 2019. Mapping lexical dialect variation in British English using Twitter. Frontiers in Artificial Intelligence 2: 11.

Harris, Zellig S. 1954. Distributional structure. Word 10(2–3): 146–162.

Hengchen, Simon, Tahmasebi, Nina, Schlechtweg, Dominik & Dubossarsky, Haim. 2021. Challenges for computational lexical semantic change. In Computational Approaches to Semantic Change, Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu & Simon Hengchen (eds), 341–372. Berlin: Language Science Press.

Jones, Taylor. 2015. Toward a description of African American Vernacular English dialect regions using “Black Twitter.” American Speech 90(4): 403–440.

Josselin, Amélie. 2001. L’emprunt lexical en France et au Canada: Le cas particulier des anglicismes et des gallicismes et leur traitement lexicographique. DEA thesis, Université de Lyon II.

Labov, William. 1972. Sociolinguistic Patterns. Philadelphia PA: University of Pennsylvania Press.

Laicher, Severin, Kurtyigit, Sinan, Schlechtweg, Dominik, Kuhn, Jonas & Schulte im Walde, Sabine. 2021. Explaining and improving BERT performance on lexical semantic change detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 192–202. Stroudsburg PA: Association for Computational Linguistics.

Martinc, Matej, Montariol, Syrielle, Zosa, Elaine & Pivovarova, Lidia. 2020. Capturing evolution in word usage: Just add more clusters? In Companion Proceedings of the Web Conference 2020 (WWW ’20), 343–349. New York NY: Association for Computing Machinery.

McArthur, Tom. 1989. The English Language as Used in Quebec: A Survey [Strathy Occasional Papers on Canadian English 3]. Kingston ON: Queen’s University.

Miletić, Filip. 2019. Contact-induced lexical variation in Quebec English: An accountable description. In RJC2019 – 22èmes rencontres des jeunes chercheurs en sciences du langage, Paris, France. <[URL]>

Miletić, Filip, Przewozny-Desriaux, Anne & Tanguy, Ludovic. 2020. Collecting tweets to investigate regional variation in Canadian English. In Proceedings of the 12th Language Resources and Evaluation Conference, 6255–6264. Marseille: European Language Resources Association.

. 2021. Detecting contact-induced semantic shifts: What can embedding-based methods do in practice? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 10852–10865. Punta Cana, Dominican Republic: Association for Computational Linguistics.

. 2023. Understanding computational models of semantic change: New insights from the speech community. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9209–9220. Singapore: Association for Computational Linguistics.

Montariol, Syrielle, Martinc, Matej & Pivovarova, Lidia. 2021. Scalable and interpretable semantic change detection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4642–4652. Stroudsburg PA: Association for Computational Linguistics.

Nguyen, Dong. 2021. Dialect variation on social media. In Similar Languages, Varieties, and Dialects. A Computational Perspective, Marcos Zampieri & Preslav Nakov (eds.), 204–218. Cambridge: CUP.

Nguyen, Dat Quoc, Vu, Thanh & Tuan Nguyen, Anh. 2020. BERTweet: A pre-trained language model for English Tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 9–14. Stroudsburg PA: Association for Computational Linguistics.

Owoputi, Olutobi, O’Connor, Brendan, Dyer, Chris, Gimpel, Kevin, Schneider, Nathan & Smith, Noah A. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 380–390. Atlanta GA: Association for Computational Linguistics.

Pavalanathan, Umashanthi & Eisenstein, Jacob. 2015. Confounds and consequences in geotagged Twitter data. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2138–2148. Lisbon: Association for Computational Linguistics.

Pedregosa, Fabian, Varoquaux, Gaël, Gramfort, Alexandre, Michel, Vincent, Thirion, Bertrand, Grisel, Olivier & Blondel, Mathieu et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: 2825–2830.

Poplack, Shana, Walker, James A. & Malcolmson, Rebecca. 2006. An English ‘like no other’? Language contact and change in Quebec. Canadian Journal of Linguistics/Revue Canadienne de Linguistique 51(2–3): 185–213.

Rodda, Martina A., Lenci, Alessandro & Senaldi, Marco S. G. 2017. Panta rei: Tracking semantic change with distributional semantics in Ancient Greek. Italian Journal of Computational Linguistics 3(1): 11–24.

Rouaud, Julie. 2019. Lexical and Phonological Integration of French Loanwords into Varieties of Canadian English Since the Seventeenth Century. PhD dissertation, Université Toulouse – Jean Jaurès.

Schlechtweg, Dominik, Hätty, Anna, Del Tredici, Marco & Schulte im Walde, Sabine. 2019. A wind of change: Detecting and evaluating lexical semantic change across times and domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732–746. Florence: Association for Computational Linguistics.

Schlechtweg, Dominik, McGillivray, Barbara, Hengchen, Simon, Dubossarsky, Haim & Tahmasebi, Nina. 2020. SemEval-2020 task 1: Unsupervised lexical semantic change detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, A. Herbelot, X. Zhu, A. Palmer, N. Schneider, J. May & E. Shutova (eds), 1–23. Barcelona: International Committee for Computational Linguistics.

Shoemark, Philippa, Sur, Debnil, Shrimpton, Luke, Murray, Iain & Goldwater, Sharon. 2017. Aye or naw, whit dae ye hink? Scottish independence and linguistic identity on social media. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1: Long Papers, 1239–1248. Valencia: Association for Computational Linguistics.

Statistics Canada. 2022. Table 98-10-0218-01. Mother tongue by age: Canada, provinces and territories. <[URL]> (20 May 2024).

Tagliamonte, Sali A. 2002. Comparative sociolinguistics. In The Handbook of Language Variation and Change, Jack K. Chambers, Peter Trudgill & Natalie Schilling-Estes (eds), 729–763. Malden MA: Blackwell.

2006. Analysing Sociolinguistic Variation. Cambridge: CUP.

Tahmasebi, Nina, Borin, Lars & Jatowt, Adam. 2021. Survey of computational approaches to lexical semantic change. In Computational Approaches to Semantic Change, Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu & Simon Hengchen (eds), 1–91. Berlin: Language Science Press.

Takamura, Hiroya, Nagata, Ryo & Kawasaki, Yoshifumi. 2017. Analyzing semantic change in Japanese loanwords. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1: Long Papers, 1195–1204. Valencia: Association for Computational Linguistics.

Turney, Peter D. & Pantel, Patrick. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37: 141–188.

Uban, Ana, Ciobanu, Alina Maria & Dinu, Liviu P. 2019. Studying laws of semantic divergence across languages using cognate sets. In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, 161–166. Florence: Association for Computational Linguistics.

Wolf, Thomas, Debut, Lysandre, Sanh, Victor, Chaumond, Julien, Delangue, Clement, Moi, Anthony & Cistac, Pierric et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Stroudsburg PA: Association for Computational Linguistics.

Xu, Yang & Kemp, Charles. 2015. A computational evaluation of two laws of semantic change. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society, 2703–2708. Austin TX: Cognitive Science Society.