Modeling fine-grained sociolinguistic variation
The promises and pitfalls of Twitter corpora and neural word
embeddings
This chapter examines the use of recent data sources and
computational methods to study fine-grained sociolinguistic phenomena. We
deploy a custom-built corpus of tweets (Miletić et al. 2020) and neural word embeddings to investigate
the use of contact-induced semantic shifts in Quebec English. Drawing on an
analysis of 40 lexical items, we show that our approach is beneficial in
facilitating manual inspection of vast amounts of data and establishing
fine-grained patterns of language variation. While it is affected by a range
of noise-related issues, which we describe in detail, coarse-grained
annotation provides an efficient way of circumventing them. We use the
results filtered in this way to conduct a quantitative analysis of
sociolinguistic constraints on contact-induced semantic shifts, further
confirming the relevance of our approach.
Article outline
- 1.Introduction
- 2.Theoretical and methodological background
- 2.1Semantic shifts in Quebec English: The need for corpus studies
- 2.2Twitter-based corpora for language variation
- 2.3Vector space models for lexical semantic variation
- 3.Data and method
- 3.1A corpus of tweets
- 3.2A set of semantic shifts in Quebec English
- 3.3Neural word embeddings
- 3.4Clustering and annotating the uses of a lexical item
- 4.Results
- 4.1An overview of regionally specific clusters
- 4.2Types of variation captured by the analysis
- 4.2.1True positives
- A clear-cut distinction
- A subtler distinction
- 4.2.2False positives
- Cultural effects
- Proper names
- French homographs in codeswitched tweets
- Structural patterns affecting model performance
- 4.3Deploying coarsely annotated data for linguistic description
- 5.Discussion and conclusion
-
Notes
-
References
This content is being prepared for publication; it may be subject to changes.