Text length and short texts
An overview of the problem
Variation in text length is an unavoidable confounder in
quantitative text-analytic corpus-linguistic studies. Texts can be difficult
to compare across text lengths, particularly if many of them are short, due
to the difficulty of calculating meaningful frequencies for the lexical
items and linguistic features of interest. Traditionally, this has been less
of an issue, since texts in many of the genres typically studied in
linguistics have been relatively long. However, the rise of social media has
brought the issue to the forefront. In this chapter, I describe the problem
of text length and short texts together with a number of solutions and
workarounds to this and related problems.
Article outline
- 1.Introduction
- 2.Background
- 2.1Text length, corpora, and social media
- 2.2The importance of text length
- 3.Solutions and workarounds
- 3.1Manipulation of the data
- 3.1.1Exclusion
- 3.1.2Combining
- 3.1.3Chunking
- 3.2Computational and statistical approaches
- 3.2.1Lengthwise analysis
- 3.2.2Multiple Correspondence Analysis
- 3.2.3Resampling methods
- 3.3A related problem: Lexical diversity
- 4.Conclusion
-
Notes
-
References
This content is being prepared for publication; it may be subject to changes.