An overview of the problem: Text length and short texts

Liimatta, Aatu

doi:10.1075/scl.118.07lii

Part of

Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 106–125

Text length and short texts

An overview of the problem

Aatu Liimatta | University of Helsinki

Variation in text length is an unavoidable confounder in quantitative text-analytic corpus-linguistic studies. Texts can be difficult to compare across text lengths, particularly if many of them are short, due to the difficulty of calculating meaningful frequencies for the lexical items and linguistic features of interest. Traditionally, this has been less of an issue, since texts in many of the genres typically studied in linguistics have been relatively long. However, the rise of social media has brought the issue to the forefront. In this chapter, I describe the problem of text length and short texts together with a number of solutions and workarounds to this and related problems.

Keywords: text length, normalization, lexical diversity, lengthwise analysis

Article outline

1.Introduction
2.Background
- 2.1Text length, corpora, and social media
- 2.2The importance of text length
3.Solutions and workarounds
- 3.1Manipulation of the data
  - 3.1.1Exclusion
  - 3.1.2Combining
  - 3.1.3Chunking
- 3.2Computational and statistical approaches
  - 3.2.1Lengthwise analysis
  - 3.2.2Multiple Correspondence Analysis
  - 3.2.3Resampling methods
- 3.3A related problem: Lexical diversity
4.Conclusion
Notes
References

This content is being prepared for publication; it may be subject to changes.

https://doi.org/10.1075/scl.118.07lii

References (24)

References

Biber, Douglas. 1988. Variation across Speech and Writing. Cambridge: CUP.

. 2014. Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contras, 14(1): 7–34.

Biber, Douglas & Conrad, Susan. 2009. Register, Genre, and Style. Cambridge: CUP.

Biber, Douglas, Csomay, Eniko, Jones, James K. & Keck, Casey. 2004. A corpus linguistic investigation of vocabulary-based discourse units in university registers. In Applied Corpus Linguistics: A Multidimensional Perspective, Ulla Connor & Thomas A. Upton (eds), 53–72. Amsterdam: Rodopi.

Biber, Douglas, Egbert, Jesse & Keller, Daniel. 2020. Reconceptualizing register in a continuous situational space. Corpus Linguistics and Linguistic Theory 16(3): 581–616.

Clarke, Isobelle & Grieve, Jack. 2017. Dimensions of abusive language on Twitter. In Proceedings of the First Workshop on Abusive Language Online, Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy & Joel Tetreault (eds), 1–10. Vancouver BC: Association for Computational Linguistics.

. 2019. Stylistic variation on the Donald Trump Twitter account: A linguistic analysis of tweets posted between 2009 and 2018. PLoS One 14(9): e0222062.

Conrad, Susan & Biber, Douglas (eds). 2001. Variation in English: Multi-dimensional Studies. Harlow: Pearson Education.

Covington, Michael A. & McFall, Joe D. 2010. Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2): 94–100.

Gries, Stefan T. 2006. Exploring variability within and between corpora: Some methodological considerations. Corpora 1(2): 109–151.

2022. Toward more careful corpus statistics: uncertainty estimates for frequencies, dispersions, association measures, and more. Research Methods in Applied Linguistics 1(1).

Hess, Carla W., Haug, Holly T. & Landry, Richard G. 1989. The reliability of type-token ratios for the oral language of school age children. Journal of Speech and Hearing Research 32: 536–540.

Hess, Carla W., Sefton, Karen M. & Landry, Richard G. 1986. Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research 29: 129–134.

Hiltunen, Turo & Tyrkkö, Jukka. 2019. Academic vocabulary in Wikipedia articles: Frequency and dispersion in uneven datasets. In From Data to Evidence in English Language Research, Carla Suhr, Terttu Nevalainen & Irma Taavitsainen (eds), 282–306. Leiden: Brill.

Koizumi, Rie & In’nami, Yo. 2012. Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System 40(4): 554–564.

Kubát, Miroslav & Milička, Jiří. 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20(4): 339–349.

Liimatta, Aatu. 2019. Exploring register variation on Reddit: A multi-dimensional study of language use on a social media website. Register Studies 1(2): 269–295.

. 2020. Using lengthwise scaling to compare feature frequencies across text lengths on Reddit. In Corpus Approaches to Social Media, Sofia Rüdiger & Daria Dayter (eds), 111–130. Amsterdam: John Benjamins.

. 2022a. Register variation across text lengths: Evidence from social media. International Journal of Corpus Linguistics 28(2): 202–231.

. 2022b. Do registers have different functions for text length? A case study of Reddit. Register Studies 4(2): 263–287.

Lijffijt, Jefrey, Nevalainen, Terttu, Säily, Tanja, Papapetrou, Panagiotis, Puolamäki, Kai & Mannila, Heikki. 2016. Significance testing of word frequencies in corpora. Digital Scholarship in the Humanities 31(2): 374–397.

Shi, Yaqian & Lei, Lei. 2020. Lexical richness and text length: An entropy-based perspective. Journal of Quantitative Linguistics 29(1), 62–79.

Säily, Tanja. 2014. Sociolinguistic Variation in English Derivational Productivity: Studies and Methods in Diachronic Corpus Linguistics. Helsinki: Société Néophilologique de Helsinki.

Winter, Bodo & Grice, Martine. 2021. Independence and generalizability in linguistics. Linguistics 59(5): 1251–1277.