366029858
03
01
01
JB
John Benjamins Publishing Company
01
JB code
SCL 118 Eb
15
9789027246530
06
10.1075/scl.118
13
2024029609
DG
002
02
01
SCL
02
1388-0373
Studies in Corpus Linguistics
118
01
Challenges in Corpus Linguistics
Rethinking corpus compilation and analysis
01
scl.118
01
https://benjamins.com
02
https://benjamins.com/catalog/scl.118
1
B01
Mark Kaunisto
Kaunisto, Mark
Mark
Kaunisto
Tampere University
2
B01
Marco Schilk
Schilk, Marco
Marco
Schilk
University of Hildesheim
01
eng
180
vii
172
LAN009000
v.2006
CFX
2
24
JB Subject Scheme
LIN.APPL
Applied linguistics
24
JB Subject Scheme
LIN.COMPUT
Computational & corpus linguistics
24
JB Subject Scheme
LIN.CORP
Corpus linguistics
24
JB Subject Scheme
LIN.THEOR
Theoretical linguistics
06
01
This book contributes to the discussion of challenges faced in different areas of corpus linguistics, namely the compilation, annotation, and analysis of linguistic corpora. In a field of growing corpus sizes and expanding possibilities of gathering data, some old issues persist, while at the same time new problems have emerged. As the compilation and study of language corpora gets increasingly sophisticated and complex, continuous attention on ways of dealing with the data in question and challenges in text selection and interpretation is needed. The contributions to this volume address problems relating to a variety of areas in corpus linguistic study, including corpus annotation, data variability, learner language, social media texts, and database utilization. The authors provide critical overviews and research-based analyses, discuss the nature of some of the common pitfalls, and offer solutions to existing problems.
04
09
01
https://benjamins.com/covers/475/scl.118.png
04
03
01
https://benjamins.com/covers/475_jpg/9789027215888.jpg
04
03
01
https://benjamins.com/covers/475_tif/9789027215888.tif
06
09
01
https://benjamins.com/covers/1200_front/scl.118.hb.png
07
09
01
https://benjamins.com/covers/125/scl.118.png
25
09
01
https://benjamins.com/covers/1200_back/scl.118.hb.png
27
09
01
https://benjamins.com/covers/3d_web/scl.118.hb.png
10
01
JB code
scl.118.toc
v
vi
2
Miscellaneous
1
01
Table of contents
10
01
JB code
scl.118.ack
vii
viii
2
Miscellaneous
2
01
Acknowledgements
10
01
JB code
scl.118.01kau
1
8
8
Chapter
3
01
From fallacies and pitfalls to solutions and future directions
Navigating the evolving terrain of corpus linguistics
1
A01
Mark Kaunisto
Kaunisto, Mark
Mark
Kaunisto
Tampere University
10
01
JB code
scl.118.02var
9
34
26
Chapter
4
01
Engaging with bad (meta)data in historical corpus linguistics
1
A01
Turo Vartiainen
Vartiainen, Turo
Turo
Vartiainen
University of Helsinki
2
A01
Tanja Säily
Säily, Tanja
Tanja
Säily
University of Helsinki
20
big data
20
corpus compilation
20
historical corpus linguistics
20
metadata
20
part-of-speech annotation
20
sampling
01
In this chapter, we discuss some common pitfalls related to historical data and its use in linguistic analysis. We argue that the “philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet the needs of the fast-evolving field of corpus linguistics, where scholars make increasing use of big-data resources and sophisticated statistical modelling. By providing examples of errors and uncertainties related to, for example, corpus metadata, sampling, balance, and OCR accuracy, we argue that corpus linguists should pay increasingly close attention to the sampling and annotation principles employed in the compilation of historical corpora as well as to the quality of the linguistic data. We propose that the principle of “knowing one’s corpus” in terms of its compilation principles has become all the more important in the age of big-data corpora, where it is not feasible for individual researchers, or corpus compilers, to validate their data manually.
10
01
JB code
scl.118.03kau
35
54
20
Chapter
5
01
Named entities as potentially problematic items in corpora
1
A01
Mark Kaunisto
Kaunisto, Mark
Mark
Kaunisto
Tampere University
20
annotation
20
corpus linguistics
20
named entities
20
proper names
01
This chapter discusses problems in the interpretation of corpus data arising from the insufficiencies in the annotation of named entities. Many corpora nowadays still do not adequately enable corpus users to set up queries that would exclude items appearing in names when needed to improve precision of the searches. Through an examination of case studies in major English language corpora, the chapter highlights the need to carefully post-process the search results, as irrelevant occurrences of named entities may pose challenges in the analyses of word frequencies and their collocational behaviour. The chapter calls for more detailed annotation of named entities in already available large linguistic corpora and reminds of the importance of close inspection of the search hits.
10
01
JB code
scl.118.04cal
55
67
13
Chapter
6
01
Challenges in the compilation, annotation, and analysis of learner corpus data
1
A01
Marcus Callies
Callies, Marcus
Marcus
Callies
University of Bremen
20
analysis
20
annotation
20
compilation
20
discourse of deficit
20
learner corpus
20
lexical bias
20
lexical innovation
20
multilingualism
20
task instruction
20
writing prompt
01
This chapter highlights and discusses the special characteristics of learner corpus data and the challenges they may present for corpus compilation, annotation, and analysis. Because learner corpus and SLA researchers use their data to study L2 production and development, it is of utmost importance that the data are valid, that is, they represent “authentic” L2 production, which means that the data must stem from the studied learners’ own language production. I discuss challenges in three areas: (1) multilingual practices and metalinguistic language use, (2) lexical and constructional bias, often brought about by the wording of task instructions or writing prompts that learners are asked to respond to, and (3) learner corpus annotation in view of the “discourse of deficit” in SLA. For each of these challenges solutions as to how they can be met are offered.
10
01
JB code
scl.118.05hil
68
88
21
Chapter
7
01
Early newspapers as data for corpus linguistics (and Digital Humanities)
Issues in using the <i>British Library Newspapers</i> database as a corpus
1
A01
Turo Hiltunen
Hiltunen, Turo
Turo
Hiltunen
University of Helsinki
20
corpus compilation
20
Digital Humanities
20
register
20
representativeness
20
sampling
01
The availability of large digital archives has great potential for corpus linguistic research, but their use is not without problems. These problems can often be traced to fundamentally different ideas of what might constitute “good data” in Digital Humanities and in corpus linguistics, leading to different expectations regarding how the data is made available to researchers. This chapter discusses the specific challenges involved in using the <i>British Library Newspapers</i> database for corpus linguistics and considers potential solutions for them. It is argued that, to take full advantage of the database, it is necessary to adopt a flexible approach enabling a critical reflection on the digital materials, how they have been collected, processed, and made available.
10
01
JB code
scl.118.06har
89
105
17
Chapter
8
01
Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices
1
A01
Stefan Hartmann
Hartmann, Stefan
Stefan
Hartmann
Heinrich Heine University Düsseldorf
20
accessibility
20
open research
20
replicability
20
representativeness
20
transparency
01
In recent years, many researchers have called attention to the fact that research results very often cannot be replicated – a phenomenon that has been called <i>replication crisis</i>. The replication crisis in linguistics is highly relevant to corpus-based research: Many corpus studies are not directly replicable as the data on which they are based are not readily available. Especially in English linguistics, the full versions of many widely used corpora are still behind paywalls, which means that they are not accessible to parts of the global research community, and even when parts of the data are freely accessible, this presents problems for state-of-the-art methods of data analysis. In this paper, I discuss the challenges that have led to this situation and address some possible solutions. In particular, I argue for using smaller but openly available corpora whenever possible and for adopting open research practices as far as possible even when using commercial corpora.
10
01
JB code
scl.118.07lii
106
125
20
Chapter
9
01
Text length and short texts
An overview of the problem
1
A01
Aatu Liimatta
Liimatta, Aatu
Aatu
Liimatta
University of Helsinki
20
lengthwise analysis
20
lexical diversity
20
normalization
20
text length
01
Variation in text length is an unavoidable confounder in quantitative text-analytic corpus-linguistic studies. Texts can be difficult to compare across text lengths, particularly if many of them are short, due to the difficulty of calculating meaningful frequencies for the lexical items and linguistic features of interest. Traditionally, this has been less of an issue, since texts in many of the genres typically studied in linguistics have been relatively long. However, the rise of social media has brought the issue to the forefront. In this chapter, I describe the problem of text length and short texts together with a number of solutions and workarounds to this and related problems.
10
01
JB code
scl.118.08ihr
126
141
16
Chapter
10
01
Corpus genre categories
Issues at the intersection of linguistics and literature
1
A01
Daniel Ocic Ihrmark
Ihrmark, Daniel Ocic
Daniel Ocic
Ihrmark
Linnaeus University
20
corpus linguistics
20
genre
20
literature
20
special corpora
20
stylistics
01
This chapter highlights genre categorizations as a pitfall at the intersection of corpus linguistics and literature and problematizes the use of the genre category from the perspectives afforded by both fields. The intention is for the paper to argue for a more explicit communication of our genre categorization practices, and by doing so suggest ways of avoiding miscommunication and confusion due to the genre term being understood differently within different disciplines and backgrounds. The conclusion is that the wider categorizations used, such as <i>novel</i> or <i>short story</i>, are likely to be the most practical, and that studies wanting to sub-categorize further using the genre term should instead apply it according to their specific needs accompanied by explicit discussion of the implementation.
10
01
JB code
scl.118.09mil
142
170
29
Chapter
11
01
Modeling fine-grained sociolinguistic variation
The promises and pitfalls of Twitter corpora and neural word embeddings
1
A01
Filip Miletic
Miletic, Filip
Filip
Miletic
CLLE, CNRS & University of Toulouse | IMS, University of Stuttgart
2
A01
Anne Przewozny-Desriaux
Przewozny-Desriaux, Anne
Anne
Przewozny-Desriaux
CLLE, CNRS & University of Toulouse
3
A01
Ludovic Tanguy
Tanguy, Ludovic
Ludovic
Tanguy
CLLE, CNRS & University of Toulouse
20
language contact
20
large language models
20
Quebec English
20
semantic shifts
20
Twitter corpora
20
word embeddings
01
This chapter examines the use of recent data sources and computational methods to study fine-grained sociolinguistic phenomena. We deploy a custom-built corpus of tweets (Miletić et al. 2020) and neural word embeddings to investigate the use of contact-induced semantic shifts in Quebec English. Drawing on an analysis of 40 lexical items, we show that our approach is beneficial in facilitating manual inspection of vast amounts of data and establishing fine-grained patterns of language variation. While it is affected by a range of noise-related issues, which we describe in detail, coarse-grained annotation provides an efficient way of circumventing them. We use the results filtered in this way to conduct a quantitative analysis of sociolinguistic constraints on contact-induced semantic shifts, further confirming the relevance of our approach.
10
01
JB code
scl.118.si
171
172
2
Miscellaneous
12
01
Subject index
02
JBENJAMINS
John Benjamins Publishing Company
01
John Benjamins Publishing Company
Amsterdam/Philadelphia
NL
02
September 2024
20240915
2024
John Benjamins B.V.
02
WORLD
13
15
9789027215888
01
JB
3
John Benjamins e-Platform
03
jbe-platform.com
09
WORLD
21
20240915
01
00
115.00
EUR
R
01
00
97.00
GBP
Z
01
gen
00
149.00
USD
S
906029857
03
01
01
JB
John Benjamins Publishing Company
01
JB code
SCL 118 Hb
15
9789027215888
13
2024029608
BB
01
SCL
02
1388-0373
Studies in Corpus Linguistics
118
01
Challenges in Corpus Linguistics
Rethinking corpus compilation and analysis
01
scl.118
01
https://benjamins.com
02
https://benjamins.com/catalog/scl.118
1
B01
Mark Kaunisto
Kaunisto, Mark
Mark
Kaunisto
Tampere University
2
B01
Marco Schilk
Schilk, Marco
Marco
Schilk
University of Hildesheim
01
eng
180
vii
172
LAN009000
v.2006
CFX
2
24
JB Subject Scheme
LIN.APPL
Applied linguistics
24
JB Subject Scheme
LIN.COMPUT
Computational & corpus linguistics
24
JB Subject Scheme
LIN.CORP
Corpus linguistics
24
JB Subject Scheme
LIN.THEOR
Theoretical linguistics
06
01
This book contributes to the discussion of challenges faced in different areas of corpus linguistics, namely the compilation, annotation, and analysis of linguistic corpora. In a field of growing corpus sizes and expanding possibilities of gathering data, some old issues persist, while at the same time new problems have emerged. As the compilation and study of language corpora gets increasingly sophisticated and complex, continuous attention on ways of dealing with the data in question and challenges in text selection and interpretation is needed. The contributions to this volume address problems relating to a variety of areas in corpus linguistic study, including corpus annotation, data variability, learner language, social media texts, and database utilization. The authors provide critical overviews and research-based analyses, discuss the nature of some of the common pitfalls, and offer solutions to existing problems.
04
09
01
https://benjamins.com/covers/475/scl.118.png
04
03
01
https://benjamins.com/covers/475_jpg/9789027215888.jpg
04
03
01
https://benjamins.com/covers/475_tif/9789027215888.tif
06
09
01
https://benjamins.com/covers/1200_front/scl.118.hb.png
07
09
01
https://benjamins.com/covers/125/scl.118.png
25
09
01
https://benjamins.com/covers/1200_back/scl.118.hb.png
27
09
01
https://benjamins.com/covers/3d_web/scl.118.hb.png
10
01
JB code
scl.118.toc
v
vi
2
Miscellaneous
1
01
Table of contents
10
01
JB code
scl.118.ack
vii
viii
2
Miscellaneous
2
01
Acknowledgements
10
01
JB code
scl.118.01kau
1
8
8
Chapter
3
01
From fallacies and pitfalls to solutions and future directions
Navigating the evolving terrain of corpus linguistics
1
A01
Mark Kaunisto
Kaunisto, Mark
Mark
Kaunisto
Tampere University
10
01
JB code
scl.118.02var
9
34
26
Chapter
4
01
Engaging with bad (meta)data in historical corpus linguistics
1
A01
Turo Vartiainen
Vartiainen, Turo
Turo
Vartiainen
University of Helsinki
2
A01
Tanja Säily
Säily, Tanja
Tanja
Säily
University of Helsinki
20
big data
20
corpus compilation
20
historical corpus linguistics
20
metadata
20
part-of-speech annotation
20
sampling
01
In this chapter, we discuss some common pitfalls related to historical data and its use in linguistic analysis. We argue that the “philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet the needs of the fast-evolving field of corpus linguistics, where scholars make increasing use of big-data resources and sophisticated statistical modelling. By providing examples of errors and uncertainties related to, for example, corpus metadata, sampling, balance, and OCR accuracy, we argue that corpus linguists should pay increasingly close attention to the sampling and annotation principles employed in the compilation of historical corpora as well as to the quality of the linguistic data. We propose that the principle of “knowing one’s corpus” in terms of its compilation principles has become all the more important in the age of big-data corpora, where it is not feasible for individual researchers, or corpus compilers, to validate their data manually.
10
01
JB code
scl.118.03kau
35
54
20
Chapter
5
01
Named entities as potentially problematic items in corpora
1
A01
Mark Kaunisto
Kaunisto, Mark
Mark
Kaunisto
Tampere University
20
annotation
20
corpus linguistics
20
named entities
20
proper names
01
This chapter discusses problems in the interpretation of corpus data arising from the insufficiencies in the annotation of named entities. Many corpora nowadays still do not adequately enable corpus users to set up queries that would exclude items appearing in names when needed to improve precision of the searches. Through an examination of case studies in major English language corpora, the chapter highlights the need to carefully post-process the search results, as irrelevant occurrences of named entities may pose challenges in the analyses of word frequencies and their collocational behaviour. The chapter calls for more detailed annotation of named entities in already available large linguistic corpora and reminds of the importance of close inspection of the search hits.
10
01
JB code
scl.118.04cal
55
67
13
Chapter
6
01
Challenges in the compilation, annotation, and analysis of learner corpus data
1
A01
Marcus Callies
Callies, Marcus
Marcus
Callies
University of Bremen
20
analysis
20
annotation
20
compilation
20
discourse of deficit
20
learner corpus
20
lexical bias
20
lexical innovation
20
multilingualism
20
task instruction
20
writing prompt
01
This chapter highlights and discusses the special characteristics of learner corpus data and the challenges they may present for corpus compilation, annotation, and analysis. Because learner corpus and SLA researchers use their data to study L2 production and development, it is of utmost importance that the data are valid, that is, they represent “authentic” L2 production, which means that the data must stem from the studied learners’ own language production. I discuss challenges in three areas: (1) multilingual practices and metalinguistic language use, (2) lexical and constructional bias, often brought about by the wording of task instructions or writing prompts that learners are asked to respond to, and (3) learner corpus annotation in view of the “discourse of deficit” in SLA. For each of these challenges solutions as to how they can be met are offered.
10
01
JB code
scl.118.05hil
68
88
21
Chapter
7
01
Early newspapers as data for corpus linguistics (and Digital Humanities)
Issues in using the <i>British Library Newspapers</i> database as a corpus
1
A01
Turo Hiltunen
Hiltunen, Turo
Turo
Hiltunen
University of Helsinki
20
corpus compilation
20
Digital Humanities
20
register
20
representativeness
20
sampling
01
The availability of large digital archives has great potential for corpus linguistic research, but their use is not without problems. These problems can often be traced to fundamentally different ideas of what might constitute “good data” in Digital Humanities and in corpus linguistics, leading to different expectations regarding how the data is made available to researchers. This chapter discusses the specific challenges involved in using the <i>British Library Newspapers</i> database for corpus linguistics and considers potential solutions for them. It is argued that, to take full advantage of the database, it is necessary to adopt a flexible approach enabling a critical reflection on the digital materials, how they have been collected, processed, and made available.
10
01
JB code
scl.118.06har
89
105
17
Chapter
8
01
Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices
1
A01
Stefan Hartmann
Hartmann, Stefan
Stefan
Hartmann
Heinrich Heine University Düsseldorf
20
accessibility
20
open research
20
replicability
20
representativeness
20
transparency
01
In recent years, many researchers have called attention to the fact that research results very often cannot be replicated – a phenomenon that has been called <i>replication crisis</i>. The replication crisis in linguistics is highly relevant to corpus-based research: Many corpus studies are not directly replicable as the data on which they are based are not readily available. Especially in English linguistics, the full versions of many widely used corpora are still behind paywalls, which means that they are not accessible to parts of the global research community, and even when parts of the data are freely accessible, this presents problems for state-of-the-art methods of data analysis. In this paper, I discuss the challenges that have led to this situation and address some possible solutions. In particular, I argue for using smaller but openly available corpora whenever possible and for adopting open research practices as far as possible even when using commercial corpora.
10
01
JB code
scl.118.07lii
106
125
20
Chapter
9
01
Text length and short texts
An overview of the problem
1
A01
Aatu Liimatta
Liimatta, Aatu
Aatu
Liimatta
University of Helsinki
20
lengthwise analysis
20
lexical diversity
20
normalization
20
text length
01
Variation in text length is an unavoidable confounder in quantitative text-analytic corpus-linguistic studies. Texts can be difficult to compare across text lengths, particularly if many of them are short, due to the difficulty of calculating meaningful frequencies for the lexical items and linguistic features of interest. Traditionally, this has been less of an issue, since texts in many of the genres typically studied in linguistics have been relatively long. However, the rise of social media has brought the issue to the forefront. In this chapter, I describe the problem of text length and short texts together with a number of solutions and workarounds to this and related problems.
10
01
JB code
scl.118.08ihr
126
141
16
Chapter
10
01
Corpus genre categories
Issues at the intersection of linguistics and literature
1
A01
Daniel Ocic Ihrmark
Ihrmark, Daniel Ocic
Daniel Ocic
Ihrmark
Linnaeus University
20
corpus linguistics
20
genre
20
literature
20
special corpora
20
stylistics
01
This chapter highlights genre categorizations as a pitfall at the intersection of corpus linguistics and literature and problematizes the use of the genre category from the perspectives afforded by both fields. The intention is for the paper to argue for a more explicit communication of our genre categorization practices, and by doing so suggest ways of avoiding miscommunication and confusion due to the genre term being understood differently within different disciplines and backgrounds. The conclusion is that the wider categorizations used, such as <i>novel</i> or <i>short story</i>, are likely to be the most practical, and that studies wanting to sub-categorize further using the genre term should instead apply it according to their specific needs accompanied by explicit discussion of the implementation.
10
01
JB code
scl.118.09mil
142
170
29
Chapter
11
01
Modeling fine-grained sociolinguistic variation
The promises and pitfalls of Twitter corpora and neural word embeddings
1
A01
Filip Miletic
Miletic, Filip
Filip
Miletic
CLLE, CNRS & University of Toulouse | IMS, University of Stuttgart
2
A01
Anne Przewozny-Desriaux
Przewozny-Desriaux, Anne
Anne
Przewozny-Desriaux
CLLE, CNRS & University of Toulouse
3
A01
Ludovic Tanguy
Tanguy, Ludovic
Ludovic
Tanguy
CLLE, CNRS & University of Toulouse
20
language contact
20
large language models
20
Quebec English
20
semantic shifts
20
Twitter corpora
20
word embeddings
01
This chapter examines the use of recent data sources and computational methods to study fine-grained sociolinguistic phenomena. We deploy a custom-built corpus of tweets (Miletić et al. 2020) and neural word embeddings to investigate the use of contact-induced semantic shifts in Quebec English. Drawing on an analysis of 40 lexical items, we show that our approach is beneficial in facilitating manual inspection of vast amounts of data and establishing fine-grained patterns of language variation. While it is affected by a range of noise-related issues, which we describe in detail, coarse-grained annotation provides an efficient way of circumventing them. We use the results filtered in this way to conduct a quantitative analysis of sociolinguistic constraints on contact-induced semantic shifts, further confirming the relevance of our approach.
10
01
JB code
scl.118.si
171
172
2
Miscellaneous
12
01
Subject index
02
JBENJAMINS
John Benjamins Publishing Company
01
John Benjamins Publishing Company
Amsterdam/Philadelphia
NL
02
September 2024
20240915
2024
John Benjamins B.V.
02
WORLD
01
JB
1
John Benjamins Publishing Company
+31 20 6304747
+31 20 6739773
bookorder@benjamins.nl
01
https://benjamins.com
01
WORLD
US CA MX
10
20240915
01
02
JB
1
00
115.00
EUR
R
02
02
JB
1
00
121.90
EUR
R
01
JB
10
bebc
+44 1202 712 934
+44 1202 712 913
sales@bebc.co.uk
03
GB
10
20240915
02
02
JB
1
00
97.00
GBP
Z
01
JB
2
John Benjamins North America
+1 800 562-5666
+1 703 661-1501
benjamins@presswarehouse.com
01
https://benjamins.com
01
US CA MX
10
20240915
01
gen
02
JB
1
00
149.00
USD