59029685
03
01
01
JB
John Benjamins Publishing Company
01
JB code
CILT 366 Eb
15
9789027246387
06
10.1075/cilt.366
13
2024034107
DG
002
02
01
CILT
02
0304-0763
Current Issues in Linguistic Theory
366
01
Recent Advances in Multiword Units in Machine Translation and Translation Technology
01
cilt.366
01
https://benjamins.com
02
https://benjamins.com/catalog/cilt.366
1
B01
Johanna Monti
Monti, Johanna
Johanna
Monti
University of Naples “L’Orientale”
2
B01
Gloria Corpas Pastor
Corpas Pastor, Gloria
Gloria
Corpas Pastor
University of Malaga
3
B01
Ruslan Mitkov
Mitkov, Ruslan
Ruslan
Mitkov
Lancaster University
4
B01
Carlos Manuel Hidalgo-Ternero
Hidalgo-Ternero, Carlos Manuel
Carlos Manuel
Hidalgo-Ternero
University of Malaga
01
eng
278
ix
262
+ index
LAN009060
v.2006
CFK
2
24
JB Subject Scheme
LIN.COMPUT
Computational & corpus linguistics
24
JB Subject Scheme
LIN.SYNTAX
Syntax
24
JB Subject Scheme
LIN.THEOR
Theoretical linguistics
24
JB Subject Scheme
TRAN.TRANSL
Translation Studies
06
01
The investigation of phraseology through corpus-based and computational approaches holds significant relevance for various professionals, including translators, interpreters, terminologists, lexicographers, language instructors, and learners. Computational Phraseology, and in particular the computational analysis of multiword expressions (also known as multiword units), has gained prominence in recent years and is essential for a number of Natural Language Processing and Translation Technology applications. The failure to detect these units automatically could result in incorrect and problematic automatic translations and could hinder the performance of applications such as text summarisation and web search. Against this background, the volume offers 13 articles carefully selected and organised into two parts: ‘Computational treatment of multiword units’ and ‘Corpus-based and linguistic studies in phraseology‘. The contributions not only highlight the latest advancements in computational and corpus-based phraseology but also reiterate its vital role in all areas of language technologies, including basic and applied research.
04
09
01
https://benjamins.com/covers/475/cilt.366.png
04
03
01
https://benjamins.com/covers/475_jpg/9789027217905.jpg
04
03
01
https://benjamins.com/covers/475_tif/9789027217905.tif
06
09
01
https://benjamins.com/covers/1200_front/cilt.366.hb.png
07
09
01
https://benjamins.com/covers/125/cilt.366.png
25
09
01
https://benjamins.com/covers/1200_back/cilt.366.hb.png
27
09
01
https://benjamins.com/covers/3d_web/cilt.366.hb.png
10
01
JB code
cilt.366.toc
v
vi
2
Table of contents
1
01
Table of contents
10
01
JB code
cilt.366.preface
vii
x
4
Preface
2
01
Preface
1
A01
Johanna Monti
Monti, Johanna
Johanna
Monti
2
A01
Gloria Corpas Pastor
Corpas Pastor, Gloria
Gloria
Corpas Pastor
3
A01
Ruslan Mitkov
Mitkov, Ruslan
Ruslan
Mitkov
4
A01
Carlos Manuel Hidalgo-Ternero
Hidalgo-Ternero, Carlos Manuel
Carlos Manuel
Hidalgo-Ternero
10
01
JB code
cilt.366.s1
11
1
Section header
3
01
Section 1. Computational treatment of multiword units
10
01
JB code
cilt.366.01col
2
17
16
Chapter
4
01
Chapter 1. Multi-word units in neural machine translation
Why the tip of the iceberg remains problematic
1
A01
Jean-Pierre Colson
Colson, Jean-Pierre
Jean-Pierre
Colson
University of Louvain
20
deep learning
20
idioms
20
neural machine translation
20
phraseology
20
transformer architecture
01
Neural machine translation (NMT) has recently made significant progress in improving the quality of the texts it produces. New features of NMT include the fluidity of translations and the successful handling of multi-word units. In this paper we first report the results of an automated evaluation of the percentage of phraseology in the translations produced by Google Translate and DeepL. A corpus-based approach makes it possible to estimate that both NMT systems succeed in producing an average percentage of phraseology that is quite reasonable and sometimes even higher than in natural language production by native speakers. However, a closer look at some problematic cases shows that the ability of NMT systems to treat phraseological units can be deceptive, as they are often unable to cope with contextual complexity and low-frequency idioms.
10
01
JB code
cilt.366.02hid
18
39
22
Chapter
5
01
Chapter 2. ReGap
A text-preprocessing algorithm to enhance MWE-aware neural machine translation systems
1
A01
Carlos Manuel Hidalgo-Ternero
Hidalgo-Ternero, Carlos Manuel
Carlos Manuel
Hidalgo-Ternero
Universidad de Málaga
2
A01
Gloria Corpas Pastor
Corpas Pastor, Gloria
Gloria
Corpas Pastor
Universidad de Málaga
20
DeepL
20
discontinuity
20
Neural Machine Translation (NMT)
20
Text-preprocessing algorithm
20
Token-based MWE identification
20
Verb-noun idiomatic constructions (VNICs)
01
This research presents ReGap, a text-preprocessing algorithm for the automatic token-based identification and conversion of discontinuous multiword expressions (MWEs) into their canonical state, i.e., their continuous form, as a means to optimise neural machine translation (NMT) systems. To this end, an experiment with flexible verb-noun idiomatic constructions (VNICs) is conducted in order to assess to what extent ReGap can enhance the performance of the most robust NMT system to date, DeepL, under the challenge of MWE discontinuity in the Spanish-into-English and the Spanish-into-German directionalities. In this regard, the promising results yielded for VNICs will shed some light on new avenues for enhancing MWE-aware NMT systems.
10
01
JB code
cilt.366.03spe
40
56
17
Chapter
6
01
Chapter 3. Evaluating the Italian-English machine translation quality of MWUs in the domain of archaeology
1
A01
Giulia Speranza
Speranza, Giulia
Giulia
Speranza
University of Naples “L’Orientale”, UNIOR NLP Research Group
2
A01
Johanna Monti
Monti, Johanna
Johanna
Monti
University of Naples “L’Orientale”, UNIOR NLP Research Group
20
archaeology
20
error analysis
20
evaluation
20
machine translation
20
multiword units
20
terminology
01
Multiword units (MWUs) represent a challenging and problematic linguistic issue in the field of Natural Language Processing (NLP) due to their idiosyncratic nature. This paper investigates the quality of Neural Machine Translation (NMT) outputs when dealing with MWUs in the domain of archaeology. As a case study, a dataset of 100 MWUs is used as a Gold Standard to evaluate out-of-context and in-context translation outputs from three state-of-the-art NMT systems for the Italian-English language pair: Google Translate, DeepL, and Microsoft Bing Translator. MT outputs are manually evaluated with reference to the Gold Standard, namely out-of-context and in-context human English translations of the selected 100 MWUs. Results show that terminology is still a problematic category for MT quality and that MWUs translation may vary, and sometimes even improve, when further context is provided.
10
01
JB code
cilt.366.04kub
57
78
22
Chapter
7
01
Chapter 4. Post-editing neural machine translation in specialised languages
The role of corpora in the translation of phraseological structures
1
A01
Natalie Kübler
Kübler, Natalie
Natalie
Kübler
Université Paris Cité, CLILLAC-ARP
2
A01
Hanna Martikainen
Martikainen, Hanna
Hanna
Martikainen
ESIT / Université Sorbonne Nouvelle, CLESTHIA
3
A01
Alexandra Mestivier
Mestivier, Alexandra
Alexandra
Mestivier
Université Paris Cité, CLILLAC-ARP
4
A01
Mojca Pecman
Pecman, Mojca
Mojca
Pecman
Université Paris Cité, CLILLAC-ARP
20
corpus-based methodology
20
errors
20
neural machine translation
20
phraseology
20
post-editing
20
specialized texts
01
This study focuses on phraseology in specialised texts and on students’ difficulties pertaining to phraseology in post-editing neural machine translation output. It is undertaken within the corpus-based methodological framework that we have developed for several purposes, one of which being to assess the impact of corpus use on translation and post-editing. The objective of the study is to propose a descriptive analysis of typical student errors related to phraseology in order to design tailored pedagogical materials. We aim to show that, with consistent training in querying corpora and in interpreting results in an appropriate manner, students can manage to improve their productions when translating specialised texts or when post-editing machine translation output.
10
01
JB code
cilt.366.05leo
79
102
24
Chapter
8
01
Chapter 5. Evaluating a bracketing protocol for multiword terms
1
A01
Pilar León-Araúz
León-Araúz, Pilar
Pilar
León-Araúz
University of Granada
2
A01
Melania Cabezas-García
Cabezas-García, Melania
Melania
Cabezas-García
University of Granada
20
bracketing
20
corpus
20
multiword term
20
structural disambiguation
20
terminology
01
Multiword terms (MWTs) are frequently used to encapsulate and convey meaning in scientific and technical texts. However, they can also make these texts difficult to understand because the relations between constituents are not transparent. When MWTs have more than two constituents, a dependency analysis (bracketing) is often necessary to facilitate their interpretation. NLP has proposed various models to automatize bracketing operations, but none has been entirely satisfactory. This paper presents a protocol that combines various models and applies it to a set of three-constituent MWTs in order to: (i) sort rules by their disambiguation potential, based on their likelihood of retrieving results from any corpus and their ability to solve bracketing; and (ii) ascertain the influence of corpus size and type in the results obtained.
10
01
JB code
cilt.366.s2
103
1
Section header
9
01
Section 2. Corpus-based and linguistic studies in phraseology
10
01
JB code
cilt.366.06fan
104
123
20
Chapter
10
01
Chapter 6. Suggestions for a new model of functional phraseme categorization for applied purposes
1
A01
Anna Fankhauser
Fankhauser, Anna
Anna
Fankhauser
Universität Osnabrück
20
(bilingual) learner lexicography
20
corpus-based phraseology
20
corpus-derived phraseme list
20
foreign language teaching
20
phraseme categorization
20
phraseological core
20
translation
01
Although the significance of phraseology in various fields of applied linguistics such as translation, language teaching, and (bilingual) learner lexicography is generally agreed upon, existing models of phraseme categorization largely fail to account for the needs of language practitioners and learners. Yet, a classification model for applied purposes is required, for example, to provide language practitioners and learners with a systematic list of useful phraseological items that can be applied to individual situations of language production and reception. The model suggested in the present paper consistently applies functional classification criteria and is derived from an extensive corpus study of spoken British and American English. It is hoped that the empirical approach and the focus on functional properties of phrasemes will ensure that the model is of maximum relevance for applied purposes.
10
01
JB code
cilt.366.07jim
124
141
18
Chapter
11
01
Chapter 7. Verb collocations and their semantics in the specialized language of science
1
A01
Eva Lucía Jiménez-Navarro
Jiménez-Navarro, Eva Lucía
Eva Lucía
Jiménez-Navarro
Universidad de Córdoba
20
method
20
noun collocation
20
research article
20
semantic frame
20
specialized corpus
20
specialized language of science
20
verb collocation
01
This chapter is concerned with verb collocations in the specialized language of science. I pay attention to their semanticity, since my main objective is to discover the topics evoked in terms of their integrating elements. The methodology applied is as follows: first, I compile a specialized corpus of research articles; second, I automatically extract a list of collocation candidates, which is manually revised; third, the selected collocations are semantically classified. As this work is motivated by a previous study on noun collocations, I perform a comparative analysis. The findings are that noun and verb collocations share similar semantic frames, although the method employed in each study yields different, but complementary, results.
10
01
JB code
cilt.366.08bre
142
156
15
Chapter
12
01
Chapter 8. Negative–positive adjective pairing in travel journalism in English, Italian, and Polish
1
A01
David Brett
Brett, David
David
Brett
Università degli Studi di Sassari
2
A01
Antonio Pinna
Pinna, Antonio
Antonio
Pinna
Università degli Studi di Sassari
3
A01
Barbara Loranc
Loranc, Barbara
Barbara
Loranc
University of Bielsko-Biala
20
ADJ+but+ADJ pattern
20
adjectives
20
English
20
Italian
20
negative – positive adjective pairing
20
Polish
20
travel journalism
01
Adjectives play a particular role in the language of tourism and often contribute to the formation of recurrent phraseologies (Manca, 2008). The combination of adjectives bearing negative and positive connotation is widely reported in the literature (Dann, 1996; Edo Marzá, 2011, 2012). Durán-Muñoz (2019) focuses on the ADJ+but+ADJ pattern in an English language corpus of Adventure Tourism texts. This contribution examines the same pattern in 1M word corpora of Travel journalism, examining examples in English, but also extending the analysis to Italian and Polish, to determine whether the pattern is limited to one language, or whether it is widely used as a discourse strategy within the same register, regardless of the code adopted.
10
01
JB code
cilt.366.09gut
157
173
17
Chapter
13
01
Chapter 9. The middle construction and some machine translation issues
Exploring the process of compositional cospecification in quality-oriented middles
1
A01
Macarena Palma Gutiérrez
Palma Gutiérrez, Macarena
Macarena
Palma Gutiérrez
Universidad de Córdoba
20
Adverb + Verb collocation
20
colloconstructional analysis of MWU
20
compositional cospecification
20
inanimate entities
20
machine translation
20
middle construction
20
prototype effects
20
quality-oriented middles
01
This paper aims at exploring the colloconstructional analysis of multiword units (MWU) in machine translation regarding the middle construction in terms of compositional cospecification (Yoshimura, 1998; Yoshimura & Taylor, 2004). For that purpose, I examine the Adverb + Verb collocation focusing on the predicates <i>cut</i> and <i>drive</i>, collocated with quality-oriented adjuncts (Heyvaert, 2003) and incorporating Inanimate Subject entities (Patients, Enablers and Instruments). The number of instances examined is 500+. The data analysed reveals that apart from the generalised <i>Qt-Qc</i> pattern in shift of semantic importance, other patterns (namely, <i>Qc-Qc</i>) can be found by virtue of the prototype effects of the construction (cf. Taylor, 1995), thus, providing a potential source of disambiguation in the computational treatment of MWU in machine translation.
10
01
JB code
cilt.366.10roj
174
197
24
Chapter
14
01
Chapter 10. Semantic annotation of named rivers and its application for the prediction of multiword-term bracketing
1
A01
Juan Rojas Garcia
Rojas Garcia, Juan
Juan
Rojas Garcia
University of Granada
20
bracketing prediction
20
frame-based terminology
20
named river
20
predicate-argument structure analysis
20
semantic annotation
20
terminological knowledge base
20
three-component multi-word term
01
The acquisition of knowledge is essential for specialized translation, hence the representation of specialized phraseology in terminological knowledge bases is part of this process. The aim of this study was thus two-fold. Firstly, it describes how the semantic annotation of predicate-argument structure of sentences mentioning named rivers can be addressed from the perspective of Frame-based Terminology. The results showed that this approach provides valuable insights into the knowledge structures underlying the usage of named rivers in specialized texts. Secondly, this study explores whether the bracketing of a three-component multi-word term can be predicted from the semantic information encoded in the sentence where the ternary compound and a named river are used as arguments. The semantic annotations permitted construction of two machine-learning models capable of accurately predicting ternary-compound bracketing.
10
01
JB code
cilt.366.11mar
198
218
21
Chapter
15
01
Chapter 11. Irony in American-English tweets
A cognitive and phraseological analysis
1
A01
Beatriz Martín Gascón
Martín Gascón, Beatriz
Beatriz
Martín Gascón
Universidad Complutense de Madrid
20
American English
20
big data
20
cognitive linguistics
20
contextual ironic markers
20
echoic account
20
intercultural awareness
20
ironic phraseology
20
Spanish
20
twitter
20
verbal irony
01
The present study examines verbal irony from a cognitive linguistics perspective, based on Ruiz de Mendoza’s (2017) development of the echoic account and on big data. Built on previous research on the detection of Spanish ironic utterances in Twitter (Martín-Gascón, 2019), this investigation aims to analyze how American-English speakers conceptualize and express irony and compares findings to the Spanish ones. The dataset, initially consisting of 1,157,773,379 tweets from 248 countries and 66 languages, was first reduced to 27,517 tweets from English-speaking users in the United States using the words “irony”, “ironies”, and “ironic”, then to 605 containing the words as hashtag and finally to 495 tweets evincing implicit and explicit-echoic irony. An in-depth cognitive and qualitative analysis of the sample revealed the complexities of perceiving irony in written discourse and, therefore, the relevance of adding contextual ironic markers, such as hashtags, emojis, interjections, laughter typing and ironic phraseology, among others. In line with Martín-Gascón’s (2019) study, findings showed a higher use of positive and explicit-echoic irony to the detriment of implicit and negative irony. By drawing attention to the similarities and differences in the expression of irony, we expect to offer preliminary informed options for the design of pedagogical proposals that enhance not only the learners’ linguistic and ironic competencies, but also their intercultural awareness.
10
01
JB code
cilt.366.12tak
219
243
25
Chapter
16
01
Chapter 12. A comprehensive Japanese MWE lexicon
JMWEL
1
A01
Masahito Takahashi
Takahashi, Masahito
Masahito
Takahashi
Kurume Institute of Technology, emeritus, Japan
2
A01
Toshifumi Tanabe
Tanabe, Toshifumi
Toshifumi
Tanabe
Fukuoka University, Japan
3
A01
Jack Halpern
Halpern, Jack
Jack
Halpern
CJKI Co., Japan
4
A01
Kosho Shudo
Shudo, Kosho
Kosho
Shudo
Fukuoka University, emeritus, Japan
20
construction grammar
20
formulaic language
20
lexical bundles
20
multiword expression (MWE)
20
neural machine translation (NMT)
20
phrase-based machine translation (PBMT)
20
phrase-based NLP
20
phrase-based statistical machine translation (PBSMT)
20
phraseology
01
JMWEL (Japanese MWE Lexicon) is a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based processing of a wide range of Japanese documents. It has about 160,000 MWE lemmas covering almost every kind of linguistically idiosyncratic but commonly used Japanese phrases, e.g., idioms, quasi-idioms, collocations, quasi-collocations, clichés, quasi-clichés, institutionalized phrases, proverbs, and old sayings, excepting technical terms in specialized fields or named entities. JMWEL consists of sixteen sub-lexicons reflecting their distinctive features. The comprehensiveness of the collected MWEs and the detailed morpho-syntactic information given to each MWE, which may include internal modifiers, are notable features of JMWEL. In this paper, we introduce the newest version of JMWEL.
10
01
JB code
cilt.366.13dib
244
262
19
Chapter
17
01
Chapter 13. Ontology-based formalisation of Italian clitic verbal MWEs
An approach for supporting machine translation
1
A01
Maria Pia di Buono
di Buono, Maria Pia
Maria Pia
di Buono
University of Naples “L’Orientale”
2
A01
Johanna Monti
Monti, Johanna
Johanna
Monti
University of Naples “L’Orientale”
3
A01
Valeria Caruso
Caruso, Valeria
Valeria
Caruso
University of Naples “L’Orientale”
20
Italian clitic verbs
20
lexicographic resources
20
linguistic linked open data
20
Neural Machine Translation
20
OntoLex-Lemon
20
verbal multiword expressions
01
In this paper we present the development of an ontology-based bilingual (IT-EN) lexicographic resource of Italian clitic Verbal MultiWord Expressions (VMWEs) to support machine translation. Starting from an analysis of these units and their linguistic features, we examine how Neural Machine Translation (NMT) handles complex VMWEs and the related translations issues. Finally, we propose a bilingual resource, formalised by means of the OntoLex-Lemon model, which accounts for morphological, syntactic, and semantic features of Italian clitic verbs, in order to enhance automatic translation of VMWEs.
02
JBENJAMINS
John Benjamins Publishing Company
01
John Benjamins Publishing Company
Amsterdam/Philadelphia
NL
02
December 2024
20241215
2024
John Benjamins B.V.
02
WORLD
13
15
9789027217905
01
JB
3
John Benjamins e-Platform
03
jbe-platform.com
09
WORLD
10
20241215
01
00
125.00
EUR
R
01
00
105.00
GBP
Z
01
gen
00
163.00
USD
S
775029684
03
01
01
JB
John Benjamins Publishing Company
01
JB code
CILT 366 Hb
15
9789027217905
13
2024034106
BB
01
CILT
02
0304-0763
Current Issues in Linguistic Theory
366
01
Recent Advances in Multiword Units in Machine Translation and Translation Technology
01
cilt.366
01
https://benjamins.com
02
https://benjamins.com/catalog/cilt.366
1
B01
Johanna Monti
Monti, Johanna
Johanna
Monti
University of Naples “L’Orientale”
2
B01
Gloria Corpas Pastor
Corpas Pastor, Gloria
Gloria
Corpas Pastor
University of Malaga
3
B01
Ruslan Mitkov
Mitkov, Ruslan
Ruslan
Mitkov
Lancaster University
4
B01
Carlos Manuel Hidalgo-Ternero
Hidalgo-Ternero, Carlos Manuel
Carlos Manuel
Hidalgo-Ternero
University of Malaga
01
eng
278
ix
262
+ index
LAN009060
v.2006
CFK
2
24
JB Subject Scheme
LIN.COMPUT
Computational & corpus linguistics
24
JB Subject Scheme
LIN.SYNTAX
Syntax
24
JB Subject Scheme
LIN.THEOR
Theoretical linguistics
24
JB Subject Scheme
TRAN.TRANSL
Translation Studies
06
01
The investigation of phraseology through corpus-based and computational approaches holds significant relevance for various professionals, including translators, interpreters, terminologists, lexicographers, language instructors, and learners. Computational Phraseology, and in particular the computational analysis of multiword expressions (also known as multiword units), has gained prominence in recent years and is essential for a number of Natural Language Processing and Translation Technology applications. The failure to detect these units automatically could result in incorrect and problematic automatic translations and could hinder the performance of applications such as text summarisation and web search. Against this background, the volume offers 13 articles carefully selected and organised into two parts: ‘Computational treatment of multiword units’ and ‘Corpus-based and linguistic studies in phraseology‘. The contributions not only highlight the latest advancements in computational and corpus-based phraseology but also reiterate its vital role in all areas of language technologies, including basic and applied research.
04
09
01
https://benjamins.com/covers/475/cilt.366.png
04
03
01
https://benjamins.com/covers/475_jpg/9789027217905.jpg
04
03
01
https://benjamins.com/covers/475_tif/9789027217905.tif
06
09
01
https://benjamins.com/covers/1200_front/cilt.366.hb.png
07
09
01
https://benjamins.com/covers/125/cilt.366.png
25
09
01
https://benjamins.com/covers/1200_back/cilt.366.hb.png
27
09
01
https://benjamins.com/covers/3d_web/cilt.366.hb.png
10
01
JB code
cilt.366.toc
v
vi
2
Table of contents
1
01
Table of contents
10
01
JB code
cilt.366.preface
vii
x
4
Preface
2
01
Preface
1
A01
Johanna Monti
Monti, Johanna
Johanna
Monti
2
A01
Gloria Corpas Pastor
Corpas Pastor, Gloria
Gloria
Corpas Pastor
3
A01
Ruslan Mitkov
Mitkov, Ruslan
Ruslan
Mitkov
4
A01
Carlos Manuel Hidalgo-Ternero
Hidalgo-Ternero, Carlos Manuel
Carlos Manuel
Hidalgo-Ternero
10
01
JB code
cilt.366.s1
11
1
Section header
3
01
Section 1. Computational treatment of multiword units
10
01
JB code
cilt.366.01col
2
17
16
Chapter
4
01
Chapter 1. Multi-word units in neural machine translation
Why the tip of the iceberg remains problematic
1
A01
Jean-Pierre Colson
Colson, Jean-Pierre
Jean-Pierre
Colson
University of Louvain
20
deep learning
20
idioms
20
neural machine translation
20
phraseology
20
transformer architecture
01
Neural machine translation (NMT) has recently made significant progress in improving the quality of the texts it produces. New features of NMT include the fluidity of translations and the successful handling of multi-word units. In this paper we first report the results of an automated evaluation of the percentage of phraseology in the translations produced by Google Translate and DeepL. A corpus-based approach makes it possible to estimate that both NMT systems succeed in producing an average percentage of phraseology that is quite reasonable and sometimes even higher than in natural language production by native speakers. However, a closer look at some problematic cases shows that the ability of NMT systems to treat phraseological units can be deceptive, as they are often unable to cope with contextual complexity and low-frequency idioms.
10
01
JB code
cilt.366.02hid
18
39
22
Chapter
5
01
Chapter 2. ReGap
A text-preprocessing algorithm to enhance MWE-aware neural machine translation systems
1
A01
Carlos Manuel Hidalgo-Ternero
Hidalgo-Ternero, Carlos Manuel
Carlos Manuel
Hidalgo-Ternero
Universidad de Málaga
2
A01
Gloria Corpas Pastor
Corpas Pastor, Gloria
Gloria
Corpas Pastor
Universidad de Málaga
20
DeepL
20
discontinuity
20
Neural Machine Translation (NMT)
20
Text-preprocessing algorithm
20
Token-based MWE identification
20
Verb-noun idiomatic constructions (VNICs)
01
This research presents ReGap, a text-preprocessing algorithm for the automatic token-based identification and conversion of discontinuous multiword expressions (MWEs) into their canonical state, i.e., their continuous form, as a means to optimise neural machine translation (NMT) systems. To this end, an experiment with flexible verb-noun idiomatic constructions (VNICs) is conducted in order to assess to what extent ReGap can enhance the performance of the most robust NMT system to date, DeepL, under the challenge of MWE discontinuity in the Spanish-into-English and the Spanish-into-German directionalities. In this regard, the promising results yielded for VNICs will shed some light on new avenues for enhancing MWE-aware NMT systems.
10
01
JB code
cilt.366.03spe
40
56
17
Chapter
6
01
Chapter 3. Evaluating the Italian-English machine translation quality of MWUs in the domain of archaeology
1
A01
Giulia Speranza
Speranza, Giulia
Giulia
Speranza
University of Naples “L’Orientale”, UNIOR NLP Research Group
2
A01
Johanna Monti
Monti, Johanna
Johanna
Monti
University of Naples “L’Orientale”, UNIOR NLP Research Group
20
archaeology
20
error analysis
20
evaluation
20
machine translation
20
multiword units
20
terminology
01
Multiword units (MWUs) represent a challenging and problematic linguistic issue in the field of Natural Language Processing (NLP) due to their idiosyncratic nature. This paper investigates the quality of Neural Machine Translation (NMT) outputs when dealing with MWUs in the domain of archaeology. As a case study, a dataset of 100 MWUs is used as a Gold Standard to evaluate out-of-context and in-context translation outputs from three state-of-the-art NMT systems for the Italian-English language pair: Google Translate, DeepL, and Microsoft Bing Translator. MT outputs are manually evaluated with reference to the Gold Standard, namely out-of-context and in-context human English translations of the selected 100 MWUs. Results show that terminology is still a problematic category for MT quality and that MWUs translation may vary, and sometimes even improve, when further context is provided.
10
01
JB code
cilt.366.04kub
57
78
22
Chapter
7
01
Chapter 4. Post-editing neural machine translation in specialised languages
The role of corpora in the translation of phraseological structures
1
A01
Natalie Kübler
Kübler, Natalie
Natalie
Kübler
Université Paris Cité, CLILLAC-ARP
2
A01
Hanna Martikainen
Martikainen, Hanna
Hanna
Martikainen
ESIT / Université Sorbonne Nouvelle, CLESTHIA
3
A01
Alexandra Mestivier
Mestivier, Alexandra
Alexandra
Mestivier
Université Paris Cité, CLILLAC-ARP
4
A01
Mojca Pecman
Pecman, Mojca
Mojca
Pecman
Université Paris Cité, CLILLAC-ARP
20
corpus-based methodology
20
errors
20
neural machine translation
20
phraseology
20
post-editing
20
specialized texts
01
This study focuses on phraseology in specialised texts and on students’ difficulties pertaining to phraseology in post-editing neural machine translation output. It is undertaken within the corpus-based methodological framework that we have developed for several purposes, one of which being to assess the impact of corpus use on translation and post-editing. The objective of the study is to propose a descriptive analysis of typical student errors related to phraseology in order to design tailored pedagogical materials. We aim to show that, with consistent training in querying corpora and in interpreting results in an appropriate manner, students can manage to improve their productions when translating specialised texts or when post-editing machine translation output.
10
01
JB code
cilt.366.05leo
79
102
24
Chapter
8
01
Chapter 5. Evaluating a bracketing protocol for multiword terms
1
A01
Pilar León-Araúz
León-Araúz, Pilar
Pilar
León-Araúz
University of Granada
2
A01
Melania Cabezas-García
Cabezas-García, Melania
Melania
Cabezas-García
University of Granada
20
bracketing
20
corpus
20
multiword term
20
structural disambiguation
20
terminology
01
Multiword terms (MWTs) are frequently used to encapsulate and convey meaning in scientific and technical texts. However, they can also make these texts difficult to understand because the relations between constituents are not transparent. When MWTs have more than two constituents, a dependency analysis (bracketing) is often necessary to facilitate their interpretation. NLP has proposed various models to automatize bracketing operations, but none has been entirely satisfactory. This paper presents a protocol that combines various models and applies it to a set of three-constituent MWTs in order to: (i) sort rules by their disambiguation potential, based on their likelihood of retrieving results from any corpus and their ability to solve bracketing; and (ii) ascertain the influence of corpus size and type in the results obtained.
10
01
JB code
cilt.366.s2
103
1
Section header
9
01
Section 2. Corpus-based and linguistic studies in phraseology
10
01
JB code
cilt.366.06fan
104
123
20
Chapter
10
01
Chapter 6. Suggestions for a new model of functional phraseme categorization for applied purposes
1
A01
Anna Fankhauser
Fankhauser, Anna
Anna
Fankhauser
Universität Osnabrück
20
(bilingual) learner lexicography
20
corpus-based phraseology
20
corpus-derived phraseme list
20
foreign language teaching
20
phraseme categorization
20
phraseological core
20
translation
01
Although the significance of phraseology in various fields of applied linguistics such as translation, language teaching, and (bilingual) learner lexicography is generally agreed upon, existing models of phraseme categorization largely fail to account for the needs of language practitioners and learners. Yet, a classification model for applied purposes is required, for example, to provide language practitioners and learners with a systematic list of useful phraseological items that can be applied to individual situations of language production and reception. The model suggested in the present paper consistently applies functional classification criteria and is derived from an extensive corpus study of spoken British and American English. It is hoped that the empirical approach and the focus on functional properties of phrasemes will ensure that the model is of maximum relevance for applied purposes.
10
01
JB code
cilt.366.07jim
124
141
18
Chapter
11
01
Chapter 7. Verb collocations and their semantics in the specialized language of science
1
A01
Eva Lucía Jiménez-Navarro
Jiménez-Navarro, Eva Lucía
Eva Lucía
Jiménez-Navarro
Universidad de Córdoba
20
method
20
noun collocation
20
research article
20
semantic frame
20
specialized corpus
20
specialized language of science
20
verb collocation
01
This chapter is concerned with verb collocations in the specialized language of science. I pay attention to their semanticity, since my main objective is to discover the topics evoked in terms of their integrating elements. The methodology applied is as follows: first, I compile a specialized corpus of research articles; second, I automatically extract a list of collocation candidates, which is manually revised; third, the selected collocations are semantically classified. As this work is motivated by a previous study on noun collocations, I perform a comparative analysis. The findings are that noun and verb collocations share similar semantic frames, although the method employed in each study yields different, but complementary, results.
10
01
JB code
cilt.366.08bre
142
156
15
Chapter
12
01
Chapter 8. Negative–positive adjective pairing in travel journalism in English, Italian, and Polish
1
A01
David Brett
Brett, David
David
Brett
Università degli Studi di Sassari
2
A01
Antonio Pinna
Pinna, Antonio
Antonio
Pinna
Università degli Studi di Sassari
3
A01
Barbara Loranc
Loranc, Barbara
Barbara
Loranc
University of Bielsko-Biala
20
ADJ+but+ADJ pattern
20
adjectives
20
English
20
Italian
20
negative – positive adjective pairing
20
Polish
20
travel journalism
01
Adjectives play a particular role in the language of tourism and often contribute to the formation of recurrent phraseologies (Manca, 2008). The combination of adjectives bearing negative and positive connotation is widely reported in the literature (Dann, 1996; Edo Marzá, 2011, 2012). Durán-Muñoz (2019) focuses on the ADJ+but+ADJ pattern in an English language corpus of Adventure Tourism texts. This contribution examines the same pattern in 1M word corpora of Travel journalism, examining examples in English, but also extending the analysis to Italian and Polish, to determine whether the pattern is limited to one language, or whether it is widely used as a discourse strategy within the same register, regardless of the code adopted.
10
01
JB code
cilt.366.09gut
157
173
17
Chapter
13
01
Chapter 9. The middle construction and some machine translation issues
Exploring the process of compositional cospecification in quality-oriented middles
1
A01
Macarena Palma Gutiérrez
Palma Gutiérrez, Macarena
Macarena
Palma Gutiérrez
Universidad de Córdoba
20
Adverb + Verb collocation
20
colloconstructional analysis of MWU
20
compositional cospecification
20
inanimate entities
20
machine translation
20
middle construction
20
prototype effects
20
quality-oriented middles
01
This paper aims at exploring the colloconstructional analysis of multiword units (MWU) in machine translation regarding the middle construction in terms of compositional cospecification (Yoshimura, 1998; Yoshimura & Taylor, 2004). For that purpose, I examine the Adverb + Verb collocation focusing on the predicates <i>cut</i> and <i>drive</i>, collocated with quality-oriented adjuncts (Heyvaert, 2003) and incorporating Inanimate Subject entities (Patients, Enablers and Instruments). The number of instances examined is 500+. The data analysed reveals that apart from the generalised <i>Qt-Qc</i> pattern in shift of semantic importance, other patterns (namely, <i>Qc-Qc</i>) can be found by virtue of the prototype effects of the construction (cf. Taylor, 1995), thus, providing a potential source of disambiguation in the computational treatment of MWU in machine translation.
10
01
JB code
cilt.366.10roj
174
197
24
Chapter
14
01
Chapter 10. Semantic annotation of named rivers and its application for the prediction of multiword-term bracketing
1
A01
Juan Rojas Garcia
Rojas Garcia, Juan
Juan
Rojas Garcia
University of Granada
20
bracketing prediction
20
frame-based terminology
20
named river
20
predicate-argument structure analysis
20
semantic annotation
20
terminological knowledge base
20
three-component multi-word term
01
The acquisition of knowledge is essential for specialized translation, hence the representation of specialized phraseology in terminological knowledge bases is part of this process. The aim of this study was thus two-fold. Firstly, it describes how the semantic annotation of predicate-argument structure of sentences mentioning named rivers can be addressed from the perspective of Frame-based Terminology. The results showed that this approach provides valuable insights into the knowledge structures underlying the usage of named rivers in specialized texts. Secondly, this study explores whether the bracketing of a three-component multi-word term can be predicted from the semantic information encoded in the sentence where the ternary compound and a named river are used as arguments. The semantic annotations permitted construction of two machine-learning models capable of accurately predicting ternary-compound bracketing.
10
01
JB code
cilt.366.11mar
198
218
21
Chapter
15
01
Chapter 11. Irony in American-English tweets
A cognitive and phraseological analysis
1
A01
Beatriz Martín Gascón
Martín Gascón, Beatriz
Beatriz
Martín Gascón
Universidad Complutense de Madrid
20
American English
20
big data
20
cognitive linguistics
20
contextual ironic markers
20
echoic account
20
intercultural awareness
20
ironic phraseology
20
Spanish
20
twitter
20
verbal irony
01
The present study examines verbal irony from a cognitive linguistics perspective, based on Ruiz de Mendoza’s (2017) development of the echoic account and on big data. Built on previous research on the detection of Spanish ironic utterances in Twitter (Martín-Gascón, 2019), this investigation aims to analyze how American-English speakers conceptualize and express irony and compares findings to the Spanish ones. The dataset, initially consisting of 1,157,773,379 tweets from 248 countries and 66 languages, was first reduced to 27,517 tweets from English-speaking users in the United States using the words “irony”, “ironies”, and “ironic”, then to 605 containing the words as hashtag and finally to 495 tweets evincing implicit and explicit-echoic irony. An in-depth cognitive and qualitative analysis of the sample revealed the complexities of perceiving irony in written discourse and, therefore, the relevance of adding contextual ironic markers, such as hashtags, emojis, interjections, laughter typing and ironic phraseology, among others. In line with Martín-Gascón’s (2019) study, findings showed a higher use of positive and explicit-echoic irony to the detriment of implicit and negative irony. By drawing attention to the similarities and differences in the expression of irony, we expect to offer preliminary informed options for the design of pedagogical proposals that enhance not only the learners’ linguistic and ironic competencies, but also their intercultural awareness.
10
01
JB code
cilt.366.12tak
219
243
25
Chapter
16
01
Chapter 12. A comprehensive Japanese MWE lexicon
JMWEL
1
A01
Masahito Takahashi
Takahashi, Masahito
Masahito
Takahashi
Kurume Institute of Technology, emeritus, Japan
2
A01
Toshifumi Tanabe
Tanabe, Toshifumi
Toshifumi
Tanabe
Fukuoka University, Japan
3
A01
Jack Halpern
Halpern, Jack
Jack
Halpern
CJKI Co., Japan
4
A01
Kosho Shudo
Shudo, Kosho
Kosho
Shudo
Fukuoka University, emeritus, Japan
20
construction grammar
20
formulaic language
20
lexical bundles
20
multiword expression (MWE)
20
neural machine translation (NMT)
20
phrase-based machine translation (PBMT)
20
phrase-based NLP
20
phrase-based statistical machine translation (PBSMT)
20
phraseology
01
JMWEL (Japanese MWE Lexicon) is a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based processing of a wide range of Japanese documents. It has about 160,000 MWE lemmas covering almost every kind of linguistically idiosyncratic but commonly used Japanese phrases, e.g., idioms, quasi-idioms, collocations, quasi-collocations, clichés, quasi-clichés, institutionalized phrases, proverbs, and old sayings, excepting technical terms in specialized fields or named entities. JMWEL consists of sixteen sub-lexicons reflecting their distinctive features. The comprehensiveness of the collected MWEs and the detailed morpho-syntactic information given to each MWE, which may include internal modifiers, are notable features of JMWEL. In this paper, we introduce the newest version of JMWEL.
10
01
JB code
cilt.366.13dib
244
262
19
Chapter
17
01
Chapter 13. Ontology-based formalisation of Italian clitic verbal MWEs
An approach for supporting machine translation
1
A01
Maria Pia di Buono
di Buono, Maria Pia
Maria Pia
di Buono
University of Naples “L’Orientale”
2
A01
Johanna Monti
Monti, Johanna
Johanna
Monti
University of Naples “L’Orientale”
3
A01
Valeria Caruso
Caruso, Valeria
Valeria
Caruso
University of Naples “L’Orientale”
20
Italian clitic verbs
20
lexicographic resources
20
linguistic linked open data
20
Neural Machine Translation
20
OntoLex-Lemon
20
verbal multiword expressions
01
In this paper we present the development of an ontology-based bilingual (IT-EN) lexicographic resource of Italian clitic Verbal MultiWord Expressions (VMWEs) to support machine translation. Starting from an analysis of these units and their linguistic features, we examine how Neural Machine Translation (NMT) handles complex VMWEs and the related translations issues. Finally, we propose a bilingual resource, formalised by means of the OntoLex-Lemon model, which accounts for morphological, syntactic, and semantic features of Italian clitic verbs, in order to enhance automatic translation of VMWEs.
02
JBENJAMINS
John Benjamins Publishing Company
01
John Benjamins Publishing Company
Amsterdam/Philadelphia
NL
02
December 2024
20241215
2024
John Benjamins B.V.
02
WORLD
01
JB
1
John Benjamins Publishing Company
+31 20 6304747
+31 20 6739773
bookorder@benjamins.nl
01
https://benjamins.com
01
WORLD
US CA MX
10
20241215
01
02
JB
1
00
125.00
EUR
R
02
02
JB
1
00
132.50
EUR
R
01
JB
10
bebc
+44 1202 712 934
+44 1202 712 913
sales@bebc.co.uk
03
GB
10
20241215
02
02
JB
1
00
105.00
GBP
Z
01
JB
2
John Benjamins North America
+1 800 562-5666
+1 703 661-1501
benjamins@presswarehouse.com
01
https://benjamins.com
01
US CA MX
10
20241215
01
gen
02
JB
1
00
163.00
USD