<TitleType>01</TitleType> <TitleText textformat="02">Recent Advances in Multiword Units in Machine Translation and Translation Technology</TitleText>

59029685 03 01 01 JB John Benjamins Publishing Company 01 JB code CILT 366 Eb 15 9789027246387 06 10.1075/cilt.366 13 2024034107 DG 002 02 01 CILT 02 0304-0763 Current Issues in Linguistic Theory 366 <TitleType>01</TitleType> <TitleText textformat="02">Recent Advances in Multiword Units in Machine Translation and Translation Technology</TitleText> 01 cilt.366 01 https://benjamins.com 02 https://benjamins.com/catalog/cilt.366 1 B01 Johanna Monti Monti, Johanna Johanna Monti University of Naples “L’Orientale” 2 B01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor University of Malaga 3 B01 Ruslan Mitkov Mitkov, Ruslan Ruslan Mitkov Lancaster University 4 B01 Carlos Manuel Hidalgo-Ternero Hidalgo-Ternero, Carlos Manuel Carlos Manuel Hidalgo-Ternero University of Malaga 01 eng 278 ix 262 + index LAN009060 v.2006 CFK 2 24 JB Subject Scheme LIN.COMPUT Computational & corpus linguistics 24 JB Subject Scheme LIN.SYNTAX Syntax 24 JB Subject Scheme LIN.THEOR Theoretical linguistics 24 JB Subject Scheme TRAN.TRANSL Translation Studies 06 01 The investigation of phraseology through corpus-based and computational approaches holds significant relevance for various professionals, including translators, interpreters, terminologists, lexicographers, language instructors, and learners. Computational Phraseology, and in particular the computational analysis of multiword expressions (also known as multiword units), has gained prominence in recent years and is essential for a number of Natural Language Processing and Translation Technology applications. The failure to detect these units automatically could result in incorrect and problematic automatic translations and could hinder the performance of applications such as text summarisation and web search. Against this background, the volume offers 13 articles carefully selected and organised into two parts: ‘Computational treatment of multiword units’ and ‘Corpus-based and linguistic studies in phraseology‘. The contributions not only highlight the latest advancements in computational and corpus-based phraseology but also reiterate its vital role in all areas of language technologies, including basic and applied research. 04 09 01 https://benjamins.com/covers/475/cilt.366.png 04 03 01 https://benjamins.com/covers/475_jpg/9789027217905.jpg 04 03 01 https://benjamins.com/covers/475_tif/9789027217905.tif 06 09 01 https://benjamins.com/covers/1200_front/cilt.366.hb.png 07 09 01 https://benjamins.com/covers/125/cilt.366.png 25 09 01 https://benjamins.com/covers/1200_back/cilt.366.hb.png 27 09 01 https://benjamins.com/covers/3d_web/cilt.366.hb.png 10 01 JB code cilt.366.toc v vi 2 Table of contents 1 <TitleType>01</TitleType> <TitleText textformat="02">Table of contents</TitleText> 10 01 JB code cilt.366.preface vii x 4 Preface 2 <TitleType>01</TitleType> <TitleText textformat="02">Preface</TitleText> 1 A01 Johanna Monti Monti, Johanna Johanna Monti 2 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor 3 A01 Ruslan Mitkov Mitkov, Ruslan Ruslan Mitkov 4 A01 Carlos Manuel Hidalgo-Ternero Hidalgo-Ternero, Carlos Manuel Carlos Manuel Hidalgo-Ternero 10 01 JB code cilt.366.s1 11 1 Section header 3 <TitleType>01</TitleType> <TitleText textformat="02">Section 1. Computational treatment of multiword units</TitleText> 10 01 JB code cilt.366.01col 2 17 16 Chapter 4 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 1. Multi-word units in neural machine translation</TitleText> <Subtitle textformat="02">Why the tip of the iceberg remains problematic</Subtitle> 1 A01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson University of Louvain 20 deep learning 20 idioms 20 neural machine translation 20 phraseology 20 transformer architecture 01 Neural machine translation (NMT) has recently made significant progress in improving the quality of the texts it produces. New features of NMT include the fluidity of translations and the successful handling of multi-word units. In this paper we first report the results of an automated evaluation of the percentage of phraseology in the translations produced by Google Translate and DeepL. A corpus-based approach makes it possible to estimate that both NMT systems succeed in producing an average percentage of phraseology that is quite reasonable and sometimes even higher than in natural language production by native speakers. However, a closer look at some problematic cases shows that the ability of NMT systems to treat phraseological units can be deceptive, as they are often unable to cope with contextual complexity and low-frequency idioms. 10 01 JB code cilt.366.02hid 18 39 22 Chapter 5 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 2. ReGap</TitleText> <Subtitle textformat="02">A text-preprocessing algorithm to enhance MWE-aware neural machine translation systems</Subtitle> 1 A01 Carlos Manuel Hidalgo-Ternero Hidalgo-Ternero, Carlos Manuel Carlos Manuel Hidalgo-Ternero Universidad de Málaga 2 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor Universidad de Málaga 20 DeepL 20 discontinuity 20 Neural Machine Translation (NMT) 20 Text-preprocessing algorithm 20 Token-based MWE identification 20 Verb-noun idiomatic constructions (VNICs) 01 This research presents ReGap, a text-preprocessing algorithm for the automatic token-based identification and conversion of discontinuous multiword expressions (MWEs) into their canonical state, i.e., their continuous form, as a means to optimise neural machine translation (NMT) systems. To this end, an experiment with flexible verb-noun idiomatic constructions (VNICs) is conducted in order to assess to what extent ReGap can enhance the performance of the most robust NMT system to date, DeepL, under the challenge of MWE discontinuity in the Spanish-into-English and the Spanish-into-German directionalities. In this regard, the promising results yielded for VNICs will shed some light on new avenues for enhancing MWE-aware NMT systems. 10 01 JB code cilt.366.03spe 40 56 17 Chapter 6 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 3. Evaluating the Italian-English machine translation quality of MWUs in the domain of archaeology</TitleText> 1 A01 Giulia Speranza Speranza, Giulia Giulia Speranza University of Naples “L’Orientale”, UNIOR NLP Research Group 2 A01 Johanna Monti Monti, Johanna Johanna Monti University of Naples “L’Orientale”, UNIOR NLP Research Group 20 archaeology 20 error analysis 20 evaluation 20 machine translation 20 multiword units 20 terminology 01 Multiword units (MWUs) represent a challenging and problematic linguistic issue in the field of Natural Language Processing (NLP) due to their idiosyncratic nature. This paper investigates the quality of Neural Machine Translation (NMT) outputs when dealing with MWUs in the domain of archaeology. As a case study, a dataset of 100 MWUs is used as a Gold Standard to evaluate out-of-context and in-context translation outputs from three state-of-the-art NMT systems for the Italian-English language pair: Google Translate, DeepL, and Microsoft Bing Translator. MT outputs are manually evaluated with reference to the Gold Standard, namely out-of-context and in-context human English translations of the selected 100 MWUs. Results show that terminology is still a problematic category for MT quality and that MWUs translation may vary, and sometimes even improve, when further context is provided. 10 01 JB code cilt.366.04kub 57 78 22 Chapter 7 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 4. Post-editing neural machine translation in specialised languages</TitleText> <Subtitle textformat="02">The role of corpora in the translation of phraseological structures</Subtitle> 1 A01 Natalie Kübler Kübler, Natalie Natalie Kübler Université Paris Cité, CLILLAC-ARP 2 A01 Hanna Martikainen Martikainen, Hanna Hanna Martikainen ESIT / Université Sorbonne Nouvelle, CLESTHIA 3 A01 Alexandra Mestivier Mestivier, Alexandra Alexandra Mestivier Université Paris Cité, CLILLAC-ARP 4 A01 Mojca Pecman Pecman, Mojca Mojca Pecman Université Paris Cité, CLILLAC-ARP 20 corpus-based methodology 20 errors 20 neural machine translation 20 phraseology 20 post-editing 20 specialized texts 01 This study focuses on phraseology in specialised texts and on students’ difficulties pertaining to phraseology in post-editing neural machine translation output. It is undertaken within the corpus-based methodological framework that we have developed for several purposes, one of which being to assess the impact of corpus use on translation and post-editing. The objective of the study is to propose a descriptive analysis of typical student errors related to phraseology in order to design tailored pedagogical materials. We aim to show that, with consistent training in querying corpora and in interpreting results in an appropriate manner, students can manage to improve their productions when translating specialised texts or when post-editing machine translation output. 10 01 JB code cilt.366.05leo 79 102 24 Chapter 8 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 5. Evaluating a bracketing protocol for multiword terms</TitleText> 1 A01 Pilar León-Araúz León-Araúz, Pilar Pilar León-Araúz University of Granada 2 A01 Melania Cabezas-García Cabezas-García, Melania Melania Cabezas-García University of Granada 20 bracketing 20 corpus 20 multiword term 20 structural disambiguation 20 terminology 01 Multiword terms (MWTs) are frequently used to encapsulate and convey meaning in scientific and technical texts. However, they can also make these texts difficult to understand because the relations between constituents are not transparent. When MWTs have more than two constituents, a dependency analysis (bracketing) is often necessary to facilitate their interpretation. NLP has proposed various models to automatize bracketing operations, but none has been entirely satisfactory. This paper presents a protocol that combines various models and applies it to a set of three-constituent MWTs in order to: (i) sort rules by their disambiguation potential, based on their likelihood of retrieving results from any corpus and their ability to solve bracketing; and (ii) ascertain the influence of corpus size and type in the results obtained. 10 01 JB code cilt.366.s2 103 1 Section header 9 <TitleType>01</TitleType> <TitleText textformat="02">Section 2. Corpus-based and linguistic studies in phraseology</TitleText> 10 01 JB code cilt.366.06fan 104 123 20 Chapter 10 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 6. Suggestions for a new model of functional phraseme categorization for applied purposes</TitleText> 1 A01 Anna Fankhauser Fankhauser, Anna Anna Fankhauser Universität Osnabrück 20 (bilingual) learner lexicography 20 corpus-based phraseology 20 corpus-derived phraseme list 20 foreign language teaching 20 phraseme categorization 20 phraseological core 20 translation 01 Although the significance of phraseology in various fields of applied linguistics such as translation, language teaching, and (bilingual) learner lexicography is generally agreed upon, existing models of phraseme categorization largely fail to account for the needs of language practitioners and learners. Yet, a classification model for applied purposes is required, for example, to provide language practitioners and learners with a systematic list of useful phraseological items that can be applied to individual situations of language production and reception. The model suggested in the present paper consistently applies functional classification criteria and is derived from an extensive corpus study of spoken British and American English. It is hoped that the empirical approach and the focus on functional properties of phrasemes will ensure that the model is of maximum relevance for applied purposes. 10 01 JB code cilt.366.07jim 124 141 18 Chapter 11 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 7. Verb collocations and their semantics in the specialized language of science</TitleText> 1 A01 Eva Lucía Jiménez-Navarro Jiménez-Navarro, Eva Lucía Eva Lucía Jiménez-Navarro Universidad de Córdoba 20 method 20 noun collocation 20 research article 20 semantic frame 20 specialized corpus 20 specialized language of science 20 verb collocation 01 This chapter is concerned with verb collocations in the specialized language of science. I pay attention to their semanticity, since my main objective is to discover the topics evoked in terms of their integrating elements. The methodology applied is as follows: first, I compile a specialized corpus of research articles; second, I automatically extract a list of collocation candidates, which is manually revised; third, the selected collocations are semantically classified. As this work is motivated by a previous study on noun collocations, I perform a comparative analysis. The findings are that noun and verb collocations share similar semantic frames, although the method employed in each study yields different, but complementary, results. 10 01 JB code cilt.366.08bre 142 156 15 Chapter 12 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 8. Negative–positive adjective pairing in travel journalism in English, Italian, and Polish</TitleText> 1 A01 David Brett Brett, David David Brett Università degli Studi di Sassari 2 A01 Antonio Pinna Pinna, Antonio Antonio Pinna Università degli Studi di Sassari 3 A01 Barbara Loranc Loranc, Barbara Barbara Loranc University of Bielsko-Biala 20 ADJ+but+ADJ pattern 20 adjectives 20 English 20 Italian 20 negative – positive adjective pairing 20 Polish 20 travel journalism 01 Adjectives play a particular role in the language of tourism and often contribute to the formation of recurrent phraseologies (Manca, 2008). The combination of adjectives bearing negative and positive connotation is widely reported in the literature (Dann, 1996; Edo Marzá, 2011, 2012). Durán-Muñoz (2019) focuses on the ADJ+but+ADJ pattern in an English language corpus of Adventure Tourism texts. This contribution examines the same pattern in 1M word corpora of Travel journalism, examining examples in English, but also extending the analysis to Italian and Polish, to determine whether the pattern is limited to one language, or whether it is widely used as a discourse strategy within the same register, regardless of the code adopted. 10 01 JB code cilt.366.09gut 157 173 17 Chapter 13 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 9. The middle construction and some machine translation issues</TitleText> <Subtitle textformat="02">Exploring the process of compositional cospecification in quality-oriented middles</Subtitle> 1 A01 Macarena Palma Gutiérrez Palma Gutiérrez, Macarena Macarena Palma Gutiérrez Universidad de Córdoba 20 Adverb + Verb collocation 20 colloconstructional analysis of MWU 20 compositional cospecification 20 inanimate entities 20 machine translation 20 middle construction 20 prototype effects 20 quality-oriented middles 01 This paper aims at exploring the colloconstructional analysis of multiword units (MWU) in machine translation regarding the middle construction in terms of compositional cospecification (Yoshimura, 1998; Yoshimura & Taylor, 2004). For that purpose, I examine the Adverb + Verb collocation focusing on the predicates cut and drive, collocated with quality-oriented adjuncts (Heyvaert, 2003) and incorporating Inanimate Subject entities (Patients, Enablers and Instruments). The number of instances examined is 500+. The data analysed reveals that apart from the generalised Qt-Qc pattern in shift of semantic importance, other patterns (namely, Qc-Qc) can be found by virtue of the prototype effects of the construction (cf. Taylor, 1995), thus, providing a potential source of disambiguation in the computational treatment of MWU in machine translation. 10 01 JB code cilt.366.10roj 174 197 24 Chapter 14 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 10. Semantic annotation of named rivers and its application for the prediction of multiword-term bracketing</TitleText> 1 A01 Juan Rojas Garcia Rojas Garcia, Juan Juan Rojas Garcia University of Granada 20 bracketing prediction 20 frame-based terminology 20 named river 20 predicate-argument structure analysis 20 semantic annotation 20 terminological knowledge base 20 three-component multi-word term 01 The acquisition of knowledge is essential for specialized translation, hence the representation of specialized phraseology in terminological knowledge bases is part of this process. The aim of this study was thus two-fold. Firstly, it describes how the semantic annotation of predicate-argument structure of sentences mentioning named rivers can be addressed from the perspective of Frame-based Terminology. The results showed that this approach provides valuable insights into the knowledge structures underlying the usage of named rivers in specialized texts. Secondly, this study explores whether the bracketing of a three-component multi-word term can be predicted from the semantic information encoded in the sentence where the ternary compound and a named river are used as arguments. The semantic annotations permitted construction of two machine-learning models capable of accurately predicting ternary-compound bracketing. 10 01 JB code cilt.366.11mar 198 218 21 Chapter 15 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 11. Irony in American-English tweets</TitleText> <Subtitle textformat="02">A cognitive and phraseological analysis</Subtitle> 1 A01 Beatriz Martín Gascón Martín Gascón, Beatriz Beatriz Martín Gascón Universidad Complutense de Madrid 20 American English 20 big data 20 cognitive linguistics 20 contextual ironic markers 20 echoic account 20 intercultural awareness 20 ironic phraseology 20 Spanish 20 twitter 20 verbal irony 01 The present study examines verbal irony from a cognitive linguistics perspective, based on Ruiz de Mendoza’s (2017) development of the echoic account and on big data. Built on previous research on the detection of Spanish ironic utterances in Twitter (Martín-Gascón, 2019), this investigation aims to analyze how American-English speakers conceptualize and express irony and compares findings to the Spanish ones. The dataset, initially consisting of 1,157,773,379 tweets from 248 countries and 66 languages, was first reduced to 27,517 tweets from English-speaking users in the United States using the words “irony”, “ironies”, and “ironic”, then to 605 containing the words as hashtag and finally to 495 tweets evincing implicit and explicit-echoic irony. An in-depth cognitive and qualitative analysis of the sample revealed the complexities of perceiving irony in written discourse and, therefore, the relevance of adding contextual ironic markers, such as hashtags, emojis, interjections, laughter typing and ironic phraseology, among others. In line with Martín-Gascón’s (2019) study, findings showed a higher use of positive and explicit-echoic irony to the detriment of implicit and negative irony. By drawing attention to the similarities and differences in the expression of irony, we expect to offer preliminary informed options for the design of pedagogical proposals that enhance not only the learners’ linguistic and ironic competencies, but also their intercultural awareness. 10 01 JB code cilt.366.12tak 219 243 25 Chapter 16 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 12. A comprehensive Japanese MWE lexicon</TitleText> <Subtitle textformat="02">JMWEL</Subtitle> 1 A01 Masahito Takahashi Takahashi, Masahito Masahito Takahashi Kurume Institute of Technology, emeritus, Japan 2 A01 Toshifumi Tanabe Tanabe, Toshifumi Toshifumi Tanabe Fukuoka University, Japan 3 A01 Jack Halpern Halpern, Jack Jack Halpern CJKI Co., Japan 4 A01 Kosho Shudo Shudo, Kosho Kosho Shudo Fukuoka University, emeritus, Japan 20 construction grammar 20 formulaic language 20 lexical bundles 20 multiword expression (MWE) 20 neural machine translation (NMT) 20 phrase-based machine translation (PBMT) 20 phrase-based NLP 20 phrase-based statistical machine translation (PBSMT) 20 phraseology 01 JMWEL (Japanese MWE Lexicon) is a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based processing of a wide range of Japanese documents. It has about 160,000 MWE lemmas covering almost every kind of linguistically idiosyncratic but commonly used Japanese phrases, e.g., idioms, quasi-idioms, collocations, quasi-collocations, clichés, quasi-clichés, institutionalized phrases, proverbs, and old sayings, excepting technical terms in specialized fields or named entities. JMWEL consists of sixteen sub-lexicons reflecting their distinctive features. The comprehensiveness of the collected MWEs and the detailed morpho-syntactic information given to each MWE, which may include internal modifiers, are notable features of JMWEL. In this paper, we introduce the newest version of JMWEL. 10 01 JB code cilt.366.13dib 244 262 19 Chapter 17 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 13. Ontology-based formalisation of Italian clitic verbal MWEs</TitleText> <Subtitle textformat="02">An approach for supporting machine translation</Subtitle> 1 A01 Maria Pia di Buono di Buono, Maria Pia Maria Pia di Buono University of Naples “L’Orientale” 2 A01 Johanna Monti Monti, Johanna Johanna Monti University of Naples “L’Orientale” 3 A01 Valeria Caruso Caruso, Valeria Valeria Caruso University of Naples “L’Orientale” 20 Italian clitic verbs 20 lexicographic resources 20 linguistic linked open data 20 Neural Machine Translation 20 OntoLex-Lemon 20 verbal multiword expressions 01 In this paper we present the development of an ontology-based bilingual (IT-EN) lexicographic resource of Italian clitic Verbal MultiWord Expressions (VMWEs) to support machine translation. Starting from an analysis of these units and their linguistic features, we examine how Neural Machine Translation (NMT) handles complex VMWEs and the related translations issues. Finally, we propose a bilingual resource, formalised by means of the OntoLex-Lemon model, which accounts for morphological, syntactic, and semantic features of Italian clitic verbs, in order to enhance automatic translation of VMWEs. 02 JBENJAMINS John Benjamins Publishing Company 01 John Benjamins Publishing Company Amsterdam/Philadelphia NL 02 December 2024 20241215 2024 John Benjamins B.V. 02 WORLD 13 15 9789027217905 01 JB 3 John Benjamins e-Platform 03 jbe-platform.com 09 WORLD 10 20241215 01 00 125.00 EUR R 01 00 105.00 GBP Z 01 gen 00 163.00 USD S 775029684 03 01 01 JB John Benjamins Publishing Company 01 JB code CILT 366 Hb 15 9789027217905 13 2024034106 BB 01 CILT 02 0304-0763 Current Issues in Linguistic Theory 366 <TitleType>01</TitleType> <TitleText textformat="02">Recent Advances in Multiword Units in Machine Translation and Translation Technology</TitleText> 01 cilt.366 01 https://benjamins.com 02 https://benjamins.com/catalog/cilt.366 1 B01 Johanna Monti Monti, Johanna Johanna Monti University of Naples “L’Orientale” 2 B01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor University of Malaga 3 B01 Ruslan Mitkov Mitkov, Ruslan Ruslan Mitkov Lancaster University 4 B01 Carlos Manuel Hidalgo-Ternero Hidalgo-Ternero, Carlos Manuel Carlos Manuel Hidalgo-Ternero University of Malaga 01 eng 278 ix 262 + index LAN009060 v.2006 CFK 2 24 JB Subject Scheme LIN.COMPUT Computational & corpus linguistics 24 JB Subject Scheme LIN.SYNTAX Syntax 24 JB Subject Scheme LIN.THEOR Theoretical linguistics 24 JB Subject Scheme TRAN.TRANSL Translation Studies 06 01 The investigation of phraseology through corpus-based and computational approaches holds significant relevance for various professionals, including translators, interpreters, terminologists, lexicographers, language instructors, and learners. Computational Phraseology, and in particular the computational analysis of multiword expressions (also known as multiword units), has gained prominence in recent years and is essential for a number of Natural Language Processing and Translation Technology applications. The failure to detect these units automatically could result in incorrect and problematic automatic translations and could hinder the performance of applications such as text summarisation and web search. Against this background, the volume offers 13 articles carefully selected and organised into two parts: ‘Computational treatment of multiword units’ and ‘Corpus-based and linguistic studies in phraseology‘. The contributions not only highlight the latest advancements in computational and corpus-based phraseology but also reiterate its vital role in all areas of language technologies, including basic and applied research. 04 09 01 https://benjamins.com/covers/475/cilt.366.png 04 03 01 https://benjamins.com/covers/475_jpg/9789027217905.jpg 04 03 01 https://benjamins.com/covers/475_tif/9789027217905.tif 06 09 01 https://benjamins.com/covers/1200_front/cilt.366.hb.png 07 09 01 https://benjamins.com/covers/125/cilt.366.png 25 09 01 https://benjamins.com/covers/1200_back/cilt.366.hb.png 27 09 01 https://benjamins.com/covers/3d_web/cilt.366.hb.png 10 01 JB code cilt.366.toc v vi 2 Table of contents 1 <TitleType>01</TitleType> <TitleText textformat="02">Table of contents</TitleText> 10 01 JB code cilt.366.preface vii x 4 Preface 2 <TitleType>01</TitleType> <TitleText textformat="02">Preface</TitleText> 1 A01 Johanna Monti Monti, Johanna Johanna Monti 2 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor 3 A01 Ruslan Mitkov Mitkov, Ruslan Ruslan Mitkov 4 A01 Carlos Manuel Hidalgo-Ternero Hidalgo-Ternero, Carlos Manuel Carlos Manuel Hidalgo-Ternero 10 01 JB code cilt.366.s1 11 1 Section header 3 <TitleType>01</TitleType> <TitleText textformat="02">Section 1. Computational treatment of multiword units</TitleText> 10 01 JB code cilt.366.01col 2 17 16 Chapter 4 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 1. Multi-word units in neural machine translation</TitleText> <Subtitle textformat="02">Why the tip of the iceberg remains problematic</Subtitle> 1 A01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson University of Louvain 20 deep learning 20 idioms 20 neural machine translation 20 phraseology 20 transformer architecture 01 Neural machine translation (NMT) has recently made significant progress in improving the quality of the texts it produces. New features of NMT include the fluidity of translations and the successful handling of multi-word units. In this paper we first report the results of an automated evaluation of the percentage of phraseology in the translations produced by Google Translate and DeepL. A corpus-based approach makes it possible to estimate that both NMT systems succeed in producing an average percentage of phraseology that is quite reasonable and sometimes even higher than in natural language production by native speakers. However, a closer look at some problematic cases shows that the ability of NMT systems to treat phraseological units can be deceptive, as they are often unable to cope with contextual complexity and low-frequency idioms. 10 01 JB code cilt.366.02hid 18 39 22 Chapter 5 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 2. ReGap</TitleText> <Subtitle textformat="02">A text-preprocessing algorithm to enhance MWE-aware neural machine translation systems</Subtitle> 1 A01 Carlos Manuel Hidalgo-Ternero Hidalgo-Ternero, Carlos Manuel Carlos Manuel Hidalgo-Ternero Universidad de Málaga 2 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor Universidad de Málaga 20 DeepL 20 discontinuity 20 Neural Machine Translation (NMT) 20 Text-preprocessing algorithm 20 Token-based MWE identification 20 Verb-noun idiomatic constructions (VNICs) 01 This research presents ReGap, a text-preprocessing algorithm for the automatic token-based identification and conversion of discontinuous multiword expressions (MWEs) into their canonical state, i.e., their continuous form, as a means to optimise neural machine translation (NMT) systems. To this end, an experiment with flexible verb-noun idiomatic constructions (VNICs) is conducted in order to assess to what extent ReGap can enhance the performance of the most robust NMT system to date, DeepL, under the challenge of MWE discontinuity in the Spanish-into-English and the Spanish-into-German directionalities. In this regard, the promising results yielded for VNICs will shed some light on new avenues for enhancing MWE-aware NMT systems. 10 01 JB code cilt.366.03spe 40 56 17 Chapter 6 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 3. Evaluating the Italian-English machine translation quality of MWUs in the domain of archaeology</TitleText> 1 A01 Giulia Speranza Speranza, Giulia Giulia Speranza University of Naples “L’Orientale”, UNIOR NLP Research Group 2 A01 Johanna Monti Monti, Johanna Johanna Monti University of Naples “L’Orientale”, UNIOR NLP Research Group 20 archaeology 20 error analysis 20 evaluation 20 machine translation 20 multiword units 20 terminology 01 Multiword units (MWUs) represent a challenging and problematic linguistic issue in the field of Natural Language Processing (NLP) due to their idiosyncratic nature. This paper investigates the quality of Neural Machine Translation (NMT) outputs when dealing with MWUs in the domain of archaeology. As a case study, a dataset of 100 MWUs is used as a Gold Standard to evaluate out-of-context and in-context translation outputs from three state-of-the-art NMT systems for the Italian-English language pair: Google Translate, DeepL, and Microsoft Bing Translator. MT outputs are manually evaluated with reference to the Gold Standard, namely out-of-context and in-context human English translations of the selected 100 MWUs. Results show that terminology is still a problematic category for MT quality and that MWUs translation may vary, and sometimes even improve, when further context is provided. 10 01 JB code cilt.366.04kub 57 78 22 Chapter 7 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 4. Post-editing neural machine translation in specialised languages</TitleText> <Subtitle textformat="02">The role of corpora in the translation of phraseological structures</Subtitle> 1 A01 Natalie Kübler Kübler, Natalie Natalie Kübler Université Paris Cité, CLILLAC-ARP 2 A01 Hanna Martikainen Martikainen, Hanna Hanna Martikainen ESIT / Université Sorbonne Nouvelle, CLESTHIA 3 A01 Alexandra Mestivier Mestivier, Alexandra Alexandra Mestivier Université Paris Cité, CLILLAC-ARP 4 A01 Mojca Pecman Pecman, Mojca Mojca Pecman Université Paris Cité, CLILLAC-ARP 20 corpus-based methodology 20 errors 20 neural machine translation 20 phraseology 20 post-editing 20 specialized texts 01 This study focuses on phraseology in specialised texts and on students’ difficulties pertaining to phraseology in post-editing neural machine translation output. It is undertaken within the corpus-based methodological framework that we have developed for several purposes, one of which being to assess the impact of corpus use on translation and post-editing. The objective of the study is to propose a descriptive analysis of typical student errors related to phraseology in order to design tailored pedagogical materials. We aim to show that, with consistent training in querying corpora and in interpreting results in an appropriate manner, students can manage to improve their productions when translating specialised texts or when post-editing machine translation output. 10 01 JB code cilt.366.05leo 79 102 24 Chapter 8 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 5. Evaluating a bracketing protocol for multiword terms</TitleText> 1 A01 Pilar León-Araúz León-Araúz, Pilar Pilar León-Araúz University of Granada 2 A01 Melania Cabezas-García Cabezas-García, Melania Melania Cabezas-García University of Granada 20 bracketing 20 corpus 20 multiword term 20 structural disambiguation 20 terminology 01 Multiword terms (MWTs) are frequently used to encapsulate and convey meaning in scientific and technical texts. However, they can also make these texts difficult to understand because the relations between constituents are not transparent. When MWTs have more than two constituents, a dependency analysis (bracketing) is often necessary to facilitate their interpretation. NLP has proposed various models to automatize bracketing operations, but none has been entirely satisfactory. This paper presents a protocol that combines various models and applies it to a set of three-constituent MWTs in order to: (i) sort rules by their disambiguation potential, based on their likelihood of retrieving results from any corpus and their ability to solve bracketing; and (ii) ascertain the influence of corpus size and type in the results obtained. 10 01 JB code cilt.366.s2 103 1 Section header 9 <TitleType>01</TitleType> <TitleText textformat="02">Section 2. Corpus-based and linguistic studies in phraseology</TitleText> 10 01 JB code cilt.366.06fan 104 123 20 Chapter 10 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 6. Suggestions for a new model of functional phraseme categorization for applied purposes</TitleText> 1 A01 Anna Fankhauser Fankhauser, Anna Anna Fankhauser Universität Osnabrück 20 (bilingual) learner lexicography 20 corpus-based phraseology 20 corpus-derived phraseme list 20 foreign language teaching 20 phraseme categorization 20 phraseological core 20 translation 01 Although the significance of phraseology in various fields of applied linguistics such as translation, language teaching, and (bilingual) learner lexicography is generally agreed upon, existing models of phraseme categorization largely fail to account for the needs of language practitioners and learners. Yet, a classification model for applied purposes is required, for example, to provide language practitioners and learners with a systematic list of useful phraseological items that can be applied to individual situations of language production and reception. The model suggested in the present paper consistently applies functional classification criteria and is derived from an extensive corpus study of spoken British and American English. It is hoped that the empirical approach and the focus on functional properties of phrasemes will ensure that the model is of maximum relevance for applied purposes. 10 01 JB code cilt.366.07jim 124 141 18 Chapter 11 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 7. Verb collocations and their semantics in the specialized language of science</TitleText> 1 A01 Eva Lucía Jiménez-Navarro Jiménez-Navarro, Eva Lucía Eva Lucía Jiménez-Navarro Universidad de Córdoba 20 method 20 noun collocation 20 research article 20 semantic frame 20 specialized corpus 20 specialized language of science 20 verb collocation 01 This chapter is concerned with verb collocations in the specialized language of science. I pay attention to their semanticity, since my main objective is to discover the topics evoked in terms of their integrating elements. The methodology applied is as follows: first, I compile a specialized corpus of research articles; second, I automatically extract a list of collocation candidates, which is manually revised; third, the selected collocations are semantically classified. As this work is motivated by a previous study on noun collocations, I perform a comparative analysis. The findings are that noun and verb collocations share similar semantic frames, although the method employed in each study yields different, but complementary, results. 10 01 JB code cilt.366.08bre 142 156 15 Chapter 12 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 8. Negative–positive adjective pairing in travel journalism in English, Italian, and Polish</TitleText> 1 A01 David Brett Brett, David David Brett Università degli Studi di Sassari 2 A01 Antonio Pinna Pinna, Antonio Antonio Pinna Università degli Studi di Sassari 3 A01 Barbara Loranc Loranc, Barbara Barbara Loranc University of Bielsko-Biala 20 ADJ+but+ADJ pattern 20 adjectives 20 English 20 Italian 20 negative – positive adjective pairing 20 Polish 20 travel journalism 01 Adjectives play a particular role in the language of tourism and often contribute to the formation of recurrent phraseologies (Manca, 2008). The combination of adjectives bearing negative and positive connotation is widely reported in the literature (Dann, 1996; Edo Marzá, 2011, 2012). Durán-Muñoz (2019) focuses on the ADJ+but+ADJ pattern in an English language corpus of Adventure Tourism texts. This contribution examines the same pattern in 1M word corpora of Travel journalism, examining examples in English, but also extending the analysis to Italian and Polish, to determine whether the pattern is limited to one language, or whether it is widely used as a discourse strategy within the same register, regardless of the code adopted. 10 01 JB code cilt.366.09gut 157 173 17 Chapter 13 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 9. The middle construction and some machine translation issues</TitleText> <Subtitle textformat="02">Exploring the process of compositional cospecification in quality-oriented middles</Subtitle> 1 A01 Macarena Palma Gutiérrez Palma Gutiérrez, Macarena Macarena Palma Gutiérrez Universidad de Córdoba 20 Adverb + Verb collocation 20 colloconstructional analysis of MWU 20 compositional cospecification 20 inanimate entities 20 machine translation 20 middle construction 20 prototype effects 20 quality-oriented middles 01 This paper aims at exploring the colloconstructional analysis of multiword units (MWU) in machine translation regarding the middle construction in terms of compositional cospecification (Yoshimura, 1998; Yoshimura & Taylor, 2004). For that purpose, I examine the Adverb + Verb collocation focusing on the predicates cut and drive, collocated with quality-oriented adjuncts (Heyvaert, 2003) and incorporating Inanimate Subject entities (Patients, Enablers and Instruments). The number of instances examined is 500+. The data analysed reveals that apart from the generalised Qt-Qc pattern in shift of semantic importance, other patterns (namely, Qc-Qc) can be found by virtue of the prototype effects of the construction (cf. Taylor, 1995), thus, providing a potential source of disambiguation in the computational treatment of MWU in machine translation. 10 01 JB code cilt.366.10roj 174 197 24 Chapter 14 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 10. Semantic annotation of named rivers and its application for the prediction of multiword-term bracketing</TitleText> 1 A01 Juan Rojas Garcia Rojas Garcia, Juan Juan Rojas Garcia University of Granada 20 bracketing prediction 20 frame-based terminology 20 named river 20 predicate-argument structure analysis 20 semantic annotation 20 terminological knowledge base 20 three-component multi-word term 01 The acquisition of knowledge is essential for specialized translation, hence the representation of specialized phraseology in terminological knowledge bases is part of this process. The aim of this study was thus two-fold. Firstly, it describes how the semantic annotation of predicate-argument structure of sentences mentioning named rivers can be addressed from the perspective of Frame-based Terminology. The results showed that this approach provides valuable insights into the knowledge structures underlying the usage of named rivers in specialized texts. Secondly, this study explores whether the bracketing of a three-component multi-word term can be predicted from the semantic information encoded in the sentence where the ternary compound and a named river are used as arguments. The semantic annotations permitted construction of two machine-learning models capable of accurately predicting ternary-compound bracketing. 10 01 JB code cilt.366.11mar 198 218 21 Chapter 15 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 11. Irony in American-English tweets</TitleText> <Subtitle textformat="02">A cognitive and phraseological analysis</Subtitle> 1 A01 Beatriz Martín Gascón Martín Gascón, Beatriz Beatriz Martín Gascón Universidad Complutense de Madrid 20 American English 20 big data 20 cognitive linguistics 20 contextual ironic markers 20 echoic account 20 intercultural awareness 20 ironic phraseology 20 Spanish 20 twitter 20 verbal irony 01 The present study examines verbal irony from a cognitive linguistics perspective, based on Ruiz de Mendoza’s (2017) development of the echoic account and on big data. Built on previous research on the detection of Spanish ironic utterances in Twitter (Martín-Gascón, 2019), this investigation aims to analyze how American-English speakers conceptualize and express irony and compares findings to the Spanish ones. The dataset, initially consisting of 1,157,773,379 tweets from 248 countries and 66 languages, was first reduced to 27,517 tweets from English-speaking users in the United States using the words “irony”, “ironies”, and “ironic”, then to 605 containing the words as hashtag and finally to 495 tweets evincing implicit and explicit-echoic irony. An in-depth cognitive and qualitative analysis of the sample revealed the complexities of perceiving irony in written discourse and, therefore, the relevance of adding contextual ironic markers, such as hashtags, emojis, interjections, laughter typing and ironic phraseology, among others. In line with Martín-Gascón’s (2019) study, findings showed a higher use of positive and explicit-echoic irony to the detriment of implicit and negative irony. By drawing attention to the similarities and differences in the expression of irony, we expect to offer preliminary informed options for the design of pedagogical proposals that enhance not only the learners’ linguistic and ironic competencies, but also their intercultural awareness. 10 01 JB code cilt.366.12tak 219 243 25 Chapter 16 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 12. A comprehensive Japanese MWE lexicon</TitleText> <Subtitle textformat="02">JMWEL</Subtitle> 1 A01 Masahito Takahashi Takahashi, Masahito Masahito Takahashi Kurume Institute of Technology, emeritus, Japan 2 A01 Toshifumi Tanabe Tanabe, Toshifumi Toshifumi Tanabe Fukuoka University, Japan 3 A01 Jack Halpern Halpern, Jack Jack Halpern CJKI Co., Japan 4 A01 Kosho Shudo Shudo, Kosho Kosho Shudo Fukuoka University, emeritus, Japan 20 construction grammar 20 formulaic language 20 lexical bundles 20 multiword expression (MWE) 20 neural machine translation (NMT) 20 phrase-based machine translation (PBMT) 20 phrase-based NLP 20 phrase-based statistical machine translation (PBSMT) 20 phraseology 01 JMWEL (Japanese MWE Lexicon) is a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based processing of a wide range of Japanese documents. It has about 160,000 MWE lemmas covering almost every kind of linguistically idiosyncratic but commonly used Japanese phrases, e.g., idioms, quasi-idioms, collocations, quasi-collocations, clichés, quasi-clichés, institutionalized phrases, proverbs, and old sayings, excepting technical terms in specialized fields or named entities. JMWEL consists of sixteen sub-lexicons reflecting their distinctive features. The comprehensiveness of the collected MWEs and the detailed morpho-syntactic information given to each MWE, which may include internal modifiers, are notable features of JMWEL. In this paper, we introduce the newest version of JMWEL. 10 01 JB code cilt.366.13dib 244 262 19 Chapter 17 <TitleType>01</TitleType> <TitleText textformat="02">Chapter 13. Ontology-based formalisation of Italian clitic verbal MWEs</TitleText> <Subtitle textformat="02">An approach for supporting machine translation</Subtitle> 1 A01 Maria Pia di Buono di Buono, Maria Pia Maria Pia di Buono University of Naples “L’Orientale” 2 A01 Johanna Monti Monti, Johanna Johanna Monti University of Naples “L’Orientale” 3 A01 Valeria Caruso Caruso, Valeria Valeria Caruso University of Naples “L’Orientale” 20 Italian clitic verbs 20 lexicographic resources 20 linguistic linked open data 20 Neural Machine Translation 20 OntoLex-Lemon 20 verbal multiword expressions 01 In this paper we present the development of an ontology-based bilingual (IT-EN) lexicographic resource of Italian clitic Verbal MultiWord Expressions (VMWEs) to support machine translation. Starting from an analysis of these units and their linguistic features, we examine how Neural Machine Translation (NMT) handles complex VMWEs and the related translations issues. Finally, we propose a bilingual resource, formalised by means of the OntoLex-Lemon model, which accounts for morphological, syntactic, and semantic features of Italian clitic verbs, in order to enhance automatic translation of VMWEs. 02 JBENJAMINS John Benjamins Publishing Company 01 John Benjamins Publishing Company Amsterdam/Philadelphia NL 02 December 2024 20241215 2024 John Benjamins B.V. 02 WORLD 01 JB 1 John Benjamins Publishing Company +31 20 6304747 +31 20 6739773 bookorder@benjamins.nl 01 https://benjamins.com 01 WORLD US CA MX 10 20241215 01 02 JB 1 00 125.00 EUR R 02 02 JB 1 00 132.50 EUR R 01 JB 10 bebc +44 1202 712 934 +44 1202 712 913 sales@bebc.co.uk 03 GB 10 20241215 02 02 JB 1 00 105.00 GBP Z 01 JB 2 John Benjamins North America +1 800 562-5666 +1 703 661-1501 benjamins@presswarehouse.com 01 https://benjamins.com 01 US CA MX 10 20241215 01 gen 02 JB 1 00 163.00 USD