Chapter 1
Multi-word units in neural machine translation
Why the tip of the iceberg remains problematic
Neural machine translation (NMT) has recently
made significant progress in improving the quality of the texts it
produces. New features of NMT include the fluidity of translations
and the successful handling of multi-word units. In this paper we
first report the results of an automated evaluation of the
percentage of phraseology in the translations produced by Google
Translate and DeepL. A corpus-based approach makes it possible to
estimate that both NMT systems succeed in producing an average
percentage of phraseology that is quite reasonable and sometimes
even higher than in natural language production by native speakers.
However, a closer look at some problematic cases shows that the
ability of NMT systems to treat phraseological units can be
deceptive, as they are often unable to cope with contextual
complexity and low-frequency idioms.
Article outline
- 1.Introduction: Lingering doubts about neural machine translation
- 2.Are texts produced by NMT rich in phraseology? An
experiment
- 3.Looking closer at problematic examples for NMT
- 4.Fine-tuning NMT for phraseology: An experiment
- 5.Conclusion
-
Notes
-
References
-
Appendix
This content is being prepared for publication; it may be subject to changes.
References (17)
References
Barreiro, A., Monti, J., Batista, F., & Orliac, B. (2013). When
multiword go bad in machine
translation. Proceedings of
the workshop on multi-word units in machine translation and
translation
technologies, 14th Machine Translation Summit, Nice.
Burger, A., Dobrovol’skij, D., Kühn, P., & Norrick, N. (Eds.). (2007). Phraseologie / Phraseology. Ein
internationales Handbuch der zeitgenössischen Forschung / An
International Handbook of Contemporary
Research. De Gruyter.
Clark, K., Luong, M. -T., Le, Q. V., & Manning, C. D. (2020). Electra:
Pre-training text encoders as discriminators rather than
generators. ICLR
2020, (pp. 1–18).
Colson, J. -P. (2017). The
IdiomSearch experiment: Extracting phraseology from a
probabilistic network of
constructions. In R. Mitkov (Ed.), Computational
and Corpus-based
phraseology, Lecture Notes in
Artificial Intelligence
10596. Springer International Publishing, Cham (pp. 16–28). 

Colson, J. -P. (2018). From
Chinese word segmentation to extraction of constructions:
Two sides of the same algorithmic
coin. Proceedings
of the Joint Workshop on Linguistic Annotation,
Multiword Expressions and Constructions
(LAW-MWE-CxG-2018), Association for Computational Linguistics (pp. 41–50).
Colson, J. -P. (2020). HMSid
and HMSid2 at PARSEME Shared Task 2020: Computational corpus
linguistics and unseen-in-training
MWEs. Coling 2020 –
Proceedings of the Joint Workshop on Multiword Expressions
and Electronic
Lexicons. Association for Computational Linguistics.
Croft, W. (2001). Radical
construction grammar: Syntactic theory in typological
perspective. Oxford University Press. 

Denkowski, M., & Lavie, A. (2014). Meteor
Universal: Language specific translation evaluation for any
target
language. Proceedings of
the EACL 2014 Workshop on Statistical Machine
Translation (pp. 376–380).
Dupal, J. (2018). Investigating
the Phrasicon of CLIL and NON-CLIL students: A corpus-based
comparative analysis using
IdiomSearch. Thesis, Université
catholique de
Louvain, Louvain-la-Neuve.
Goldberg, A. (2006). Constructions
at work. Oxford University Press.
Hoffmann, Th., & Trousdale, G. (Eds.). (2013). The
Oxford Handbook of Construction
Grammar. Oxford University Press. 

Isabelle, P., Cherry, C., & Foster, G. (2017). A
Challenge Set approach to evaluating machine
translation. Proceedings
of the 2017 Conference on Empirical Methods in Natural
Language
Processing (pp. 2486–2496). 
Laviosa, S. (2002). Corpus-Based
translation studies: Theory, findings,
applications. Rodopi. 

Loock, R. (2018). Traduction automatique et usage linguistique :
une analyse de traductions anglais-français réunies en
corpus. Meta,
Journal des
traducteurs, 63, 786–806. 

Papineni, K., Roukos, S., Ward, T. et al. (2002). Bleu:
A method for automatic evaluation of machine
translation. Proceedings
of 40th Annual Meeting of the Association for Computational
Linguistics (pp. 311–318).
Sinclair, J. (1991). Corpus,
concordance,
collocation. Oxford University Press.
Wray, A. (2008). Formulaic
language: Pushing the
boundaries. Oxford University Press.