A text-preprocessing algorithm to enhance MWE-aware neural machine translation systems: Chapter 2. ReGap

Hidalgo-Ternero, Carlos Manuel; Pastor, Gloria Corpas

Part of

Recent Advances in Multiword Units in Machine Translation and Translation Technology
Edited by Johanna Monti, Gloria Corpas Pastor, Ruslan Mitkov and Carlos Manuel Hidalgo-Ternero
[Current Issues in Linguistic Theory 366] 2024
► pp. 18–39

Chapter 2
ReGap

A text-preprocessing algorithm to enhance MWE-aware neural machine translation systems

Carlos Manuel Hidalgo-Ternero | Universidad de Málaga | [email protected]

Gloria Corpas Pastor | Universidad de Málaga

This research presents ReGap, a text-preprocessing algorithm for the automatic token-based identification and conversion of discontinuous multiword expressions (MWEs) into their canonical state, i.e., their continuous form, as a means to optimise neural machine translation (NMT) systems. To this end, an experiment with flexible verb-noun idiomatic constructions (VNICs) is conducted in order to assess to what extent ReGap can enhance the performance of the most robust NMT system to date, DeepL, under the challenge of MWE discontinuity in the Spanish-into-English and the Spanish-into-German directionalities. In this regard, the promising results yielded for VNICs will shed some light on new avenues for enhancing MWE-aware NMT systems.

Keywords: Text-preprocessing algorithm, Neural Machine Translation (NMT), DeepL, Token-based MWE identification, Verb-noun idiomatic constructions (VNICs), discontinuity

Article outline

1.Introduction
2.The MWEs under study
3.Related work
4.Methodology
5.Results
6.Discussion
7.Conclusion
Notes
References

This content is being prepared for publication; it may be subject to changes.

References (50)

References

Al Saied, H., Candito, M., & Constant, M. (2019). Comparing linear and neural models for competitive MWE identification. Proceedings of the 22nd Nordic Conference on Computational Linguistics (pp. 86–96). [URL]

Alegria, I., Ansa, O., Artola, X., Ezeiza, N., Gojenola, K., & Urizar, R. (2004). Representation and treatment of multiword expressions in Basque. Proceedings of the Second ACL Workshop on Multiword Expressions: Integrating Processing (pp. 48–55). [URL].

Bargmann, S. & Sailer, M. (2018). The syntactic flexibility of semantically non-decomposable idioms. In M. Sailer & S. Markantonatou (Eds.), Multiword expressions: Insights from a multi-lingual perspective (pp. 1–29). Language Science Press.

Bejček, E., Straňák, P., & Pecina, P. (2013). Syntactic identification of occurrences of multiword expressions in text using a lexicon with dependency structures. Proc. of the 9th Workshop on Multiword Expressions (pp. 106–115). [URL]

Belinkov, Y., & Bisk, Y. (2018). Synthetic and natural noise both break neural machine translation. ArXiv. [URL]

Bentivogli, L., Bisazza, A., Cettolo, M., & Federico, M. (2016). Neural versus phrase-based machine translation quality: A case study. arXiv [URL].

Colson, J. -P. (2019). Multi-Word units in machine translation: Why the tip of the iceberg remains problematic – and a tentative corpus-driven solution. MUMTT 2019, the 4th Workshop on Multi-word Units in Machine Translation and Translation Technology. [URL].

Constant, M., Eryiǧit, G., Monti, J., van der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A. (2017). Multiword expression processing: A survey. Computational Linguistics, 43(4), 1–92.

Corpas Pastor, G. (2013). Detección, descripción y contraste de las unidades fraseológicas mediante tecnologías lingüísticas. In I. Olza & E. Manero (Eds.) Fraseopragmática. Colección Romanistik (pp. 335–373). Frank & Timme. [URL]

Derczynski, L., Ritter, A., Clark, S., & Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In R. Mitkov, G. Angelova & K. Bontcheva (Eds.), Proceedings of the International Conference on Recent Advances in Natural Language Processing (pp. 198–206). INCOMA Ltd. [URL]

DILEA – Penadés Martínez, I. (2019). arrimar el hombro. En Diccionario de locuciones idiomáticas del español actual. [URL]

(2019). poner los cuernos. En Diccionario de locuciones idiomáticas del español actual. [URL]

DLE – Real Academia Española (2022). dejarse la piel. En Diccionario de la Lengua Española. [URL]

ELIS – European Language Industry Survey (2018). 2018 Language Industry Survey – Expectations and concerns of the European language industry. [URL]

(2020). 2020 Language Industry Survey – 2020 before & after COVID-19. [URL]

Fazly, A., Cook, P., & Stevenson, S. (2009). Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35(1), 61–103.

Finlayson, M., & Kulkarni, N. (2011). Detecting multiword expressions improves word sense disambiguation. Proceedings of the ALC Workshop on MWEs (MWE 2011) (pp. 20–24). [URL]

Foufi, V., Nerima, L., & Wehrli, E. (2019). Multilingual parsing and MWE detection. In Y. Parmentier & J. Waszczuk (Eds.), Representation and parsing of multiword expressions: Current trends (pp. 217–237). Language Science Press. [URL]

Gui, T., Zhang, Q., Huang, H., Peng, M., & Huang, X. (2017). Part-of-speech tagging for twitter with adversarial neural networks. In M. Palmer, R. Hwa & S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2411–2420). Association for Computational Linguistics.

.

Hidalgo-Ternero, C. M. (2020). Google Translate vs. DeepL: analysing neural machine translation performance under the challenge of phraseological variation. In P. Mogorrón Huerta (Ed.), Análisis multidisciplinar del fenómeno de la variación en traducción e interpretación / Multidisciplinary Analysis of the Phenomenon of Phraseological Variation in Translation and Interpreting. MonTI Special Issue 6 (pp. 154–177).

(2021). El algoritmo ReGap para la mejora de la traducción automática neuronal de expresiones pluriverbales discontinuas (FR>EN/ES). In G. Corpas Pastor, M. R. Bautista Zambrana & C. M. Hidalgo-Ternero (Eds.), Sistemas fraseológicos en contraste: enfoques computacionales y de corpus (pp. 253–270). Comares.

Hidalgo-Ternero, C. M., & Corpas Pastor, G. (2020). Bridging the ‘gApp’: improving neural machine translation systems for multiword expression detection. Yearbook of Phraseology, 11, 61–80.

Hidalgo-Ternero C. M., & Corpas Pastor, G. (2024/forthcoming). Qué se traerá gApp entre manos … O cómo mejorar la traducción automática neuronal de variantes somáticas (ES>EN/DE/FR/IT/PT). In M. Seghiri & M. Pérez Carrasco (Eds.), Nuevas tendencias en traducción e interpretación especializadas. Peter Lang.

Hidalgo-Ternero, C. M. (2024/forthcoming). ¿DeepL, Google Translate o VIP? Qué sistema ofrece un mejor rendimiento en la traducción de locuciones continuas y discontinuas. In G. Corpas Pastor & F. J. Veredas Navarro (eds.), Tecnologías lingüísticas multilingües: desarrollos actuales y transición digital. Comares

Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, 7.

Huang, P. S., Wang, C., Huang, S., Zhou, D., & Deng, L. (2018). Towards neural phrase-based machine translation. arXiv preprint arXiv:1706.05565. [URL]

Junczys-Dowmunt, M., Dwojak, T., & Hoang, H. (2016). Is neural machine translation ready for deployment? A case study on 30 translation directions. Arxiv. [URL]

Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. Proceedings of the 11th EURALEX International Congress (pp. 105–116).

Klyueva, N., Doucet, A., & Straka M. (2017). Neural networks for multi-word expression detection. Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) (pp. 60–65).

Lohar, P., Popović, M., Alfi, H., & Way, A. (2019). A systematic comparison between SMT and NMT on translating user-generated content. 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2019).

Maldonado, A., Han, L., Moreau, E., Alsulaimani, A., Chowdhury, K. D., Vogel, C., & Liu, Q. (2017). Detection of verbal multi-word expressions via conditional random fields with syntactic dependency features and semantic re-ranking. Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) (pp. 114–120).

Martínez Alonso, H., & Zeman, D. (2016). Universal dependencies for the AnCora treebanks. Procesamiento del Lenguaje Natural, [S.l.], 57, 91–98. ISSN 1989-7553. [URL]

Moreau, E., Alsulaimani, A., Maldonado, A., & Vogel, C. (2018). CRF-Seq and CRFDepTree at PARSEME Shared Task 2018: Detecting verbal MWEs using sequential and dependency-based approaches. Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018) (pp. 241–247).

Nagy, T., & Vincze, V. (2014). VPCTagger: Detecting verb-particle constructions with syntax-based methods. Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014). Association for Computational Linguistics.

Neunerdt, M., Trevisan, B., Reyer, M., & Mathar, R. (2013). Part-of-speech tagging for social media texts. In I. Gurevych, C. Biemann & T. Zesch (Eds.), Language processing and knowledge in the web. Lecture Notes in Computer Science 8105 (pp. 139–150). Springer.

Nothman, J., Ringland, N., Radford, W., Murphy, T., & Curran, J. R. (2017). Learning multilingual named entity recognition from Wikipedia. figshare. Dataset.

Parra Escartín, C., Nevado Llopis, A., & Sánchez Martínez, E. (2018). Spanish multiword expressions: Looking for a taxonomy. In M. Sailer & S. Markantonatou (Eds.), Multiword expressions: Insights from a multi-lingual perspective (pp. 271–323). Language Science Press.

Ramisch, C. (2015). Multiword Expressions Acquisition: A Generic and Open Framework. Theory and Applications of Natural Language Processing series XIV. Springer. ISBN 978-3-319-09206-5.

Ramisch, C., & Villavicencio, A. (2018). Computational treatment of multiword expressions. In R. Mitkov (Ed.), Oxford Handbook on Computational Linguistics (2ª ed).

Ramisch, C., Cordeiro, S. R., Savary, A., Vincze, V., Barbu Mititelu, V., Bhatia, A., Buljan, M., Candito, M., Gantar, P., Giouli, V., Güngör, T., Hawwari, A., Iñurrieta, U., Kovalevskaitė, J., Krek, S., Lichte, S., Liebeskind, C., Monti, J., Parra Escartín, C., …, & Walsh, A. (2018). Edition 1.1 of the PARSEME Shared Task on automatic identification of verbal multiword expressions. Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), (pp. 222–240). [URL]

Riedl, M., & Biemann, C. (2016). Impact of MWE resources on multiword recognition. Proc. of the ACL 2016 Workshop on MWEs (pp. 107–111).

Rikters, M., & Bojar, O. (2017). Paying attention to multi-word expressions in neural machine translation. arXiv preprint arXiv:1710.06313.

Rohanian, O., Taslimipoor, S., Kouchaki, S., An Ha, L., & Mitkov, R. (2019). Bridging the gap: Attending to discontinuity in identification of multiword expressions. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1 (pp. 2692–2698).

Schneider, N., Danchik, E., Dyer, C., & Smith, N. A. (2014). Discriminative lexical semantic segmentation with gaps: Running the MWE gamut. TACL, 2, 193–206.

Shterionov, D., Superbo, R., Nagle, P., Casanellas, L. O, O’Dowd, T., & Way, A. (2018). Human versus automatic quality evaluation of NMT and PBSMT. Machine Translation, 32, 217–235.

Wang, X., Tu, Z., Xiong, D., & Zhang, M. (2017). Translating phrases in neural machine translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017) (pp. 1421–1431).

Wang, H., Wu, H. He, Z., Huang, L., & Church, K. W. (2022). Progress in machine translation. Engineering. (online first, 14 July 2021). [URL]

Wyrwoll, C. (2014). User-Generated content. Social Media, 11–45.

Zampieri, N., Ramisch, C., & Damnati, G. (2019). The impact of word representations on sequential neural MWE identification. Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019) (pp. 169–175).

Zaninello, A., & Birch, A. (2020). Multiword expression aware neural machine translation. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 3816–3825). [URL]

Chapter 2ReGap

A text-preprocessing algorithm to enhance MWE-aware neural machine translation systems

Chapter 2
ReGap