Initial Parallel Corpus Creation and Statistical Machine Translation Experiments for Spanish Guarani pair of Languages
Parallel corpus, Bilingual corpus, Statistical machine translation, GuaraníAbstract
This paper introduces the work that has been done to collect sentences in Spanish and Guaraní to create a bilingual corpus. This corpus might serve as a baseline for the creation of linguistic technology related to the pair of languages. In this article, the focus is on machine translation from Spanish to Guaraní. Guaraní is an under-resourced language that suffers from digital resource insufficiency. This prevents the language from thriving in terms of technology development. To generate the bilingual corpus, digital resources available on the cloud have been used. Furthermore, a web platform called Guampa has been employed to generate new phrases collaboratively. Statistical data related to the corpus is presented along with initial experiments for Statistical Machine Translation (SMT) using Moses platform. The results serve as a starting point for future research in the area.
Apertium/apertium-grn. (2020). [Python]. Apertium. (Original work published 2018)
Gasser, M. (2006). Machine translation and the future of indigenous languages. I Congreso Internacional de Lenguas y Literaturas Indoamericanas.
Gasser, M. (2018). Mainumby: Un Ayudante para la Traducción Castellano-Guaraní. CoRR, abs/1810.08603.
Guarani Language and the Guarani Indian Tribe (Avañe’e, Jopará, Chiriguano, Mbyá). (n.d.). Retrieved March 3, 2020, from
Hltdi/Bitext. (n.d.). GitHub. Retrieved December 1, 2020, from
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 177–180.
Maldonado, D. M., Villalba Barrientos, R., & Pinto-Roa, D. P. (2016, November 22). Eñe’˜e: Sistema de reconocimiento automático del habla en Guaraní. Simposio Argentino de Inteligencia Artificial (ASAI 2016) - JAIIO 45 (Tres de Febrero, 2016).
Milagros, M. P., Abdelali, A., Cowie, J., Helmreich, S., Jin, W., Ogden, B., Rad, H., & Zacharski, R. (2006). Guarani: A Case Study in Resource Development for Quick Ramp-Up MT.
morfo: Análisis y generación morfológica. (n.d.). Retrieved February 10, 2021, from
Moses—Main/HomePage. (n.d.). Retrieved May 12, 2020, from
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.
Rudnick, A., Skidmore, T., Samaniego, A., & Gasser, M. (2014). Guampa: A Toolkit for Collaborative Translation. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1659–1663.

How to Cite
Copyright (c) 2023 Aldo Andrés Álvarez López

This work is licensed under a Creative Commons Attribution 4.0 International License.
Creative Commons Attribution License CC-BY
You are free to:
Share — copy and redistribute the material in any medium or format.
Adapt — remix, transform, and build upon the material for any purpose, including commercially.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if any changes have been made. You may do so in any reasonable way, but not in any way that suggests that you or your use is endorsed by the Licensor.