Initial Parallel Corpus Creation and Statistical Machine Translation Experiments for Spanish Guarani pair of Languages

Initial Parallel Corpus Creation and Statistical Machine Translation Experiments for Spanish Guarani pair of Languages

Authors

DOI:

https://doi.org/10.70833/rseisa17item342

Keywords:

Parallel corpus, Bilingual corpus, Statistical machine translation, Guaraní

Abstract

This paper introduces the work that has been done to collect sentences in Spanish and Guaraní to create a bilingual corpus. This corpus might serve as a baseline for the creation of linguistic technology related to the pair of languages. In this article, the focus is on machine translation from Spanish to Guaraní. Guaraní is an under-resourced language that suffers from digital resource insufficiency. This prevents the language from thriving in terms of technology development. To generate the bilingual corpus, digital resources available on the cloud have been used. Furthermore, a web platform called Guampa has been employed to generate new phrases collaboratively. Statistical data related to the corpus is presented along with initial experiments for Statistical Machine Translation (SMT) using Moses platform. The results serve as a starting point for future research in the area.

Downloads

Download data is not yet available.

References

Apertium/apertium-grn. (2020). [Python]. Apertium. https://github.com/apertium/apertium-grn (Original work published 2018)

Gasser, M. (2006). Machine translation and the future of indigenous languages. I Congreso Internacional de Lenguas y Literaturas Indoamericanas.

Gasser, M. (2018). Mainumby: Un Ayudante para la Traducción Castellano-Guaraní. CoRR, abs/1810.08603. http://arxiv.org/abs/1810.08603

Guarani Language and the Guarani Indian Tribe (Avañe’e, Jopará, Chiriguano, Mbyá). (n.d.). Retrieved March 3, 2020, from http://www.native-languages.org/guarani.htm

Hltdi/Bitext. (n.d.). GitHub. Retrieved December 1, 2020, from https://github.com/hltdi/Bitext

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 177–180. https://www.aclweb.org/anthology/P07-2045

Maldonado, D. M., Villalba Barrientos, R., & Pinto-Roa, D. P. (2016, November 22). Eñe’˜e: Sistema de reconocimiento automático del habla en Guaraní. Simposio Argentino de Inteligencia Artificial (ASAI 2016) - JAIIO 45 (Tres de Febrero, 2016). http://sedici.unlp.edu.ar/handle/10915/56979

Milagros, M. P., Abdelali, A., Cowie, J., Helmreich, S., Jin, W., Ogden, B., Rad, H., & Zacharski, R. (2006). Guarani: A Case Study in Resource Development for Quick Ramp-Up MT.

morfo: Análisis y generación morfológica. (n.d.). Retrieved February 10, 2021, from http://plogs.soic.indiana.edu/morfo/

Moses—Main/HomePage. (n.d.). Retrieved May 12, 2020, from http://www.statmt.org/moses/

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135

Rudnick, A., Skidmore, T., Samaniego, A., & Gasser, M. (2014). Guampa: A Toolkit for Collaborative Translation. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1659–1663. http://www.lrec-conf.org/proceedings/lrec2014/pdf/151_Paper.pdf

Published

2022-12-27

How to Cite

Álvarez López, A. A. (2022). Initial Parallel Corpus Creation and Statistical Machine Translation Experiments for Spanish Guarani pair of Languages. Journal on Studies and Research of Academic Knowledge, (17), e2023003. https://doi.org/10.70833/rseisa17item342

Issue

Section

Research Articles

Categories

Loading...