Rosetta Balcanica: Deriving a “Gold Standard” Neural Machine Translation (NMT) Parallel Dataset for Western Balkan Languages

Edmon Begoli, Maria Mahbub, Sudarshan Srinivasan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The Rosetta Balcanica is an ongoing effort to expand the resources for lowresource western Balkan languages. This effort focuses on discovering and using accurately translated, officially mapped, and curated parallel language resources and their preparation and use as neural machine translation (NMT) datasets. Some of the guiding principles, practices, and methods employed by Rosetta Balcanica are generalizable and could apply to other lowresource language resource expansion efforts. With this goal in mind, we present our rationale and approach to discovering and using meticulously translated and officially curated lowresource language resources and our use of these resources to develop a parallel “gold standard” translation training resource. Secondly, we describe our specific methodology for NMT dataset development from these resources and its publication to a widelyused and accessible repository for natural language processing (Hugging Face Hub). Finally, we discuss the tradeoffs and limitations of our current approach and the roadmap for future development and expansion of the current Rosetta Balcanica language resource.

Original languageEnglish
Title of host publicationLoResMT 2024 - 7th Workshop on Technologies for Machine Translation of Low-Resource Languages, Proceedings of the Workshop
EditorsAtul Kr. Ojha, Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jade Abbott, Jonathan Washington, Nathaniel Oco, Valentin Malykh, Varvara Skolkovo Logacheva, Xiaobing Zhao
PublisherAssociation for Computational Linguistics (ACL)
Pages74-84
Number of pages11
ISBN (Electronic)9798891761490
StatePublished - 2024
Event7th Workshop on Technologies for Machine Translation of Low-Resource Languages, LoResMT 2024 at ACL 2024 - Bangkok, Thailand
Duration: Aug 15 2024 → …

Publication series

NameLoResMT 2024 - 7th Workshop on Technologies for Machine Translation of Low-Resource Languages, Proceedings of the Workshop

Conference

Conference7th Workshop on Technologies for Machine Translation of Low-Resource Languages, LoResMT 2024 at ACL 2024
Country/TerritoryThailand
CityBangkok
Period08/15/24 → …

Fingerprint

Dive into the research topics of 'Rosetta Balcanica: Deriving a “Gold Standard” Neural Machine Translation (NMT) Parallel Dataset for Western Balkan Languages'. Together they form a unique fingerprint.

Cite this