TY - GEN
T1 - Rosetta Balcanica
T2 - 7th Workshop on Technologies for Machine Translation of Low-Resource Languages, LoResMT 2024 at ACL 2024
AU - Begoli, Edmon
AU - Mahbub, Maria
AU - Srinivasan, Sudarshan
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - The Rosetta Balcanica is an ongoing effort to expand the resources for lowresource western Balkan languages. This effort focuses on discovering and using accurately translated, officially mapped, and curated parallel language resources and their preparation and use as neural machine translation (NMT) datasets. Some of the guiding principles, practices, and methods employed by Rosetta Balcanica are generalizable and could apply to other lowresource language resource expansion efforts. With this goal in mind, we present our rationale and approach to discovering and using meticulously translated and officially curated lowresource language resources and our use of these resources to develop a parallel “gold standard” translation training resource. Secondly, we describe our specific methodology for NMT dataset development from these resources and its publication to a widelyused and accessible repository for natural language processing (Hugging Face Hub). Finally, we discuss the tradeoffs and limitations of our current approach and the roadmap for future development and expansion of the current Rosetta Balcanica language resource.
AB - The Rosetta Balcanica is an ongoing effort to expand the resources for lowresource western Balkan languages. This effort focuses on discovering and using accurately translated, officially mapped, and curated parallel language resources and their preparation and use as neural machine translation (NMT) datasets. Some of the guiding principles, practices, and methods employed by Rosetta Balcanica are generalizable and could apply to other lowresource language resource expansion efforts. With this goal in mind, we present our rationale and approach to discovering and using meticulously translated and officially curated lowresource language resources and our use of these resources to develop a parallel “gold standard” translation training resource. Secondly, we describe our specific methodology for NMT dataset development from these resources and its publication to a widelyused and accessible repository for natural language processing (Hugging Face Hub). Finally, we discuss the tradeoffs and limitations of our current approach and the roadmap for future development and expansion of the current Rosetta Balcanica language resource.
UR - http://www.scopus.com/inward/record.url?scp=85204880654&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204880654
T3 - LoResMT 2024 - 7th Workshop on Technologies for Machine Translation of Low-Resource Languages, Proceedings of the Workshop
SP - 74
EP - 84
BT - LoResMT 2024 - 7th Workshop on Technologies for Machine Translation of Low-Resource Languages, Proceedings of the Workshop
A2 - Ojha, Atul Kr.
A2 - Ojha, Atul Kr.
A2 - Liu, Chao-hong
A2 - Vylomova, Ekaterina
A2 - Pirinen, Flammie
A2 - Abbott, Jade
A2 - Washington, Jonathan
A2 - Oco, Nathaniel
A2 - Malykh, Valentin
A2 - Logacheva, Varvara Skolkovo
A2 - Zhao, Xiaobing
PB - Association for Computational Linguistics (ACL)
Y2 - 15 August 2024
ER -