Impact & Opinions | Tionchar & Tuairimí

How Computers Can Future-Proof Minority Languages
edition-image
Internationalisation

How Computers Can Future-Proof Minority Languages

29 July 21
 | 
0
(0)
 STARS
 | 11 MINS

Dr. Theodorus Fransen & Dr. John McCrae explore how digital language tools can potentially resolve the underrepresentation of minority languages in terms of digital technology and the Web.

There are estimated to be about 7,000 languages spoken in the world. Globally, 26 languages are dying every year and with them the knowledge and stories of their cultures.

Digital language tools currently only support a small fraction of these languages. While languages in Africa, the Americas or the Pacific may spring to mind first, we technically don’t have to go further than Europe: a recent report on European languages classified all but 2 EU languages as ‘severely’ under-resourced [1].

Irish is one of those languages. Like any language, it’s a vital part of the country’s heritage and culture and building digital resources for this language ensures its survival in the 21st century.

We still have a long way to go to resolve the underrepresentation of most languages in terms of digital technology and the Web. Even huge companies, such as Google, don’t have a clear plan of how to scale beyond 100 languages, as most language technologies, such as automatic translation, can only be developed with experts who speak the language.

How Computers Can Future-Proof Minority Languages
Prof Ciarán Ó hÓgartaigh, NUI Galway President & Dr John McCrae, Data Science Institute, NUI Galway

The good news is that the scientific community is well-aware of the resource gap. The last decade or so has seen a global surge in interest in research dedicated to language technology dealing with low-resource languages. For example, within the framework of the 2019 International Year of Indigenous Languages, UNESCO organised an international conference that was envisaged to “contribute towards the promotion of human rights and fundamental freedoms of all language users to access information and knowledge in languages that are best understood” [2].

There are estimated to be about 7,000 languages spoken in the world. Globally, 26 languages are dying every year and with them the knowledge and stories of their cultures.

A crucial development in the language technology sector is the rise and advance of deep learning, a subfield of artificial intelligence and machine learning that tries to emulate how the brain works and how humans learn. Machine-learning techniques are becoming ubiquitous in our increasingly data-driven society.

In the field of Natural Language Processing—a subfield of computer science that aims to make computers understand human language—the employment of deep learning models has shown that we can extract linguistic information from texts without having to teach the model anything about the language or languages in question. This has led to impressive results with machine translation improving in quality year-on-year and even being able to translate between pairs of languages it has never seen before.

There is one caveat, however; since deep learning models basically learn by example, they need a lot of data. This means that such models are unsuitable for under-resourced languages for which much less linguistic data is available—at least when simply applying machine-learning techniques in the same way as we would do for better-resourced languages.

Here is where the project Comparative deep models for minority and historical languages, Cardamom for short, comes in [3]. The aim of this Irish Research Council-funded project, hosted at the Data Science Institute, NUI Galway, is to use insights from linguistics and data gathered from the Web to bolster natural language processing techniques and applications that benefit low-resource languages, which have been largely ignored by current approaches.

The project’s methodology involves two major parallel but complementary strategies. Firstly, we aim to significantly enlarge datasets for minority languages, focusing on European and Indian languages, by gathering as much text from the Web as possible. Secondly, we will develop models of language, based on deep learning, that learn features of low-resource languages from closely-related, better-resourced languages, thus reducing the need for large datasets in minority-language and other low-resource scenarios.

we will develop models of language, based on deep learning, that learn features of low-resource languages from closely-related, better-resourced languages, thus reducing the need for large datasets in minority-language and other low-resource scenarios.

How Computers Can Future-Proof Minority Languages
Dr. Theodorus Fransen

Speakers of minority languages are among the fastest growing communities on the Web and meeting their needs is of major societal and commercial importance. An example of this is a recent collaboration between NUI Galway and Translators without Borders to develop resources for aid workers to use with refugees forced out of Burma by the ongoing genocide.

The project involved the creation of translation not only in Bengali, the 6th most-spoken language in the world—yet still with less text available on the Web than even Irish—but also languages like Chittagonian and Rohingya that had next to no resources. For those languages, we developed one of the first digital corpora, providing an important step towards the development of much-needed language technology.

Our focus in the Cardamom project is not only on contemporary minority languages, but also on historical varieties. From a methodological and linguistic data point of view, this makes sense: languages like Old English or Old Irish are characterised by scanty textual evidence, a scenario not unlike the Indian subcontinent, where many languages have little presence on the Web (although many are becoming increasingly important in the rapidly developing and globalising world).

Just as we can use linguistic information from a well-resourced language like Hindi to understand any of the hundreds of closely-related Indo-European languages spoken in India, we can use features from, say, Modern Irish, to gain more insight into the historical stages of the Irish language.

By incorporating historical language data in our models, we aspire to facilitate the growing demand for text analysis in digital humanities, whereby access to large corpora in languages such as Sanskrit or Old Irish can enable new insights in the study of history and literature. Our research will therefore contribute to digitally safeguarding and future-proofing linguistic heritage, not only in a contemporary, but also in a historical sense.

RATE

0 / 5. Vote count: 0

Discover More

Keep up to date on the latest from us straight to your inbox

Privacy policy