A crucial development in the language technology sector is the rise and advance of deep learning, a subfield of artificial intelligence and machine learning that tries to emulate how the brain works and how humans learn. Machine-learning techniques are becoming ubiquitous in our increasingly data-driven society.
In the field of Natural Language Processing—a subfield of computer science that aims to make computers understand human language—the employment of deep learning models has shown that we can extract linguistic information from texts without having to teach the model anything about the language or languages in question. This has led to impressive results with machine translation improving in quality year-on-year and even being able to translate between pairs of languages it has never seen before.
There is one caveat, however; since deep learning models basically learn by example, they need a lot of data. This means that such models are unsuitable for under-resourced languages for which much less linguistic data is available—at least when simply applying machine-learning techniques in the same way as we would do for better-resourced languages.
Here is where the project Comparative deep models for minority and historical languages, Cardamom for short, comes in [3]. The aim of this Irish Research Council-funded project, hosted at the Data Science Institute, NUI Galway, is to use insights from linguistics and data gathered from the Web to bolster natural language processing techniques and applications that benefit low-resource languages, which have been largely ignored by current approaches.
The project’s methodology involves two major parallel but complementary strategies. Firstly, we aim to significantly enlarge datasets for minority languages, focusing on European and Indian languages, by gathering as much text from the Web as possible. Secondly, we will develop models of language, based on deep learning, that learn features of low-resource languages from closely-related, better-resourced languages, thus reducing the need for large datasets in minority-language and other low-resource scenarios.