The ongoing pandemic has reinforced the importance of data in making informed decisions. Data, however, comes in many forms. Very broadly, we distinguish between structured and unstructured data. Structured data comprises numerical data, for instance operational numbers in hospitals, weather related measurements, financial reporting, or anywhere else where numbers play a role. Unstructured data comprises data derived from text, speech, images, or any other form of human communication.
Increased interest in Artificial Intelligence (AI) derives largely from the growing success over recent decades in the automatic analysis of such unstructured data by use of machine learning techniques. The field of Natural Language Processing (NLP) has developed increasingly powerful methods to automatically analyse large collections of unstructured textual data, originating from a wide variety of sources such as news reports, patient health records, scientific articles, enterprise documents, emails, social media, etc.
NLP is, in fact, one of the oldest branches of AI research, which started back in the fifties, first in the USA, with a specific interest in automatic methods for translating Russian documents into English. The first international scientific meetings between NLP researchers started in the early sixties. Since then, NLP has grown into a broad field of research that has contributed to the development of everyday household and business tools such as search engines, chatbots, machine translation services, sentiment analysis and opinion mining, among many others.
At the Data Science Institute, we are developing innovative approaches to improve such applications even further, as well as developing completely new ones. For instance, in the context of the EU funded Pandem-2 project, we develop methods for the automatic extraction of suggestions from social media to improve two-way communication between government agencies and the public.
In the context of the Insight SFI Research Centre for Data Analytics, we develop algorithms for the detection of offensive content in social media. Here, we focus specifically on memes, consisting of a combination of image and text that require sophisticated multimodal methods to take both modalities – image and text – into account. Also in the Insight context, we develop methods for the automatic extraction of knowledge graphs, semantically structured representations of significant concepts and relations for a given domain of discourse, where the resulting knowledge graphs are used in chatbot development for these domains. In the context of the SFI Centre for Research Training in Artificial Intelligence, we are developing approaches to automatically reason over text, for instance to infer causality between different statements as well as methods for the automatic generation of textual summaries from tables with numerical data, such as in sports, finance, and healthcare.
However, with success comes responsibility. The increasing uptake of NLP technology in the development of everyday tools that are used across society requires increased scrutiny in regard to the inner workings of such tools. It goes without saying that any product or service should be trusted to perform reliably what it is supposed to do. If you buy a new car, you can reliably assume it will drive you safely from A to B. In the case of AI tools, that is unfortunately not yet the case, despite many concerted efforts between the scientific community, industry, and governments. Interdisciplinary research areas such as Trustworthy AI and Explainable AI, involving researchers from data and computer science alongside social scientists and legal experts, focus on developing new methods for making AI systems more transparent. Governments are working together with industry and the scientific community to develop new regulations for AI applications. Not surprisingly, this has developed mostly in AI for healthcare, but transparency and corresponding enforcement regulation will be required for all AI systems.
So, what can go wrong with NLP technology? There have, in fact, been several notorious examples where things can go wrong. In 2015, Amazon discovered that their newly installed automatic recruitment tool unintentionally discriminated against women by focusing on masculine language in ranking CVs. In 2016, Microsoft released a chatbot, which soon after its release started to generate racist and other offensive language that may have been the consequence of unintentionally learning such language through interaction with human users or from unintended biases introduced in its initial development. These and other big tech companies have learned from these mistakes and have set up working groups across the industry to address bias in AI.
Unintended bias introduced by NLP and other AI systems will continue to be a growing concern for society, business, and individuals if it is not addressed from the start and followed through in all its potential consequences. Amazon’s recruitment tool was based on CV data from previous years, without considering that this data would comprise the CVs of mostly male employees as had been recruited before. Microsoft’s chatbot was intended for informal and general use, but consider a future scenario where chatbots will be increasingly used in education. Machine translation is used more and more as an everyday tool but given differences in conceptualization between languages, unintended bias may be introduced. The outcome of automated sentiment analysis and of other social media analysis tools influence industry and governments in making decisions that may have far-reaching effects in society.
In all such cases, it is no longer only about ever faster and more efficient data processing, but increasingly about selecting the right data and using it in a transparent and responsible way.
Profiles
Dr. Paul Buitelaar is Senior Lecturer in Computer Science, Interim Director of the Data Science Institute at NUI Galway, Interim Site Director of the Insight SFI Research Centre for Data Analytics, and Co-Director of the SFI Centre for Research Training in Artificial Intelligence