My first academic job was working as a Research Assistant within the Department of Computer Science at the University of Sheffield. The project was about developing automated methods to identify and measure text reuse within the British newspaper industry. Since then, I have been interested in Natural Language Processing; following progress within the field and using NLP methods in my work. In this post I will introduce the field of NLP, the typical approaches for processing language and some example applications and use cases. Future blog posts will delve deeper into specific aspects of AI and NLP.
What is NLP?
Not to be confused with Neuro Linguistic Programming, the NLP being talked about here is a branch of Computer Science and Artificial Intelligence (AI) concerned with developing methods to analyse, model, process and understand human language. NLP can be further divided into the subfields of Natural Language Understanding (NLU) that deals with building representations of language that machines can work with and reason over, and Natural Language Generation (NLG) which is about generating word sequences.
Why is processing natural language hard?
As humans we develop language abilities from a young age. However, processing and understanding language, especially using machines, is hard. Why? It’s because natural language can be full of ambiguity, often requiring context to interpret and disambiguate its meaning (e.g., think river bank vs. financial bank).
To make things harder, people might also use their own language and idiosyncrasies. For example, social media has spellings and slang you won’t find in any dictionary; whilst reports and papers can be full of jargon and industry-specific terminology. In addition, to correctly interpret meaning, language is often only possible with some working model of the world, context and common sense.
In many ways, the holy grail for NLP is to develop methods that enable machines to deeply understand the meaning of language and converse in naturalistic ways. However, interpreting meaning could be hard. Just consider the following examples of real newspaper headlines: “Drunk gets nine years in violin case” and “Blind Bishop appointed To See”.
What approaches can be used to tackle NLP?
Approaches to tackle NLP problems typically fall into two main categories: rule/heuristics-based vs. data-driven, which in many ways reflect the more theoretical vs. empirical perspectives to NLP. The heuristics-based approaches use rules created and programmed into machines (e.g., using templates, grammars or regular expressions). These are often a good place to start in tackling NLP problems and can be surprisingly effective, especially when the problem follows a regular structure. For example, identifying UK postcodes can be achieved with high accuracy using pattern matching based on regular expressions. The open-source spaCy library would be a good way to implement this.
Rule-based approaches can be brittle and become difficult to manage with more complex problems though. The data-driven approaches model language and solve tasks using statistical methods or machine learning. In this approach the rules are not provided, but learned, from large samples of language and labelled training data. These ‘classical NLP’ approaches require human input to specify how to represent language and possibly additional derived attributes (in both cases referred to as features). A key challenge for data-driven methods is representing language because computers can only deal with numbers. Commonly, we do this by recording word occurrences (e.g., a Bag-of-Words model) or word contexts (e.g., using word embeddings) as vectors of numbers.
More recently, approaches to learn representations or models of language from large samples of text, and without guidance using deep learning methods, have dominated the headlines (e.g., GPT-3). In this approach the machine will extract its own features and learn rules automatically. It’s probably fair to say that the world of NLP has been transformed (literally) by the use of large language models based on deep learning. Transformers are a game changer and 1,000s of pre-trained models for NLP understanding and generation, as well as computer vision and audio tasks are available to use. Transformers work by taking a pre-trained language model and then fine-tuning this to a specific domain or task. This ‘transfers’ patterns learned during language model pre-training to domain specific problems, reducing the need for domain-specific training data that is expensive to create.
Pipelines and common NLP tasks
Because of the complexity of processing natural language, it is common to break down NLP into a series of simpler tasks. For example:
- identifying individual words within a text (tokenisation)
- categorising word usage (e.g. identifying their parts-of-speech or entities)
- identifying and representing how words are arranged in sentences (syntax)
- identifying the meaning of words in context (semantics)
- understanding how sentences can form narratives or dialogues (pragmatics)
These tasks typically fall into levels of difficulty, with the earlier being easier than the latter, and are commonly arranged as a sequence of steps in a pipeline, which can be developed to create applications for broader tasks such as (adapted from Practical Natural Language Processing):
- Language modelling – predicting the next word(s) in a sequence
- Text classification – putting text into predefined categories
- Information extraction – extracting key entities, events, relations from text
- Information retrieval – finding documents relevant to a user query
- Conversational agents – building dialogue systems
- Text summarisation – summarising key aspects of a text
- Question answering – automatically answering questions posed in natural language
- Machine translation – translating text from one language to another
- Topic modelling – automatically identifying themes or topics in text
This AI demo site from JISC provides an interactive website to experiment with AI-powered vision and language applications, showing the diversity of what is currently possible.
Use cases for NLP in practice
This all sounds great, but in reality what can we use NLP for? The impressive stunts performed by the likes of the OpenAI tools from writing newspaper articles or program code given a description in natural language are all well and good, but what can NLP really do for organisations?
Well, consider central government and the wealth of unstructured data available from healthcare records, energy and environmental reports, to citizen surveys and social media. These are all ripe for applying NLP methods, for example chatbots to improve citizen engagement, improving public services by mining citizen feedback, improving predictions to aid decision making, or enhancing policy analysis.
Further examples I have worked on include analysing text from real-time incident reporting for a utilities company. The insights gained from the unstructured data enabled more accurate predictions to model power outages. One reason for this was certain language recorded in the transcripts was associated with the severity and type of faults. In another project for a UK charity, topic modelling was used to identify underlying themes in descriptions of projects and enhance reporting capabilities. Finally, unified access to multiple collections managed by a local authority was provided using search and NLP methods. Enrichments to identify entities, subsequently mapped to controlled vocabularies, were implemented to enable richer search capabilities.
There is no doubt that NLP methods and technologies are rapidly changing how we model language and build applications. However, along with the many benefits come challenges. One such challenge is that many are questioning the sustainability and environmental impact of training large language models. Others are uncovering the biases that data-driven approaches, especially language models, can learn and replicate, such as racial and gender biases. Will machines ever really ‘understand’ language or just seem to exhibit intelligent traits or appear sentient? It’s difficult to say, but what is certain is that there are many areas and tasks where NLP is successfully helping and supporting individuals and organisations.
Our recent blog posts
Visualising data in Clojure with Hanami
This is the second post in a tutorial series, which builds on exploring data with Clojure and Clerk. In this post we look at visualising the data.