Domain-specific NLP pipelines

Posted on

Domain-specific NLP pipelines

Just like you have general purpose clothes for every day use, there also are a wide variety of specialised clothing that are a better fit for certain situations. The same thing is true for Natural Language Processing pipelines, which is what this blog post is about.

Let’s start with having a look at the architecture of an NLP pipeline and the building blocks that can be used in them.

Solution architecture

The overal architecture of an NLP pipeline consists of number of layers: a user interface; one or several NLP models, depending on the use case; a Natural Language Understanding layer to describe the meaning of words and sentences; a preprocessing layer; microservices for linking the components together and of course the infrastructure which hosts the whole solution.

Any NLP solution consists of a number of reusable components. For tailoring models to a specific domain, it is mainly the natural language understanding (NLU) layer that needs to be retrained. This layer is responsible for mapping input tokens (words, sentences, paragraphs or documents) to their respective meanings.

The other components like tokenisers, lemmatisers, and system entity parsers are common to all NLP models. The surrounding back-end systems that deal with API connectivity, scalability, model versioning and security can also be reused. Although the models themselves have to be re-trained for every use case, the architecture of the models stays the same so they are considered to be “90% reusable”.

The following sections briefly describe every layer of the NLP pipeline.

User interaction (web application)

Machine Learning models only make sense when they are actively used by users. A good user interface has two functions: making sure the NLP models are used and gathering extra training data.

Use cases

The components of the custom AI platform are reused in multiple use cases. The initial investment for training domain specific components recompenses by virtue of the reusability in a wide variety of use cases.

Faktion does continuous research on novel deep learning architectures for most of the Machine Learning tasks. We have implemented, benchmarked and compared tens of different types of models for each task. These models are ready to reuse and – in most cases – do not require customisation.

Models

Almost every use case regarding the use of NLP falls into one of five categories: classification, entity or information extraction, document similarity matching, summarisation or natural language generation. The output of the underlying NLU components are used as input for these models.

Natural Language Understanding layer

To enable the models described in the previous paragraph, it is crucial that they can understand the context and capture the meaning of specific words, sentences and documents. This happens by using word vectors, sentence vectors, document vectors or a combination of these. These vectors are used as input for the NLP models.

Since the meaning and scope of the words is strongly domain dependent, this is one of the layers where the most impact can be made with a custom implementation.

Pre-processing

Faktion has built robust and high quality components for Flemish, Dutch, French, English and German. These components form the backbone of the solution and are crucial in getting good results quickly. Our preprocessing components include

  • tokenisation (splitting up a text into words)
  • parsing (for instance to transform all date formats to one single format)
  • OCR (optical character recognition to transform pdf documents and images to raw text)
  • language detection
  • text to speech (TTS)
  • speech to text (STT)

We also use existing Microsoft solutions for several of these components. This is the case for TTS, STT and OCR. For the Flemish variant of Dutch, we have in-house TTS and STT models that can be customised if needed.

Microservices backend

For linking all components together, you need a robust and scalable back-end system. This back-end system takes care of scheduling training jobs, API-requests, integration, building, testing, security and authentication. 

Infrastructure

Our solution is deployed using Helm templates which can easily be deployed for example on Azure Kubernetes Service or on-premise on OpenShift.

When does a domain-specific NLP pipeline make sense?

General purpose NLP pipelines are used successfully in a wide variety of applications.  The need for a domain specific NLP pipeline arises when the data to be processed differs considerably from standard, everyday language.

Jargon-filled language

General purpose models are trained on common language (books, articles, forums, wikipedia, news articles). As such, they are very suitable for applications in the same domains as they are trained on: language and context that can be expected to be understood by people with a normal high school education or the average reader of a newspaper.

Some document types however contain specialised language and an industry-specific vocabulary, giving rise to the need of training custom word embeddings. Highly specialised terminology is not frequently used, and the chances that these words have an accurate representation in pre-trained word embeddings are small. Pre-trained word embeddings are usually trained on big corpora such as Wikipedia, and even though Wikipedia does contain quite some technical lingo, it does not contain all technical words, or it contains them in a very low frequency. On top of that, words may have an entirely different meaning in a specific domain than they have in everyday language. Think for instance of the word “interest”. In everyday language, “interest” usually means “the feeling of wanting to know or learn about something or someone”. However, when the word “interest” is used in financial documents, it much more likely designates “money paid regularly at a particular rate for the use of money lent, or for delaying the repayment of a debt”. For an NLP pipeline for financial documents to give as accurate results as possible, it is crucial that the specialised meaning of the words and terminology occurring in them is properly represented in the NLU layer. This is the main reason for choosing to create a domain specific NLP pipeline.

Some domains in which the ubiquity of specialised terms may require training custom NLU models:

  • Financial documents like annual reports, shareholder letters, …
  • Legal documents like contracts, legislation, …
  • Technical manuals
  • R&D Lab reports

Non-standard language use

Sometimes the general purpose language models are not enough because the text data pertains to a register with particular linguistic characteristics. Think for instance of radio communication in air traffic control centers, or between police officers.

Dispatcher: Adam Twelve code five.
Adam Twelve: Twelve, code five, go ahead.
Dispatcher: I’m showing a warrant on your party, Doe, John Q., date of birth three five of sixty, showing physical as white male, six foot, two-eighty, blond and blue, break–

Even though the conversation above contains common, everyday words, to outsiders the conversation may seem to consist of a chain of disconnected words. Training custom word embeddings to capture and properly represent the particular distribution of words in radio communication will increase the accuracy of the NLP models making use of the embeddings.

The same is valid for text message communication:

The message above contains every day language (“Laughing out loud, was great to see you too but oh my god got to go talk to you later!”), but it is not written in the standard way. The chances that the words used in the text message have a proper representation in pre-trained word embeddings, or a representation altogether, are pretty low. Training word embeddings on social media posts and text messages ensures the availability of an accurate representation for this type of text.

 

Specific input features

For some use cases, the classic NLP modelling techniques, based on an NLU layer, won’t do, a different type of model and input is needed. Take for instance dementia detection. Dementia has an impact on language capacity in the sense that patients will use an increasingly simplified vocabulary, shorter sentences, words with the wrong meaning etc. In order to detect such patterns, it doesn’t simply suffice to take an abstract vectorial representation of their writings and run these representations through a classification model determining whether an email was written by a dementia patient or not. Since it is not necessarily what dementia patients say but rather how they say it, word embeddings or other meaning representations are not the right type of input features. Features such as the ratio between the number of unique words and the total number of words, or the length of sentences, have to be extracted to serve as an input to the classification models. Additionally, what matters in this case is a decline over time, it is not possible to simply classify a text as being written by a dementia patient or not. A person might be writing to her 6-year-old grand child and choose to use a simplified vocabulary, and this writing might wrongly be classified as belonging to a dementia patient if the model does not take into account the evolution over time. If a person formerly used a rich vocabulary and long, complex sentences, and this richness declines gradually but consistently over time, it might be an indication of declining cognitive capacities.

Also pitching and writing coaches require a different type of input features, word embeddings won’t do the job. For pitching coaches it might for instance be useful to include word rate as a feature, and for writing coaches the use of textual connectors such as ‘nevertheless’ and a measure for the complexity of the sentences used might be relevant.

For the type of use cases mentioned in this section, it is crucial to understand what characterises the target classes, and apply proper feature engineering to the data which will allow the models to correctly identify which patterns are associated with which class.

Accuracy is paramount

Custom models increase the accuracy. If the business case is large enough and every increase in accuracy leads to a high enough return on investment, it makes sense to squeeze out a couple of additional percentages by creating a custom model. Faktion has the necessary expertise to determine whether your specific use case needs customization.

Typical project plan for a domain specific NLP pipeline

A typical project takes about 3 months of time. At the end of this, a self-learning algorithm is in place that will make sure these models keep on improving over time.

Are you interested in building your own custom NLP pipeline on our NLP framework? Contact us to learn more.