Domain Specific NLP Pipelines
The overall architecture of an NLP pipeline consists of several layers: a user interface; one or several NLP models, depending on the use case; a Natural Language Understanding layer to describe the meaning of words and sentences; a preprocessing layer; microservices for linking the components together and of course the infrastructure which hosts the complete solution.
Any NLP solution consists of several reusable components. For tailoring models to a specific domain, it is mainly the natural language understanding (NLU) layer that needs to be retrained. This layer is responsible for mapping input tokens (words, sentences, paragraphs, or documents) to their respective meanings.
The other components, like tokenizers, lemmatizers, and system entity parsers, are common to all NLP models. The surrounding back-end systems that deal with API connectivity, scalability, model versioning, and security can also be reused. Although the models themselves have to be re-trained for every use case, the architecture of the models stays the same, so they are considered to be “90% reusable”.
The following sections briefly describe every layer of the NLP pipeline.
User interaction (web application)
Machine Learning models only make sense when users actively use them. An excellent user interface has two functions: making sure the NLP models are used and gathering extra training data.
The components of the custom AI platform are reused in multiple use cases. The initial investment for training domain-specific components recompenses by virtue of the reusability in a wide variety of use cases.
Faktion does continuous research on novel deep learning architectures for most of the Machine Learning tasks. We have implemented, benchmarked, and compared tens of different types of models for each job. These models are ready to reuse and – in most cases – do not require customization.
Almost every use case regarding the use of NLP falls into one of five categories: classification, entity or information extraction, document similarity matching, summarisation, or natural language generation. The output of the underlying NLU components is used as input for these models.
Natural Language Understanding layer
To enable the models described in the previous paragraph, it is crucial that they can understand the context and capture the meaning of specific words, sentences, and documents. This happens by using word vectors, sentence vectors, document vectors, or a combination of these. These vectors are used as input for the NLP models.
Since the meaning and scope of the words is strongly domain-dependent, this is one of the layers where the most impact can be made with a custom implementation.
Faktion has built robust and high-quality components for Flemish, Dutch, French, English, and German. These components form the backbone of the solution and are crucial in getting good results quickly. Our preprocessing components include
- tokenization (splitting up a text into words)
- parsing (for instance to transform all date formats to one single format)
- OCR (optical character recognition to convert pdf documents and images to raw text)
- language detection
- text to speech (TTS)
- speech to text (STT)
We also use existing Microsoft solutions for several of these components. This is the case for TTS, STT, and OCR. For the Flemish variant of Dutch, we have in-house TTS and STT models that can be customized if needed.
For linking all components together, you need a robust and scalable backend system. This back-end system takes care of scheduling training jobs, API-requests, integration, building, testing, security, and authentication.
Our solution is deployed using Helm templates, which can easily be deployed, for example on Azure Kubernetes Service or on-premise on OpenShift.
When does a domain-specific NLP pipeline make sense?
General-purpose NLP pipelines are used successfully in a wide variety of applications. The need for a domain-specific NLP pipeline arises when the data to be processed, differs considerably from standard, everyday language.
General-purpose models are trained in common language (books, articles, forums, Wikipedia, news articles). As such, they are very suitable for applications in the same domains as they are trained on language and context that can be expected to be understood by people with a regular high school education or the average reader of a newspaper.
Some document types however, contain specialized language and an industry-specific vocabulary, giving rise to the need for training custom word embeddings. Highly specialized terminology is not frequently used, and the chances that these words have an accurate representation in pre-trained wordembeddings are small. Pre-trained wordembeddings are usually trained on big corpora such as Wikipedia, and even though Wikipedia does contain quite some technical jargon, it does not include all technical words, or it contains them in a shallow frequency. On top of that, terms may have an entirely different meaning in a specific domain than they have in everyday language. Think, for instance, of the word “interest.” In everyday language, “interest” usually means “the feeling of wanting to know or learn about something or someone.” However, when the word “interest” is used in financial documents, it much more likely designates “money paid regularly at a particular rate for the use of money lent, or for delaying the repayment of a debt.” For an NLP pipeline for financial documents to give as accurate results as possible, it is crucial that the specialized meaning of the words and terminology occurring in them is properly represented in the NLU layer. This is the main reason for choosing to create a domain-specific NLP pipeline.
Some domains in which the ubiquity of specialized terms may require training custom NLU models:
- Financial documents like annual reports, shareholder letters,…
- Legal documents like contracts, legislation,…
- Technical manuals
- R&D Lab reports
Non-standard language use
Sometimes the general-purpose language models are not enough because the text data pertains to a register with particular linguistic characteristics. Think, for instance, of radio communication in air traffic control centers, or between police officers.
Dispatcher: Adam Twelve code five. Adam Twelve: Twelve, code five, go ahead. Dispatcher: I’m showing a warrant on your party, Doe, John Q., date of birth three five of sixty, showing physical as white male, six foot, two-eighty, blond and blue, break–
Even though the conversation above contains common, everyday words, to outsiders, the conversation may seem to consist of a chain of disconnected words. Training custom word embeddings to capture and adequately represent the particular distribution of words in radio communication will increase the accuracy of the NLP models making use of the embeddings.
The same is valid for text message communication:
The message above contains everyday language (“Laughing out loud, was great to see you too, but oh my god got to go talk to you later!”), but it is not written in the standard way. The chances that the words used in the text message have a proper representation in pre-trained wordembeddings, or a representation altogether, are pretty low. Training word embeddings on social media posts and text messages ensure the availability of accurate representation for this type of text.
Specific input features
For some use cases, the classic NLP modeling techniques, based on an NLU layer, won’t do, a different type of model and input is needed. Take for instance dementia detection. Dementia has an impact on language capacity in the sense that patients will use an increasingly simplified vocabulary, shorter sentences, words with the wrong meaning, etc. To detect such patterns, it doesn’t merely suffice to take an abstract vectorial representation of their writings and run these representations through a classification model determining whether a dementia patient wrote an email or not. Since it is not necessarily what dementia patients say but instead how they say it, word embeddings or other meaning representations are not the right types of input features. Features such as the ratio between the number of unique words and the total number of words, or the length of sentences, have to be extracted to serve as an input to the classification models.
Additionally, what matters, in this case, is a decline over time, it is not possible to simply classify a text as being written by a dementia patient or not. A person might be writing to her 6-year-old grandchild and choose to use a simplified vocabulary. This writing might wrongly be classified as belonging to a dementia patient if the model does not take into account the evolution over time. If a person formerly used a rich vocabulary and long, complex sentences, and this richness declines gradually but consistently over time, it might be an indication of declining cognitive capacities.
Also pitching and writing coaches require a different type of input features, word embeddings won’t do the job. For pitching coaches, it might, for instance be useful to include word rate as a feature, and for writing coaches, the use of textual connectors such as ‘nevertheless’ and a measure for the complexity of the sentences used might be relevant.
For the type of use cases mentioned in this section, it is crucial to understand what characterizes the target classes, and apply proper feature engineering to the data which will allow the models to identify which patterns are associated with which class correctly.
Accuracy is paramount
Custom models increase accuracy. If the business case is large enough and every increase of accuracy leads to a high enough return on investment, it makes sense to squeeze out a couple of additional percentages by creating a custom model. Faktion has the necessary expertise to determine whether your specific use case needs customization.
A typical project plan for a domain-specific NLP pipeline
A typical project takes about three months. At the end of this, a self-learning algorithm is in place that will make sure these models keep on improving over time.
Are you interested in building your custom NLP pipeline on our NLP framework? Contact us to learn more.
Data Scientist with a passion for NLP and a background as a researcher in Theoretical Linguistics. Specialized in formal approaches to typical aspects of linguistic interaction, such as question answering, implicit information, and discourse organization. Aleksandra loves tackling complex problems and finding patterns in unstructured data. A growing fascination for data science and its applications in Language Technology and AI led me to participate in a data science boot camp. I am an advocate of making science accessible to a broad audience and strongly believe that crowdsourcing is a crucial component of innovation.