Faced with inconsistent employee data from legacy systems, our client partnered with Faktion to deploy an AI-driven classification solution. Leveraging LLMs and advanced classification ML methods, the solution automated job-title standardisation, enhancing data quality, operational efficiency, and customer insights.
Our client, a Belgian HR specialist, offers a wide range of services related to human resources and social administration for businesses and self-employed individuals. Through an online platform, employers and employees have access to various services such as payroll administration, wellbeing advice, and more. With an extensive active membership base and monthly visitors in the hundreds of thousands, the platform experiences significant traffic.
Platform users expect a frictionless digital journey where they can interact at any time, looking for an integrated, one-stop-shop service experience. As an advisor and service provider, the client is dedicated to delivering accurate and tailored support to its customers. In this context, having reliable and accurate customer data is crucial for extracting relevant insights. This necessity brings high demands in terms of data quality—no small feat for a fast-growing organisation with legacy systems built over the years.
At Faktion, we strongly believe in a data-centric approach to AI, and in using AI to optimise data quality. This viewpoint is also embraced by our client’s leadership, which has made data quality and data governance a strategic priority. Hence, when looking for a partner that could help them take data quality to the next level, our client found a natural ally in Faktion.
Managing a large-scale HR services platform requires substantial effort in user management while also ensuring compliance with data protection and privacy regulations. A large portion of this data comes from older legacy systems, making some records incomplete or incorrect. According to the principle of “garbage in, garbage out,” poor data yields poor insights.
As an advisor, having the correct information about customers is essential to adapting communication and services. For instance, imagine a prevention and wellbeing advisor helping a business owner ensure all legal safety standards are met. Naturally, the advisor needs to know who the employees are and what their job titles or professional activities entail, as safety standards differ by role. In an ideal scenario, multiple activities or job titles link to an overarching reference title, which in turn is tied to specific safety regulations.
Historically, however, new employees’ job titles were added as free-text fields. Originally, the decision to allow for free-text input was meant to simplify the user experience—offering maximum flexibility for individuals when adding their roles, rather than forcing them to scroll through a very long list of job titles. But over time, those free-text entries have created challenges:
Thus, what was once a deliberate UX choice to offer free-text flexibility is now limiting the ability to leverage that data for various customer-focused initiatives. An up-to-date classification system is the logical next step, and modern AI can handle much of that effort automatically.
Essentially, all free-text fields need classification into specific job title categories so that the client can provide more tailored services. While several legacy data management actions have been taken, many proved non-scalable because they were done manually—time-consuming, prone to error, repetitive, and ultimately draining resources in a way that hindered operational efficiency.
The client also needed accurate insights for benchmarking, and it suspected that current technology could offer a more advanced solution for data management. Hence, the goal was to automate classification of free-text job titles into standardised reference categories, enhancing data quality and consistency without requiring additional workforce or compromising customer satisfaction.
Faktion’s “Intelligent Data Quality Optimisation” (IDQO) toolbox—an AI-driven suite of solutions for automating data tasks—provides a robust and scalable way to eliminate manual processes.
For the project’s first phase, the client prioritised employee classification in a specific high-need industry: construction.
Overall, the AI-driven solution comprises three major building blocks:
Initially, we applied an unsupervised clustering approach by running a hierarchical clustering model on the cleaned free-text job descriptions, grouping them by similarity. A single pass of fine-tuning was performed in collaboration with the client’s domain experts.
However, relying on unsupervised clustering alone proved insufficient:
To address these shortcomings, we leveraged LLMs for semantic interpretation, aiming to separate generic words from more specific subcategories. Here is the high-level workflow:
The next step involved building a classification model capable of mapping each free-text job description to the correct label from the reference list. Often, a perfect one-to-one match may exist. If not, the classification model must infer the best possible match.
A key challenge was the absence of a labeled dataset: the client lacked preexisting links between free-text entries and standardised reference titles. To surmount this, we employed a zero-shot learning approach with SetFit.
Finally, a Search API was created to classify both historical data and new entries in real time. Upon data entry:
Because the same API powers both historical data classification and new entry suggestions, it seamlessly keeps all data consistent and up to date.
In the first phase, we delivered:
This approach significantly improves insights for construction-sector customers, ensuring more tailored services. All steps are well documented, paving the way for future rollouts in other industries.
Moreover, the entire solution involved close collaboration among various stakeholders—domain experts, data specialists, IT/architecture specialists, and management—to guarantee a holistic perspective on requirements. This emphasis on knowledge transfer means that the client’s teams can create and train new data pipelines without extensive ML engineering skills.
Next, the approach will be further operationalised and extended to additional sectors. This will continue to elevate data quality across the board. Ultimately, the client will have a robust and well-structured reference list spanning all industries they support—fully integrated into the platform.
Key outcomes include:
By embracing a data-centric approach and leveraging AI-driven classification, the client lays a strong foundation for high-quality data and scalable AI initiatives in the future, transforming data from a liability into a driver of innovation and strategic growth.