AI in Publishing & Pre-Press: Turning Unstructured Manuscripts into Print-Ready Books and Audiobooks

In publishing, one of the biggest inefficiencies sits right at the start of production: the author’s manuscript.

The manuscripts are delivered as Word files, but every author has a specific way of formatting and structuring the content:

Bold text might indicate a title or a new chapter.
An empty line might mean a new scene.
Italics might indicate emphasis, a foreign word, a brand, a professional term, and many other things, depending on the author and/or the book’s genre.

And this becomes a big problem when using LLMs to create structured, formatted files

None of this translates into the clean, machine-readable structure required for downstream workflows like e-book formatting, print layout, or audiobook generation.

Earlier this year, Faktion and Crius Group announced a strategic collaboration to bring AI into the heart of publishing production. Together, the teams are advancing the CORE platform, an end-to-end publishing platform that automates everything from manuscript structuring and metadata enrichment to multilingual translation, audiobook generation, and compliance with the EU Accessibility Act.

Crius, a digital publishing technology company, partnered with Faktion to leverage AI and reduce book preparation time and costs.

Together, we set out to build an AI system that could understand the author’s layout and intent, and translate the manuscript Word files into clean, structured, and enriched with semantic tags XML files, that are ready for downstream use.

What followed was a series of technical challenges that shaped the final solution and taught us what it truly means to teach AI to read like an editor.

Identifying Structure Within Inconsistent Document Formatting

Our first obstacle came immediately. Word documents look organised, but under the hood, they’re a mess. There are no true page boundaries in Word files, no concept of a chapter, and no way to tell whether spacing or styling has meaning. Feeding this directly into an LLM produced noisy, inconsistent outputs.

Instead of trying to parse Word files directly, we decided to make the layout visible to the model. So, we built a visual pipeline.

Each manuscript was first converted to PDF, then split into page images. Each image was processed by a multimodal LLM trained to recognise structural elements, chapters, paragraphs, dialogues, scene breaks, and translate them into XML tags.

By giving the model access to the document’s visual context, we allowed it to infer meaning from the document layout as the representation of the author’s vision rather than relying on lossy, inconsistent textual formats.

Teaching the Model to Write Valid XML

Once the model began generating structured XML, another obstacle emerged: syntax. Tags weren’t closed, nesting was broken, and small syntax issues crashed downstream tools. In other words, the model could write some XML, but not long, syntactically and semantically valid XML documents.

So, we built a self-healing mechanism directly into the pipeline.

When a file failed validation, the parser returned the error message to the model, prompting it to fix its own output. This created a feedback loop of generate–validate–repair, turning each failure into training for the next iteration.

Over time, the number of broken XML files dropped sharply, and the system became self-correcting by design.

Optimising The System for The Edge Cases

Processing each page independently worked until we noticed that paragraphs were getting split across pages. Paragraphs started mid-sentence on one page and finished on another, but the model treated them as two separate paragraphs. The result looked structured but wasn’t coherent.

We built a stitching algorithm that merged page-level XML outputs. It detected incomplete tags, matched paragraphs spanning page breaks, and preserved metadata throughout. The final XML is seamless, as if the model had processed the whole book at once.

Measuring How The System Performs

At Faktion, nothing is finished until it’s measurable, but there was no labelled dataset for evaluation, which means we couldn’t objectively say whether the outputs are correct and reliable or not.

As a result, we had to create our own evaluation dataset.

A 60-page sample was manually aligned between the original pages and the reference XML. On top of that, we defined metrics to make evaluation tangible:

Character Error Rate (CER) – text extraction accuracy, including special typographic punctuation
Word Error Rate (WER) – word extraction accuracy
Structural Distance (SD) – XML hierarchy and tag correctness

These metrics fed into an internal leaderboard, letting us compare models from various providers, such as OpenAI, Anthropic, Google, or Mistral, across speed, cost, and quality. Every iteration produced hard data, not guesses, allowing us to select the model that delivers the highest quality outputs fast and cost-efficiently.

This is how evaluation-driven development became the backbone of the project.

‍

Results & Conclusion

After months of iteration, the system achieved measurable, production-level performance:

Word Error Rate - essentially 0% indicating that even generalist LLMs are excellent in extracting text from clean text images
Character Error Rate - ~6% where errors mostly concern special punctuation, like dashes or language-specific quotes, which are often fixed automatically
Structural Distance: ~23 edit operations (for reference, a structured page from a novel typically contains around 40 nested XML tags), which are mainly concentrated around formatting inside paragraphs, think <i>or <b> tags, rather than high-level document structure, like chapters, scenes, and paragraphs.

In practice, this means the AI can take a raw manuscript and generate an XML file that’s valid, coherent, and ready for print or audio conversion with minimum or no human post-editing.

What once required hours of manual editing can now be automated in minutes with measurable quality assurance through Faktion’s evaluation-driven approach.
Moreover, the built-in evaluation module will allow Crius to apply a consistent, principled evaluation process to future model releases and thus benefit from AI advances with no more than a configuration change.

This collaboration with Crius shows how evaluation-driven AI can turn unstructured inputs into structured, production-ready outputs.

In this case, by combining multimodal understanding, self-correcting loops, and continuous evaluation, we bridged the gap between human intent with machine precision.

The same principles extend beyond publishing to any domain where information structure matters as much as content.

At its core, this project proves one thing: structure is the foundation of scalability, whether you’re printing a book, generating an audiobook, or building the next generation of AI-driven content workflows.

Part of sequence:

No items found.