Testing GPT-5 Beyond the Hype: What We Learned from Real-World Evaluations

When GPT-5 was announced, it was billed as the closest step yet toward AGI; the pinnacle of model development, arriving after months of anticipation. Marketing materials, demos, and second-hand commentary flooded in, bolstered by impressive results on popular maths and coding benchmarks.

These benchmarks may look convincing, but they measure how a model solves clean, well-defined problems with clear right-or-wrong answers, conditions that rarely exist in the real world. In business environments, queries are messy, data is fragmented, and context is everything.

For an enterprise, it doesn’t matter how a model scores on a public leaderboard. What matters is how it performs in your architecture, with your processes, on your data. That’s the only benchmark that matters.

That is why Evaluation-Driven Development (EDD) is essential. EDD ensures that you build your AI system on a foundation full of observability agents, automated evaluation tools, and feedback loops, which enables you to benchmark AI models under production-like conditions. It gives you empirical, context-specific results, showing whether a new model actually moves the baseline or just looks good on paper.

At Faktion, we have multiple agentic & knowledge assistants in production across various industries, built upon our own proprietary Evaluation Driven Development (EDD). As a result, we could immediately benchmark GPT-5 across these knowledge assistants to evaluate if it really moves the baseline in real-world use cases; instead of taking GPT-5’s public performance at face value.

The Test Case

Our research question was simple but critical:
Does GPT-5 meaningfully outperform previous-generation models in real-world enterprise use cases, and if not, which combinations of model and agent architecture actually deliver the best results?

This matters because model choice is rarely made in isolation in enterprise AI. A model’s output is shaped by the agent architecture around it; the orchestration layer that decides how queries are processed, how the model is called, and how answers are synthesised. A model that excels in one architecture may underperform in another.

We tested this in the context of a multi-agent knowledge assistant for a large enterprise, where users ask complex, domain-specific questions and the system must interpret intent and deliver complete, context-aware answers. A misinterpretation can cause delays, misinformation, or costly mistakes, which is why we measured performance using the Query Understanding Score. This metric assesses how well a system grasps user intent and returns a fully relevant, accurate answer.

Query Understanding matters because it shows how well the AI works for real users, builds trust, and helps the product improve—more than any public benchmark can.

Methodology

Dataset: A curated dataset with real-world queries sourced from production data.
Models tested
- GPT-5 (Standard, Mini)
- GPT-4.1 (Standard, Mini, Nano)
- GPT-4o Mini
Scope: Query Understanding Score, isolating pure model comprehension without the influence of retrieval quality. Context retrieval was deliberately excluded from this study, as retrieval quality is a separate variable we evaluate independently.

Why two architectures?

We wanted to determine not just which model performs best, but which performs best for a given orchestration style:

Basic Agent: Single-pass design. Minimal reasoning, retrieves and answers in one go. Optimised for speed.
Research Agent: Multi-step reasoning. Breaks queries into sub-questions, runs them in parallel, then synthesises results. More methodical but more dependent on the model’s reasoning style.

By running each model under both architectures, we could see how model behaviour interacts with agent design and whether a smaller, cheaper model could outperform a larger one when paired correctly.

The Results: Agent Architecture is more important than model size

The numbers tell an important story. When we look at best performance for each pairing, certain combinations absolutely shine. GPT-5 Mini with the Basic Agent, for example, hits the highest score of all tests.

Basic Agent

Model	Score	Ranking	Latency
gpt-4o	0.89	🥈 #2	19.22s
gpt-4o-mini	0.72	#6	20.89s
gpt-4.1	0.84	#4	19.96s
gpt-4.1-mini	0.89	🥈 #2	22.70s
gpt-4.1-nano	0.83	#5	13.41s
gpt-5	0.87	🥉 #3	66.73s
gpt-5-mini	0.96	🥇 #1	37.12s
gpt-5-nano	0.87	🥉 #3	34.53s

Research Agent

Model	Score	Ranking	Latency
gpt-4o	0.85	🥉 #3	52.12s
gpt-4o-mini	0.79	#5	53.22s
gpt-4.1	0.78	#6	49.09s
gpt-4.1-mini	0.78	#6	49.01s
gpt-4.1-nano	0.79	#5	42.18s
gpt-5	0.91	🥈 #2	90.77s
gpt-5-mini	0.95	🥇 #1	63.66s
gpt-5-nano	0.81	#4	67.60s

Performance is highly dependent on the compatibility between the model and the agent, not on raw compute power or model size.

Top Performers:

GPT-5 Mini + Basic Agent → 0.96 (highest score overall)
GPT-5 + research-assistant → 0.91 (research agent champion)
GPT-4o + basic-assistant → 0.86 (solid speed-quality option)

Ones to Avoid:

GPT-4o Mini + Basic Agent → 0.67 (major compatibility issues)

Agent-Specific Patterns:

Basic Agent: More volatile (0.67–0.96) but capable of higher peaks when paired well.
Research Agent: More consistent (0.78–0.95) but slightly lower peaks.

Note that the results are only specific to this use case and data, and it doesn’t mean you would get similar results for another use case.

Key Learnings

Model size does not predict performance
Smaller models can outperform larger ones. GPT-5 Mini scored 0.96 with Basic Agent, beating all larger models tested.
Performance is about architectural harmony
Our findings show that enterprise AI success comes from model–architecture synergy, not just raw compute power. The agent’s design has a stronger impact on performance than the model’s size. Basic Agent shows very volatile results, while Research Agent is more robust and steadier and shows more consistent results. The right architecture can elevate a smaller model above a larger one.
Bigger isn’t always better
Larger models can introduce complexity that some agent designs can’t fully leverage. Well-defined architectures can make smaller, more predictable models just as, if not more effective.
Design and test the “brain” and “workflow” together
The AI model is the brain; the agent is the workflow. They should be optimised as a unit. That means:
- Testing multiple model sizes in the same workflow.
- Adjusting workflows to exploit model strengths.
- Measuring performance in your production environment.

Speed–accuracy trade-offs are architecture-dependent‍
- For latency-sensitive tasks, GPT-4.1-nano in Basic Agent delivered a solid 0.83 score at just 13.41s per query.
- For maximum accuracy regardless of speed, GPT-5 Mini leads in both architectures.

Evaluation-Driven Development is essential
Without continuous evaluation, you’re guessing which combination works best. With it, you can:
- Identify the optimal model–agent pairing.
- Avoid overspending on underperforming models.
- Continuously monitor and improve each system component.

Cost savings without performance loss
Top performance doesn’t require the largest or most expensive models; optimally paired smaller models can deliver equal or better results.
The best approach is modular: route most sub-tasks to smaller models, reserving the larger models for rare, open-ended reasoning, scaling out with small experts, not up with one monolith.

Closing Thoughts

Chasing the biggest, newest model does not define success in enterprise AI. Adopting an Evaluation-Driven Development improves your chances of success.
Why? Because it gives you the ability to match the right model to your context, your processes, your architecture, and your data and to iterate fast with confidence

Without it, you’re flying blind; unable to see how each model performs in your context, which components help or hurt, how quality is evolving, or where cost savings are possible.

This is why many companies fail to productise their AI prototypes; they think they can just plug in the latest model and they will achieve a production-ready system. Our results show that “bigger” and “newer” don’t necessarily mean “better.”

An Evaluation-Driven Development approach changes that. It embeds measurement into every element of your AI system so you can test multiple models and architectures on your own data and understand which combination delivers the best results for your processes and KPIs. This level of observability enables you to measure real progress from day one and turn every signal - feedback, logs, system outputs, edge cases, user behaviour, and more - into actionable insights that directly inform and prioritise improvement plans.

In enterprise AI, success comes from knowing, not guessing, what works. And you only know if you evaluate. If you want to see how EDD can help you cut through the hype and build AI systems that actually perform in production, explore our Evaluation-Driven Development approach.

Part of sequence:

No items found.