Far from the hype-driven conversations of past years, AI Engineer Paris 2025 was focused on engineering reality: how to make AI systems reason better, operate efficiently, and evolve autonomously in production.
Arian Pasquali, our AI Engineer & Researcher, attended this event to explore the latest advancements and uncover where the future of AI engineering is heading.
Organised by Koyeb and hosted at Station F, the conference gathered the sharpest AI engineers, researchers, and founders from across the globe to dissect how the next generation of AI systems is actually being built, deployed, and improved in the wild.
The main message was clear: AI engineering has matured. The bottleneck is no longer in model training; it’s in evaluation, reasoning, and operational excellence. Open-source LLMs have caught up fast, observability and cost optimisation are now critical engineering pillars, and prompts themselves are being optimised with the same rigour once reserved for models.
This blog post dives into the highlights of AI Engineer Paris 2025, unpacking how open models, reasoning agents, cost-optimised infrastructure, and prompt learning are shaping the next generation of AI systems.
Key Trends: Observability & Evaluation
God doesn't play dice with agents. - (probably) Einstein
The conversations at AI Engineer Paris 2025 made one thing clear: evaluation is no longer a one-time exercise; it’s a continuous process.
As AI systems become more capable, the challenge is no longer building functional AI systems; it’s ensuring those systems generate outputs that are correct, reliable, and continuously improving. That’s why observability and evaluation have become the new cornerstones of AI engineering.
That’s why they were one of the key focus points of the AIEngineer event, and there have been many big advancements happening in this field. Let’s go through some of them together:
1. System Prompt Learning
One of the most interesting trends in the context of observability and evaluation was System Prompt Learning, an approach that treats system prompts like models that can be trained and optimised using reinforcement learning from natural language feedback.

Inspired by traditional reinforcement learning, Prompt Learning uses natural language feedback from online LLM-as-a-Judge to iteratively improve prompts. The system takes advantage of user feedback and LLM-as-a-Judges model reasoning to make suggestions and changes in the system prompt. This method addresses some key limitations of current methods:
- Natural Language Feedback: Instead of numeric scores, the error term is in plain English, allowing for direct feedback that explains exactly why an evaluation failed
- Single Example Optimisation: Unlike traditional Reinforcement Learning, which needs thousands of examples, Prompt Learning can make meaningful changes with just one annotation example
- Instruction Management: All instructions are maintained in English, making it possible to maintain it using existing tooling and human review processes.
The Optimisation Loop:
The Prompt Learning process works through a continuous feedback loop:
- Evaluation: Run evaluations on your system prompts
- English Critique: Generate natural language explanations of why evaluations passed or failed
- Meta-Prompt Processing: Use a meta-prompt controller to analyse the critique and determine how to modify the instruction section
- Instruction Updates: Apply targeted changes to specific sections of the prompt (not the entire prompt)
- Continuous Improvement: Repeat the process as new failure modes are discovered
.jpg)
Real-World Example: Claude's leaked 24,000-line system prompt wasn't accidental - it was meticulously engineered through iterative optimisation processes, showcasing the power of systematic prompt engineering.
According to Arize's research, Prompt Learning can achieve significant improvements with only one-tenth or one-hundredth of the number of labelled examples compared to traditional methods. The approach is also 10-100x faster than current prompt optimisation ecosystems.
This represents a shift toward treating prompts as first-class citizens in AI system optimisation, with implications for:
- Production Systems: Continuous self-healing capabilities for deployed AI applications
- Cost Efficiency: Reduced need for large labelled datasets
- Interpretability: Full audit trails in natural language
- Scalability: Online optimisation that can run alongside production workloads
2. Specialised Agents for Observability & Troubleshooting
A major theme at AI Engineer Paris 2025 was the rise of autonomous agents built to observe, diagnose, and optimise other AI systems. These specialised agents mark the next step in AI operations, moving from passive monitoring to active, intelligent observability. These agents are designed to:
- Analyse traces and performance metrics
- Assist with evaluation metrics
- Explain issues in natural language
- Automate troubleshooting workflows
This new generation of observability agents is expected to become a standard feature of modern AI infrastructure, bringing continuous visibility, faster debugging, and self-correcting capabilities to large-scale deployments.
3. Cost Optimisation
As AI systems grow more complex and evaluation workloads expand, cost optimisation emerges as a critical engineering lever.
At AI Engineer Paris 2025, #1 Intelligent Kubernetes Automation Platform – Cast AI showcased an elegant solution: an intelligent routing layer that dynamically selects the most efficient LLM for each request. Key features:
- Supports OpenAI, Anthropic, Mistral, Gemini, and more
- Acts as a proxy endpoint requiring no code changes
- Can prioritise for cost or quality savings (up to 62-98% savings observed)
Beyond raw optimisation, it also enables teams to evaluate which models perform best for specific prompts before production deployment, turning cost management into a data-driven part of the evaluation process.
4. MCP Infrastructure
The Model Context Protocol (MCP) ecosystem is rapidly maturing. Many startups were present, showcasing tooling to solve the challenges of deploying MCP solutions and the protocols' limitations.
Key Tools:
- Alpic: deploy, manage and scale your MCP servers "Heroku for MCP servers" - deploy and host MCPs with GitHub integration. It drastically facilitates deploying MCP servers online with built-in analytics to understand how people are using your MCP tools.
- Apify: Full-stack web scraping and data extraction platform offers MCP-compatible web crawlers. Configurable for targeted data extraction. No need for custom scraping code. Particularly useful for ongoing data collection projects and feasibility studies.
5. Observability and Evaluation Best Practices
Throughout the event, many best practices were discussed that among which the following stood out:
- Automate feedback loops: Thumbs-down events, low-confidence responses, or user corrections should automatically trigger re-evaluation datasets.
- Enable continuous prompt learning: Feed real-world usage data into model improvement cycles to ensure the system evolves with user needs.
- Instrument detailed traces: Log every input, output, and reasoning step to make model behaviour transparent and diagnosable.
- Cluster and analyse errors: Group failure cases by type or context to identify systematic weaknesses.
- Integrate evaluation agents: Use AI-powered evaluators to analyse results, explain issues, and recommend improvements.
- Close the loop in production: Treat evaluation as part of the runtime environment, not just offline testing.
The outcome is a shift toward self-observing, self-improving AI systems architectures capable of detecting, explaining, and correcting their own errors over time.
6. Build vs. Buy Strategy
As observability and evaluation play a key role in engineering effective AI solutions, teams are faced with a practical question: should they build these capabilities in-house or rely on external platforms?
For early-stage prototyping, teams are encouraged to leverage existing SaaS tools to move fast, reduce DevOps overhead, and validate ideas quickly. As systems mature and observability needs deepen, organisations should re-evaluate whether to build or self-host key components based on:
- Data sensitivity and compliance requirements
- Operational control and customisation needs
- Long-term maintenance and cost efficiency
This strategic balance between speed and sovereignty is becoming a defining capability of mature AI organisations. The most successful teams are those that start lean but scale deliberately, evolving from SaaS-based experimentation to fully integrated, observable AI infrastructures over time.
Key Trends: State of Open LLMs in 2025
One of the standout presentations at AI Engineer Paris was Vaibhav Srivastav's (Head of Developer Experience at Hugging Face) comprehensive overview of the current state and future trends in Open Large Language Models. This presentation provided crucial insights into where the open-source LLM ecosystem stands and where it's heading.

Trend #1: Reasoning
The most exciting development this year is the reasoning revolution in smaller models:
- Knowledge Distillation Success: Chain-of-thought reasoning from larger models can be effectively distilled into smaller, more efficient models
- Democratisation of Reasoning: High-quality reasoning capabilities are now accessible without massive computational resources
- 25x Efficiency: Smaller models can now beat models 25x their size in specific reasoning tasks
Trend #2: Cost Curves & Context Windows
Today, 1 million token context is a standard feature which allows models to process and reason over entire books, repositories, or long-form conversations without losing coherence or context.
Business Impact:
- Reduced API Costs: Longer context windows mean fewer API calls
- Better Context Retention: More comprehensive understanding of long documents and conversations
- Enhanced Multimodal Applications: A Larger context enables better integration of text, images, and other modalities
Trend #3: Ecosystem Maturity
Today, the open LLM ecosystem has reached a new level of maturity:
- Standardised chat templates ensure consistent formatting across models, reducing errors and simplifying integrations.
- 8/4-bit quantisation has become mainstream, cutting memory requirements and enabling deployment on smaller, more efficient hardware.
- Performance optimisations have dramatically improved inference speed, making large-scale applications faster and more responsive.
Meanwhile, modernised tooling now allows developers to deploy, monitor, and manage open models with far less complexity, marking a decisive shift from experimental setups to production-grade reliability.
Where Proprietary Models Still Lead
Despite significant progress, proprietary models maintain advantages in:
- General Reasoning: Still leads by a margin in broad reasoning tasks
- End-to-End Multimodal: Superior integration of text, vision, and other modalities
- Safety & Jailbreak Scaffolding: More robust safety mechanisms and alignment
Future Directions: What's Next?
Areas of Excitement:
- Smaller and Domain-Specific LLMs: Specialised models for specific industries and use cases
- Effort-Based Reasoning: Models that can adjust their reasoning effort based on task complexity
- Better Quantisation: Further improvements in model compression without performance loss
- Sparse (Faster) MoE: Mixture of Experts architectures optimised for speed and efficiency
Deployment Options for Open LLMs
Hugging Face outlined three main approaches:
- Serverless API: Similar to OpenAI/Anthropic, with pay-per-use pricing
- Managed Deployments: One-click deploy on scalable infrastructure
- DIY - Deploy It Yourself: Bare metal deployment for maximum control
Implications for AI Engineering
For Practitioners:
- Cost Efficiency: Open LLMs now offer competitive performance at significantly lower costs
- Reasoning Capabilities: Access to advanced reasoning without massive infrastructure
- Flexibility: Multiple deployment options to match different use cases and constraints
For Organisations:
- Reduced Vendor Lock-in: Open models provide alternatives to proprietary solutions
- Customisation: Ability to fine-tune and adapt models for specific needs
- Transparency: Open-source models offer better visibility into model behaviour and capabilities
This presentation highlighted that 2025 represents a pivotal year for open-source LLMs, with the ecosystem reaching maturity levels that make them viable alternatives to proprietary solutions for many use cases.
Key Takeaways of AIEngineer Paris 2025
- AI engineering has matured, shifting focus from model building to evaluation, reasoning, and operational excellence.
- Smaller reasoning models now outperform larger ones through knowledge distillation, driving efficiency and accessibility.
- 1M-token context windows enable long-document comprehension and multimodal reasoning at scale.
- The open LLM ecosystem has reached production readiness with quantisation, standardised templates, and faster inference.
- System prompts and evaluation loops are now continuously optimised using natural-language feedback.
- Specialised observability agents are emerging to monitor, explain, and troubleshoot AI systems autonomously.
- Cost optimisation is essential. Smart routing can provide significant savings (up to 62-98% with tools like http://Cast.ai )
- The MCP ecosystem is standardising interoperability for agentic deployments.
- Prompt Learning, the ability to optimise prompts with single examples using natural language feedback, represents a fundamental advancement in AI system optimisation
- The industry is adopting evaluation-driven development, continuous improvement, and build-vs-buy pragmatism as core engineering principles.
Predicting the Future Trends
AI Engineer Paris 2025 demonstrated that the AI engineering field is maturing rapidly, with clear trends toward:
- Prompt optimisation: Prompt Learning represents a fundamental shift from traditional methods, enabling continuous improvement with minimal data requirements and giving LLMs their "scratchpad" for explicit knowledge management
- Human-like learning patterns: AI systems are evolving beyond static model updates. They are beginning to store and reuse reasoning strategies, much like how humans learn from experience, marking a shift from parameter fine-tuning to knowledge-based adaptation.
- Automated Optimisation of AI Systems
Future AI systems will be self-healing, capable of detecting new failure modes, diagnosing root causes, and adapting their behaviour without human intervention. This will make large-scale AI operations far more robust and sustainable. - Specialised Tools for Specific Use Cases
A new class of purpose-built agents is emerging, designed to monitor, evaluate, and troubleshoot AI systems automatically. These agents will soon become standard components of major observability and MLOps platforms. - Robust Observability and Debugging Capabilities
Modern observability frameworks are moving toward full auditability and interpretability, allowing engineers to trace model reasoning, reproduce decisions, and continuously improve system reliability.
At Faktion, we are closely watching these trends and will keep you updated.
Conclusion
AI Engineer Paris 2025 marked a decisive moment in the shift from building AI models to engineering intelligent, self-optimising systems that are accurate and reliable and achieve user adoption. The frontier of AI engineering is now defined by evaluation, observability, and operational excellence.
Models are becoming lighter yet more capable, prompts are being optimised through natural-language feedback, and specialised agents are taking over evaluation and troubleshooting at scale. Meanwhile, the ecosystem around open-source LLMs has matured to a point where they can compete head-to-head with proprietary models at a fraction of the cost.
The next phase of progress will be about closing the loop: building AI systems that continuously learn from their own outputs, detect and fix errors autonomously, and scale efficiently through modular, observable infrastructure.
At Faktion, we see this evolution as the blueprint for the next era of AI product development, one where engineering discipline meets adaptive intelligence, and where systems don’t just run, but continuously get better.