Behind the scenes

Artificial Intelligence

AI Agents in Production: Why a Successful Workflow Can Deliver a Bad Result

Équipe Castelis

Author

Reading time : 10 min reading

The real issue: AI agents don’t only raise a generation quality problem. They raise an operational control problem: can they produce within the right scope, with the right guardrails, and under proper oversight?
What businesses must understand: an AI agent can produce a technically valid but business-wrong deliverable, and the system will mark it “completed successfully.”
The incident: during an automatic update of clawpilot.eu, the agent team proposed 24 content blocks instead of the 15 planned, with the tone drifting from marketing toward technical documentation. Pipeline “completed successfully,” result unpublishable.
The redesign: from 3 to 8 explicit steps, organized as a DAG, with separation between product analysis, English drafting, market-specific localization, and automated editorial validation.
Three levels of guardrails: content (register, forbidden phrasing, market-specific briefs), action (per-agent limited rights, no automatic publishing), observability (execution report, traceability, human validation).
Scope of application: multilingual product pages (marketing), multi-audience release notes (product/SaaS), knowledge bases (support/KM).
What Castelis brings: designing the target pipeline, identifying business risks, defining guardrails, integrating agents into existing workflows, making the system observable.

agents IA • gouvernance • IA générative • marketing • pipeline • REX

AI agents don’t only raise a generation quality problem. They raise an operational control problem. As soon as they modify content, trigger workflows, or interact with business systems, the real question is no longer just “can they produce?” but “can they produce within the right scope, with the right guardrails, and under proper oversight?”

This hands-on account, conducted on ClawPilot, our open-source multi-agent AI orchestration platform, concretely illustrates what this question covers. For marketing, product, support, and CTO teams, the topic is no longer simply whether an AI agent can write, translate, or summarize. The topic is understanding what architecture to embed it in so that it delivers value without degrading the brand, product, or customer relationship.

AI agents are no longer used solely to produce a text, summarize a document, or answer a question. They are beginning to operate within complete operational chains: analyzing product changes, detecting what deserves to be highlighted, producing a content proposal, adapting it across several languages, reviewing it, then submitting it for approval.

ClawPilot, distributed under the MIT license, is used to maintain the clawpilot.eu website. The site presents the platform as a structured landing page: a grid of 15 content blocks, available in 5 languages: English, French, Spanish, German, and Japanese.

The goal appeared simple: when a product change genuinely visible to the user is shipped in ClawPilot, the site should be able to update automatically. Not through a single agent tasked with doing everything. Through a team of specialized agents, each with its own role, its own tools, and its own permissions.

An automatic update reminded us of an essential rule: an AI pipeline can succeed technically and fail functionally.

On April 20, 2026, an automatic update of the editorial maintenance pipeline ends with a reassuring status: “completed successfully.”

On the surface, everything looks fine. The pipeline analyzed recent product changes, prepared a site update, produced the localized content, and submitted a deliverable for review. The system had therefore fulfilled its apparent mission: transforming product activity into a content proposal.

Except that the deliverable proposed by the agent team contained 24 content blocks, where the page planned for 15.

The landing page had been designed as a short, readable, balanced grid. With 24 blocks, the site no longer told a clear value proposition. It stacked updates. The page looked more like a product changelog than a page designed to persuade.

More problematic still: some descriptions reproduced development vocabulary almost verbatim. The tone had drifted from product marketing toward technical documentation. Internal elements, understandable to a developer, appeared as if they were user benefits. A corrective improvement had even been presented as a new marketing feature.

The pipeline had not hallucinated. It had not invented a non-existent feature. It had done something more subtle: it had confused a technically real modification with a publishable product promise.

This is where the blind spot of generative agents in production lies.

An agent can follow the instruction, modify the right files, produce a valid format for the CMS, prepare a deliverable for review, and finish without error. That does not mean the result is correct from a business standpoint. “Completed successfully” does not mean “publishable.”

In a traditional organization, this type of gap is often caught by a human step: a marketing manager, a product owner, an editor, or a brand manager. In an agentic pipeline, one must decide where to place this judgment capability: in the instructions, in the permissions, in the tests, in the orchestration, or in a separate validation step.

The short answer: everywhere. But not in the same way.

Key takeaways for businesses

An AI agent can produce a technically valid but business-wrong deliverable, and the system will mark it “completed successfully.”
Prompts are not enough: a control architecture is required (scoping, governance, observability, validation).
The value of an agentic pipeline depends as much on its guardrails as on its generation capabilities.
Concrete risks without proper scoping: erroneous publication, brand drift, multilingual inconsistency, loss of editorial control, exploding review costs.

The Dev / Marketing Gap in an AI Pipeline

This automatic update did not fail because AI agents write poorly. It failed because the same signal had two different meanings depending on which team was reading it.

For a development team, a technical change describes a precise system modification. It must be factual, traceable, sometimes very detailed. It serves to understand what changed in the product, why it changed, and within what scope.

For a marketing or product team, a landing page follows a different logic. It does not describe the implementation. It clarifies the usage value. It selects, simplifies, and prioritizes. It transforms a product change into a benefit understandable to a visitor.

These two worlds are not opposed. But they do not optimize for the same thing.

A technical change might be worded like this:

Before: “The schema adds the instance_shared_files table with FTS5 mirror.”

A marketing formulation should instead become:

After: “Agents share a common file space to collaborate more effectively on the same objective.”

The first formulation is useful for a maintainer. The second is useful for a user. Between the two, translation is not enough. Interpretation is required.

This is precisely what the initial pipeline was doing poorly. It analyzed the product, selected the changes, drafted the content, and prepared the site update in one overly compact chain. It had enough context to act, but not enough constraints to arbitrate.

The problem was therefore not only editorial. It was architectural.

An agent that reads technical changes tends to overestimate the importance of what it sees. An agent that writes marketing copy must on the contrary filter out what doesn’t directly concern the user. An agent that localizes must not translate literally; it must adapt the register, usage conventions, and terminology to the relevant market.

The initial pipeline mixed these responsibilities. The incident made them visible.

The fix consisted of moving from one agent capable of doing everything to a team of agents intentionally incapable of doing everything alone.

The Final Architecture: 8 Steps, 3 Levels of Governance

The redesign of the Daily Web Maintenance pipeline rests on a simple principle: each step must have a single objective, a clear responsibility, and a limited scope of action.

The pipeline runs on the web-maintenance instance of ClawPilot. It is organized as a DAG, meaning a sequence of steps where certain tasks can progress in parallel, without loops or backtracks. The term is technical, but the idea is simple: each agent intervenes at the right moment, with a precise role, and the pipeline maintains a controlled progression through to final validation.

ClawPilot Daily Web Maintenance pipeline: analysis, ship-draft, 4 parallel localizations, review, notification. — The DAG pipeline: analysis, English drafting, parallel localization FR/ES/DE/JA, validation, notification.

The 8 Pipeline Steps

The first step, analysis, is handled by the repo-analyst agent. It analyzes over 50 recent changes since the last processed state, then classifies them into two categories: candidates for product highlighting and maintenance notes. Only changes genuinely visible to the user can be retained.

The second step, ship-draft, is handled by the site-maintainer agent. Its role is intentionally limited: it edits only the English version of the landing page and prepares a working branch. It does not touch any other language.

The four following steps, localize-fr, localize-es, localize-de, and localize-ja, are executed in parallel. They are assigned to a localizer agent, instantiated with a brief specific to each market. French uses the formal vous form. Spanish favors the informal tú. German applies the formal Sie. Japanese adopts the です・ます register. In each case, the instruction is explicit: localization does not mean literal translation.

The review-and-pr step is assigned to the content-writer agent. Its name is technical, but its role is simple: verify before submission. It does not rewrite the entire site. It validates text parity across the 5 languages, scans for forbidden phrasing, checks editorial constraint compliance, and prepares the deliverable for review.

Finally, the notify step is handled by the reporter agent, which sends a report in the team coordinator’s language.

Step	Agent	Responsibility	Intentional limit
analysis	repo-analyst	Read product changes, classify them, identify user-visible changes	Does not draft any marketing content
ship-draft	site-maintainer	Update site structure and English content	Does not touch other languages
localize-fr	localizer FR	Adapt content for the French market	Only modifies French
localize-es	localizer ES	Adapt content for the Spanish-speaking market	Only modifies Spanish
localize-de	localizer DE	Adapt content for the German-speaking market	Only modifies German
localize-ja	localizer JA	Adapt content for the Japanese market	Only modifies Japanese
review-and-pr	content-writer	Check consistency, content keys, forbidden phrasing, and prepare the deliverable	Does not produce the initial content alone
notify	reporter	Summarize the update result	Modifies neither code nor content

This architecture introduces three levels of governance.

The first level is selection: not every product change becomes marketing content. A fix, an internal optimization, or a technical refactoring are not automatically selling points.

The second level is separation of powers: the agent that analyzes is not the one that localizes; the one that drafts English is not the one that validates; the one that notifies has no write access.

The third level is automated editorial validation: before delivery, the pipeline verifies that form and content constraints are met.

It is this transition from raw autonomy to bounded autonomy that transformed the pipeline.

Three Levels of Guardrails

After the incident, the fix did not consist of “better prompting” a single agent. It consisted of redesigning the system. The mechanisms put in place fall into three distinct categories, which are worth naming explicitly because they address three different risks: what the agent produces, what it is allowed to do, and what the organization can see of its work.

Content Guardrails

This level governs what the agent can write: expected number of blocks, editorial register, forbidden phrasing, consistency with the marketing promise.

The first mechanism is structural: the landing page remains a 15-block grid. Agents must not indefinitely add new entries. They must merge, replace, or improve existing ones. Before, each significant change could become a block. After, the pipeline must arbitrate: is this a new product promise or an improvement to integrate into an existing one? This guardrail limits the catalogue effect, common in content generated from product activity.

The second mechanism is a forbidden phrasing list. The pipeline now scans descriptions for signals of overly technical content: internal paths, component names, implementation details, or phrasing directly drawn from development. The goal is not to censor the technical, but to prevent it from appearing in the wrong place.

Before: “The schema adds the instance_shared_files table with FTS5 mirror.”
After: “Agents share a common file space to collaborate more effectively on the same objective.”

The first sentence describes an implementation. The second expresses a product capability. The difference is fundamental.

The third mechanism is market specialization. Each localizer agent has a brief of around 90 lines per language, defining the register, forms of address, calques to avoid, product terminology, and preferred phrasing. This prevents a classic pitfall: producing four grammatically correct translations that are culturally weak. Localizing a landing page is not converting words; it means preserving intent, commercial energy, and product precision in each market.

Action Guardrails

This level governs what the agent is allowed to do: limited rights per agent, clear scope of intervention, no automatic publishing, separation between production and validation.

The guiding principle is simple: no agent has full authority. The site-maintainer only touches English. The localizer agents only touch their language. The content-writer verifies and prepares the deliverable for review: it does not produce the initial content alone. The reporter informs and writes neither code nor content. This separation strongly reduces the risk that a misunderstanding upstream propagates unchecked all the way to production.

The pipeline does not publish. It prepares a deliverable for review. Final validation remains a human step, which aligns the system with the reality of an enterprise editorial chain: an agent can accelerate production, but cannot substitute for editorial responsibility.

Observability Guardrails

This level governs what the organization can see and trace of the agents’ work: end-of-execution report, change traceability, explicitness of arbitrations, human validation before publishing.

Each execution produces a summary report sent to the coordinator by the reporter agent: retained changes, discarded changes, modified blocks, arbitration points. The goal is not only to notify pipeline completion, but to make the agent’s decisions readable by a human, so that a reviewer knows why a product change became a marketing block rather than another.

Changes go through a versioned deliverable before publishing. A human can read the diff, compare language by language, roll back. This traceability is what transforms an autonomous agent into an auditable agent.

Without observability, an agentic pipeline becomes a black box: it can produce a correct result for weeks, then drift without a detectable signal. With observability, each execution leaves an exploitable trace: to review, to measure quality over time, to adjust briefs.

Between the incident and subsequent updates, the pipeline moved from a risky scenario (an automatic update that could degrade the published page) to a governed scenario: a review-ready deliverable, produced by a team of specialized agents, governed by three readable levels of guardrails.

That is an architectural difference, not just a matter of prompt engineering.

This REX illustrates an essential point for businesses: deploying AI agents is not just about automating a task. It requires identifying business risks, defining scopes of action, designing the right control points, and making the system observable. This is precisely the type of framing work that Castelis supports its clients with.

What This Pattern Enables Elsewhere

The clawpilot.eu case is an intentionally concrete example: continuous product activity, a landing page, five languages, 75 short texts to keep consistent. But the pattern (decompose, restrict, validate, trace) applies to any workflow where a technical signal must become business content without drifting. Three examples to project this REX into different contexts.

Marketing Teams: Multilingual Product Pages

Maintaining multilingual product pages without tone, promise, or positioning drift. An agent analyzes catalog changes, another drafts the pivot version, localizer agents adapt by market, a validation agent checks terminological consistency and conformity with validated messaging. Content guardrails protect the brand; action guardrails prevent a product sheet from being published without marketing review; observability allows quality to be measured over time and briefs to be adjusted.

Product Teams / SaaS: Multi-Audience Release Notes

Transforming technical changelogs into release notes adapted for different audiences: customers, support, sales, documentation. The same signal (a commit, a ticket, a merged PR) must feed a developer note, a product announcement, a support message, and a documentation update. Each channel has its register, level of detail, and constraints. A single agent consistently targets the wrong audience; a pipeline of agents specialized by channel, governed by content and action guardrails, produces the right message in the right place, with a readable trace for the product owner.

Support / Knowledge Management: Knowledge Base

Updating a knowledge base after each product change, with validation before publishing. An agent detects impacted articles, another proposes modifications, a validation agent checks consistency with official documentation and non-regression on critical answers. Observability guardrails are decisive here: a support team relying on an outdated or divergent knowledge base loses customer trust within weeks.

In all cases, the key point remains the same: do not ask a single agent to understand the product, arbitrate business value, write, localize, validate, and publish. Castelis intervenes precisely in this area: designing the target pipeline, identifying business risks, defining guardrails, integrating agents into existing workflows, and making the system observable.

The goal is not to replace teams. It is to transform repetitive, fragile, and multilingual operations into controlled, auditable, business-adapted agentic chains.

Conclusion

The automatic update incident on clawpilot.eu did not show that AI agents were incapable of producing marketing content. It showed that they can do it too quickly, too literally, and without discernment if the architecture does not impose the right guardrails.

For businesses, the question is therefore not to add AI agents everywhere. The question is to identify the workflows where automation brings a real gain, to define acceptable risks, then to design an architecture where each agent acts within a controlled, observable, and validatable scope.

The question is no longer whether your agents can generate content.

The question is whether you know how to prevent them from doing it badly.

This is precisely the role of an operational AI framing: moving from a promising experiment to a system usable in production.