Sunday Coffee & Code: Adding a Security Agent to the RFx multi-agent pipeline

This weekend I worked on something that has been bugging me for a while in my RFx multi-agent orchestration pipeline: prompt injection security.

A risk with any system that ingests uploaded documents and passes extracted content through LLM-driven workflows is that the document itself may contain prompt injection attacks. In other words, the RFx is not always just a source of requirements - it could also be a delivery mechanism for malicious instructions aimed at the downstream agents.

So this weekend I added a dedicated Security Agent into the pipeline.

It sits after requirements extraction into JSON and before the rest of the orchestration continues. Its role is to inspect the extracted structure for signs of prompt injection or other attempts to manipulate agent behaviour. If something looks wrong, the orchestrator stops the workflow.

The design uses a two-phase analysis approach:

𝗣𝗵𝗮𝘀𝗲 𝟭: 𝗣𝗲𝗿-𝗻𝗼𝗱𝗲 𝘀𝗰𝗮𝗻𝗻𝗶𝗻𝗴

Each JSON node is scanned individually to catch targeted injection attempts hidden in specific fields.

𝗣𝗵𝗮𝘀𝗲 𝟮: 𝗙𝘂𝗹𝗹-𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝘀𝗰𝗮𝗻𝗻𝗶𝗻𝗴

The complete JSON structure is then analyzed as a whole to catch attacks that are distributed across multiple fields or rely on payload splitting.

The agent currently checks across 𝘀𝗲𝘃𝗲𝗻 threat vectors:

𝗗𝗶𝗿𝗲𝗰𝘁 𝗶𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻 𝗼𝘃𝗲𝗿𝗿𝗶𝗱𝗲

Phrases like “ignore previous instructions” or “system override”

𝗥𝗼𝗹𝗲𝗽𝗹𝗮𝘆 𝗮𝗻𝗱 𝘃𝗶𝗿𝘁𝘂𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻

Attempts to make the model adopt a persona or simulate a different system context

𝗢𝗯𝗳𝘂𝘀𝗰𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝘀𝗺𝘂𝗴𝗴𝗹𝗶𝗻𝗴

Base64, hex, Unicode tricks, and other ways of hiding intent

𝗣𝗮𝘆𝗹𝗼𝗮𝗱 𝘀𝗽𝗹𝗶𝘁𝘁𝗶𝗻𝗴

Malicious instructions broken across multiple JSON nodes

𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝘄𝗶𝗻𝗱𝗼𝘄 𝗲𝘀𝗰𝗮𝗽𝗲

Special characters or delimiters intended to break parsing boundaries

𝗜𝗻𝗱𝗶𝗿𝗲𝗰𝘁 𝗶𝗻𝗷𝗲𝗰𝘁𝗶𝗼𝗻 / 𝗱𝗮𝘁𝗮 𝗽𝗼𝗶𝘀𝗼𝗻𝗶𝗻𝗴

Harmful instructions hidden in otherwise plausible business content

𝗠𝗮𝗻𝘆-𝘀𝗵𝗼𝘁 / 𝗳𝗹𝗼𝗼𝗱𝗶𝗻𝗴 𝗮𝘁𝘁𝗮𝗰𝗸𝘀

Repetitive content intended to overwhelm or steer the model

If malicious content is detected, the orchestrator automatically aborts the workflow and produces a security audit report in JSON, including severity, confidence scores, and flagged paths. That gives both traceability and something concrete to inspect rather than just a pass/fail result.

With agentic systems, it is easy to get excited about orchestration, reasoning, tool use, and automation. But once you start building systems that ingest third-party documents and act on them, defensive design has to become part of the architecture and design.

Video of testing attached - 100% local, uses Ollama, IBM Granite4, Microsoft Agent Framework on Amazon Web Services (AWS)

Sunday Coffee & Code: Adding a Security Agent to the RFx multi-agent pipeline

Want to Discuss This Topic?