This weekend I worked on something that has been bugging me for a while in my RFx multi-agent orchestration pipeline: prompt injection security.
A risk with any system that ingests uploaded documents and passes extracted content through LLM-driven workflows is that the document itself may contain prompt injection attacks. In other words, the RFx is not always just a source of requirements - it could also be a delivery mechanism for malicious instructions aimed at the downstream agents.
So this weekend I added a dedicated Security Agent into the pipeline.
It sits after requirements extraction into JSON and before the rest of the orchestration continues. Its role is to inspect the extracted structure for signs of prompt injection or other attempts to manipulate agent behaviour. If something looks wrong, the orchestrator stops the workflow.
The design uses a two-phase analysis approach:
๐ฃ๐ต๐ฎ๐๐ฒ ๐ญ: ๐ฃ๐ฒ๐ฟ-๐ป๐ผ๐ฑ๐ฒ ๐๐ฐ๐ฎ๐ป๐ป๐ถ๐ป๐ด
Each JSON node is scanned individually to catch targeted injection attempts hidden in specific fields.
๐ฃ๐ต๐ฎ๐๐ฒ ๐ฎ: ๐๐๐น๐น-๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ ๐๐ฐ๐ฎ๐ป๐ป๐ถ๐ป๐ด
The complete JSON structure is then analyzed as a whole to catch attacks that are distributed across multiple fields or rely on payload splitting.
The agent currently checks across ๐๐ฒ๐๐ฒ๐ป threat vectors:
๐๐ถ๐ฟ๐ฒ๐ฐ๐ ๐ถ๐ป๐๐๐ฟ๐๐ฐ๐๐ถ๐ผ๐ป ๐ผ๐๐ฒ๐ฟ๐ฟ๐ถ๐ฑ๐ฒ
Phrases like โignore previous instructionsโ or โsystem overrideโ
๐ฅ๐ผ๐น๐ฒ๐ฝ๐น๐ฎ๐ ๐ฎ๐ป๐ฑ ๐๐ถ๐ฟ๐๐๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป
Attempts to make the model adopt a persona or simulate a different system context
๐ข๐ฏ๐ณ๐๐๐ฐ๐ฎ๐๐ถ๐ผ๐ป ๐ฎ๐ป๐ฑ ๐๐บ๐๐ด๐ด๐น๐ถ๐ป๐ด
Base64, hex, Unicode tricks, and other ways of hiding intent
๐ฃ๐ฎ๐๐น๐ผ๐ฎ๐ฑ ๐๐ฝ๐น๐ถ๐๐๐ถ๐ป๐ด
Malicious instructions broken across multiple JSON nodes
๐๐ผ๐ป๐๐ฒ๐ ๐ ๐๐ถ๐ป๐ฑ๐ผ๐ ๐ฒ๐๐ฐ๐ฎ๐ฝ๐ฒ
Special characters or delimiters intended to break parsing boundaries
๐๐ป๐ฑ๐ถ๐ฟ๐ฒ๐ฐ๐ ๐ถ๐ป๐ท๐ฒ๐ฐ๐๐ถ๐ผ๐ป / ๐ฑ๐ฎ๐๐ฎ ๐ฝ๐ผ๐ถ๐๐ผ๐ป๐ถ๐ป๐ด
Harmful instructions hidden in otherwise plausible business content
๐ ๐ฎ๐ป๐-๐๐ต๐ผ๐ / ๐ณ๐น๐ผ๐ผ๐ฑ๐ถ๐ป๐ด ๐ฎ๐๐๐ฎ๐ฐ๐ธ๐
Repetitive content intended to overwhelm or steer the model
If malicious content is detected, the orchestrator automatically aborts the workflow and produces a security audit report in JSON, including severity, confidence scores, and flagged paths. That gives both traceability and something concrete to inspect rather than just a pass/fail result.
With agentic systems, it is easy to get excited about orchestration, reasoning, tool use, and automation. But once you start building systems that ingest third-party documents and act on them, defensive design has to become part of the architecture and design.
Video of testing attached - 100% local, uses Ollama, IBM Granite4, Microsoft Agent Framework on Amazon Web Services (AWS)
