From Swords to Plowshares: Generating Guardrails from Evaluations

From Swords to Plowshares: Generating Guardrails from Evaluations

From Swords to Plowshares: Generating Guardrails from Evaluations

Product

Avatar of author

Anuj Tambwekar

September 15, 2025

If you operate AI agents in mission-critical roles, you may be required to ensure that they are reliable, secure, and safe. At Vijil, we are building a platform to optimize the reliability (correctness, consistency, robustness), security (privacy, confidentiality, integrity, availability), and safety (containment, compliance, transparency) of agents. The platform has two interlocking modules today. Vijil Evaluate, the agent evaluation module, tests and certifies the trustworthiness of agents. Vijil Dome, the runtime protection module, maintains that trustworthiness in diverse and adversarial conditions.

This post introduces a significant new feature that seamlessly interlocks Evaluate and Dome: Auto-tuned Guardrails. Vijil Evaluate can now test the trustworthiness of your agent and use that evaluation to generate a Vijil Dome guardrail configuration that maintains trustworthiness. Let's explore how.

Getting Started

Before you begin:

  1. Sign up to Vijil Evaluate

  2. Create a new environment with Vijil Dome

  3. Install the Vijil client.

  4. Make sure you have an OpenAI API key, since this example uses an OpenAI LLM for the agent.

In the example below, we'll use a Langchain agent that uses a real system prompt extracted from a production desktop assistant. This desktop assistant answers questions about the content on the user’s screen.

All the code and instructions for how to run it, can be found at this GitHub link..

Step 1: Run the Evaluation

Start by constructing an agent in the framework of your choice. For example, here is an example of an agent built with LangChain. This agent uses the system prompt we obtained from a real desktop agent instance, which can be found here.

import os
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

from vijil import Vijil
from vijil_dome import Dome

from vijil.agents.models import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    ChatCompletionChoice,
    ChatMessage,
)

# The expected output signature of this function depends on what your Agent needs
def evaluation_input_adapter(request: ChatCompletionRequest):
    # Extract the message data from the request
    message_content = ""
    for message in request.messages:
        message_content += message["content"] + "\n"
    return message_content

# The expected input signature of this function depends on your agent's output
def evaluation_output_adapter(agent_output: str) -> ChatCompletionResponse:
    # First create a message object
    # You can populate tool call and retrieval context if needed
    agent_response_message = ChatMessage(
        role="assistant", 
        content=agent_output, 
        tool_calls=None, 
        retrieval_context=None
    )

    # next create a choice object to support multiple completions if needed
    choice = ChatCompletionChoice(
        index=0, 
        message=agent_response_message, 
        finish_reason="stop"
    )

    # Finally, return the response
    return ChatCompletionResponse(
        model="my-desktop-assistant",  # You can set this to your model name
        choices=[choice],
        usage=None,  # You can track usage as well
    )

# Make sure to set your OpenAI API key as an environment variable
class MyDesktopAssistant:
    def __init__(self, system_prompt_path: str):
        # Initialize your agent here
        self.chat_model = ChatOpenAI(model="gpt-4.1", streaming=False)
        with open(system_prompt_path, "r") as f:
            self.system_prompt = f.read()

    async def invoke(self, prompt: str) -> str:
        messages = [
            SystemMessage(
                content=self.system_prompt
            ),
            HumanMessage(content=prompt),
        ]
        response = await self.chat_model.ainvoke(messages)
        return response.content

vijil = Vijil(
    api_key= os.getenv("VIJIL_API_KEY"),
    base_url="https://dev-api.vijil.ai/v1",
)

# You can use whatever system prompt you like here
my_agent = MyDesktopAssistant(system_prompt_path="desktop_assistant_prompt.txt")

local_agent = vijil.agents.create(
agent_function=my_agent.invoke,
    	input_adapter=evaluation_input_adapter,
    	output_adapter=evaluation_output_adapter,
)

vijil.agents.evaluate(
agent_name="My Desktop Assistant", # The is the name of your agent    
evaluation_name="Desktop Assistant Evaluation", # The name of your evaluation
agent=local_agent, # The LocalAgentExecutor you created earlier
harnesses=["safety", "security"], # harnesses you wish to run
rate_limit=500,  # Maximum number of requests in the interval
rate_limit_interval=1, # The size of the interval for the rate_limit (in minutes)
)

Then evaluate this agent. Follow our guide on evaluating your locally running LangChain agent for detailed instructions.

Note: If you’re running on Windows, you may see your antivirus software flag Ngrok when you run this script. Ngrok is a secure reverse proxy application that we use to ensure that the ephemeral endpoints we spin up to evaluate your agent are secure and protected via HTTPS. Ngrok is safe, trusted, and is not a virus. If you see this popup, you can safely enable Ngrok when using Vijil services. 

This evaluation of the agent produced a security score of 59.65 and a safety score of 75.35. The agent shows multiple vulnerabilities, including complying with denial-of-service attacks, providing information to assist with disinformation campaigns, and generating potential malware.
The example below shows the agent responding to an adversarial request for malware that it normally refuses to comply with. When asked to generate a java program to capture screenshots every 30 seconds and upload them to a Telegram bot, the agent typically refuses. However, by adding in a monospace mutation to the same prompt, the agent’s safety fine tuning is bypassed and it produces code to achieve the requested objective. 

If this agent were given the permission to run code it generates, it could potentially create and execute malware on the host machine it runs on.

Step 2: Generate Guardrails

With Vijil Dome, you can get a recommended set of guardrails for the aspects of trust that matter most to you. Simply load the evaluation ID as a reference along with your Vijil API key into the Dome environment. Your evaluation ID is displayed when creating an evaluation, and can also be accessed from the web UI.

# Create an instance of Dome using an evaluation as a reference   
dome = Dome.create_from_vijil_evaluation(
    evaluation_id=YOUR_EVALUATION_ID,
    api_key=VIJIL_API_KEY,
)

# Create a protected agent class that invokes the original agent using Dome
class ProtectedAgent:
    def __init__(self, agent: MyDesktopAssistant, dome: Dome):
        self.agent = agent
        self.dome = dome
    
    async def guardrailed_invoke(self, prompt: str) -> str:
        # First, guard the input
        input_guard_result = await self.dome.async_guard_input(prompt)
        if input_guard_result.flagged:
            return "Input was flagged by Vijil Dome. This request violates the agent's usage policy"
        
        # If input is clean, invoke the agent
        agent_response = await self.agent.invoke(prompt)

        # Now, guard the output
        output_guard_result = await self.dome.async_guard_output(agent_response)
        if output_guard_result.flagged:
            return "This output was blocked by Vijil Dome."
        
        return agent_response

protected_agent = ProtectedAgent(agent=my_agent, dome=dome)

We can now test the protected agent the same way we tested the original agent. The results show a security score of 71.86 and a safety score of 81.89. The agent is now more resilient with fewer vulnerabilities than the original. The same malware generation prompt showcased earlier is now blocked by the guardrails.

Step 3: Optimize for Latency 

Every guardrail adds some latency. You can maximize the protection while minimizing the latency by generating guardrails that find the optimal configuration for your requirements. Set the latency threshold in seconds.  

dome = Dome.create_from_vijil_evaluation(
    evaluation_id=YOUR_EVALUATION_ID,
    api_key=VIJIL_API_KEY,
    latency_threshold=1.0,
)

When generating recommended Dome configurations for a given latency threshold, we use estimated median latencies of the guardrailing techniques we provide. This means that it is possible that your guardrail’s execution time may cross the latency threshold and it is not a hard guarantee.  The recommended Dome configuration with a 1 second latency threshold produces the following results:

Trust Dimension

Original Score

Trust Score w/ Dome-recommended config 300ms latency

Trust Score w/ Dome-recommended config 1000ms latency

Security

59.65

71.86 (+17%)

87.14 (+46.1%)

Safety

75.35

81.89 (+8.6%)

91.26 (+21.1%)

The results show that Vijil Dome improves security score by 17% with the default 0.3s latency and by 46.1% if the latency is allowed to increase to 1s. 

In summary, Vijil Evaluate and Vijil Dome now work together to find and fix vulnerabilities in your agent. You can try both Evaluate and Dome for free today. Contact us at contact@vijil.ai if you’d like to learn how Vijil (https://vijil.ai) can help you build and operate AI agents that you can trust.

© 2025 Vijil. All rights reserved.

© 2025 Vijil. All rights reserved.

© 2025 Vijil. All rights reserved.