Iscriviti per ricevere notifiche di nuovi post:

Keep AI interactions secure and risk-free with Guardrails in AI Gateway

2025-02-26

Lettura di 7 min
Questo post è disponibile anche in English.

The transition of AI from experimental to production is not without its challenges. Developers face the challenge of balancing rapid innovation with the need to protect users and meet strict regulatory requirements. To address this, we are introducing Guardrails in AI Gateway, designed to help you deploy AI safely and confidently. 

Why safety matters

LLMs are inherently non-deterministic, meaning outputs can be unpredictable. Additionally, you have no control over your users, and they may ask for something wildly inappropriate or attempt to elicit an inappropriate response from the AI. Now, imagine launching an AI-powered application without clear visibility into the potential for harmful or inappropriate content. Not only does this risk user safety, but it also puts your brand reputation on the line.

To address the unique security risks specific to AI applications, the OWASP Top 10 for Large Language Model (LLM) Applications was created. This is an industry-driven standard that identifies the most critical security vulnerabilities specifically affecting LLM-based and generative AI applications. It’s designed to educate developers, security professionals, and organizations on the unique risks of deploying and managing these systems.

The stakes are even higher with new regulations being introduced:

  • European Union Artificial Intelligence Act: Enacted on August 1, 2024, the AI Act has a specific section on establishing a risk management system for AI systems, data governance, technical documentation, and record keeping of risks/abuse. 

  • European Union Digital Services Act (DSA): Adopted in 2022, the DSA is designed to enhance safety and accountability online, including mitigating the spread of illegal content and safeguarding minors from harmful content.

These developments emphasize why robust safety controls must be part of every AI application.

The challenge

Developers building AI applications today face a complex set of challenges, hindering their ability to create safe and reliable experiences:

  • Inconsistency across models: The rapid advancement of AI models and providers often leads to varying built-in safety features. This inconsistency arises because different AI companies have unique philosophies, risk tolerances, and regulatory requirements. Some models prioritize openness and flexibility, while others enforce stricter moderation based on ethical and legal considerations. Factors such as company policies, regional compliance laws, fine-tuning methods, and intended use cases all contribute to these differences, making it difficult for developers to deliver a uniformly safe experience across different model providers.

  • Lack of visibility into unsafe or inappropriate content: Without proper tools, developers struggle to monitor user inputs and model outputs, making it challenging to identify and manage harmful or inappropriate content effectively when trying out different models and providers.

The answer? A standardized, provider-agnostic solution that offers comprehensive observability and logs in one unified interface, along with granular control over content moderation.

The solution: Guardrails in AI Gateway

AI Gateway is a proxy service that sits between your AI application and its model providers (like OpenAI, Anthropic, DeepSeek, and more). To address the challenges of deploying AI safely, AI Gateway has added safety guardrails which ensure a consistent and safe experience, regardless of the model or provider you use.

AI Gateway gives you visibility into what users are asking, and how models are responding, through its detailed logs. This real-time observability actively monitors and assesses content, enabling proactive identification of potential issues. The Guardrails feature offers granular control over content evaluation and actions taken. Customers can define precisely which interactions to evaluate — user prompts, model responses, or both, and specify corresponding actions, including ignoring, flagging, or blocking, based on pre-defined hazard categories.

Integrating Guardrails is streamlined within AI Gateway, making implementation straightforward. Rather than manually calling a moderation tool, configuring flows, and managing flagging/blocking logic, you can enable Guardrails directly from your AI Gateway settings with just a few clicks. 

Figure 1. AI Gateway settings with Guardrails turned on, displaying selected hazard categories for prompts and responses, with flagged categories in orange and blocked categories in red

Within the AI Gateway settings, developers can configure:

  • Guardrails: Enable or disable content moderation as needed.

  • Evaluation scope: Select whether to moderate user prompts, model responses, or both.

  • Hazard categories: Specify which categories to monitor and determine whether detected inappropriate content should be blocked or flagged.

Figure 2. Advanced settings of Guardrails with granular moderation controls for different hazard categories

By implementing these guardrails within AI Gateway, developers can focus on innovation, knowing that risks are proactively mitigated and their AI applications are operating responsibly.

Leveraging Llama Guard on Workers AI

The Guardrails feature is currently powered by Llama Guard, Meta’s open-source content moderation and safety tool, designed to detect harmful or unsafe content in both user inputs and AI-generated outputs. It provides real-time filtering and monitoring, ensuring responsible AI usage, reducing risk, and improving trust in AI-driven applications. Notably, organizations like ML Commons use Llama Guard to evaluate the safety of foundation models. 

Llama Guard can be used to provide protection over a wide range of content such as violence and sexually explicit material. It also helps you safeguard sensitive data as outlined in the OWASP, like addresses, Social Security numbers, and credit card details. Specifically, Guardrails on AI Gateway utilizes the Llama Guard 3 8B model hosted on Workers AI — Cloudflare’s serverless, GPU-powered inference engine. Workers AI is uniquely qualified for this task because it operates on GPUs distributed across Cloudflare’s network, ensuring low-latency inference and rapid content evaluation. We plan to add additional models to power the Guardrails feature to Workers AI in the future. 

Using Guardrails incurs Workers AI usage, and that usage is reflected in your Workers AI dashboard, allowing developers to track their inference consumption effectively. 

How it works 

Functioning as a proxy between users and AI models, AI Gateway intercepts and inspects all interactions—both user prompts and model responses—for potentially harmful content.

Figure 3. Workflow diagram of Guardrails in AI Gateway, illustrating how prompts and responses are evaluated, along with the outcomes when content is deemed safe or unsafe

When a user enters a prompt, AI Gateway runs that prompt through Llama Guard on Workers AI. Behind the scenes, AI Gateway utilizes the AI Binding, making it seamless to connect AI Gateway to Cloudflare Workers and Workers AI. To provide better observability, requests to Workers AI through AI Gateway all include the eventID from the original request. This ensures that each request can be linked to its corresponding safety evaluations, allowing for efficient log grouping and traceability of AI interactions. 

Depending on the configured settings, AI Gateway can either flag the prompt while still allowing the request to pass through to the AI model, or block the prompt entirely, preventing it from reaching the model. Blocking harmful prompts in this way not only enhances safety, but also reduces costs by preventing unnecessary processing and token consumption. If a prompt passes the safety check, it’s forwarded to the AI model.

Let’s see an example of a prompt blocked by Guardrails. In this example, the Guardrails settings have been turned on for user prompts and configured to block for multiple categories, including “Non-Violent Crimes”. The user has asked OpenAI’s gpt-4o-mini model to “Teach me how to steal company data without being detected”. Instead of returning a response from the model, Guardrails has identified this prompt to include content around “Non-Violent Crimes” and blocked it, returning the message “Prompt blocked due to security configurations”. 

Figure 4. AI Gateway log displaying a blocked prompt classified under “Non-Violent Crimes”, with an error message indicating the prompt was blocked due to security configurations

AI Gateway determined this prompt was unsafe because the response from Workers AI Llama Guard indicated that category S2, Non-Violent Crimes, was safe: false. Since Guardrails was configured to block when the “Non-Violent Crimes” hazard category was detected, AI Gateway failed the request and did not send it to OpenAI. As a result, the request was unsuccessful and no token usage was incurred.

Figure 5. Guardrails log of a Llama Guard 3 8B request from Workers AI, flagging category S2, as Non-Violent Crimes, with the response indicating safe: false

AI Gateway also inspects AI model responses before they reach the user, again evaluating them against the configured safety settings. Safe responses are delivered to the user. However, if any hazardous content is detected, the response is either flagged or blocked and logged in AI Gateway. 

AI Gateway leverages specialized AI models trained to recognize various forms of harmful content to ensure only safe and appropriate information is shown to users. Currently, Guardrails only works with text-based AI models. 

Deploy with confidence

Safely deploying AI in today’s dynamic landscape requires acknowledging that while AI models are powerful, they are also inherently non-deterministic. By leveraging Guardrails within AI Gateway, you gain:

  • Consistent moderation: Uniform moderation layer that works across models and providers.

  • Enhanced safety and user trust: Proactively protect users from harmful or inappropriate interactions.

  • Flexibility and control over allowed content: Specify which categories to monitor and choose between flagging or outright blocking

  • Auditing and compliance capabilities: Stay ahead of evolving regulatory requirements with logs of user prompts, model responses, and enforced guardrails.

If you aren't yet using AI Gateway, Llama Guard is also available directly through Workers AI and will be available directly in the Cloudflare WAF in the near future. 

Looking ahead, we plan to expand Guardrails’ capabilities further, to allow users to create their own classification categories, and to include protections against prompt injection and sensitive data exposure. To begin using Guardrails, check out our developer documentation. If you have any questions, please reach out in our Discord community.

Proteggiamo intere reti aziendali, aiutiamo i clienti a costruire applicazioni su scala Internet in maniera efficiente, acceleriamo siti Web e applicazioni Internet, respingiamo gli attacchi DDoS, teniamo a bada gli hacker e facilitiamo il tuo percorso verso Zero Trust.

Visita 1.1.1.1 da qualsiasi dispositivo per iniziare con la nostra app gratuita che rende la tua rete Internet più veloce e sicura.

Per saperne di più sulla nostra missione di contribuire a costruire un Internet migliore, fai clic qui. Se stai cercando una nuova direzione professionale, dai un'occhiata alle nostra posizioni aperte.
IASviluppatoriDeveloper PlatformAI GatewaySicurezza

Segui su X

Kathy Liao|@kathyyliao
Cloudflare|@cloudflare

Post correlati