Summary
Developing responsible and trustworthy AI applications relies on essential safety mechanisms like LLM guardrails. These guardrails, built upon a robust DevSecOps foundation, help detect, mitigate, and prevent undesirable LLM behaviors.
This post was co-authored by Gauri Kholkar, Applied AI/ML Scientist, Office of the CTO, and Dr. Ratinder Paul Singh Ahuja, CTO for Security and GenAI. Dr. Ahuja is a renowned name in the field of security, AI, and networking.
Large language models (LLMs) are transforming industries, unlocking unprecedented capabilities. It’s an exciting time, but harnessing this power responsibly means navigating a complex web of potential risks—from harmful content to data leaks. Just as data needs robust storage and security, AI models need strong guardrails.
But it seems like new guardrail models pop up every other month, making it tough to know which fits your needs or if you even need one yet. Maybe you’ve heard about protecting against SQL injection, but now there’s talk of “prompt injection” and other novel attacks on LLM applications that constantly emerge. Are you worried about sending your proprietary data to third-party LLM APIs? Feeling overwhelmed by AI security?
In this series, we’ll break down the critical layers of protection needed for enterprise AI, covering both the familiar ground of application security best practices (which absolutely still apply) and the unique challenges specific to AI. We’ll also reflect how we approach these challenges at Pure Storage:
- Part 1: What No One Tells You About Securing AI Apps: Demystifying AI Guardrails (You are here!): Understanding how combining DevSecOps with specialized safety models is key to making LLM apps strong, secure, and compliant.
- Part 2: Securing the Data Fueling Your LLMs: Strategies for protecting the sensitive information that is used to train models or interact with AI applications, outlining key aspects of the data security approach of Pure Storage.
- Part 3: Building a Secure Infrastructure Foundation: Ensuring the underlying systems supporting your AI workloads are robust and resilient, creating a fortified, end-to-end security shield that protects every layer, from infrastructure to application, throughout deployment and ongoing monitoring.
Understanding the AI Security Landscape
Today, we dive into the rapidly evolving world of LLM guardrails—the essential safety mechanisms designed to detect, mitigate, and prevent undesirable LLM behaviors. But first, let’s clarify the components involved and where security responsibilities lie. A typical AI application integrates several parts, often including external services.
Figure 1: Reference AI Application.
Let’s break down this flow and relate it to our security discussion:
- AI Application: This encompasses everything before the final call to the core AI model. In Figure 1, this includes:
- User Interface: Where the user initially enters their query.
- Orchestration and Routing: This is part of your application’s business logic. It decides how to handle the user query—does it need information from internal knowledge bases, external web searches, or both? This logic also handles interactions with LLM APIs (via adapters/clients) and any necessary calls to external tools or functions.
- Context Construction: Another key piece of business logic. This component gathers the necessary information (from the knowledge base, web search APIs, tool outputs, etc.) and formats it along with the original user query to create the final prompt (Query + Context) that will be sent to the LLM. This is a critical area for security, as it handles potentially sensitive corporate data and external information.
- AI Infrastructure: This refers to the underlying systems that run the core AI model and manage its operation.
- If using a third-party LLM API: As shown in the diagram, the core intelligence often comes from an external provider (OpenAI, Google, Anthropic, etc.). In this case, your infrastructure responsibility is primarily focused on securely interacting with that API (authentication, network security, managing API keys). The provider manages the actual model serving infrastructure. Your concern about sending proprietary data relates directly to this step—the context you construct might contain sensitive information passed to this external API.
- If self-hosting an LLM: You are responsible for the entire infrastructure stack needed to serve the model (compute resources like GPUs, networking, storage for model weights). This also includes the infrastructure for model training if you are fine-tuning or building custom models.
- General AI infra: Regardless of the hosting model, this layer includes the compute, network, and storage infrastructure where the AI application components (like orchestration, context construction) and potentially the self-hosted model itself are deployed. This could involve cloud services (e.g., AWS Lambda, ECS/Fargate, EC2, S3, Azure Functions), on-premises servers, or a hybrid setup. It also encompasses essential operational components like logging, monitoring, and potentially artifact repositories for model versions. Securing this entire infrastructure stack is crucial.
From DevOps to DevSecOps:
Figure 2: Security isn’t an add-on; it must be end-to-end, from secure infrastructure to secure applications, and from secure deployment to secure operations.
When discussing AI security, the conversation often jumps to cutting-edge defenses: prompt injection defenses, hallucination filters, guardrails against unexpected sentience! While these AI-specific concerns are valid and important (as we’ll discuss), it’s crucial not to overlook the fundamentals.
DevSecOps, the practice of integrating security into every stage of the software development lifecycle, is paramount. We often forget that an AI application is, fundamentally, still an application. Before worrying exclusively about novel AI threats, we must ensure we’re applying the basics of application and infrastructure security correctly. Securing the overall AI system starts with securing your AI Application components and the AI Infrastructure using robust DevSecOps practices. This includes standard secure coding, vulnerability scanning, infrastructure hardening, access controls, and threat modeling. If you can secure a traditional n-tier application and its data, you have a strong foundation.
This commitment to security is foundational at Pure Storage, reflected in our rigorous DevSecOps methodology—the “6-Point Plan” detailed in our product security journey—overseen by our security leadership, ensuring that security is built in, not bolted on, for all our solutions, including those powering demanding AI workloads. This includes leveraging innovative tools and techniques, such as using LLMs to automate and scale security practices like STRIDE threat modeling, making robust security analysis accessible even for rapid development cycles.
You must apply these principles rigorously, especially when handling sensitive data during context construction or managing the AI infrastructure, before layering on AI-specific considerations. Critically, since much of the application code itself might be AI-generated, adhering strictly to a secure software development lifecycle for review, testing, and validation and incorporating static and dynamic code scanning is more important than ever.
What’s Unique about AI Security?
Beyond standard practices, the unique aspects requiring focus are:
Risky inputs (user and context): The data flowing into the LLM requires careful scrutiny.
- User input: The direct query or input from the user can be intentionally malicious (e.g., prompt injection, attempts to reveal sensitive info), factually false, or contain toxic/harmful language. As shown in Figure 3, Scenario 1 below, the user directly inputs a malicious instruction, contaminating the final prompt even if other parts are benign.
- Constructed context: The context assembled by your application (from internal knowledge bases, external web searches, API calls, etc.) can also be malicious (if external sources are compromised or manipulated), contain false or outdated information, or include toxic content retrieved from the web. As illustrated in Figure 3, Scenario 2 below, the application retrieves compromised or bad data for context, tainting the final prompt even if the user query was harmless.
Figure 3: Malicious inputs in AI prompts.
AI security, therefore, involves securing your application’s business logic and the underlying AI infrastructure, plus managing the risks associated with the AI-specific components like the prompt/context interaction. LLM guardrails (discussed next) are a specific tool often implemented within the AI application layer to help manage risks at the boundary before interacting with the core LLM.
Why Guardrails Are Non-negotiable for Enterprise AI
Building secure GenAI applications requires understanding the attack vectors. Here are some of the most significant risks, many of which are categorized and detailed in resources like the OWASP Top 10 for LLM Applications:
Figure 4: Risks to generative AI.
5 Primary Risks to AI Security
- Prompt injection: This is a class of attacks against applications built on top of large language models (LLMs) that work by concatenating untrusted user input with a trusted prompt constructed by the application’s developer. Essentially, it’s like tricking the AI. Attackers manipulate the input prompts given to the LLM to make it behave in unintended ways, potentially including tricking it into revealing its confidential system prompt or executing malicious commands via the application’s capabilities.
- Direct injection: A malicious user directly inputs instructions intended to override the original prompt, potentially causing the AI to reveal its initial instructions or execute harmful commands.
- Indirect injection: Adversarial instructions are hidden within external data sources like websites or documents that the AI processes. When the AI interacts with this tainted content, it can inadvertently execute hidden commands. This risk is significantly amplified when the AI has access to tools or APIs that can interact with sensitive data or perform actions, such as tricking an AI email assistant into forwarding private emails or manipulating connected systems.
- Jailbreaking: This is the class of attacks that attempt to subvert safety filters built into the LLMs themselves. It involves crafting inputs specifically designed to bypass these safety mechanisms. The goal is often to coerce the model into generating harmful, unethical, or restricted content it’s designed to refuse. This can range from generating instructions for dangerous activities to creating embarrassing outputs that damage brand reputation.
- Misinformation: LLMs can sometimes generate incorrect or nonsensical information or hallucinations, unsafe code, or unsupported claims.
- Factual inaccuracies: Models might confidently state incorrect facts, potentially leading users astray.
- Unsupported claims: AI models may generate baseless assertions or “facts” with high confidence. This becomes particularly dangerous when applied in critical fields like law, finance, or healthcare, where decisions based on inaccurate AI-generated information can have serious real-world consequences.
- Unsafe code: AI might suggest insecure code or even reference non-existent software libraries. Attackers can exploit this by creating malicious packages with these commonly hallucinated names, tricking developers into installing them.
- Sensitive information disclosure: Without proper safeguards, LLMs can inadvertently reveal Personally Identifiable Information (PII) or other confidential data. This exposure might happen if the model repeats sensitive details provided during user interactions, accesses restricted information through poorly secured retrieval-augmented generation (RAG) systems or external tools, or, in some cases, recalls sensitive data it was inadvertently trained on. The leaked information could include customer PII, internal financial data, proprietary source code, strategic plans, or health records. Such breaches often lead to severe consequences like privacy violations, regulatory penalties (e.g., under GDPR or CCPA), loss of customer trust, and competitive disadvantage.
- Supply chain and data integrity risks: GenAI applications often rely on pre-trained models, third-party data sets, and external plugins. If any component in this supply chain is compromised (e.g., a vulnerable model or a malicious plugin), it can introduce significant security risks. Furthermore, attackers may intentionally corrupt the data used for training, fine-tuning, or RAG systems (“data poisoning”). This poisoning can introduce hidden vulnerabilities, biases, or backdoors into the model, causing it to behave maliciously or unreliably under specific conditions.
Understanding these risks is the first step toward building defenses, which often involves implementing robust guardrails.
Given these risks, especially those related to malicious or harmful inputs and outputs, where should guardrails be placed? Referring to the application flow diagram in Figure 5 below, there are two critical points for intervention:
- Input Guardrails: Placed after the initial User Query is received but before it (and any constructed context) is sent to the LLM API. This helps detect and block malicious prompts, toxic language, or attempts to inject harmful instructions early.
- Output Guardrails: Placed after receiving the response from the LLM API but before presenting the final User Output. This helps filter out any harmful, biased, toxic, or inappropriate content generated by the LLM, preventing it from reaching the user.
Figure 5: LLM guardrails in an AI application.
Implementing guardrails at both these stages provides layered security. Now, let’s look at the specific guardrail solutions available.
The LLM Guardrail Landscape: Key Players and Capabilities
Several solutions have emerged to address the need for LLM safety, each with different strengths and approaches. Here’s a look at some prominent players and the types of harmful content they aim to filter:
Content Detection Capabilities Comparison
Model Name | Violence & Hate, Offensive Speech | Adult Content | Weapons, Illegal Items & Criminal Planning | Self-harm & Suicide | Intellectual Property | Mis- information & Hallucination | Privacy & PII | Jailbreak Prevention & Prompt Injection |
Llama Guard3 (Meta) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
NeMo Guardrails (Nvidia) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Bedrock Guardrails (Amazon) | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ |
Azure AI Content Safety (Microsoft) | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
Granite Guardian 3.2 5B (IBM) | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ |
Guardrails AI | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
WildGuard (Ai2) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Prompt Guard (Meta) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
InjecGuard | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
Note: Data based on publicly available information at time of publication and vendor documentation. Capabilities may evolve.
Architectural Approaches: How Guardrails Work
The underlying architecture significantly influences a guardrail model’s flexibility, performance, and integration capabilities.
Architectural Comparison
Model Name | Architecture | Key Components | Integration Method | Customization Options | Scalability |
Llama Guard3 | Llama-3.1-8B Transformer | Safety classifier for input/output moderation | Hugging Face endpoint, open source model | Fine-tunable | Designed to support Llama 3.1 capabilities |
NeMo Guardrails | Framework (event-driven rails with text embeddings + Colang) | Event-driven architecture, text embeddings, rules-based filtering | Integrates with multiple LLMs | User-defined rules (Colang) | High, supports multiple models |
Bedrock Guardrails | Rule-based & ML-assisted | Pre-built filters for harmful content & hallucination prevention | AWS API | Configurable filtering thresholds | High, built for enterprise |
Azure AI Content Safety | Rule-based & ML-assisted | Prompt Shields, Groundedness Detection, risk assessments | Azure AI API | Limited tuning via Azure AI Studio | High, cloud-scale AI |
Granite Guardian 3.2 5B | Iterative pruning & healing on 5 B transformer | Pruned/healed risk model, IBM AI Risk Atlas taxonomy | Hugging Face endpoint, IBM toolkit | fine-tunable | Enterprise-grade, requires moderate cost, latency |
Guardrails AI | Library using external APIs/LLMs for validation | Validators (e.g., Regex, Toxicity) with custom prompts | API & open source | Highly customizable validators | Scalable, but computationally expensive |
WildGuard | Fine-tuned Mistral-7B | Unified classification head, trained on WildGuardTrain | Hugging Face endpoint, open source model | Fine-tunable | Requires moderate cost, latency |
Prompt Guard | mDeBERTa-v3-base multilingual classifier | Head for benign/injection/jailbreak, trained on open source, red-teamed & synthetic data | Hugging Face endpoint, open source model | Fine-tunable | Small footprint, CPU-deployable |
InjecGuard | DeBERTAV3-base with MOF over-defense mitigation | Mitigating Over-defense for Free (MOF) training strategy | Hugging Face endpoint, open source model | Fine-tunable | Small footprint, CPU-deployable |
Key Architectural Differences:
- Transformer-based (e.g., LlamaGuard3, WildGuard): Leverage fine-tuned language models specifically trained to classify content safety. Often accurate but can be resource-intensive.
- Framework-based (e.g., NeMo Guardrails, Guardrails AI): Provide flexible toolkits using techniques like text embeddings, rule engines (like Nvidia’s Colang), or even using another LLM to validate outputs. Highly customizable but may require more setup effort.
- Automated reasoning/rule-based (e.g., Amazon Bedrock, Azure AI): Rely on predefined rules, machine learning models, and risk assessment frameworks, often tightly integrated into cloud platforms for ease of use and scalability in enterprise environments. Customization might be more limited compared to frameworks.
Choosing the Right Guardrail
Selecting the appropriate guardrail depends on your specific needs, existing infrastructure, technical expertise, and risk tolerance.
Model Overall Comparison
Model Name | Architecture | Evaluation Data Highlights | Pros | Cons |
Llama Guard3 | Llama-3.1-8B Transformer | Proprietary data sets, XSTest data set | High accuracy, broad coverage, open source model | Requires moderate cost, latency, no jailbreak/prompt injection detection |
NeMo Guardrails | Framework (event-driven rails with text embeddings + Colang) | Anthropic Red Teaming data set, Helpful data set, MS-Marco, NLU Banking | Flexible, integrates with various LLMs, open source, highly programmable | Requires rule creation expertise, performance varies, runtime overhead from rule‐engine |
Bedrock Guardrails | Rule-based & ML-assisted | Proprietary data sets | Works across multiple foundation models, configurable, integrates with AWS | Limited public evaluation details, potential region lock |
Azure AI Content Safety | Rule-based & ML-assisted | Proprietary data sets | Configurable, integrates with Azure AI | Might not be perfectly accurate in detecting inappropriate content, custom categories might need more tuning, limited public evaluation details |
Granite Guardian 3.2 5B | Iterative pruning & healing on 5 B transformer | Aegis AI Content Safety, ToxicChat, HarmBench, True, DICES | High reliability, rigorous safety testing | Requires moderate cost, latency, only trained and tested on English |
Guardrails AI | Library using external APIs/LLMs for validation | Is different for each validator | Supports broad categories, API and open source SDK | Relies on external validators |
WildGuard | Fine-tuned Mistral-7B | WildGuardTrain data set | Open source, detects nuanced refusal/compliance in completions | Requires moderate cost, latency |
PromptGuard | mDeBERTa-v3-base multilingual classifier | Cybersec eval data sets | Lightweight & CPU-deployable, good multilingual injection/jailbreak detection | High false positives, model’s classifications for jailbreak attempts and prompt injection often overlap |
InjecGuard | DeBERTAV3 with MOF over-defense mitigation | BIPIA data set, PINT data set | Lightweight & CPU-deployable, reduced over defense | Only focuses on prompt injection |
Challenges and the Road Ahead
The field of LLM safety is dynamic. Current challenges include:
- Sophisticated evasion: Adversarial attacks are constantly evolving to bypass existing filters, as demonstrated by research such as SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters.
- Performance overhead: Adding safety checks can introduce latency and computational cost.
- Balancing safety and utility: Overly strict guardrails can stifle creativity and usefulness, while overly permissive ones increase risk.
- Contextual nuance: Determining harmfulness often depends heavily on context, which is challenging for automated systems.
Continuous research, development, and community collaboration are essential to stay ahead of emerging threats and refine these critical safety technologies.
Next Steps: Securing Your AI Ecosystem
Understanding the landscape of LLM guardrails is the crucial first step in building responsible and trustworthy AI applications. These tools provide essential checks against harmful outputs, but they’re only one piece of the puzzle.
Stay tuned for Part 2 of this series, where we’ll delve into the critical strategies for securing data flowing through LLMs and the applications built upon them. Following that, Part 3 will focus on establishing a secure and resilient infrastructure—the bedrock upon which reliable AI systems are built.
Building safe, enterprise-grade AI requires a holistic approach, encompassing the model, the data, and the infrastructure. Join us as we continue to explore how to navigate this exciting new frontier securely and effectively.

FlashBlade//EXA
Experience the World’s Most Powerful Data Storage Platform for AI
Power Your AI Success
Learn more about the world’s most powerful data storage platform for AI.