Skip to main content

Overview

vLLM is a high-throughput, memory-efficient inference engine for LLMs. PromptGuard integrates with vLLM by proxying requests through its security layer, applying all threat detectors to your vLLM traffic with minimal latency overhead.
PromptGuard adds approximately ~30ms of proxy overhead to vLLM’s already-fast inference. For most workloads, this is negligible compared to model inference time.

Prerequisites

  1. vLLM server running — Start a vLLM server with your chosen model:
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3-70B-Instruct \
      --port 8000
    
  2. PromptGuard API key — Sign up at app.promptguard.co and create an API key

Quick Start

Route vLLM traffic through PromptGuard using the vllm/ model prefix:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.promptguard.co/api/v1",
    api_key="your-promptguard-key",
)

response = client.chat.completions.create(
    model="vllm/meta-llama/Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Summarize the key points of transformer architecture"}],
)

print(response.choices[0].message.content)

Model Naming

Use the vllm/ prefix followed by the model identifier as loaded in your vLLM server:
vLLM ModelPromptGuard Model Name
meta-llama/Llama-3-70B-Instructvllm/meta-llama/Llama-3-70B-Instruct
meta-llama/Llama-3-8B-Instructvllm/meta-llama/Llama-3-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.3vllm/mistralai/Mistral-7B-Instruct-v0.3
Qwen/Qwen2-72B-Instructvllm/Qwen/Qwen2-72B-Instruct
microsoft/Phi-3-medium-128k-instructvllm/microsoft/Phi-3-medium-128k-instruct
The model name after vllm/ must match the --model argument used when starting your vLLM server.

Environment Variables

Configure your vLLM endpoint and PromptGuard credentials:
# .env
PROMPTGUARD_API_KEY=your-promptguard-key
VLLM_BASE_URL=http://localhost:8000    # Default vLLM endpoint
If vLLM is running on a different host or port, set VLLM_BASE_URL accordingly. PromptGuard reads this variable to route requests to your vLLM server.
# Remote vLLM instance
VLLM_BASE_URL=http://gpu-server.internal:8000

# Multiple GPU nodes behind a load balancer
VLLM_BASE_URL=http://vllm-lb.internal:8000

Full Integration Example

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.promptguard.co/api/v1",
    api_key=os.getenv("PROMPTGUARD_API_KEY"),
    default_headers={
        "X-VLLM-Base-URL": os.getenv("VLLM_BASE_URL", "http://localhost:8000"),
    },
)

response = client.chat.completions.create(
    model="vllm/meta-llama/Llama-3-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "Explain gradient descent in three sentences."},
    ],
    temperature=0.3,
    max_tokens=512,
)

print(response.choices[0].message.content)

Streaming

PromptGuard supports streaming from vLLM servers:
stream = client.chat.completions.create(
    model="vllm/meta-llama/Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Write a quick tutorial on Docker"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Performance Notes

vLLM is designed for maximum throughput. Here’s how PromptGuard fits into the latency picture:
ComponentTypical Latency
vLLM inference (70B model)200–800ms
PromptGuard security scan~30ms
Network round-trip (proxy)~5ms
Total overhead~35ms
PromptGuard’s security scanning runs in parallel with request preprocessing, so the effective overhead is often lower than 30ms for longer prompts.

Batch Processing

For high-throughput batch workloads, PromptGuard scans requests concurrently:
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://api.promptguard.co/api/v1",
    api_key=os.getenv("PROMPTGUARD_API_KEY"),
    default_headers={
        "X-VLLM-Base-URL": os.getenv("VLLM_BASE_URL", "http://localhost:8000"),
    },
)

async def process_batch(prompts: list[str]):
    tasks = [
        client.chat.completions.create(
            model="vllm/meta-llama/Llama-3-70B-Instruct",
            messages=[{"role": "user", "content": p}],
        )
        for p in prompts
    ]
    return await asyncio.gather(*tasks)

results = asyncio.run(process_batch([
    "Summarize this document...",
    "Translate to French...",
    "Extract key entities...",
]))

Security Benefits

Prompt Injection

Protects self-hosted models from jailbreaks and instruction hijacking

PII Detection

Prevents sensitive data from being processed by local inference

Data Exfiltration

Blocks attempts to extract system prompts or training artifacts

Content Safety

Enforces content moderation on unaligned open-weight models
Open-weight models served via vLLM often lack the safety tuning of commercial APIs. PromptGuard provides a critical defense layer for production deployments.

Troubleshooting

Error: “Cannot connect to vLLM”

Verify your vLLM server is running and accessible:
curl http://localhost:8000/v1/models

Error: “Model not found”

Ensure the model name matches your vLLM server’s --model argument:
# Check what model vLLM is serving
curl http://localhost:8000/v1/models | jq '.data[].id'

Error: “No provider found for model”

Use the vllm/ prefix in your model name:
# Wrong
model="meta-llama/Llama-3-70B-Instruct"

# Correct
model="vllm/meta-llama/Llama-3-70B-Instruct"

Next Steps

LLM Providers

See all supported LLM providers

Security Policies

Configure threat detection thresholds

Streaming

Streaming integration details

Monitoring

Track usage and threats in real time