Overview
vLLM is a high-throughput, memory-efficient inference engine for LLMs. PromptGuard integrates with vLLM by proxying requests through its security layer, applying all threat detectors to your vLLM traffic with minimal latency overhead.PromptGuard adds approximately ~30ms of proxy overhead to vLLM’s already-fast inference. For most workloads, this is negligible compared to model inference time.
Prerequisites
- vLLM server running — Start a vLLM server with your chosen model:
- PromptGuard API key — Sign up at app.promptguard.co and create an API key
Quick Start
Route vLLM traffic through PromptGuard using thevllm/ model prefix:
Model Naming
Use thevllm/ prefix followed by the model identifier as loaded in your vLLM server:
| vLLM Model | PromptGuard Model Name |
|---|---|
meta-llama/Llama-3-70B-Instruct | vllm/meta-llama/Llama-3-70B-Instruct |
meta-llama/Llama-3-8B-Instruct | vllm/meta-llama/Llama-3-8B-Instruct |
mistralai/Mistral-7B-Instruct-v0.3 | vllm/mistralai/Mistral-7B-Instruct-v0.3 |
Qwen/Qwen2-72B-Instruct | vllm/Qwen/Qwen2-72B-Instruct |
microsoft/Phi-3-medium-128k-instruct | vllm/microsoft/Phi-3-medium-128k-instruct |
The model name after
vllm/ must match the --model argument used when starting your vLLM server.Environment Variables
Configure your vLLM endpoint and PromptGuard credentials:VLLM_BASE_URL accordingly. PromptGuard reads this variable to route requests to your vLLM server.
Full Integration Example
Streaming
PromptGuard supports streaming from vLLM servers:Performance Notes
vLLM is designed for maximum throughput. Here’s how PromptGuard fits into the latency picture:| Component | Typical Latency |
|---|---|
| vLLM inference (70B model) | 200–800ms |
| PromptGuard security scan | ~30ms |
| Network round-trip (proxy) | ~5ms |
| Total overhead | ~35ms |
PromptGuard’s security scanning runs in parallel with request preprocessing, so the effective overhead is often lower than 30ms for longer prompts.
Batch Processing
For high-throughput batch workloads, PromptGuard scans requests concurrently:Security Benefits
Prompt Injection
Protects self-hosted models from jailbreaks and instruction hijacking
PII Detection
Prevents sensitive data from being processed by local inference
Data Exfiltration
Blocks attempts to extract system prompts or training artifacts
Content Safety
Enforces content moderation on unaligned open-weight models
Troubleshooting
Error: “Cannot connect to vLLM”
Verify your vLLM server is running and accessible:Error: “Model not found”
Ensure the model name matches your vLLM server’s--model argument:
Error: “No provider found for model”
Use thevllm/ prefix in your model name:
Next Steps
LLM Providers
See all supported LLM providers
Security Policies
Configure threat detection thresholds
Streaming
Streaming integration details
Monitoring
Track usage and threats in real time