PromptGuard fully supports streaming responses. Security scanning happens on the input before the request is forwarded, so streaming adds no additional latency to token delivery.
How Streaming Works
- Your request is sent to PromptGuard
- PromptGuard scans the input for threats (~150ms)
- If safe, the request is forwarded to the LLM provider
- The LLM provider streams tokens directly back through PromptGuard
- Tokens arrive in real-time as they’re generated
Using the OpenAI SDK
The simplest way to stream — works with your existing OpenAI/Anthropic code.
from openai import OpenAI
client = OpenAI(
api_key="your_promptguard_api_key",
base_url="https://api.promptguard.co/api/v1"
)
stream = client.chat.completions.create(
model="gpt-5-nano",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content is not None:
print(content, end="", flush=True)
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.PROMPTGUARD_API_KEY,
baseURL: 'https://api.promptguard.co/api/v1'
});
const stream = await openai.chat.completions.create({
model: 'gpt-5-nano',
messages: [{ role: 'user', content: 'Explain quantum computing' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
curl -N https://api.promptguard.co/api/v1/chat/completions \
-H "X-API-Key: $PROMPTGUARD_API_KEY" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5-nano",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"stream": true
}'
Using the PromptGuard SDK
from promptguard import PromptGuard
pg = PromptGuard(api_key="pg_xxx")
stream = pg.chat.completions.create(
model="gpt-5-nano",
messages=[{"role": "user", "content": "Write a short story"}],
stream=True
)
for chunk in stream:
content = chunk.get("choices", [{}])[0].get("delta", {}).get("content")
if content:
print(content, end="", flush=True)
import PromptGuard from 'promptguard-sdk';
const pg = new PromptGuard({ apiKey: 'pg_xxx' });
const response = await pg.chat.completions.create({
model: 'gpt-5-nano',
messages: [{ role: 'user', content: 'Write a short story' }],
stream: true
});
Server-Sent Events (SSE)
When streaming, the API returns Server-Sent Events. Each event contains a JSON chunk:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Framework Integration
FastAPI (Python)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI(
api_key="your_promptguard_api_key",
base_url="https://api.promptguard.co/api/v1"
)
@app.post("/chat/stream")
async def stream_chat(message: str):
def generate():
stream = client.chat.completions.create(
model="gpt-5-nano",
messages=[{"role": "user", "content": message}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield f"data: {content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Express (Node.js)
import express from 'express';
import OpenAI from 'openai';
const app = express();
app.use(express.json());
const openai = new OpenAI({
apiKey: process.env.PROMPTGUARD_API_KEY,
baseURL: 'https://api.promptguard.co/api/v1'
});
app.post('/chat/stream', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
const stream = await openai.chat.completions.create({
model: 'gpt-5-nano',
messages: [{ role: 'user', content: req.body.message }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
}
res.write('data: [DONE]\n\n');
res.end();
});
Next.js (React)
// app/api/chat/route.ts
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.PROMPTGUARD_API_KEY!,
baseURL: 'https://api.promptguard.co/api/v1'
});
export async function POST(req: Request) {
const { message } = await req.json();
const stream = await openai.chat.completions.create({
model: 'gpt-5-nano',
messages: [{ role: 'user', content: message }],
stream: true
});
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
controller.enqueue(encoder.encode(`data: ${JSON.stringify({ content })}\n\n`));
}
}
controller.enqueue(encoder.encode('data: [DONE]\n\n'));
controller.close();
}
});
return new Response(readable, {
headers: { 'Content-Type': 'text/event-stream' }
});
}
Error Handling During Streaming
Errors during streaming are delivered as SSE events:
try:
stream = client.chat.completions.create(
model="gpt-5-nano",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="")
except Exception as e:
if "policy_violation" in str(e):
print("\nRequest blocked by security policy")
elif "rate_limit" in str(e):
print("\nRate limited - retry with backoff")
else:
print(f"\nError: {e}")
Security blocks happen before streaming begins (during input scanning). If a request passes the security check, the stream will complete normally. You won’t receive a mid-stream security block.
Streaming Output Guardrails
When scan_responses (Python) or scanResponses (Node.js) is enabled with auto-instrumentation, PromptGuard also scans the completed output after streaming finishes. The SDK buffers the full response internally and sends it to the Guard API with direction="output" once the stream ends.
import promptguard
from promptguard import PromptGuardBlockedError
promptguard.init(
api_key="pg_xxx",
mode="enforce",
scan_responses=True,
)
from openai import OpenAI
client = OpenAI()
try:
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this report"}],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
except PromptGuardBlockedError as e:
print(f"\nOutput blocked: {e.decision.threat_type}")
import { init, PromptGuardBlockedError } from 'promptguard-sdk';
import OpenAI from 'openai';
init({
apiKey: 'pg_xxx',
mode: 'enforce',
scanResponses: true,
});
const client = new OpenAI();
try {
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Summarize this report' }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
} catch (error) {
if (error instanceof PromptGuardBlockedError) {
console.log(`\nOutput blocked: ${error.decision.threatType}`);
}
}
How it works:
- Input is scanned before streaming begins (same as without output scanning)
- Tokens stream to your application in real-time as they arrive
- The SDK accumulates the full response in the background
- After the stream completes, the full response is sent to the Guard API for output scanning
- If the output is flagged, a
PromptGuardBlockedError is raised after the stream ends
Because output scanning happens after the full stream is received, your application will have already displayed the tokens to the user by the time a block is triggered. Design your UI to handle post-stream blocks gracefully — for example, by clearing the displayed response or showing a warning banner.
| Metric | Value |
|---|
| Input scan overhead | ~150ms (one-time, before streaming starts) |
| Per-token overhead | ~0ms (tokens pass through directly) |
| Time to first token | Same as direct provider + ~150ms |
| Output scan overhead | ~150ms (one-time, after stream completes; only when scanResponses is enabled) |
Streaming is recommended for all user-facing applications. The perceived latency is significantly lower because users see tokens appear in real-time rather than waiting for the full response.