Can I use vLLM with Bifrost?

Register your vLLM base URL as a custom provider in Bifrost and route via http://localhost:8080.

Can I use vLLM with Bifrost?

Register your vLLM base URL as a custom provider in Bifrost and route via http://localhost:8080.

Try Bifrost Enterprise free for 14 days.

PERFORMANCE FEATURES ENTERPRISE PRICING DOCS BLOG

How to Set Up vLLM

Create a Mistral account at docs.vllm.ai, generate your API key, store it securely, then integrate with Bifrost for virtual keys, budgets, and cost governance. Complete setup in minutes.

Self-hostedOpenAI-compatibleGPU inferenceOpen modelsBifrost gateway

vLLM provider summary

Bifrost can front a self-hosted vLLM server so teams share one gateway with budgets and observability.

Property	Details
Description	vLLM is an open-source inference engine for fast LLM serving with an OpenAI-compatible HTTP API.
Provider route on Bifrost	vllm/<model>
Provider doc	vLLM
API endpoint for provider	http://localhost:8000/v1
Supported endpoints	/v1/models, /v1/completions, /v1/chat/completions, /v1/responses, /v1/embeddings, /v1/audio/transcriptions, /v1/rerank

Official vLLM Resources

vLLM documentation and GitHub repository.

Prerequisites

Before you begin, you will need:

Linux host with NVIDIA GPU (recommended)Python 3.9+CUDA drivers installed

No cloud API key: vLLM runs on your hardware. Authentication is optional, use Bifrost virtual keys at the gateway layer.

[ QUICK START ]

How Do You Set Up vLLM in 5 Steps?

Install vLLM

Use pip on a GPU machine.

On a machine with CUDA, install vLLM: pip install vllm. See the vLLM docs for hardware requirements.

Choose a Hugging Face model

Pick a model ID from Hugging Face (for example meta-llama/Meta-Llama-3.1-8B-Instruct). Ensure you have access tokens if the model is gated.

Launch the OpenAI-compatible server

vLLM listens on port 8000 by default.

Start the server with your model. For local testing, API key checks are often disabled or use a placeholder.

Terminal

$ python -m vllm.entrypoints.openai.api_server   --model meta-llama/Meta-Llama-3.1-8B-Instruct   --port 8000

Make your first Chat Completions call

Point clients at localhost:8000/v1.

Call the local OpenAI-compatible endpoint:

Terminal

$ curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role":"user","content":"Hello from vLLM!"}]
  }'

[ MODELS ]

Available vLLM Models

Model	API ID	Best for
Llama 3.1 8B Instruct	meta-llama/Meta-Llama-3.1-8B-Instruct	Common starter model for vLLM.
Llama 3.3 70B Instruct	meta-llama/Llama-3.3-70B-Instruct	Larger production deployment.
Mistral 7B Instruct v0.3	mistralai/Mistral-7B-Instruct-v0.3	Efficient 7B instruct serving.
Qwen 2.5 7B Instruct	Qwen/Qwen2.5-7B-Instruct	Strong small model for coding.
Qwen 2.5 72B Instruct	Qwen/Qwen2.5-72B-Instruct	Large Qwen on multi-GPU vLLM.
DeepSeek V2.5	deepseek-ai/DeepSeek-V2.5	MoE model with vLLM support.

Models and availability change over time. See the vLLM's supported models for the latest list and pricing.

[ TROUBLESHOOTING ]

Troubleshooting Common vLLM Issues

Error	Likely Cause	What to Do
`401 Unauthorized`	Invalid or missing API key.	Verify your API key is correct. Generate a new key if needed.
`400 Bad Request`	Invalid request format or unsupported model.	Check request format and confirm model ID is valid.
`429 Rate Limited`	Rate limit exceeded for your plan.	Upgrade your plan or implement exponential backoff. Use Bifrost for intelligent load distribution.
`502/503 Service Error`	Temporary Mistral service unavailability.	Retry after a delay. Check Mistral status page. Configure failover with Bifrost.

[ PRODUCTION-READY ]

Use vLLM with Bifrost

Bifrost is a drop-in replacement for vLLM SDKs: keep your client code and change the base URL to your gateway. Bifrost handles cost tracking, virtual keys, budgets, and failover automatically.

Step 1: Start Bifrost and register vLLM

Run the Bifrost gateway and configure your Mistral credentials in the Web UI.

Terminal

$ npx -y @maximhq/bifrost

OUTPUT

✓ Bifrost started
├─ HTTP server listening on http://localhost:8080
├─ Web UI available at   http://localhost:8080
└─ Configure providers and virtual keys in the dashboard

→

Add the vLLM integration in the Web UI. For details, read vLLM on Bifrost.

Step 2: Point your vLLM SDK at Bifrost

Update your vLLM SDK client to route through the Bifrost gateway.

example.py

from openai import OpenAI

client = OpenAI(
    api_key="sk-bf-your-virtual-key",
    base_url="http://localhost:8080/vllm"
)

response = client.chat.completions.create(
    model="vllm/meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from Bifrost!"}]
)

print(response.choices[0].message.content)

→

Virtual keys can be sent as x-bf-vk or Authorization: Bearer sk-bf-* per the Bifrost documentation.

[ WHAT'S NEXT ]

Explore Bifrost Resources

You have your API key. Add governance, guardrails, and MCP controls for production.

Access Control

Governance

Virtual keys, budgets, rate limits, routing, and enterprise RBAC with SSO.

Security

Guardrails

PII detection, content moderation, prompt injection defense, and compliance.

MCP

MCP Gateway

High-performance tool execution for AI agents with approvals and audit trails.

View all resources

Ready to Route vLLM Through Bifrost?

Bifrost is open source and production-ready. Get started in minutes with cost tracking, virtual keys, and failover built in.

[ BIFROST FEATURES ]

Open Source & Enterprise

Everything you need to run AI in production, from free open source to enterprise-grade features.

01 Governance

SAML support for SSO and Role-based access control and policy enforcement for team collaboration.

02 Adaptive Load Balancing

Automatically optimizes traffic distribution across provider keys and models based on real-time performance metrics.

03 Cluster Mode

High availability deployment with automatic failover and load balancing. Peer-to-peer clustering where every instance is equal.

04 Alerts

Real-time notifications for budget limits, failures, and performance issues on Email, Slack, PagerDuty, Teams, Webhook and more.

05 Log Exports

Export and analyze request logs, traces, and telemetry data from Bifrost with enterprise-grade data export capabilities for compliance, monitoring, and analytics.

06 Audit Logs

Comprehensive logging and audit trails for compliance and debugging.

07 Vault Support

Secure API key management with HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, and Azure Key Vault integration.

08 VPC Deployment

Deploy Bifrost within your private cloud infrastructure with VPC isolation, custom networking, and enhanced security controls.

09 Guardrails

Automatically detect and block unsafe model outputs with real-time policy enforcement and content moderation across all agents.

[ SHIP RELIABLE AI ]

Try Bifrost Enterprise with a 14-day Free Trial

[quick setup]

Drop-in replacement for any AI SDK

Change just one line of code. Works with OpenAI, Anthropic, Vercel AI SDK, LangChain, and more.

1import os

2from anthropic import Anthropic

4anthropic = Anthropic(

5 api_key=os.environ.get("ANTHROPIC_API_KEY"),

6 base_url="https://<bifrost_url>/anthropic",

9message = anthropic.messages.create(

10 model="claude-3-5-sonnet-20241022",

11 max_tokens=1024,

12 messages=[

13 {"role": "user", "content": "Hello, Claude"}

14 ]

15)

Drop in once, run everywhere.

[ FAQ ]

Frequently Asked Questions

vLLM is an open-source library for high-throughput LLM inference with PagedAttention and an OpenAI-compatible server.

Local vLLM often runs without auth. Put authentication and budgets at the Bifrost gateway instead.

An NVIDIA GPU with sufficient VRAM for your model is recommended. CPU-only setups are possible but slower.

Typically one model per server process. Run multiple vLLM instances or use Bifrost to route across hosts.

Tune tensor parallelism, batch size, and GPU count per vLLM documentation.

How to Set Up vLLM

vLLM provider summary

Official vLLM Resources

Prerequisites

How Do You Set Up vLLM in 5 Steps?

Install vLLM

Choose a Hugging Face model

Launch the OpenAI-compatible server

Make your first Chat Completions call

Available vLLM Models

Troubleshooting Common vLLM Issues

Use vLLM with Bifrost

Step 1: Start Bifrost and register vLLM

Step 2: Point your vLLM SDK at Bifrost

Explore Bifrost Resources

Governance

Guardrails

MCP Gateway

Ready to Route vLLM Through Bifrost?

Open Source & Enterprise

Try Bifrost Enterprise with a 14-day Free Trial

Drop-in replacement for any AI SDK

Frequently Asked Questions

[ Features ]

[ Resources ]

[ Industries ]

[ Developers ]

[ Company ]