Can I use SGLang with Bifrost?

Register your SGLang base URL in Bifrost and route OpenAI-compatible requests via http://localhost:8080/sgl.

Can I use SGLang with Bifrost?

Register your SGLang base URL in Bifrost and route OpenAI-compatible requests via http://localhost:8080/sgl.

Try Bifrost Enterprise free for 14 days.

PERFORMANCE FEATURES ENTERPRISE PRICING DOCS BLOG

How to Set Up SGLang

Install SGLang on your GPU host, launch an OpenAI-compatible server for structured generation, then integrate with Bifrost for virtual keys, budgets, and multi-provider routing. Complete setup in under 20 minutes.

Self-hostedStructured outputsOpenAI compatibleGPU inferenceBifrost gateway

SGLang provider summary

Bifrost can front a self-hosted SGLang server so teams share one gateway with budgets, observability, and structured-output workloads.

Property	Details
Description	SGLang is an open-source serving framework for fast LLM inference with structured output support and an OpenAI-compatible HTTP API.
Provider route on Bifrost	sgl/<model>
Provider doc	SGLang
API endpoint for provider	http://localhost:8000/v1
Supported endpoints	/v1/models, /v1/completions, /v1/chat/completions, /v1/responses, /v1/embeddings

Official SGLang Resources

SGLang documentation, GitHub repository, and model hub references.

Prerequisites

Before you begin, you will need:

Linux or macOS (WSL2 on Windows)Python 3.8+NVIDIA GPU with 8GB+ VRAM (recommended)

No cloud API key: SGLang runs on your hardware. Authentication is optional; use Bifrost virtual keys at the gateway layer.

[ QUICK START ]

How Do You Set Up SGLang in 5 Steps?

Install SGLang

Use pip on a GPU machine when possible.

On a machine with CUDA, install SGLang with extras. See the SGLang docs for hardware requirements.

Terminal (GPU)

$ pip install "sglang[all]"

For CPU-only hosts:

Terminal (CPU)

$ pip install sglang

Choose a Hugging Face model

Pick a model ID from Hugging Face (for example meta-llama/Meta-Llama-3.1-8B-Instruct). Ensure you have access tokens if the model is gated.

Launch the SGLang server

SGLang listens on port 8000 by default.

Start the server with your model path. Wait until the logs show the server is ready.

Terminal

$ python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8000

→

Server ready: When startup completes, the OpenAI-compatible API is available at http://localhost:8000/v1.

Make your first Chat Completions call

Point clients at localhost:8000/v1.

Call the local OpenAI-compatible endpoint:

Terminal

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role":"user","content":"Hello from SGLang!"}]
  }'

Use structured output

Constrain generation with SGLang programs or schema-aware APIs.

SGLang supports constrained decoding for JSON-like outputs. Example using the Python frontend:

example.py

import sglang as sgl
import json

@sgl.function
def extract_info(s):
    s += sgl.gen(
        "output",
        max_tokens=500,
        regex=r'\{.*"name".*"age".*\}',
    )

state = extract_info()
result = json.loads(state["output"])
print(result)

[ MODELS ]

Recommended SGLang Models

Model	API ID	Best for
Llama 3.1 8B Instruct	meta-llama/Meta-Llama-3.1-8B-Instruct	Common starter model for SGLang.
Mistral 7B Instruct v0.3	mistralai/Mistral-7B-Instruct-v0.3	Efficient 7B instruct serving.
Llama 3.3 70B Instruct	meta-llama/Llama-3.3-70B-Instruct	Larger production deployment.
Qwen 2.5 7B Instruct	Qwen/Qwen2.5-7B-Instruct	Strong small model for coding.
Llama 2 7B Chat	meta-llama/Llama-2-7b-chat-hf	Legacy 7B chat baseline (~14GB VRAM).
Code Llama 34B	codellama/CodeLlama-34b-Instruct-hf	Code-heavy workloads (high VRAM).

Models and VRAM requirements vary by quantization and tensor parallelism. See Hugging Face and the SGLang docs for deployment guidance.

[ TROUBLESHOOTING ]

Troubleshooting Common SGLang Issues

Error	Likely Cause	What to Do
`CUDA OOM`	Model exceeds available GPU memory.	Use a smaller model, enable quantization, or add GPUs per SGLang docs.
`CUDA not found`	NVIDIA drivers or CUDA toolkit missing.	Install drivers and CUDA, or use the CPU-only pip install path.
`Port 8000 in use`	Another process is bound to the default port.	Pass `--port 8001` or stop the conflicting process.
`Slow inference`	CPU fallback or undersized GPU for the model.	Confirm GPU is used, reduce model size, or tune batch settings in SGLang.

[ PRODUCTION-READY ]

Use SGLang with Bifrost

Bifrost fronts your SGLang server: keep OpenAI-compatible client code and point the base URL at the gateway. Bifrost handles cost tracking, virtual keys, budgets, and failover automatically.

Step 1: Start Bifrost and register SGLang

Run the Bifrost gateway and add your SGLang base URL in the Web UI.

Terminal

$ npx -y @maximhq/bifrost

OUTPUT

✓ Bifrost started
├─ HTTP server listening on http://localhost:8080
├─ Web UI available at   http://localhost:8080
└─ Configure providers and virtual keys in the dashboard

→

Add the SGLang integration in the Web UI. For details, read SGLang on Bifrost.

Step 2: Point your OpenAI SDK at Bifrost

Update your client to route through Bifrost's SGLang-compatible gateway instead of localhost:8000 directly.

example.py

from openai import OpenAI

client = OpenAI(
    api_key="sk-bf-your-virtual-key",
    base_url="http://localhost:8080/sgl"
)

response = client.chat.completions.create(
    model="sgl/meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from Bifrost!"}]
)

print(response.choices[0].message.content)

→

Virtual keys can be sent as x-bf-vk or Authorization: Bearer sk-bf-* per the Bifrost documentation.

[ WHAT'S NEXT ]

Explore Bifrost Resources

You have your API key. Add governance, guardrails, and MCP controls for production.

Access Control

Governance

Virtual keys, budgets, rate limits, routing, and enterprise RBAC with SSO.

Security

Guardrails

PII detection, content moderation, prompt injection defense, and compliance.

MCP

MCP Gateway

High-performance tool execution for AI agents with approvals and audit trails.

View all resources

Ready to Route SGLang Through Bifrost?

Bifrost is open source and production-ready. Get started in minutes with cost tracking, virtual keys, and failover built in.

[ BIFROST FEATURES ]

Open Source & Enterprise

Everything you need to run AI in production, from free open source to enterprise-grade features.

01 Governance

SAML support for SSO and Role-based access control and policy enforcement for team collaboration.

02 Adaptive Load Balancing

Automatically optimizes traffic distribution across provider keys and models based on real-time performance metrics.

03 Cluster Mode

High availability deployment with automatic failover and load balancing. Peer-to-peer clustering where every instance is equal.

04 Alerts

Real-time notifications for budget limits, failures, and performance issues on Email, Slack, PagerDuty, Teams, Webhook and more.

05 Log Exports

Export and analyze request logs, traces, and telemetry data from Bifrost with enterprise-grade data export capabilities for compliance, monitoring, and analytics.

06 Audit Logs

Comprehensive logging and audit trails for compliance and debugging.

07 Vault Support

Secure API key management with HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, and Azure Key Vault integration.

08 VPC Deployment

Deploy Bifrost within your private cloud infrastructure with VPC isolation, custom networking, and enhanced security controls.

09 Guardrails

Automatically detect and block unsafe model outputs with real-time policy enforcement and content moderation across all agents.

[ SHIP RELIABLE AI ]

Try Bifrost Enterprise with a 14-day Free Trial

[quick setup]

Drop-in replacement for any AI SDK

Change just one line of code. Works with OpenAI, Anthropic, Vercel AI SDK, LangChain, and more.

1import os

2from anthropic import Anthropic

4anthropic = Anthropic(

5 api_key=os.environ.get("ANTHROPIC_API_KEY"),

6 base_url="https://<bifrost_url>/anthropic",

9message = anthropic.messages.create(

10 model="claude-3-5-sonnet-20241022",

11 max_tokens=1024,

12 messages=[

13 {"role": "user", "content": "Hello, Claude"}

14 ]

15)

Drop in once, run everywhere.

[ FAQ ]

Frequently Asked Questions

SGLang is an open-source framework for efficient LLM serving with structured output guarantees, including JSON schemas and regex-constrained generation.

Local SGLang often runs without auth. Put authentication and budgets at the Bifrost gateway instead.

An NVIDIA GPU with at least 8GB VRAM is recommended for smaller models. CPU-only installs are possible but slower.

Yes. SGLang uses constrained decoding so outputs can match JSON schemas, regex patterns, and other structured formats.

Any Hugging Face compatible checkpoint, including Llama, Mistral, Qwen, and other popular open-weight models.

How to Set Up SGLang

SGLang provider summary

Official SGLang Resources

Prerequisites

How Do You Set Up SGLang in 5 Steps?

Install SGLang

Choose a Hugging Face model

Launch the SGLang server

Make your first Chat Completions call

Use structured output

Recommended SGLang Models

Troubleshooting Common SGLang Issues

Use SGLang with Bifrost

Step 1: Start Bifrost and register SGLang

Step 2: Point your OpenAI SDK at Bifrost

Explore Bifrost Resources

Governance

Guardrails

MCP Gateway

Ready to Route SGLang Through Bifrost?

Open Source & Enterprise

Try Bifrost Enterprise with a 14-day Free Trial

Drop-in replacement for any AI SDK

Frequently Asked Questions

[ Features ]

[ Resources ]

[ Industries ]

[ Developers ]

[ Company ]