INFERENCE

Instantly access top open-source LLMs and serve fine-tuned models

W&B Inference powered by CoreWeave provides API and playground access to leading open-source LLMs allowing you to develop AI applications and agents without needing to sign up for a hosting provider or deploy models on your own. You can also bring your own trained Low Rank Adaptation (LoRA) weights to run serverless inference with fine-tuned models.

Available models

NVIDIA Nemotron 3 Super 120B

Text

New

Mar 2026

$0.20 input

$0.80 output

262K

Nemotron 3 is a LatentMoE model designed to deliver strong agentic, reasoning, and conversational capabilities.

MiniMax M2.5

Text

New

Feb 2026

$0.30 input

$1.20 output

197K

MoE model with a highly sparse architecture designed for high-throughput and low latency with strong coding capabilities

Z.AI GLM 5

Text

New

Feb 2026

$1.00 input

$3.20 output

203K

Mixture-of-Experts model for long-horizon agentic tasks with strong performance on reasoning and coding.

Moonshot AI Kimi K2.5

Text

Vision

Jan 2026

$0.50 input

$2.85 output

262K

Multimodal Mixture-of-Experts language model featuring 32 billion activated parameters and a total of 1 trillion parameters.

Deepseek V3.1

Text

Aug 2025

$0.55 input

$1.65 output

128K

A large hybrid model that supports both thinking and non-thinking modes via prompt templates.

OpenAI GPT OSS 20B

Text

Aug 2025

$0.05 input

$0.20 output

131K

Lower latency Mixture-of-Experts model trained on OpenAI’s Harmony response format with reasoning capabilities.

OpenAI GPT OSS 120B

Text

Aug 2025

$0.15 input

$0.60 output

131K

Efficient Mixture-of-Experts model designed for high-reasoning, agentic and general-purpose use cases.

Qwen3 30B A3B

Text

Jul 2025

$0.10 input

$0.30 output

262K

Qwen3-30B-A3B-Instruct-2507 is a 30.5B MoE instruction-tuned model with enhanced reasoning, coding, and long-context understanding.

Qwen3 235B A22B-2507

Text

Jul 2025

$0.10 input

$0.10 output

262K

Efficient multilingual, Mixture-of-Experts, instruction-tuned model, optimized for logical reasoning.

Qwen3 Coder 480B A35B

Text

Jul 2025

$1.00 input

$1.50 output

262K

Mixture-of-Experts model optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning.

Qwen3 235B A22B Thinking-2507

Text

Jul 2025

$0.10 input

$0.10 output

262K

High-performance Mixture-of-Experts model optimized for structured reasoning, math, and long-form generation.

OpenPipe Qwen3 14B Instruct

Text

Apr 2025

$0.05 input

$0.22 output

33K

An efficient multilingual, dense, instruction-tuned model, optimized by OpenPipe for building agents with finetuning.

Meta Llama 4 Scout

Text

Vision

Apr 2025

$0.17 input

$0.66 output

64K

Multimodal model integrating text and image understanding, ideal for visual tasks and combined analysis.

Microsoft Phi 4 Mini 3.8B

Text

Feb 2025

$0.08 input

$0.35 output

128K

Compact, efficient model ideal for fast responses in resource-constrained environments.

Meta Llama 3.3 70B

Text

Dec 2024

$0.71 input

$0.71 output

128K

Multilingual model excelling in conversational tasks, detailed instruction-following, and coding.

Meta Llama 3.1 70B

Text

Jul 2024

$0.80 input

$0.80 output

128K

Efficient conversational model optimized for responsive multilingual chatbot interactions.

Meta Llama 3.1 8B

Text

Jul 2024

$0.22 input

$0.22 output

128K

Efficient conversational model optimized for responsive multilingual chatbot interactions.

				
					import openai
import weave

# Weave autopatches OpenAI to log calls to Weave
weave.init("<team>/<project>")
client = openai.OpenAI(
    # The custom base URL points to Inference
    base_url='https://api.inference.wandb.ai/v1',
    # Get your API key from https://wandb.ai/authorize
    # Consider setting it in the environment as OPENAI_API_KEY instead for safety
    api_key="<your-apikey>",
    # Team and project are required for usage tracking
    project="<team>/<project>",
)
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."}
    ],
)
print(response.choices[0].message.content)

Quickly explore and switch new models

New models with better performance and pricing pop up all the time, but each new model means another provider, another account, and another API key to deal with.

W&B Inference powered by CoreWeave hosts popular open source models on powerful CoreWeave infrastructure that you can readily access with your existing Weights & Biases account via the SDK or the UI. Test and switch between models quickly without signing up for additional API keys or hosting models yourself.

Access models in playground with zero configuration

Explore open-source models instantly in the playground. No model endpoints or access keys required.

Skip the hassle of configuring model endpoints and custom providers, your Weights & Biases account gives you instant access to a wide selection of powerful open-source foundation models, fully hosted on our infrastructure. Zero configuration needed.

				
					from openai import OpenAI

model_name = f"wandb-artifact:///{WB_TEAM}/{WB_PROJECT}/qwen_lora:latest"

client = OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=API_KEY,
    project=f"{WB_TEAM}/{WB_PROJECT}",
)

resp = client.chat.completions.create(
    model=model_name,
    messages=[{"role": "user", "content": "Say 'Hello World!'"}],
)
print(resp.choices[0].message.content)

Serverless LoRA inference

The evolving continuous learning paradigm for iterating on agents requires AI engineers to switch frequently between training and inference. In practice, that means building a complex pipeline to fetch the latest weights, hot-swap them for inference, and resume training.

With W&B Inference, we handle that complexity for you. Bring your own LoRA weights to serve fine-tuned models without setting up and scaling serving infrastructure for every LoRA iteration.

Read the docs

Easily iterate on AI applications that use open source models

LLM-powered apps need observability tools, but open-source model hosting providers don’t offer them, forcing developers to juggle disconnected platforms for hosting and observability.

W&B Inference runs directly on CoreWeave infrastructure with observability built-in through W&B Weave to evaluate, monitor, and iterate on AI applications and agents—no extra instrumentation, fragmented workflows, or complexity.

Get started for free

Experimentation can quickly get expensive when every new model you test comes with a separate price plan.

We host the latest models, ready for inference within your existing Weights & Biases subscription, keeping costs low and simple with a single plan instead of managing multiple providers.

See our pricing page for more information.

The Weights & Biases end-to-end AI developer platform

Weave

Traces

Debug agents and AI applications

Evaluations

Rigorous evaluations of agentic AI systems

Playground

Explore prompts
and models

Agents

Observability tools for agentic systems

Guardrails

Block prompt attacks and harmful outputs

Monitors

Continuously improve in prod

Models

Experiments

Track and visualize your ML experiments

Sweeps

Optimize your hyperparameters

Tables

Visualize and explore your ML data

Core

Inference

Explore hosted, open-source LLMs

Registry

Publish and share your AI models and datasets

Artifacts

Version and manage your AI pipelines

Reports

Document and share your AI insights

SDK

Log AI experiments and artifacts at scale

Automations

Trigger workflows automatically

INFERENCE

Instantly access top open-source LLMs and serve fine-tuned models

Available models

Quickly explore and switch new models

Access models in playground with zero configuration

Serverless LoRA inference

Easily iterate on AI applications that use open source models

Get started for free

The Weights & Biases end-to-end AI developer platform

The Weights & Biases platform helps you streamline your workflow from end to end

Models

Experiments

Sweeps

Registry

Automations

Weave

Traces

Evaluations

Core

Artifacts

Tables

Reports

SDK

Get started with Inference

The Platform

Article

Resources

Company

Use cases

Industries

Learn more

INFERENCE

Instantly access top open-source LLMs and serve fine-tuned models

Available models

Quickly explore and switch new models

Access models in playground with zero configuration

Serverless LoRA inference

Easily iterate on AI applications that use open source models

Get started for free

The Weights & Biases end-to-end AI developer platform

The Weights & Biases platform helps you streamline your workflow from end to end

Models

Weave

Core

Get started with Inference

The Platform

Article

Resources

Company

Use cases

Industries