Deliver AI with confidence

Evaluate, monitor, and iterate on AI applications. Get started with one line of code.
1 2 3 4
import weave
weave.init("quickstart")
@weave.op()
def llm_app(prompt):
Keep an eye on your AI

Improve quality, cost, latency, and safety

Weave works with any LLM and framework and comes with a ton of integrations out of the box

Quality

Accuracy, robustness, relevancy

Cost

Token usage and estimated cost

Latency

Track response times and bottlenecks

Safety

Protect your end users using guardrails

Anthropic
Cohere
Groq
EvalForge
LangChain
OpenAI
Together
LlamaIndex
Mistral AI
Crew AI
OpenTelemetry
OpenTelemetry
Evaluations

Measure and iterate

Visual comparisons

Use powerful visualizations for objective, precise comparisons

Automatic versioning

Save versions of your datasets, code, and scorers

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import openai, weave
weave.init("weave-intro")

@weave.op
def correct_grammar(user_input):
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="o1-mini",
        messages=[{
            "role": "user", 
            "content": "Correct the grammar:\n\n" + 
            user_input,
        }],
    )
    return response.choices[0].message.content.strip()

result = correct_grammar("That was peace of cake!")
print(result)

Playground

Iterate on prompts in an interactive chat interface with any LLM

Leaderboards

Group evaluations into leaderboards featuring the best performers and share across your organization

Tracing and monitoring

Log everything for production monitoring and debugging

Debugging with trace trees

Weave organizes logs into an easy to navigate trace tree so you can identify issues

Trace tree Trace tree

Multimodality

Track any modality—text, code, documents, image, and audio. Other modalities coming soon

Easily work with long form text

View large strings like documents, emails, HTML, and code in their original format

Online evaluations

Score live incoming production traces for monitoring without impacting performance

Online evaluations Online evaluations
Agents

Observability and governance tools for agentic systems

Build state-of-the-art agents

Supercharge your iteration speed and top the charts

SWE-Bench Leaderboard

Agent framework and protocol agnostic

Integrates with leading agent frameworks such as OpenAI Agents SDK and protocols such as MCP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
from pydantic import BaseModel
from agents import Agent, Runner, function_tool, set_trace_processors
import agents
import weave
from weave.integrations.openai_agents.openai_agents import WeaveTracingProcessor
import asyncio

weave.init("openai-agents")
set_trace_processors([WeaveTracingProcessor()])

class Weather(BaseModel):
    city: str
    temperature_range: str
    conditions: str

@function_tool
def get_weather(city: str) -> Weather:
    return Weather(city=city, temperature_range="14-20C", conditions="Sunny with wind.")

agent = Agent(
    name="Hello world",
    instructions="You are a helpful agent.",
    tools=[get_weather]
)

async def main():
    result = await Runner.run(agent, input="What's the weather in Tokyo?")    
    print(result.final_output)

if __name__ == "__main__":
    asyncio.run(main())

Trace trees purpose-built for agentic systems

Easily visualize agents rollouts to pinpoint issues and improvements

Agent trace tree visualization
Scoring

Use our scorers or bring your own

Pre-built scorers

Jumpstart your evals with out-of-box scorers built by our experts

Toxicity
Hallucinations
Content Relevance

Write your own scorers

Near-infinite flexibility to build custom scoring functions to suit your business

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
import weave, openai

llm_client = openai.OpenAI()

@weave.op()
def evaluate_output(generated_text, reference_text):
    """
    Evaluates AI-generated text against a reference answer.
    
    Args:
        generated_text: The text generated by the model
        reference_text: The reference text to compare against
        
    Returns:
        float: A score between 0-10
    """
    system_prompt = """You are an expert evaluator of AI outputs.
    Your job is to rate AI-generated text on a scale of 0-10.
    Base your rating on how well the generated text matches 
    the reference text in terms of factual accuracy,
    comprehensiveness, and conciseness."""
    
    user_prompt = f"""Reference: {reference_text}
    
    AI Output: {generated_text}
    
    Rate this output from 0-10:"""
    
    response = llm_client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )
    
    # Extract the score from the response
    score_text = response.choices[0].message.content
    # Parse score (assuming it returns a number between 0-10)
    try:
        score = float(score_text.strip())
        return min(max(score, 0), 10)  # Clamp between 0-10
    except:
        # Fallback score if parsing fails
        return 5.0

Human feedback

Collect user and expert feedback for real-life testing and evaluation

Human feedback Human feedback

Third-party scorers

Plug and play off-the-shelf scoring functions from other vendors

RAGAS
EvalForge
LangChain
LlamaIndex
HEMM
Inference (preview)

Access popular open-source models

Playground or API access

Access to leading open-source foundation models

Inference playground
OpenAI GPT OSS 120B
OpenAI GPT OSS 20B
Qwen3 235B A22B Thinking-2507
Qwen3 Coder 480B A35B
Qwen3 235B A22B-2507
MoonshotAI Kimi K2
DeepSeek R1-0528
DeepSeek V3-0324
Llama 3.1 78
Llama 3.3 70B
Llama 4 Scout
Phi 4 Mini