AI agents in healthcare: Enhancing patient outcomes and streamlining operations

AI agents are rapidly transforming the healthcare landscape, ushering in a new era of innovation and efficiency. These intelligent tools, capable of processing vast amounts of medical data and learning from complex patterns, are redefining how care is delivered to patients and managed by providers. By integrating AI agents into healthcare systems, hospitals and clinics can harness advanced decision-making support, enabling clinicians to make faster, more accurate diagnoses and create personalized treatment plans tailored to each patient’s unique needs.

Doctor typing on a virtual terminal.

For patients, AI agents promise a more proactive and connected healthcare experience; reminding them of important checkups, monitoring their well-being in real time, and prompting early interventions before problems escalate. Healthcare providers, on the other hand, benefit from reduced administrative burdens, enhanced data-driven insights, and greater operational efficiency across their practices. From streamlining scheduling and claims processing to flagging potential medical errors and suggesting next steps, AI agents are poised to revolutionize the delivery and management of healthcare. As the adoption of AI becomes more widespread, its intelligent integration within electronic health records and healthcare workflows holds the potential to dramatically improve outcomes, patient satisfaction, and the overall quality of care.

How AI Agents Transform Healthcare: In-Depth Examples and Key Challenges

AI agents are on the verge of fundamentally advancing nearly every aspect of patient care and medical research. While these capabilities are not yet implemented widely, the technology is fast approaching a level where these once aspirational examples become reality.

Catching Doctor Errors and Enhancing Clinical Safety

AI agents can play a crucial role as a safety net for healthcare providers. Imagine a situation where a patient visits their physician with vague symptoms like fatigue and mild shortness of breath. Even the most experienced doctors may occasionally overlook necessary diagnostic steps for less common conditions.

Here, an AI agent, integrated into the EHR, can review the physician’s notes in real time. If key diagnostic tests—such as a ferritin level to assess for anemia or an age-appropriate cancer screening—are missing, the AI flags this immediately and prompts the clinician to consider these additional options.

The following Python code illustrates how an AI assistant can review clinical notes and lab results to identify potentially missed diagnostic steps and generate appropriate alerts. It also shows how the AI can translate complex medical information into plain language for patients, empowering them to understand their health better.

# Required libraries
import litellm
import weave; weave.init(“healthcare_agents”)
 
# Set your model
model = “openai/gpt-4o”
 
# Sample clinical notes and labs
doctor_notes = “”
Patient: Jane Doe, 47yo F. Presents with fatigue and mild shortness of breath.
Exam normal. Labs ordered: CBC, BMP. Awaiting results. No further workup planned at this time.
PMH: None; Meds: None; FamHx: Noncontrib.
“”
 
lab_results = {
“Hemoglobin”: 10.6,
“Ferritin”: None,
“Creatinine”: 1.10,
“eGFR”: 58,
}
 
def generate_clinical_safety_prompt(notes, labs):
return f””“You are an experienced virtual clinical assistant. Review the following scenario.
 
Provider’s note:
{notes}
 
Lab results:
{labs}
 
1. Are important diagnostic steps missing (e.g., tests not ordered that could help explain the symptoms)?
2. What specific recommendation should the physician consider adding based on standard guidelines?
3. Write a concise, polite alert for the clinician.
 
Structure your answer as:
A) Clinical Review (13 sentences)
B) Recommendation(s)
C) Alert to display in EHR system
“”
 
def generate_patient_message(labs):
return f””“You are a virtual assistant helping patients understand lab results in plain English.
 
Given these lab findings:
{labs}
 
Write a brief, reassuring message a provider could send to the patient about their kidney function, in a way that suggests a conversation with their doctor without causing unnecessary alarm.
“”
 
# Logging with Weave
@weave.op()
def review_clinical_safety(notes, labs):
prompt = generate_clinical_safety_prompt(notes, labs)
response = litellm.completion(
model=model,
messages=[
{“role”: “system”, “content”: “You are an expert virtual medical assistant helping reduce diagnostic errors in primary care.”},
{“role”: “user”, “content”: prompt}
],
stream=False,
max_tokens=512
)
return response[‘choices’][0][‘message’][‘content’]
 
@weave.op()
def explain_to_patient(labs):
prompt = generate_patient_message(labs)
response = litellm.completion(
model=model,
messages=[
{“role”: “system”, “content”: “You help explain lab results in simple, reassuring language to patients.”},
{“role”: “user”, “content”: prompt}
],
stream=False,
max_tokens=256
)
return response[‘choices’][0][‘message’][‘content’]
 
print(“***** Clinical Safety Net Demonstration *****”)
print(“\n[LLM is reviewing the physician’s documentation…]\n”)
 
# Run and log through Weave
clinical_out = review_clinical_safety(doctor_notes, lab_results)
print(“AI Assistant (Clinician-facing):”)
print(clinical_out)
 
print(“\n—————————————-\n”)
 
patient_out = explain_to_patient(lab_results)
print(“AI Assistant (Patient-facing message):”)
print(patient_out)
 

After running this script, you can visualize how your model is responding within the Weave dashboard. Weave is a powerful tool that helps developers track, visualize, and evaluate their AI agents and language model applications. It captures many important details about the agent, including the prompts sent to the model, the responses received, and any tools the agent uses. This allows you to inspect the agent’s reasoning, debug issues, and ensure it’s performing as expected, offering crucial transparency for high-stakes applications like healthcare.


Still, integrating such safety measures into daily clinical workflows is not trivial. Excessive alerts can lead to “alarm fatigue,” making clinicians less likely to take warnings seriously. Legal questions about who is ultimately responsible for following up on these recommendations also remain unresolved.

Transforming Diagnostics Through Data-Driven Insights


The shift toward personalized diagnostics is becoming possible thanks to AI’s ability to process diverse streams of health data. Wearable devices, for instance, are now capable of tracking heart rhythms, sleep quality, HRV, and changes in activity. AI agents can analyze this information and identify early red flags, such as arrhythmias or sleep apnea, often before a patient is even aware of a problem. If a smartwatch detects an unusual heart rhythm at night, the AI agent can immediately notify both the patient and their care team, potentially preventing complications like stroke.


Genetics is becoming more precise and actionable thanks to AI. Software like Emedgene analyzes a patient’s genetic sequencing data to quickly identify disease-causing mutations, helping diagnose rare genetic conditions that might otherwise be missed. Another breakthrough is DeepMind’s AlphaFold, which predicts how genetic sequences fold into protein structures – this is crucial because misfolded proteins often cause diseases, and understanding their shape helps scientists develop targeted treatments.

 

These AI tools scan through genetic variants and match them against databases of known mutations, helping doctors flag specific risks and develop more targeted screening and prevention plans. For instance, if the analysis reveals mutations linked to increased cancer risk, doctors can implement personalized monitoring schedules instead of following one-size-fits-all guidelines. The real power of AlphaFold lies in drug development and understanding disease mechanisms – by accurately predicting protein structures, it helps researchers understand how genetic mutations affect protein function and develop more effective treatments


Imaging also has huge potential to benefit from AI. With access to full-body MRI data, the agent can scan for early signs of tumors or other changes. Humans are capable of doing this task, however the reality is that it is prohibitively expensive for many. AI agents can also consider multiple tests over a long time period, and intelligently recommend further investigation based on trends rather than relying solely on isolated test results or subjective human interpretation. The advantage AI brings to imaging is its ability to detect subtle changes that might go unnoticed and to synthesize patterns across time and different modalities. Instead of a doctor looking at a single MRI scan in isolation, an AI system can compare scans from months or years apart, noticing gradual but significant trends.


However, deploying these capabilities presents challenges. Algorithms are only as fair and accurate as the data used to train them, and there is a real risk of AI perpetuating existing health disparities if datasets lack diversity. Integrating information from a variety of devices and health records raises significant questions about data standardization and patient privacy, requiring robust safeguards for personally identifiable information (PII).

Robotics for Research and Diagnostics


Many of the limitations in medical research stem from the sheer number of experiments required to fully understand complex biological processes and validate new treatments. Traditional research workflows can be slow, labor-intensive, and constrained by human time and resources. AI-driven robots are beginning to automate complex tasks in research laboratories and clinical diagnostics. For example, robotic arms, steered by AI, have enabled the rapid testing of thousands of drug combinations, accelerating discovery while minimizing human error. This high-speed, high-precision automation not only increases throughput but also enhances reproducibility.


This concept – turning the physical world into a software development kit– will extend to diagnostics, too. Imagine a laboratory where sample handling, experiment execution, and even interpretation of results are seamlessly automated using a fusion of robotics and AI. This could democratize access to sophisticated testing, especially in remote or resource-limited settings. Currently, doctors have to essentially make decisions while also considering patient costs, which inevitably will lead to situations where tests must be skipped in cases where financial constraints limit diagnostic options.
Overall, these advances still require significant investment, meticulous calibration, and rigorous oversight to ensure quality and safety. A malfunction or systematic error in an automated process could scale rapidly if not properly monitored. Validation, standards-setting, and regular audits become critical when physical processes are handled predominantly by machines.

Navigating the Path Forward: Challenges and Risks


While the potential for AI in healthcare is enormous, it comes with important challenges. Bias remains a serious concern – algorithms trained on incomplete or non-representative data can perpetuate or worsen health inequities. Patient privacy is paramount, especially as more-sensitive data from wearables, genetics, and imaging gets integrated; strong data governance and security protocols are essential.


There are also important questions about trust and accountability. For clinicians and patients to embrace AI-driven recommendations, these systems must be transparent, explainable, and clinically validated. Regulatory hurdles remain, with healthcare leaders and policymakers rightly demanding a high bar for evidence, safety, and ethical safeguards.
Ultimately, the next few years will be pivotal. If these technological and ethical challenges can be resolved, AI agents will not only optimize operational efficiency and safety but truly transform the practice and science of medicine, delivering more precise, responsive, and personalized care to patients everywhere.

Benefits of Integrating AI Agents within EHRs


Integrating AI agents with Electronic Health Records (EHRs) unlocks a powerful new dimension in both individual patient care and population health. Unlike traditional EHRs, which mainly function as digital filing cabinets, EHRs powered by AI become predictive, interactive, and intelligent clinical partners.


A key advantage is the potential for early disease detection and proactive intervention. AI agents, trained on vast amounts of anonymized EHR data spanning tens or hundreds of thousands of patients, can surface subtle health trends and early warning signs that are impossible for humans to detect at scale. For example, AI might analyze vital signs, lab results, and prior visit notes to alert a primary care physician about a patient’s increased risk for heart failure or diabetes, months before symptoms appear.


By reviewing millions of prior outcomes, LLM-powered agents are adept at spotting correlations that can inform current treatment choices. If a patient is eligible for several different medications, an AI agent can highlight which drug, based on real-world historic data within the EHR, has tended to yield the best outcomes for people with similar profiles and co-existing conditions. This means clinicians are empowered to make evidence-based choices in situations where the “best” option is not obvious.


AI-powered EHR integration also shines in closing care gaps. For example, when reviewing chart notes, a well-trained AI might recognize that certain diagnostic tests—such as a screening colonoscopy or HbA1c check for diabetes have not been performed, even though guidelines or comparable patient cases suggest they should be. The agent can flag these missing tests and prompt clinicians to consider them. Similarly, if prior patient records show that a rarely ordered test helped diagnose a hard-to-detect disease, AI can suggest ordering it in similar new cases.


Beyond supporting physicians, AI is positioned to enhance patient engagement and safety. The agent can translate complex test results or trends into actionable, plain-language prompts. For instance, after bloodwork, a patient might receive a message: “Your liver enzymes are higher than last year. Ask your doctor about whether a follow-up test or a medication review is recommended.” This encourages patients to play a more active role in their health and addresses issues promptly.


For healthcare providers, the integration of AI into EHRs ultimately reduces cognitive burden, helps prevent missed diagnoses, and supports more precise, individualized care planning. For patients, it means earlier warnings, clearer communication, and an overall higher standard of care. As these systems continue to mature, both clinicians and patients will benefit from recommendations and insights that draw on the collective intelligence of millions of medical experiences unlocking a new era of data-driven, proactive healthcare.

 

Streamlining Administrative Tasks with AI Agents


One of the biggest challenges facing Western healthcare systems—including those in the United States—is the overwhelming administrative burden that inflates costs and distracts from direct patient care. Much of the healthcare dollar, sometimes as much as 25% in the U.S., goes toward paperwork, billing, insurance claims, appointment management, and other back-office functions. AI agents are positioned to revolutionize this side of medicine, automating repetitive processes, increasing efficiency, and freeing up valuable human resources for what truly matters: patient outcomes.

 

Proactive Appointment Management and Patient Engagement


AI agents can monitor electronic health records and administrative databases to spot patients who might be overdue for important checkups, screenings, or lab tests. Rather than waiting for overworked staff or patients themselves to notice a gap, the AI can send an automated, personalized message. This proactive approach ensures patients receive just-in-time prompts and reassurance, lowering barriers to care and catching health problems earlier.


The Python code below demonstrates how an AI agent can generate a personalized outreach message to a patient. By feeding in a simulated patient profile, the function uses a large language model to craft a friendly, non-judgmental reminder for overdue bloodwork and a follow-up appointment, all in the patient’s preferred language. This automated process ensures consistent, timely communication, freeing up staff and improving patient adherence to preventative care.

# pip install litellm weave
 
import litellm
import weave
 
weave.init(“healthcare_agents”) # Initialize Weave project
 
# Simulated EHR-derived data
patient_profile = {
“name”: “Maria Gomez”,
“last_lab_date”: “2023-04-10”,
“today”: “2024-06-14”,
“age”: 52,
“primary_language”: “English”,
“chronic_conditions”: [“hypertension”],
“preferred_contact”: “SMS”,
}
 
def overdue_prompt(patient):
return (
f””“You are a friendly virtual assistant for a doctor’s office.
Patient: {patient[‘name’]}, {patient[‘age’]} years old. Last labs drawn: {patient[‘last_lab_date’]}.
Patient has hypertension and hasn’t had a recent doctor visit, and may be overdue for a checkup.
 
Write a brief, warm, encouraging reminder message inviting the patient to schedule bloodwork and a followup appointment.
The message should be nonjudgmental, clear, and prompt the patient to reach out to the office with any new or worsening symptoms, concerns, or questions.
Do NOT give specific medical advice.
 
Write this message in {patient[‘primary_language’]}.“”
)
 
@weave.op()
def generate_outreach(patient):
prompt = overdue_prompt(patient)
response = litellm.completion(
model=“openai/gpt-4o”,
messages=[
{
“role”: “system”,
“content”: “You are a helpful virtual assistant supporting patient engagement and appointment scheduling.”
},
{“role”: “user”, “content”: prompt}
],
stream=False,
max_tokens=256
)
return response[‘choices’][0][‘message’][‘content’]
 
outreach_message = generate_outreach(patient_profile)
 
print(“\nAI-generated Patient Outreach Message:”)
print(outreach_message)
print(“\n[If the patient replies, their message is securely flagged for review by clinical team.]”)

Once the message is sent, if a patient replies with worrisome symptoms, such as chest pain, dizziness, or mental health struggles, the agent can immediately flag this for follow up by clinical staff or recommend that the patient seeks urgent care. This demonstrates how AI bridges the gap between administrative efficiency and direct patient safety. 


For many patients, especially those dealing with guilt, uncertainty, financial fear, or anxiety about “bothering” their doctor, AI-powered outreach lowers barriers to care. Instead of waiting until symptoms worsen or risking an ER trip, patients receive just-in-time prompts and reassurance that reaching out is worthwhile and supported. This proactive communication can catch health problems earlier and reduce preventable complications.

 

Automating Claims, Authorizations, and Billing


On the back end, administrative processes like insurance authorizations, billing, and claims management are fraught with inefficiencies and errors. AI agents can process the mountains of paperwork much faster and with fewer mistakes, auto-filling forms, matching procedures with insurance codes, catching discrepancies, and following up on pending claims. This shortens the reimbursement cycle for providers, reduces denials and appeals, and eliminates the manual drudgery that drives up both staff workload and overall system costs.

 

Cost Prediction and Transparent Quotes for Patients


Perhaps one of the most transformative uses of AI in healthcare administration is transparent cost prediction and quote generation. Patients are too often left in the dark about what a procedure or hospital stay will actually cost them, resulting in unexpected bills, stress, or even avoidance of necessary care. Using data on insurance contracts, historical billing, and negotiated rates, AI agents can present patients with clear, up-front price estimates that account for their coverage, deductible status, and provider options.


For example, when a procedure is recommended, an AI agent could generate a real-time quote comparing costs at multiple in-network hospitals or clinics, allowing patients to make informed value decisions. By enabling this kind of comparison shopping, AI agents create a fairer, more transparent healthcare marketplace, ultimately driving providers toward greater price and quality competition, which benefits consumers.

 

Operational Efficiency and Market Transformation


Altogether, AI-driven automation of administrative work reduces overhead, improves staff job satisfaction, slashes errors, shortens wait times, and significantly cuts operational costs. For the health system as a whole, this means more resources go toward direct patient services rather than bureaucratic friction. Patients benefit from easier access, clearer information, reduced financial surprises, and timely interventions.


This systemic improvement doesn’t just make healthcare cheaper, it makes it more patient-centered and fair. As AI continues to mature and be deployed across healthcare operations, the market will shift toward truly rewarding value, efficiency, and positive patient outcomes.


By embedding intelligence across the administrative fabric of healthcare, AI agents promise a future where the system works not just for providers and payers, but truly for patients.

Challenges and Risks of AI Agents in Healthcare


While AI agents have the potential to radically improve healthcare, their adoption is accompanied by significant risks and challenges that demand careful consideration from industry leaders, regulators, and frontline clinicians.

 

Bias and Health Disparities


AI models learn from data, and if that data reflects existing biases or is not representative of all patient populations, the technology can inadvertently perpetuate or even exacerbate health disparities. For example, if an AI system is trained mostly on data from specific demographic groups, its recommendations may be less accurate or even unsafe for others. This can manifest in diagnostic errors, unequal allocation of resources, or treatments that are less effective for underrepresented groups. Ensuring diverse, high-quality data and continual monitoring for bias in AI outputs is crucial, yet remains a major operational challenge.

 

Privacy, Security, and PII Protection


AI agents rely on vast amounts of sensitive health data, including personally identifiable information (PII) and medical histories. The more data these systems collect and share, the greater the risk of security vulnerabilities or data breaches. Unauthorized exposure of health information can have devastating consequences for patients, from discrimination to identity theft. Healthcare organizations must not only comply with privacy regulations like HIPAA and GDPR, but also keep pace with evolving cyber threats and ensure that AI vendors implement robust security protocols.

 

Transparency, Explainability, and Trust


Many AI models, particularly those using deep learning, can act as “black boxes,” generating recommendations that are difficult even for experts to explain. This opacity challenges both clinician and patient trust. When an AI agent recommends a course of treatment or flags a risk, providers must understand how and why that decision was made especially when lives are at stake. Explainability and transparency are essential for clinical acceptance, regulatory approval, and ethical deployment.

 

Clinical Validation and Accountability


Unlike traditional software, AI algorithms often adapt over time. This raises questions about how best to test, regulate, and monitor their performance in the real world. How do we ensure AI therapies and recommendations remain accurate as clinical guidelines and patient populations evolve? Determining accountability is also complex; if an AI agent’s suggestion leads to harm, is the provider, the institution, or the software developer responsible?

 

Operational, Cultural, and Regulatory Hurdles


Healthcare workflows are complex and highly regulated. Integrating AI agents into daily practice can disrupt established routines, require significant training, and initially slow down rather than accelerate processes. Concerns about liability, reimbursement, and compliance make many healthcare leaders understandably cautious, often favoring pilot programs and phased rollouts over widescale adoption. Additionally, not all facilities have the technical expertise or resources to deploy advanced AI systems safely.

 

Potential for Over-reliance and Deskilling


If clinicians begin to rely too heavily on AI recommendations, there is a risk that essential diagnostic and critical thinking skills may erode over time (a phenomenon sometimes called “deskilling”). Healthcare providers must remain vigilant, using AI as an augmenting tool; but not a substitute for their own expertise and patient-centered judgment.


In summary, while AI agents herald a new era of possibility in healthcare, realizing their promise safely and ethically requires continuous attention to fairness, security, transparency, accountability, and patient trust. Healthcare leaders must balance innovation with caution, conducting rigorous validation, fostering multidisciplinary oversight, and placing patient welfare at the center of all AI-driven change.

 

Evaluating Health AI Models with HealthBench


HealthBench is a rigorous, clinician-validated benchmark designed to assess health AI models using real-world, multi-turn scenarios. Developed by physicians from many countries, each HealthBench evaluation presents models with challenging health questions. The responses generated by these models are then assessed using detailed, physician-written rubrics, each criterion within these rubrics carries a weight that reflects its clinical importance. By evaluating how well a response satisfies each requirement, HealthBench delivers nuanced and trustworthy performance measurements, spanning axes such as accuracy, communication quality, and context-awareness.


The evaluation process proceeds in several stages. First, the model or expert is presented with a standardized prompt from the HealthBench dataset and generates a response. Each response is then rigorously graded: for every rubric criterion, a grading model such as GPT‑4.1 reviews the entire conversation, the latest model response, and the relevant rubric. It determines through detailed reasoning, captured in structured JSON, whether the response meets the criterion, and provides an explicit explanation.


Scoring is calculated by summing the weighted points for all rubrics satisfied and normalizing this total by the maximum possible score for the example. Aggregating these results across thousands of cases provides a comprehensive view of model performance, which can be analyzed in depth by clinical context, scenario type, rubric axis, or model version. This approach allows for a detailed understanding of both strengths and areas for improvement.

 

Logging HealthBench Evaluation with Weave


To ensure that the evaluation process is reproducible, transparent, and easy to analyze over time, I integrated Weave’s EvaluationLogger into the HealthBench evaluation loop. This integration automatically logs each model prediction along with its input prompts, outputs, rubric-level judgments, and final scores. All of this information is stored in a searchable dashboard, supporting ongoing progress tracking and in-depth model comparisons.


When adding Weave, I began by initializing both the Weave environment and the evaluation logger within the evaluation pipeline.


Here’s the core healthbench_eval.py script that I modified:

import hashlib
import json
import random
from collections import defaultdict
from typing import Literal
 
import numpy as np
 
import weave
from weave import EvaluationLogger
 
from . import common
from .types import Eval, EvalResult, MessageList, SamplerBase, SingleEvalResult
 
 
# The grader reads the full conversation + a single rubric item and returns a
# JSON object: {“explanation”: …, “criteria_met”: true/false}.
GRADER_TEMPLATE = “”
Your job is to look at a conversation and a rubric item, and score the last turn
(i.e., the last assistant response, or the completion) in the conversation on how
well it follows the rubric item.
 
# Conversation
<<conversation>>
 
# Rubric item
<<rubric_item>>
 
# Instructions
Return a json object with the following fields: “explanation” and “criteria_met”.
“explanation”: a string explaining why the response does or does not meet the criteria.
“criteria_met”: a boolean. If a rubric item has multiple criteria, all must be met
for this to be true.
“”
 
 
class HealthBenchEval(Eval):
def __init__(
self,
grader_model: SamplerBase,
num_examples: int | None = None,
n_repeats: int = 1,
subset_name: Literal[“hard”, “consensus”] | None = None,
model_name: str | None = None,
dataset_name: str = “healthbench”,
):
“””Initialize HealthBenchEval and the Weave logger.”””
weave.init(“healthbenchdev”)
 
import re
 
# Weave’s EvaluationLogger groups every prediction + score under one
# (model, dataset) run so you can compare models side by side later.
self.eval_logger = EvaluationLogger(
model=re.sub(r’\W+’, ‘_’, str(model_name or “unknown_model”)),
dataset=re.sub(r’\W+’, ‘_’, str(dataset_name)),
)
 
# Pick the dataset shard (full / hard / consensus) and load examples.
input_path = INPUT_PATH # resolved from subset_name in the full file
with open(input_path, “r”, encoding=“utf-8-sig”) as f:
examples = [json.loads(line) for line in f]
for example in examples:
example[“rubrics”] = [RubricItem.from_dict(d) for d in example[“rubrics”]]
 
rng = random.Random(0)
if num_examples is not None and num_examples < len(examples):
examples = rng.sample(examples, num_examples)
 
self.examples = examples * n_repeats
self.grader_model = grader_model
 
def grade_sample(
self,
prompt: list[dict[str, str]],
response_text: str,
example_tags: list[str],
rubric_items: list[“RubricItem”],
) > tuple[dict, str, list[dict]]:
convo_with_response = prompt + [dict(content=response_text, role=“assistant”)]
 
def grade_rubric_item(rubric_item) > dict:
convo_str = “\n\n”.join(
[f”{m[‘role’]}: {m[‘content’]}” for m in convo_with_response]
)
grader_prompt = GRADER_TEMPLATE.replace(
“<<conversation>>”, convo_str
).replace(“<<rubric_item>>”, str(rubric_item))
messages: MessageList = [dict(content=grader_prompt, role=“user”)]
while True:
sampler_response = self.grader_model(messages)
grading_response_dict = parse_json_to_dict(sampler_response.response_text)
if isinstance(grading_response_dict.get(“criteria_met”), bool):
break
print(“Grading failed due to bad JSON output, retrying…”)
return grading_response_dict
 
grading_response_list = [grade_rubric_item(ri) for ri in rubric_items]
 
# Sum the weighted points for satisfied rubrics, normalized by the max.
overall_score = calculate_score(rubric_items, grading_response_list)
metrics = {“overall_score”: overall_score}
 
rubric_items_with_grades = []
for rubric_item, grading_response in zip(rubric_items, grading_response_list):
rubric_items_with_grades.append(
{
**rubric_item.to_dict(),
“criteria_met”: grading_response[“criteria_met”],
“explanation”: grading_response.get(“explanation”, “”),
}
)
 
return metrics, “”, rubric_items_with_grades
 
def __call__(self, sampler: SamplerBase) > EvalResult:
def fn(row: dict):
prompt_messages = row[“prompt”]
sampler_response = sampler(prompt_messages)
response_text = sampler_response.response_text
actual_queried_prompt_messages = sampler_response.actual_queried_message_list
 
metrics, _, rubric_items_with_grades = self.grade_sample(
prompt=actual_queried_prompt_messages,
response_text=response_text,
rubric_items=row[“rubrics”],
example_tags=row[“example_tags”],
)
score = metrics[“overall_score”]
convo = actual_queried_prompt_messages + [
dict(content=response_text, role=“assistant”)
]
return SingleEvalResult(
html=“”,
score=score,
convo=convo,
metrics=metrics,
example_level_metadata={
“score”: score,
“rubric_items”: rubric_items_with_grades,
“prompt”: actual_queried_prompt_messages,
“completion”: [dict(content=response_text, role=“assistant”)],
“prompt_id”: row[“prompt_id”],
“completion_id”: hashlib.sha256(
(row[“prompt_id”] + response_text).encode(“utf-8”)
).hexdigest(),
},
)
 
# Run each example, then log the prediction + per-rubric scores to Weave.
results = []
for row in self.examples:
r = fn(row)
results.append(r)
try:
pred_logger = self.eval_logger.log_prediction(
inputs=row[“prompt”],
output=r.convo[1][“content”],
)
for k, v in r.metrics.items():
pred_logger.log_score(scorer=k, score=v)
if r.score is not None:
pred_logger.log_score(scorer=“overall_score”, score=r.score)
pred_logger.finish()
except Exception as e:
print(“Weave logging failed:”, e)
 
final_metrics = _aggregate_get_clipped_mean(results)
 
# Log the run-level summary (mean score) to Weave.
try:
scores = [r.score for r in results if getattr(r, ‘score’, None) is not None]
mean_score = float(np.mean(scores)) if scores else None
self.eval_logger.log_summary({“mean_overall_score”: mean_score})
except Exception as e:
print(“Weave summary logging failed:”, e)
 
return final_metrics

The script begins by initializing the Weave logger, dataset, and scoring components. For each example in the HealthBench dataset, it generates a model response to a user prompt. It then iterates through each physician-written rubric item, using a grader model (GPT-4o) to assess if the model response satisfies the rubric’s requirements. Explanations and scores are collected for each criterion. The script computes an overall score for the response, aggregates results across samples, and automatically logs each evaluation—including inputs, outputs, and metrics—to Weave. This enables transparent, reproducible analysis of model performance across thousands of realistic health conversations.


After running the evals with the following commands:

python m simpleevals.simple_evals eval=healthbench model=gpt4o examples=100
 
# Evaluate gpt-4o-mini on the same 100 examples for a side-by-side comparison.
python m simpleevals.simple_evals eval=healthbench model=gpt4omini examples=100

We can then visualize the performance of our models inside Weave. Here’s the results for the evaluation:

GPT-4o demonstrates superior performance, as the view shows it achieved a mean overall score of 0.378, which is noticeably higher than gpt-4o-mini’s score of 0.334. This indicates that gpt-4o was more effective at meeting the physician-defined rubric criteria, resulting in better quality responses on average for these complex health scenarios, and the +0.0442 difference clearly quantifies this improvement.


Weave also offers a powerful comparisons view. This specific tool enables you to compare model outputs side by side, filter results by specific parameters, and trace inputs and outputs for every function call. This visual approach simplifies debugging and provides deep insights into model performance, making Weave an indispensable tool for tracking and refining large language models


Here’s a screenshot of the comparisons view:

Conclusion


AI agents hold tremendous promise for transforming healthcare by enhancing diagnostic accuracy, personalizing treatment, and automating both clinical and administrative tasks. Their integration into healthcare workflows and EHR systems can lead to earlier disease detection, improved patient engagement, and streamlined operations, ultimately improving outcomes while reducing costs. However, realizing this potential demands careful attention to challenges such as bias, privacy, transparency, and clinical accountability. With responsible development, rigorous validation, and thoughtful implementation supported by evaluation tools like HealthBench and monitoring platforms such as Weave, AI agents can safely become trusted partners in delivering high-quality, patient-centered care. The future of healthcare lies in harnessing AI’s power to augment human expertise without replacing it, ensuring better, more proactive care for all.