Driving toward zero traffic accidents: How Woven by Toyota builds video AI agents for automated driving with W&B Weave

“The W&B team added video support to Weave, just for us, so we could advance this project. W&B Weave has been great for us, not just tracking results and logging all experiments, but evaluating inputs and outputs, identifying and fixing bugs much faster, and giving us a lot of great metrics (like weighted precision and weighted recall) to analyze."
WOVEN-LOGO
Suigen Koide
Head of DevBoost, World Understanding, and AD / ADAS

For Suigen Koide, the Head of DevBoost, World Understanding, AD / ADAS (Automated Driving and Advanced Driver Assistance Systems) at Woven by Toyota (the mobility technology subsidiary of Toyota Motor Corporation), car accidents are not just a critical societal problem to work on, but also a very personal one.

Koide was struck by a truck when he was just four years old, and came very close to losing his life. He was unconscious for four days, and still remembers his grandfather sitting on his bed begging him to wake up, as well as the overwhelming grief and trauma his mother felt, before he miraculously woke up and recovered.

“It may take years, but our goal and my goal is to work toward a future with zero traffic accidents,” said Koide. “I want to build autonomous driving systems that could have saved my 4-year-old self.”

And the stakes couldn’t be higher: with two people dying every minute in car accidents, and over 1.3 million lives lost annually worldwide, safer roads through autonomous vehicles represents one of the most important technological challenges that AI can tackle. But it is also a very difficult challenge, and Koide and his team have encountered several bottlenecks over the past 8+ years building Woven by Toyota’s machine learning environment for automated driving, from the ground up.

Read on to learn more about those challenges, and how Weights & Biases (W&B) Weave helped Woven by Toyota overcome those obstacles to build a cutting-edge video AI agent that improved triage and set them on a path toward safer autonomous vehicles.

From data to deployment: Why human triage slowed down automated driving development

Automated driving consists of breaking down human driving into three core steps – perception, planning, control and translating all three into software. Koide’s team had developed a sophisticated workflow in developing that software to meet the high safety standards of driving.

Screenshot 2025-06-27 at 8.30.19 AM

That workflow starts, like with all AI projects, by collecting real-world driving data. Most of this data is collected through test ego cars to see what they perceive on the roads (either closed courses or the open road), and if the system planned to do the right thing. If and when that doesn’t happen correctly in either phase, that is detected as a bug, and that bug is uploaded to the cloud. Koide and his team then classify the issue, improve the system, run simulations, test it on a closed course, and finally deploy the fix via OTA (over-the-air).

But when they mapped out the full workflow, one major bottleneck continuously emerged: human triage. Triage was the slow, manual process of reviewing driving logs to find the root cause of system failures. The team would individually look through logged footage, analyze the video, and determine what went wrong, often checking frame-by-frame to identify exactly what happened.

With many different subcategories of bugs, the team needed to classify them correctly in order to get it to the right team to fix. This proved to be a painstaking, time consuming process in order to classify all bugs correctly.

But what if GenAI could provide a scalable solution? What if they showed the input driving log videos to GenAI to predict the root cause of the issue, and then took action to classify it correctly? That was the idea behind AutoTriage, and Koide and his team set out to build and automate the human triage aspect of their workflow.

From prototype to production: How W&B Weave helped mature a video AI agent

Screenshot 2025-06-27 at 8.31.17 AM

The initial version of AutoTriage seemed to be an immediate success, reviewing video logs and automatically classifying bugs in the right category. But the team soon learned a critical lesson: bug classification was very complex and nuanced, even within a single category. And AutoTriage was not quite nailing it correctly.

For example, it might predict that the ego car was “Unable to detect a green traffic light.” But when a human reviewed it, that Ground Truth was more accurately described as “Lost perception of green traffic light.” That type of small but important distinction proved critical to their workflows. In Koide’s attempts to improve the performance of AutoTriage, the team hit roadblocks at every aspect of AI agent development:

Screenshot 2025-06-27 at 8.30.37 AM
  • Data wasn’t ready or good enough:
    • Only low-resolution videos were being attached to bug tickets, while high-quality logs that were used and reviewed by humans were not saved. Triage, in that sense, was still a very manual process focused on human investigation.
  • Prompts were too complicated:
    • Koide’s team didn’t have enough domain knowledge about triage, thus making prompts hard to write. They didn’t know where to look, what to tell the model, and ultimately how to write good enough prompts that covered every aspect of the domain accurately.
  • Models struggled to deliver accurate results:
    • The multimodal models the team tried to use struggled initially, primarily with limited inputs and shallow understanding of the domain problem. At the beginning of the project, they tried to use GPT-4o, which was supposed to support video, but uploading more than 10 images caused input errors. Similarly, switching to Gemini yielded similar issues, where the model struggled to even recognize the difference between a red and green traffic light correctly.
  • Experiments run by the team were scattered and poorly tracked:
    • The team ran constant experiments to yield better results, but all experiment results, metrics and reports were scattered across the organization, and saved in Microsoft Word and PowerPoint. The lack of effective tracking, visualization, or ability to put those experiments into action slowed the team down.

Koide and his team took a divide-and-conquer approach to each of these problems and bottlenecks, and after a year of tinkering without meaningful improvements, they finally got the results they were after.

For data, they built a high-resolution video pipeline so the GenAI models could better “see” what the human triagers were seeing. That immediately led to better video inputs. For prompts, they recruited domain experts into their project and collaborated tightly to work on writing better prompts. The models themselves improved, with the release of Gemini 2.5 Pro, a multimodal model with reasoning, emerging and unlocking a whole new level of detailed video analysis. Finally, to improve experiment tracking, the team turned to W&B Weave.

“The W&B team added video support to Weave, just for us, so we could advance this project,” said Koide. “Our relationship with W&B goes back many years – we even had the three founders of the company visit us in Japan a long time ago to help us set this up – and we deeply value this partnership, this level of care, and it’s been impressive to see how much the company and product has grown and improved.”

Using W&B Weave for all video inputs and experiments tracking led to massive gains in productivity, accuracy, and also sanity and peace of mind for the team.

“W&B Weave has been great for us, not just tracking results and logging all experiments, but evaluating inputs and outputs, identifying and fixing bugs much faster, and giving us a lot of great metrics (like weighted precision and weighted recall) to analyze,” said Koide. “Finally after one year of effort, we started seeing high accuracy on AutoTriage and classifying our bugs.”

“Before W&B, I had to manually review everything. Just seeing all that data visualized in W&B’s graphs has been genuinely moving for me, and so impactful for our team.”

By addressing the bottlenecks of data, prompts, models, and experiments, the team achieved a 10x increase in the speed and scale of their bug triaging efforts, with the same team and resources.

Screenshot 2025-06-27 at 8.31.27 AM

What’s next: Building the future of automated driving with multi-agent systems

The team wants to take things further, moving from a single agent to multi-agent workflow with an LLM-as-judge. The goal is for the AutoTriage to classify bugs as normal, and then a new Triager agent to compare it to existing bugs and use expert-made explanation videos and prompts. 

“This is the future of triage,” said Koide. “With AI-assisted triage, we can scale far beyond the maximum human triage capacity. Spend 95% of your efforts on getting the data right, collaborate closely with domain experts, stay on top of evolving GenAI models, and finally experiment a lot! And use W&B Weave as the hub for all your experiments.”

The path to zero traffic accidents is possible, thanks to the hard work and innovation of people like Suigen Koide at Woven by Toyota. Making the roads safer for everyone, including his 4-year-old self, is a mission that he and his team are truly passionate about.