cd ../blog

Building a Computer Use Agent for UI Testing at Autodesk

2026-03-10
AIComputer UseVLMsAutomationEngineering

The Problem

UI testing is one of the most expensive and time-consuming parts of software QA. At Autodesk, we had close to 1200 manual UI test cases spread across ~10 workspaces — and every release, QA engineers had to run them by hand. That's hundreds of person-weeks of work per year.

The challenge: could we automate this with AI?

The Approach

Most UI automation frameworks require you to hard-code selectors (button#submit, div.toolbar) or record-and-replay click sequences. These break every time the UI changes — which in an active product, is constantly.

We took a different approach: build a Computer Use Agent (CUA) that sees the UI the same way a human does, using a Vision-Language Model (VLM).

Visual Grounding & Localization

The core technical challenge was grounding — given an instruction like "click the Export button", how do you tell the model exactly where on screen that button is?

Off-the-shelf VLMs are great at describing what they see, but terrible at pinpointing pixel coordinates. We invented a novel multi-stage visual grounding method that:

  1. Uses the VLM to identify the region of interest on screen
  2. Crops and zooms into that region
  3. Re-queries the model for precise localization within the crop

The result: precision jumped from 45% → 93%, and runtime dropped from 30–60s → 10–16s per step.

Results

After building the system:

  • ~1200 previously manual test cases are now fully automated
  • 400 QA-weeks saved per year across the org
  • The system is UI-agnostic — no per-feature engineering needed

We also implemented full observability with Langfuse to track cost, latency, and model quality across every run.

Lessons Learned

VLMs + structured prompting > fine-tuning — We didn't need to fine-tune any models. Careful prompt engineering and multi-stage reasoning got us to production quality using off-the-shelf LLMs.

Benchmarks matter from day one — We built sanity, regression, and benchmarking datasets before shipping anything. This was critical for catching regressions as models and prompts evolved.

Observability is not optional — Without tracing, debugging a multi-step agentic system is nearly impossible. Langfuse integration paid for itself on day one.


Building agentic systems that interact with real UIs is one of the hardest and most exciting problems in applied AI right now. Happy to chat more — reach out at d.elchoueiry@gmail.com.