Offloading Score: Measuring AI Reliance
through Counterfactual Workflows

Vishakh Padmakumar¹, Lujain Ibrahim², Zora Wang³, Jennifer Wang¹, Q. Vera Liao⁴, Diyi Yang¹

¹ Stanford University ² University of Oxford ³ Carnegie Mellon University ⁴ University of Michigan

TL;DR

We introduce a way to measure AI reliance by looking at where cognitive effort goes in a real workflow. Using a controlled user study, we show that this measure captures shifts in reliance under time pressure, and helps reveal different patterns in how people work with AI tools.

A lot of existing work treats reliance as a fairly simple event: did the user accept the AI's suggestion, or did they reject it? That framing made sense for earlier AI systems, where the tool produced a recommendation and the user made a decision around it. But that is not really how people use contemporary AI tools. These interactions are conversational, multi-turn, and often woven through the whole task. Two users can end up with the same amount of AI-generated code, but get there in very different ways: one might break the task down, ask for help on sub-components, and check each step; another might hand the whole problem to the model at the start. A usage count makes these look much closer than they are.

Self-reports give us a bit more texture, but they come with their own problems: they are subjective, noisy, and hard to collect every time someone uses a tool.

What gets lost in both approaches is the distribution of effort over the interaction itself.

Two users complete the same task with similar outputs, but one offloads far more cognitive work to the AI than the other — Two developers can produce similar outputs while relying on AI very differently. One decomposes the task, queries the model for subcomponents, and verifies each step; the other hands the whole problem to the AI. The offloading score is designed to capture this difference.

At some point, the thing we wanted to measure was not just whether the AI was used, but how much cognitive effort had moved from the person to the tool.

So instead of asking only whether the user accepted an output, we ask a more counterfactual question: how much work would this person have had to do if the tool had not been there? If the answer is "a lot more," then the interaction involved high reliance. If the answer is "roughly the same amount," then the tool was present, but not carrying much of the cognitive load.

We make this concrete by looking at the workflow step by step. Whenever a user seeks AI assistance, we estimate what an 'average' user would have had to do to reach that same sub-goal without the tool. The steps saved, relative to the full human-only counterfactual, give us the offloading score.

The score is only one part of the picture. We also label each AI-assisted step to understand how the reliance shows up: what kind of cognitive work is being offloaded — Planning Execution Feedback Control — and how the user treats the AI's output — Directly Reuse Adapt & Apply Pushback Reject.

In practice, the calculation has three moving parts.

First, we record a developer's session — screen, keystrokes, clicks — and automatically induce a step-by-step workflow from it. Each step is meant to capture a meaningful unit of progress: "sent prompt asking AI to design the database schema", or "manually debugged CSS layout issues."

Second, for each AI-assisted step, we simulate what an average developer might have done without the tool. This is not a full alternate universe for the whole task. It is a short-horizon simulation: what would it take to reach this same local sub-goal?

Third, we turn that into a score:

Offloading Score = m − n m

n = observed steps · m = counterfactual steps · range [0, 1]

A score of 0 means the observed workflow did not save steps relative to the human-only version. A score close to 1 means most of the work for those sub-goals was effectively offloaded. For a concrete sense of what this looks like, see the examples below.

We validate the offloading score through a user study. The basic test is simple: prior work suggests that time pressure increases reliance on external tools. So if our measure is capturing the right thing, it should move in that direction when people are rushed.

We ran a controlled study with 40 experienced developers. Everyone worked on programming tasks involving functional web apps, with the same tools available. The main difference was time: half had 1 hour, and half had 4 hours.

The offloading score was significantly higher in the time-pressured group (+43%, p = 0.018). Two common baselines — the fraction of AI-generated code retained, and self-reported cognitive load — did not significantly separate the two conditions.

We also check the pieces that make the score work: whether the counterfactual simulations are plausible, and whether the human and LLM annotations for process and output-use labels are reliable.

Boxplot comparing offloading score and baselines across short and long conditions — Comparison of different reliance measures across *short* (1 hr) and *long* (4 hr) conditions. The offloading score is significantly higher in the short condition (p = 0.018), matching the expected effect of time pressure. Baselines based on AI usage and self-reported cognitive load show more variance and do not significantly distinguish the conditions.

Over 85% of the generated counterfactual steps were rated as plausible by participants. LLM-as-judge annotations for the process and output-use labels matched human annotators at 80–81% agreement.

Not necessarily. This is the part that matters: reliance is not automatically good or bad. The offloading score tells us how much effort was offloaded. Whether that is desirable depends on what the user was trying to get out of the interaction.

To see this more clearly, we combined the offloading score with a measure of system recall. After the task, we asked participants questions about the system they had built: design decisions, implementation details, and how the pieces fit together.

Potential overreliance

High offloading score, low system recall. These users delegated heavily, but did not retain much understanding of what had been built. In interviews, several said they knew how to do the task themselves but used the tool out of habit or to move faster — and in some cases were not even aware of features the AI had quietly included.

Appropriate reliance

Moderate-to-high offloading score, high system recall. These users leaned on AI for things they did not already know how to do, but stayed engaged enough to understand the result. The tool seemed to function more like a learning scaffold than a shortcut. One participant described "not wanting to limit themselves to one line of thinking" and using the model in a back-and-forth planning loop.

Lower reliance

Smaller offloading scores. In interviews, this often seemed less like a deliberate preference for independence and more like limited familiarity with what the tool could do. When shown examples of alternative workflows, several participants said they would have used the tool more if they had known.

Scatter plot of offloading score vs system recall — System recall (y-axis) vs. offloading score (x-axis) across participants. The red line shows a linear fit after excluding the moderate-to-high reliance, high-recall cluster in green. Among the remaining users, higher offloading is more strongly associated with lower recall. The highlighted cluster shows a different pattern: substantial tool use while still maintaining understanding.

The offloading score is not meant to decide whether reliance is bad. It gives us a way to notice when cognitive effort has moved, and then ask what that movement meant in context.

Two moments from real, anonymized sessions in our user study. Each workflow contains a mix of human and AI-assisted steps. Click any highlighted step to see the human-only alternative we estimate for that moment.

Task: Build a web-based game with a mobile-responsive UI. The user has finished the core logic and is now trying to make the interface work on smaller screens.

Induced workflow · Offloading score: 0.53

👤

Review task requirements: ensure the app layout works on mobile screens.

👤

Open the app in browser, resize to mobile viewport (~375px width), and identify layout issues in the Controls and Board components.

✦

(editing generation) Accept AI-suggested CSS/Controls/Board changes for responsive design via multiple "Accept all"/inline Accepts.

Execution Directly Reuse Click to see human alternative (9 steps)

Without AI, this step would require:

Open the implementation plan / notes summarizing the mobile-responsiveness fixes and keep that window visible for reference.
Use VS Code search to find UI/layout sources — open index.css plus component files that contain the controls and board layout (e.g., Controls.tsx, Board.tsx, App.css).
Edit index.css to add mobile-targeted rules: add a @media (max-width: 600px) block that adjusts root spacing variables, sets the main container to column flow, and reduces paddings/margins.
Update controls component styles so the control group stacks vertically on small screens and buttons expand to 100% width within that breakpoint.
Adjust board/container styles to preserve aspect ratio on narrow viewports — set max-width: 100%, use responsive heights, and handle overflow.
Run the dev server and open the app in browser; use DevTools Device Toolbar at common mobile widths (375×667, 414×896) to inspect the layout.
Interact with the board and controls in responsive DevTools view, noting issues; iterate on CSS edits until controls and board behave correctly on small screens.
Run linting and tests (npm run lint, npm test) and fix any failures introduced.
Create a descriptive commit and push the branch.

1 AI step → 9 human steps · saves ~89% of effort at this moment

👤

Test the responsive layout in browser DevTools at multiple mobile widths; verify board and controls render correctly.

👤

Run npm test to check for regressions and confirm the build passes.

Task: Build a recipe finder app using the Spoonacular API. The user needs to add a "Generate" button that fetches recipes and displays them in the app.

Induced workflow · Offloading score: 0.47

👤

Review task specification: add a recipe search feature with a "Generate" button connected to the Spoonacular API.

✦

(reading generation) Ask VS Code AI how to add a generate button and recipe list using the API key; read its implementation plan.

Planning Adapt & Apply Click to see human alternative (7 steps)

Without AI, this step would require:

Open the Spoonacular API docs and locate the recipe search endpoint. Note required query parameters, response format, and how the API key is passed.
In VS Code open App.jsx and decide where to add UI and state: plan to add a "Generate" button, a search input, and a recipe list.
Add React state hooks: recipes (array), loading (boolean), error (string), and query (string).
Implement a searchRecipes function: set loading true; build the request URL using the API key from env; call fetch(url), await response.json(), set recipes; catch errors.
Wire the UI: add onChange to the input, add a Generate button with onClick={searchRecipes}, and add conditional rendering for loading, error, and a mapped list of recipe cards.
Save App.jsx and restart the dev server so that environment changes are picked up.
Open the app in browser, enter a query, click Generate, watch the network tab, and confirm recipes render correctly.

1 AI step → 7 human steps · saves ~86% of effort at this moment

👤

Open App.jsx and begin implementing the recipe search feature following the plan.

👤

Enter a test query, click Generate, and check the network tab to confirm the API request and response format.

👤

Fix API key configuration in .env.local if needed; restart dev server and verify recipes render correctly.

Any new measure should invite some skepticism. These are the concerns we think about most often. There is a lot more work to be done here, so please reach out if any of these directions are interesting to you.

But what if the estimated counterfactual steps don't reflect how someone would solve the sub-task(s)?

We use human-only counterfactuals to approximate what an average developer would do to reach the same sub-goal without the tool. This is a reasonable place to be skeptical. We evaluate these counterfactuals using a dataset of workflows from prior work, and in our user study participants rated over 85% of generated counterfactual steps as plausible. We also expect this part of the method to improve as workflow induction, user modeling, and simulation methods get better.

But how do you account for differences in work styles, shouldn't the counterfactual be personalized?

In principle, yes. A senior engineer and a junior developer might take very different paths to reach the same sub-goal without AI, and a single "average user" counterfactual will not capture all of that. We use a general counterfactual because it keeps the measure reusable across contexts, and because personalized workflows estimated from limited session history can be very noisy. But adapting the offloading score to individual baselines is an important next step.

But what if this only works for coding tasks?

Our user study focuses on 40 experienced developers working on programming tasks, largely because coding is one of the places where AI agents are already widely used. But the framework itself is not specific to code. In principle, any setting where we can induce a workflow and estimate a human-only alternative could work. That still needs empirical validation.

But what if cognitive effort is not always visible in on-screen actions?

Workflow steps only capture what is externally visible. A user might sit quietly thinking through a design before typing anything, and that effort will not appear in the trace. So the measure treats observable steps as a useful but imperfect proxy for cognitive effort. Our hope is that this still gives users a signal they can reflect on, and gives model developers a way to design incentives that reduce harmful forms of overreliance.

Citation

@article{padmakumar2026offloading,
  title   = {Offloading Score: Measuring {AI} Reliance through
             Counterfactual Workflows},
  author  = {Padmakumar, Vishakh and Ibrahim, Lujain and Wang, Zora
             and Wang, Jennifer and Liao, Q. Vera and Yang, Diyi},
  note    = {Unpublished},
  year    = {2026},
}