Deft: Building an On-Device AI Phone Agent

Matthew J. Whitney

•June 24, 2026•9 min read

artificial intelligencellmai integrationmachine learning

Building an on-device AI phone agent that actually works — not a demo, not a prototype, but something that handles real user flows on a real Android device without phoning home — turns out to be one of the harder engineering problems I've tackled in twenty years of building production systems. We shipped Deft. Here's what it took.

The short version: Deft is a fully local AI agent for Android that uses Gemma 4 to interpret natural language instructions, traverse the Android Accessibility Service tree, and take actions on behalf of users — all without a single byte leaving the device. No cloud backend. No inference API. No data retention risk. The three open-source libraries we built to make this work are the real story, and so are the three failure modes that nearly killed the project.

Why On-Device AI Integration Changes the Privacy Equation

The current wave of phone agents — Gemini Live, Apple Intelligence actions, various Android automation tools — all share one architectural assumption: the model lives in the cloud. Your screen state, your intent, fragments of your messages — they get serialized and shipped to a data center. For most consumer use cases, people have accepted that trade. For enterprise, healthcare, and financial applications, it's a non-starter.

Deft's design constraint was absolute: the LLM infers locally, the action plan executes locally, and the only network traffic is what the user's underlying apps would have generated anyway. That constraint forced every interesting architectural decision we made. It also meant Gemma 4 — specifically the 4B instruction-tuned variant — was the only realistic choice. At 4B parameters with Google's updated quantization, it fits in working memory on a mid-range 2025 Android device with 8GB RAM without killing the foreground app. Barely. We'll get to that.

The Architecture: Three Layers, Three Libraries

Deft's runtime breaks into three distinct layers, and each one required a library that didn't exist when we started.

Layer 1 — Intent Parsing. The user speaks or types a goal: "Book the cheapest flight on the Kayak app for next Tuesday." Gemma 4 converts that into a structured action plan: a sequence of typed steps with rollback semantics, preconditions, and a confidence score per step. We built deft-intent to handle this translation, including a prompt templating system that keeps the context window under control by stripping irrelevant history.

Layer 2 — UI Tree Comprehension. Android's AccessibilityService gives you the full UI node tree. In theory, great. In practice, a complex app like Gmail or a travel booking site renders 400–600 nodes, many of them decorative, duplicated, or semantically meaningless. We built deft-tree to prune, compress, and semantically annotate that tree before it ever touches the context window. More on this below — it's where the project nearly died.

Layer 3 — Action Validation and Execution. Before Deft taps a button, submits a form, or confirms a purchase, it passes the proposed action through deft-guard, our validation layer. This is where irreversible actions — sends, purchases, deletes — get held for explicit user confirmation regardless of model confidence. The model can be wrong. The model will be wrong. deft-guard is the circuit breaker.

The Hard Problem Nobody Talks About: UI Trees at Scale

Every tutorial on LLM-based UI automation shows a clean 30-node tree. Real apps don't look like that. The first time I ran Deft against the Kayak Android app, the accessibility tree came back at 547 nodes. Feeding that raw into Gemma 4's context window — even with its extended 128k context — produces two problems simultaneously: latency spikes past anything a user will tolerate, and the model starts hallucinating node IDs that don't exist because it's lost in the noise.

deft-tree's solution is a three-pass compression pipeline:

Structural pruning — remove nodes with no contentDescription, no text, no clickable flag, and no children that have any of those. These are layout containers. They add tokens, not meaning.
Semantic deduplication — collapse repeated patterns (list items with identical structure) into a typed template plus a count. A RecyclerView with 40 flight results becomes FlightResultList[40] with one representative item expanded.
Relevance scoring — given the current action step's intent, score remaining nodes by semantic proximity using a lightweight embedding model (MiniLM-L6-v2, running locally) and drop anything below threshold.

After all three passes, that 547-node Kayak tree compresses to 38–60 nodes depending on the action. Latency dropped from 11.2 seconds per step to 2.8 seconds on a Pixel 8 Pro. Still not fast. But usable.

The latency problem on constrained hardware is real and it doesn't have a clean solution. We're running a 4B parameter model on a mobile SoC. The Pixel 8 Pro's NPU helps, but MediaTek and Qualcomm's NPU acceleration for transformer models is still maturing — we saw 30–40% variance in inference time depending on thermal state. On a warm device after ten minutes of use, Deft slows down measurably. We surface this to users with a thermal indicator in the UI. Hiding it would be dishonest.

Validating Irreversible Actions: The deft-guard Design

This is the problem that cost us the most design iteration. An AI agent that can tap buttons on your behalf is useful. An AI agent that can confirm a $600 flight purchase because it misread a precondition is a liability.

deft-guard classifies every proposed action against a risk taxonomy before execution:

Reversible — navigation, scrolling, text input into a non-submitted field. Execute immediately.
Soft-irreversible — form submission, account changes, sending messages. Pause and show the user a plain-English summary of what's about to happen.
Hard-irreversible — financial transactions, deletions, account closures. Full stop. Require explicit confirmation with a 3-second delay (enough to interrupt an accidental tap).

The classification runs rule-based first — we pattern-match on AccessibilityService action types and node semantics — and falls back to a Gemma 4 classification call only when the rule layer is uncertain. That keeps validation latency under 200ms for the common case.

What surprised us: the failure mode we'd been most worried about — the model confidently executing a wrong action — was less common than the model correctly identifying the right action but the validation layer over-triggering on false positives. An overly cautious agent that interrupts you every three steps is annoying enough that users disable the guard entirely. We tuned the thresholds hard against a test corpus of 200 real-world flows before we got the false positive rate to an acceptable level.

Machine Learning at the Edge: What Gemma 4 Gets Right and Wrong

Gemma 4's instruction-following on structured action plans is genuinely impressive for its size. When the context is clean — well-pruned UI tree, clear user intent, simple app — it executes correctly north of 85% of the time on our benchmark flows. That's not GPT-4 territory, but it's enough to be useful for a large class of tasks.

Where it breaks down: ambiguous UI state, apps that use non-standard accessibility labeling (looking at you, every React Native app ever shipped), and multi-step flows where an early error compounds. The model doesn't recover gracefully from mid-flow failures. It tends to retry the same failed action rather than backtrack and replan. We partially address this with a retry budget and forced replan trigger in the orchestration layer, but it's a real limitation of the current model.

The other honest limitation: Gemma 4 at 4B parameters has a narrower world model than the frontier models. It knows what a "checkout button" is. It doesn't always know that a specific airline's booking flow has an unusual confirmation step that looks like an upsell screen. Fine-tuning on domain-specific flows helps significantly — we saw a 22-point accuracy improvement on travel booking flows after fine-tuning on 1,400 labeled examples — but that's work that has to be repeated per vertical.

What We Open-Sourced and Why

deft-intent, deft-tree, and deft-guard are all MIT-licensed on GitHub. The decision to open-source was straightforward: the value in Deft isn't in these libraries in isolation, it's in the tuned models, the curated flow datasets, and the integration work. Keeping the libraries closed would have slowed the ecosystem without protecting anything meaningful.

The response from the Android automation and accessibility communities has been faster than I expected. deft-tree in particular has gotten traction from developers building accessibility tools — the UI tree compression problem isn't unique to AI agents, it's a general problem for anyone trying to programmatically understand an Android screen.

The broader trend here matters. As on-device model performance continues improving — and it is improving fast, with each new Qualcomm and MediaTek SoC generation — the architectural case for cloud inference weakens. The privacy case was always there. The performance case is catching up. Projects like Deft are early, but the trajectory is clear.

What Comes Next for Deft

Three immediate priorities:

Multi-modal context. Right now Deft works purely off the accessibility tree. Adding a lightweight vision model to capture screenshot context for apps with poor accessibility labeling would significantly expand coverage. We're evaluating whether this fits within the memory budget.

Cross-app flows. Today Deft operates within a single app per task. Flows that span apps — "take this confirmation email and add it to my calendar" — require orchestration across AccessibilityService contexts, which introduces state management complexity we haven't fully solved.

Flow sharing. The fine-tuned flow data is where the real value accumulates. We're designing a privacy-preserving mechanism for users to contribute anonymized flow traces to improve the shared model — opt-in, differential privacy applied, no raw UI data transmitted. The architecture for this is closer to federated learning than traditional telemetry.

The on-device AI phone agent category is going to be crowded within 18 months. Every major platform vendor is moving here. What we learned building Deft — particularly around UI tree compression and irreversible action validation — is directly applicable to whatever the next generation of these tools looks like. That's why we shipped the libraries before the product was finished. The hard problems are worth solving in the open.

Claude

Claude Code extended thinking is being sold as a window into the machine's mind. A peek behind the curtain. Genuine deliberation made visible.

June 23, 2026•9 min read

Building Reliable AI Agents: Lessons from Familiar

We run 30+ autonomous AI agents 24/7 with Familiar. Here

June 21, 2026•9 min read

Norway Bans AI in Schools: Right Call or Moral Panic?

Norway just banned AI in elementary schools. Is this smart regulation or moral panic? A hard look at what the AI industry keeps getting wrong.

June 20, 2026•10 min read

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Why On-Device AI Integration Changes the Privacy Equation

The Architecture: Three Layers, Three Libraries

The Hard Problem Nobody Talks About: UI Trees at Scale

Validating Irreversible Actions: The deft-guard Design

Machine Learning at the Edge: What Gemma 4 Gets Right and Wrong

What We Open-Sourced and Why

What Comes Next for Deft

Claude

Related Posts

Claude

Building Reliable AI Agents: Lessons from Familiar

Norway Bans AI in Schools: Right Call or Moral Panic?

Have Questions or Need Help?