Dominic Jainy has spent his career at the intersection of artificial intelligence, machine learning, and blockchain, guiding enterprises from cautious pilots to scaled agentic systems. He blends hands-on build experience with governance and human-in-the-loop rigor, and has recently advised teams piloting service desk automation, cross-functional agent mediation, and digital twins tied to real-world workflows. In this conversation with Alistair Miller, he unpacks early wins and common pitfalls with AI agents, how to design orchestration and monitoring for emergent behavior, and the playbooks that keep humans confidently in control while pushing speed, cost, and quality forward.
Many leaders expect AI agents to create more confusion than value over the next year. Where do you see the biggest early wins versus pitfalls, and how would you sequence pilots to de-risk confusion? Please share examples, decision criteria, and guardrails that actually worked.
I’m not surprised that 70% of folks in the room agreed agents may create more confusion than value in the next 12 months; that sentiment reflects the reality of immature operating models. The biggest early wins live in high-volume, well-instrumented workflows with clear outcomes, like knowledge retrieval for internal policies, repetitive ticket triage, and data reconciliation between systems. I sequence pilots in three waves: first, narrow-scope assistive agents with hard stop points and transparent logs; second, semi-autonomous flows with bounded tool access and human approval; and third, cross-team orchestration only after monitoring and audit trails are stable. Decision criteria that worked for me include measurable cycle-time gaps (like a process currently taking days), readily available ground-truth data, and a low blast radius if something goes sideways. Guardrails that kept us safe: per-function autonomy thresholds, an explicit “refuse and escalate” behavior for ambiguous tasks, error budgets tied to business tolerance, and dashboards that surface both individual agent actions and collective patterns before they spiral.
Some enterprises have shifted roughly 90% of IT service desk tasks to autonomous systems. What processes qualified for handoff first, which stayed manual, and how did you redesign workflows to prevent failure cascades? Walk through change management, SLAs, and KPIs you tracked.
In IT service desks we handed off the predictable, template-driven tasks first: password resets, basic access requests, device configuration checks, and status updates. What stayed manual were edge cases touching security policy exceptions, complex software conflicts, and anything involving sensitive entitlements—those remained human-led with agents as advisors. To prevent failure cascades, we decoupled workflows: each agent action updated a canonical ticket record with reversible steps, and any confidence dip triggered a human checkpoint rather than allowing daisy-chained automation to pile on. Change management hinged on two points: visible upskilling so that 85% of affected staff moved into higher-value roles, and clear communication that the remaining 10% of tasks would still require human stewardship. SLAs remained the same on paper but gained an auto-triage clause—if the agent couldn’t meet resolution confidence, it had to escalate early. KPIs that mattered were first-contact resolution, re-open rates after agent actions, and average time to human takeover; we also watched sentiment in free-text feedback because a frustrated “this didn’t help” often surfaced long before hard metrics shifted.
Cross‑functional workflows have dropped from days to seconds with agent mediation. How did you map data flows, define agent permissions, and verify accuracy at speed? Describe the before/after state, error budgets, and the feedback loops that kept humans confidently in control.
Before agent mediation, something like a finance–sales alignment took multiple handoffs and often stretched to four days; after, the mediated handshakes dropped to roughly eight seconds when the data lived in accessible systems and the contracts were standardized. We started by mapping data lineage from source systems to decision points, annotating which fields were authoritative and which were derived; that let us define per-agent scopes: read-only for sensitive ledgers, read/write for staging environments, and “propose-only” for final approvals. Accuracy at speed came from a dual-check: the agent had to cite sources for any recommendation and pass a lightweight critic review that flagged inconsistent fields; only then could it queue a human approval where policy required it. Error budgets were framed by business tolerance—if a mismatch risked downstream financial errors, the budget was essentially zero and the agent defaulted to “propose, don’t commit.” Feedback loops included an always-available “why” explainer, a one-click rollback for any committed change, and weekly human review of a random stratified sample focused on borderline cases to keep trust grounded in real evidence.
When agents and humans collaborate through an orchestration layer, outcomes improved on speed, cost, and quality. Which operating model patterns scale best, and which anti-patterns derail teams? Share a client-style roadmap, role design, and the metrics that proved adoption stuck.
The most durable pattern is a hub-and-spoke orchestration layer with clear roles: planners to decompose goals, tool-using executors to perform tasks, and critics/directors to enforce policy and reduce sycophancy. The anti-patterns are monolithic “super agents” that try to do everything and opaque chains-of-thought that humans can’t audit; both lead to brittle systems and eroded trust. A practical roadmap starts with a 90-day foundation: instrumented pilots, autonomy thresholds per function, and a shared glossary for prompts and policies; next, a 6–9 month scale-out that adds cross-functional flows like the four-day-to-eight-seconds transition; finally, a maturity phase where teams operate against transformation targets such as compressing a 26-month roadmap to eight months for eligible programs. Role design pairs each agent role with a human: a product owner for planners, a domain SME for critics, and an SRE-like steward for executors. Adoption stuck when we saw stable cycle-time reductions, cost-per-transaction trending down, and quality signals like lower rework; the telltale sign was teams choosing agents for new work without mandates, because speed and clarity simply felt better.
Some organizations cut multi‑year goals to months using agentic systems. What levers—task decomposition, retrieval, tool use, or parallelization—mattered most? Please break down the architecture, the bottlenecks you hit, and the benchmarks you used to validate real throughput gains.
The step-change came from combining task decomposition with parallelization—breaking work into crisp sub-goals and running them concurrently instead of serializing everything behind a single path. Retrieval and tool use mattered in service of those two: retrieval grounded each sub-task in the right context, and tool use gave agents hands to execute rather than just advise. Architecturally we ran a planner that emitted structured tasks, a pool of specialized executors with scoped permissions, and critics that enforced policy before commits; an orchestration bus handled messaging, while observability streamed traces so humans could intervene. Bottlenecks showed up at integration points—when tools throttled, you felt it—so we cached intermediate results and designed fallback steps that didn’t stall the entire pipeline. We validated throughput with end-to-end cycle-time cuts like the 26-month to eight-month compression on real deliverables, not just synthetic benchmarks; a secondary proof was improved quality on first pass, which let us retire unnecessary human checkpoints without raising risk.
Engineering teams now spin up multiple solution paths in parallel with background agents. How do you manage branching complexity, code quality, and re-merge strategies? Describe your repository hygiene, test gates, and the observability signals engineers rely on during exploration.
We embrace branching, but we constrain it with explicit lanes: exploratory branches for agent-generated alternatives and a protected integration branch that only accepts code passing a full test suite. Agents operate in sandboxes with read-only access to the main history and propose changes via pull requests that include rationale, cited references, and a “what I changed and why” summary. Test gates start simple—linting, unit tests, and policy checks—and escalate to integration tests and security scans before anything hits the protected branch; humans review deltas with an agent critic that flags risky diffs and ambiguous changes. Observability is our north star during exploration: we watch agent confidence trends, test flakiness rates, and anomaly signatures in commit patterns; spikes in reverts or repeated fixes to the same area tell us to pause and reassess the branch. Re-merge isn’t a free-for-all; we stage merges behind feature flags and maintain one-click rollback so an ambitious exploration never jeopardizes stability.
In multi‑agent settings, emergent behaviors can create side effects. How do you detect, triage, and resolve them without halting progress? Walk through your monitoring stack, anomaly signatures you watch, escalation paths, and an incident you learned from—metrics and timelines included.
Emergence shows up as subtle pattern drifts before it becomes a headline, so we monitor both individual agents and their collective conversation graph. Our stack captures action traces, tool invocations, policy critic decisions, and cross-agent dialogues; we visualize clusters of interactions so we can spot echo chambers—classic sycophancy—forming. Anomaly signatures include sudden convergence on the same answer without new evidence, repeated back-and-forth handoffs with no progress, and confidence rising while citation quality drops. Escalation paths are tiered: auto-throttle autonomy for the implicated agents, route the task to a human lead, and trigger a critic review that inspects the last N steps across the team. In one incident, a set of agents negotiating a cross-functional task began reinforcing a false assumption sourced from an outdated doc; within hours we saw rising confidence and near-identical recommendations. We intervened by resetting context with authoritative sources, adding directors to inject dissent, and requiring citations; the loop calmed quickly, and we finished the work the same day without halting the broader program.
Effective orchestration needs communication management, task assignment, and conflict resolution among agents. Which coordination primitives (planners, critics, directors) have been most reliable, and why? Outline your playbook for role design, contention handling, and autonomy thresholds by function.
Planners, critics, and directors each earn their keep. Planners shine when they output structured task trees with explicit dependencies; critics act as the immune system, catching policy and logic slips; directors shape tone and inject healthy disagreement to counter sycophancy. My playbook pairs each business function with a specific autonomy ceiling: finance often runs at “propose-only,” sales operations may allow “propose-and-commit” in staging, and IT maintenance can reach “commit” on low-risk actions like metadata updates. Contention handling starts with evidence-based debate: each agent must cite its sources, critics score the arguments, and directors pick the next step if the score is tied. Humans sit above as tie-breakers when stakes are high; we’ve found that simple, explicit rules beat elaborate, opaque heuristics every time.
Human-in-the-loop remains essential, especially in higher‑risk domains. Where do you place human checkpoints, and how do you prove they’re meaningful, not rubber stamps? Share review criteria, sampling strategies, and a step-by-step example of an intervention that changed an agent’s outcome.
I place human checkpoints where policy meets impact: before committing to sensitive systems, at decision boundaries with low signal, and whenever agents disagree with credible evidence on both sides. To keep reviews meaningful, reviewers must reject or request revisions with reasoned notes tied to clear criteria—source quality, consistency with policy, and alignment with business objectives—so rubber stamping simply isn’t an option. We also sample a stratified set of “green” cases that look easy, because that’s where complacency creeps in; catching one subtle miss there pays dividends. In one case, an agent proposed reconciling a cross-team record in seconds, citing logs and a standard template; a human spotted that the source doc had been updated but not propagated, so they requested fresh retrieval and added a director instruction to encourage disagreement. The agent re-ran, found the new doc, and altered the recommendation; that small intervention prevented a needless cleanup later and strengthened the agent’s future behavior.
Digital twin simulations can spark real-world interactions but raise privacy concerns. How do you source public and first‑party data responsibly, set consent defaults, and honor opt‑outs at runtime? Describe your data minimization rules, red-teaming for identity risks, and user transparency practices.
The right starting point is clear consent and data minimization: we only ingest what’s necessary to create useful, grounded behavior—basic profiles, high-level goals, and public-facing content like contributor pages—nothing gratuitous. Consent defaults should be honest and revocable; at runtime we maintain an active opt-out registry so a person can vanish from the twin world without trace, not just stop collecting new data. We red-team for identity leakage by attempting linkages across twins and seeing if one twin can infer sensitive traits about another; directors can also instruct twins to welcome disagreement, which counteracts herd behavior and helps surface mismatches before they spread. Transparency matters: we show people their twin’s background and goals and let them inspect conversations; that visibility creates trust and invites course correction. In one simulation, no one opted out at first, but the presence of an easy exit was essential—it turned a provocative experiment into a respectful experience.
Reducing agent sycophancy sometimes involves explicit prompts to encourage disagreement. Beyond prompt engineering, what governance or diversity mechanisms make agent debates productive? Please detail evaluation rubrics, tie-breaker logic, and a case where structured dissent improved a decision.
Governance beats clever prompts alone. We enforce diversity at the role level—critics trained to flag weak evidence, directors who instruct agents to surface minority views, and planners that solicit alternatives rather than defaulting to the first path. Our evaluation rubric scores each proposal on evidence quality, policy alignment, and novelty of insight; proposals must include citations and an uncertainty readout. Tie-breaking is transparent: if scores tie, the director asks for a short counterfactual—“what would make this wrong?”—and the option with the clearer failure analysis advances. In practice we saw this shine when a cross-functional decision looked settled; a dissenting agent cited updated guidance and forced a re-check, shifting a days-long process to seconds without cutting corners. The win wasn’t just speed; it was confidence that the quickest path was also the right one.
Upskilling moves people into higher‑value roles alongside agents. Which curricula, labs, and certifications actually change behavior, and how do you measure skill lift? Share an onboarding path, adoption metrics over 90 days, and stories of role transitions that stuck.
Behavior changes when learning is task-centered and hands-on. We run labs where people configure agents with real tools, practice orchestrating planners–critics–directors, and learn to read traces and intervene; short certifications cap each module so progress feels tangible. The first 30 days focus on safe sandboxes; by day 60, learners run semi-autonomous flows with human checkpoints; by day 90, they manage a live pilot with clear KPIs. We measure skill lift by tracking how often individuals choose to use agents without being told, how quickly they spot and correct agent missteps, and whether they can articulate policy trade-offs rather than just copy prompts. It’s powerful to see 85% of workers moving into higher-value roles when supported well; those stories stick because people feel ownership over the tools and the outcomes.
For teams building personal or on-device agents connected to third‑party models, how do you balance autonomy, security, and cost? Outline your deployment steps, key configuration choices, and the observability you keep on-device versus in the cloud.
On-device agents strike a great balance when you design for privacy-first and dial up autonomy only where you have strong safeguards. Deployment starts with an on-device core limited to local data, then careful connections to third-party models with scoped APIs; we keep sensitive context local and send only minimal prompts or embeddings outward as needed. Configuration choices include hard caps on tool permissions, caching to reduce repeated calls, and director policies that nudge agents to question assumptions rather than blindly comply. Observability splits: on-device we log actions, permissions used, and local errors; in the cloud we capture coarse-grained telemetry and policy outcomes without shipping private content. That mix gives you enough visibility to debug and improve while keeping the user’s data footprint under control.
Choosing platforms for hosting and network‑native agent collaboration can be daunting. What selection criteria—latency, tool integration, policy control, or ecosystem—carry the most weight? Walk through a vendor assessment you conducted and the trade-offs you accepted.
For network-native collaboration, I weigh policy control and tool integration first, then latency and ecosystem maturity. The orchestration fabric needs native roles—planners, critics, directors—and the ability to enforce autonomy thresholds per function; integrations should cover the tools your agents actually need, not just boast a long catalog. In one assessment, we favored a platform with strong governance and open connectors over a faster but opaque alternative; the trade-off was accepting slightly higher latency in exchange for clearer audit trails and human-in-the-loop hooks. We also valued an open approach that aligned with the goal of secure agent collaboration across the internet; that made future interoperability less risky. The proof came when cross-functional workflows dropped from days to seconds without sacrificing transparency, which is the right kind of trade.
Many leaders expect AI agents to create more confusion than value. How would you sequence pilots to build confidence specifically among skeptics, and what evidence do you surface to change minds without overpromising?
I start skeptics on narrow, reversible use cases with crisp baselines—show them a task that historically took days and walk them through how agents safely reduce it to seconds. Every pilot has a live dashboard with traces, citations, and one-click rollbacks so people can see, touch, and question the system; you win hearts when they feel in control. I never pitch “autonomy” first; I pitch “augmentation with guardrails” and then let the data speak—like a visible handoff that escalates if confidence dips or sources conflict. The turning point is when skeptics witness an agent catch an inconsistency and ask for help; that moment flips the script from fear to partnership.
What mistakes do teams make when they try to go from a single agent to orchestrated teams, and how do you phase the transition to avoid regressions in speed or quality?
The classic mistake is treating orchestration as just “more agents,” rather than a shift in roles, governance, and observability. Teams often skip adding critics and directors, over-trust a planner, and end up with agreeable but shallow outcomes. I phase the transition by codifying roles and autonomy thresholds, adding dashboards that show inter-agent dynamics, and running shadow deployments where the new team observes and proposes but doesn’t commit. Once the signals look healthy—fast cycles, low rework, and no signs of echo chambers—I flip commits on for low-risk tasks and expand from there. It’s slower at first, but you avoid the whiplash of a big-bang cutover that burns trust.
In your experience, what keeps adoption sticky after the initial excitement fades, and how do you maintain momentum without constant executive pressure?
Stickiness comes from everyday wins and cultural rituals, not slogans. When a salesperson sees a quote–approval jump from four days to eight seconds, or an engineer tries five ideas in parallel without weeks of toil, they come back because it feels better. We institutionalize success with lightweight forums—demos, postmortems, and leaderboards that celebrate quality improvements and safe catches as much as speed. The quiet engine of momentum is observability: when people can see what agents did, why they did it, and how to fix it, they trust the system on ordinary Tuesdays, not just during big launches.
Do you have any advice for our readers?
Start with one valuable, low-blast-radius workflow and instrument it so well that every stakeholder can see what’s happening in real time. Treat orchestration, monitoring, and human-in-the-loop as first-class features, not add-ons you’ll fix later. Encourage healthy disagreement—among humans and agents—so you catch brittle assumptions early; directors that invite dissent are a practical antidote to sycophancy. Most of all, invest in your people: give them sandboxes, let them play, and celebrate the 85% who’ll move into higher-value roles when you back them with real training and trust.
What is your forecast for AI agents in the enterprise?
Over the next year, we’ll still see that 70% anxiety show up in boardrooms, but the organizations that pair orchestration with observability and human accountability will be the ones that turn skepticism into measurable wins. Expect more internal success stories like the 90% service desk handoff and cross-functional jumps from days to seconds, with guarded autonomy where policy demands it. We’ll also see digital twins move from novelty to utility, provided consent and opt-outs are honored at runtime and data stays minimal by design. In short, the future favors teams that are brave in experimentation and meticulous in governance—those who can compress a 26-month ambition to eight months without losing their footing.
