Beyond the Script: Building a "Self-Healing" Autonomous Web Agent

Traditional web automation is brittle. Anyone who has written a Selenium or Puppeteer script knows the pain: you spend hours coding a sequence of clicks, only for the entire script to crash the next day because a button moved three pixels to the left or an unexpected "Sign Up for our Newsletter" popup appeared.

I wanted to build something better. I didn't just want a script that followed instructions blindly; I wanted an agent that could see, think, and adapt when things went wrong.

Here is a look at Sentinel, an autonomous web agent framework I built using React, Node.js, and Google’s Gemini models.

The Goal: Resilience Over Rigidness The primary goal of this project was to create an agent framework that could navigate the messy, unpredictable reality of the live web without constant hand-holding.

To achieve this, the system needed three things:

Eyes: The ability to see the screen exactly as a human does.

Hands: A way to interact with a real browser instance.

A Brain: A cognitive engine to interpret visuals and make decisions.

The Architecture I architected the solution as a full-stack application with a clear separation of concerns between the "Mission Control" and the "Agent."

The "Mission Control" (Frontend) Built with React and Vite, this dashboard serves as the human interface. It allows me to set a target URL and a high-level goal (e.g., "Go to Google and search for Gemini API").

Crucially, it provides real-time telemetry. Using WebSockets, it streams logs from the backend and displays a live feed of the agent’s "computer vision"—screenshots of exactly what the bot is looking at in that moment.

The "Hands" (Backend) The backend is a Node.js server controlling a real instance of Google Chrome via Puppeteer. I spent significant effort here ensuring the browser environment was stable, using dedicated user profiles to avoid sessions crashing and implementing specific flags to prevent bot detection blocking.
The "Brain" (AI Integration) This is where the magic happens. I integrated Google's Gemini multimodal models to serve as the agent's cognitive engine.

To balance speed with intelligence, I implemented a "two-speed" cognitive architecture:

System 1 (Fast): Gemini 1.5 Flash. This model handles the standard "perceive-and-act" loop. It takes a screenshot, analyzes the interactive elements, and decides on the next click almost instantly.

System 2 (Slow): Gemini 1.5 Pro. This heavier, reasoning-focused model is held in reserve for complex problem-solving.

The Breakthrough: Self-Healing in Action The most exciting part of this build wasn't watching the agent successfully click a button; it was watching it fail, and then recover on its own.

During testing on a live website, the agent attempted to click a search bar. However, an unexpected modal dialog (a popup) was obscuring the element. A standard script would have thrown an ElementClickIntercepted error and crashed instantly.

Sentinel did something different.

Error Detection: The backend caught the click error.

Cognitive Shift: Instead of crashing, the system triggered a "Deep Reasoning" state, switching from the fast Flash model to the smarter Pro model.

Diagnosis & Planning: The Pro model analyzed the screenshot of the error state, recognized the obscuring popup, and formulated a Recovery Strategy: "Close modal dialog."

Execution: The agent found the "X" on the popup, clicked it, and then resumed its original goal successfully.

A snapshot of the system logs showing the self-healing loop in action:

[ERROR] Action failed: ElementClickIntercepted [SYSTEM] Cooling down... engaging Deep Reasoning... [THINKING] Error detected. Analyzing recovery options... [ACTION] Recovery Strategy formulated: Close modal dialog. [SUCCESS] Recovery successful. Resuming main task.

Conclusion Building Sentinel taught me that the future of automation isn't about writing perfect scripts for perfect worlds. It's about building resilient systems that can handle imperfection. By combining traditional web drivers with multimodal AI, we can create agents that don't just follow orders—they adapt to the environment.