A voice/text-controlled browser automation agent powered by GPT-4o-mini. 100% Chrome Extension based - no Playwright, no Selenium!
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Chrome Browser β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Chrome Extension (extension/) β β
β β β’ Collects page context (DOM, text, elements) β β
β β β’ Executes actions (click, type, scroll, etc.) β β
β β β’ Opens/manages tabs β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTP (localhost:8765)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Python Backend (agent.py) β
β β’ AI Brain (GPT-4o-mini) β
β β’ Plans actions based on page context β
β β’ Web UI for sending commands β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
pip3 install -r requirements.txtexport OPENAI_API_KEY='your-key-here'- Open Chrome and navigate to:
chrome://extensions - Enable Developer mode (toggle in top-right corner)
- Click "Load unpacked"
- Select the
extensionfolder from this project - You should see "Agent Bridge" extension loaded
python3 agent.pyYou should see:
π BROWSER AGENT (Chrome Extension)
======================================================================
π Setup Instructions:
1. Open Chrome and go to: chrome://extensions
2. Enable 'Developer mode' (top right)
3. Click 'Load unpacked' and select the 'extension' folder
4. The extension will handle all browser interactions!
π How to use:
β’ Web UI: http://127.0.0.1:8765
β’ Terminal: Type commands here
β’ Shortcut: Type '1' for a GUI dialog box
β
Agent ready! Type 'exit' or press Ctrl+C to quit.
======================================================================
- Open
http://127.0.0.1:8765in your browser - Type a command (e.g., "Go to Reddit and click the first post")
- Click "Run"
- Type commands directly in the terminal where
agent.pyis running - Example:
Go to YouTube and search for cats
- Type
1in the terminal - A dialog box will appear
- Enter your command and click OK
π¬ Go to reddit.com
π¬ Open a new tab and go to youtube.com
π¬ Click on the first post
π¬ Search for python tutorials
π¬ Scroll down and click the login button
π¬ Type "hello world" in the search box
-
OBSERVE: Extension sends page context to Python backend
- URL, title, text content
- All interactive elements (buttons, links, inputs)
- Element positions (x, y coordinates)
-
DECIDE: GPT-4o-mini analyzes the context
- Reads what's on the page
- Plans the next action
-
ACT: Backend sends action to extension
- Extension executes the action (click, type, etc.)
- Action happens in the actual Chrome browser
-
LOOP: Repeat until task is complete
manifest.json: Extension configuration
- Permissions: tabs, scripting, activeTab
- Runs on all URLs
content.js: Runs on every web page
- Collects page context every 4 seconds
- Executes actions (click, type, scroll, navigate)
- Polls backend for actions every 1.2 seconds
service_worker.js: Background script
- Relays context to Python backend
- Handles tab creation
- Polls for actions every 1 second
popup.html/js: Extension popup
- Shows connection status
- Quick link to open backend UI
agent.py: Main entry point
- Starts web UI server
- Manages command queue
- Routes commands to agent_runner
agent_runner.py: AI brain
- Sends context + command to GPT-4o-mini
- Receives tool calls (actions)
- Executes actions via extension_bridge
actions.py: Action functions
click(x, y): Click at coordinatestype_text(text): Type textscroll(direction, amount): Scroll pagepress_key(key): Press keyboard keynavigate_url(url): Navigate current tabopen_tab(url): Open new tab
extension_bridge.py: Communication layer
- Queue for actions (Python β Extension)
- Storage for page context (Extension β Python)
- Thread-safe
context_capture.py: Context retrieval
- Gets latest page context from extension
- Formats it for GPT-4o-mini
web_ui.py: HTTP server
- Serves web interface
- API endpoints for extension communication
The Python backend exposes these endpoints:
GET /- Web UI homepagePOST /api/run- Submit a commandPOST /api/extension/context- Extension sends page contextGET /api/extension/context- Get latest contextGET /api/extension/next_action- Extension polls for actionsPOST /api/extension/action_result- Extension reports results
The AI can use these tools:
- get_screen_state() - See what's on the page
- click(x, y) - Click at coordinates
- type_text(text) - Type text
- scroll(direction, amount) - Scroll page
- press_key(key) - Press keyboard key
- navigate_url(url) - Navigate to URL
- open_tab(url) - Open new tab
- task_complete() - Mark task as done
- ask_user(question) - Ask for user input
- Make sure the Chrome Extension is loaded
- Check that you're on a web page (not chrome:// URLs)
- Open the extension popup to check connection status
- Make sure
python3 agent.pyis running - Check that port 8765 is not in use
- Try restarting the Python backend
- Check browser console for errors (F12 β Console)
- Make sure the extension has permissions
- Try reloading the extension
- Go to
chrome://extensions - Click the refresh icon on the "Agent Bridge" extension
- Reload the web page
- The extension works on any website (except chrome:// pages)
- Context is sent automatically every 4 seconds
- Actions are polled every 1-1.2 seconds
- You can have multiple tabs open - the extension works on all of them
- The AI sees up to 80 interactive elements per page
This version completely removes:
- β Playwright
- β Firefox automation
- β PyAutoGUI for browser control
- β Any Python-based browser automation
Everything is now handled by the Chrome Extension!
- Model: GPT-4o-mini
- Cost: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens
- Speed: ~2-3 seconds per action
- Token usage: ~500-1000 tokens per observation
- Backend only listens on
127.0.0.1(localhost) - Extension only communicates with localhost:8765
- No external connections except OpenAI API
Use freely for personal or commercial projects!