Skip to content

CodCodingCode/mediapipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Browser Agent - Chrome Extension Architecture

A voice/text-controlled browser automation agent powered by GPT-4o-mini. 100% Chrome Extension based - no Playwright, no Selenium!

🎯 Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Chrome Browser                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Chrome Extension (extension/)                     β”‚    β”‚
β”‚  β”‚  β€’ Collects page context (DOM, text, elements)     β”‚    β”‚
β”‚  β”‚  β€’ Executes actions (click, type, scroll, etc.)    β”‚    β”‚
β”‚  β”‚  β€’ Opens/manages tabs                              β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↕ HTTP (localhost:8765)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Python Backend (agent.py)                      β”‚
β”‚  β€’ AI Brain (GPT-4o-mini)                                   β”‚
β”‚  β€’ Plans actions based on page context                      β”‚
β”‚  β€’ Web UI for sending commands                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Setup Instructions

1. Install Python Dependencies

pip3 install -r requirements.txt

2. Set OpenAI API Key

export OPENAI_API_KEY='your-key-here'

3. Load the Chrome Extension

  1. Open Chrome and navigate to: chrome://extensions
  2. Enable Developer mode (toggle in top-right corner)
  3. Click "Load unpacked"
  4. Select the extension folder from this project
  5. You should see "Agent Bridge" extension loaded

4. Start the Python Backend

python3 agent.py

You should see:

🎭 BROWSER AGENT (Chrome Extension)
======================================================================

πŸ“‹ Setup Instructions:
  1. Open Chrome and go to: chrome://extensions
  2. Enable 'Developer mode' (top right)
  3. Click 'Load unpacked' and select the 'extension' folder
  4. The extension will handle all browser interactions!

πŸ“‹ How to use:
  β€’ Web UI: http://127.0.0.1:8765
  β€’ Terminal: Type commands here
  β€’ Shortcut: Type '1' for a GUI dialog box

βœ… Agent ready! Type 'exit' or press Ctrl+C to quit.
======================================================================

πŸ“ How to Use

Option 1: Web UI

  1. Open http://127.0.0.1:8765 in your browser
  2. Type a command (e.g., "Go to Reddit and click the first post")
  3. Click "Run"

Option 2: Terminal

  1. Type commands directly in the terminal where agent.py is running
  2. Example: Go to YouTube and search for cats

Option 3: GUI Dialog

  1. Type 1 in the terminal
  2. A dialog box will appear
  3. Enter your command and click OK

🎬 Example Commands

πŸ’¬ Go to reddit.com
πŸ’¬ Open a new tab and go to youtube.com
πŸ’¬ Click on the first post
πŸ’¬ Search for python tutorials
πŸ’¬ Scroll down and click the login button
πŸ’¬ Type "hello world" in the search box

πŸ”§ How It Works

The Observe-Act Cycle

  1. OBSERVE: Extension sends page context to Python backend

    • URL, title, text content
    • All interactive elements (buttons, links, inputs)
    • Element positions (x, y coordinates)
  2. DECIDE: GPT-4o-mini analyzes the context

    • Reads what's on the page
    • Plans the next action
  3. ACT: Backend sends action to extension

    • Extension executes the action (click, type, etc.)
    • Action happens in the actual Chrome browser
  4. LOOP: Repeat until task is complete

Extension Components

manifest.json: Extension configuration

  • Permissions: tabs, scripting, activeTab
  • Runs on all URLs

content.js: Runs on every web page

  • Collects page context every 4 seconds
  • Executes actions (click, type, scroll, navigate)
  • Polls backend for actions every 1.2 seconds

service_worker.js: Background script

  • Relays context to Python backend
  • Handles tab creation
  • Polls for actions every 1 second

popup.html/js: Extension popup

  • Shows connection status
  • Quick link to open backend UI

Python Backend Components

agent.py: Main entry point

  • Starts web UI server
  • Manages command queue
  • Routes commands to agent_runner

agent_runner.py: AI brain

  • Sends context + command to GPT-4o-mini
  • Receives tool calls (actions)
  • Executes actions via extension_bridge

actions.py: Action functions

  • click(x, y): Click at coordinates
  • type_text(text): Type text
  • scroll(direction, amount): Scroll page
  • press_key(key): Press keyboard key
  • navigate_url(url): Navigate current tab
  • open_tab(url): Open new tab

extension_bridge.py: Communication layer

  • Queue for actions (Python β†’ Extension)
  • Storage for page context (Extension β†’ Python)
  • Thread-safe

context_capture.py: Context retrieval

  • Gets latest page context from extension
  • Formats it for GPT-4o-mini

web_ui.py: HTTP server

  • Serves web interface
  • API endpoints for extension communication

🌐 API Endpoints

The Python backend exposes these endpoints:

  • GET / - Web UI homepage
  • POST /api/run - Submit a command
  • POST /api/extension/context - Extension sends page context
  • GET /api/extension/context - Get latest context
  • GET /api/extension/next_action - Extension polls for actions
  • POST /api/extension/action_result - Extension reports results

🎯 Available Actions

The AI can use these tools:

  1. get_screen_state() - See what's on the page
  2. click(x, y) - Click at coordinates
  3. type_text(text) - Type text
  4. scroll(direction, amount) - Scroll page
  5. press_key(key) - Press keyboard key
  6. navigate_url(url) - Navigate to URL
  7. open_tab(url) - Open new tab
  8. task_complete() - Mark task as done
  9. ask_user(question) - Ask for user input

πŸ” Troubleshooting

"No context received"

  • Make sure the Chrome Extension is loaded
  • Check that you're on a web page (not chrome:// URLs)
  • Open the extension popup to check connection status

"Backend offline"

  • Make sure python3 agent.py is running
  • Check that port 8765 is not in use
  • Try restarting the Python backend

"Actions not executing"

  • Check browser console for errors (F12 β†’ Console)
  • Make sure the extension has permissions
  • Try reloading the extension

Extension not working after Chrome update

  1. Go to chrome://extensions
  2. Click the refresh icon on the "Agent Bridge" extension
  3. Reload the web page

πŸ’‘ Tips

  • The extension works on any website (except chrome:// pages)
  • Context is sent automatically every 4 seconds
  • Actions are polled every 1-1.2 seconds
  • You can have multiple tabs open - the extension works on all of them
  • The AI sees up to 80 interactive elements per page

🚫 What Was Removed

This version completely removes:

  • ❌ Playwright
  • ❌ Firefox automation
  • ❌ PyAutoGUI for browser control
  • ❌ Any Python-based browser automation

Everything is now handled by the Chrome Extension!

πŸ“Š Cost & Performance

  • Model: GPT-4o-mini
  • Cost: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens
  • Speed: ~2-3 seconds per action
  • Token usage: ~500-1000 tokens per observation

πŸ” Security

  • Backend only listens on 127.0.0.1 (localhost)
  • Extension only communicates with localhost:8765
  • No external connections except OpenAI API

πŸ“„ License

Use freely for personal or commercial projects!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors