Skip to content

MeyerThorsten/grimfaste-roundupforge

Repository files navigation

RoundupForge by Grimfaste

Amazon Roundup Scout — A tool by Grimfaste that automates Amazon product research for roundup articles at scale.

Paste up to 10,000 keywords, and RoundupForge searches Amazon across 21 country marketplaces, collects product ASINs, and delivers organized results — ready for article creation with tools like ZimmWriter.


Technology Statement

RoundupForge is a server-side web application built with Next.js 16 and TypeScript. It runs as a self-hosted service on macOS or Linux, using SQLite for local development and PostgreSQL for multi-worker production deployments.

The application is designed for headless batch processing — users submit keyword lists, and the system scrapes Amazon search results and product pages in the background using a pool of scraping API providers. All processing runs server-side with progress tracked in the database, so users can close the browser and return later.

RoundupForge is part of the Grimfaste platform and serves as the data collection layer for DojoClaw, the AI-powered article generation and publishing system.


Tech Stack

Layer Technology Purpose
Framework Next.js 16 (App Router) Server-side rendering, API routes, React UI
Language TypeScript (strict mode) Type safety across frontend and backend
Styling Tailwind CSS 4 Utility-first CSS framework
ORM Prisma 7 Database abstraction with migrations
Database SQLite (dev) / PostgreSQL (prod) Data persistence, job state, settings
HTML Parsing Cheerio Server-side DOM extraction from scraped pages
Scraping Multi-provider pool ScrapeOwl, ScraperAPI, ScrapingBee, ZenRows, DataForSEO
Concurrency p-limit Keyword-level parallel processing (1-50 concurrent)
Job Queue Custom sequential queue globalThis singleton with DB-backed state
LLM Integration OpenAI-compatible API Relevance filter for product scoring
Encryption AES-256-GCM Secrets encrypted at rest in database
Testing Vitest Unit tests for parsers, scrapers, services
Google Sheets googleapis npm Keyword import and result export
Real-time Updates Server-Sent Events (SSE) Live progress streaming with polling fallback

Architecture

┌─────────────────────────────────────────────────────────────┐
│  Browser (React UI)                                         │
│  ├── Home — keyword input, Google Sheets, batch config      │
│  ├── Dashboard — analytics, credit usage, failure patterns  │
│  ├── Projects — progress, products, export, relevance       │
│  ├── Profiles — scrape profiles per Amazon marketplace      │
│  └── Settings — scrapers, LLM, Google Sheets, auth          │
└──────────────────────┬──────────────────────────────────────┘
                       │ HTTP / SSE
┌──────────────────────▼──────────────────────────────────────┐
│  Next.js API Routes                                         │
│  ├── /api/projects      — CRUD, run, stop, export           │
│  ├── /api/queue         — queue status, recovery            │
│  ├── /api/bulk-queue    — multi-tab Google Sheets queue      │
│  ├── /api/dashboard     — aggregated analytics              │
│  ├── /api/profiles      — scrape profile management         │
│  ├── /api/settings      — scrapers, LLM, Google, general    │
│  ├── /api/sheets        — keyword load, result sync         │
│  ├── /api/system/status — health check, diagnostics         │
│  └── /api/auth/session  — optional admin authentication     │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────┐
│  Backend Services                                           │
│  ├── Queue Processor    — sequential project execution      │
│  ├── Runner             — keyword processing with retries   │
│  ├── Scraper Pool       — primary + fallback adapters       │
│  ├── Plugin Registry    — extensible scraper registration   │
│  ├── Product Cache      — ASIN dedup across projects        │
│  ├── Lifecycle Hooks    — preScrape, postScrape, onFailure  │
│  ├── Relevance Filter   — LLM-based product scoring         │
│  ├── Settings Service   — encrypted DB-backed config        │
│  ├── Job Run Service    — durable job tracking + heartbeat  │
│  └── Failure Summary    — error categorization (10 types)   │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────┐
│  Data Layer                                                 │
│  ├── Project, KeywordResult, Product  — core scrape data    │
│  ├── JobRun                           — durable job state   │
│  ├── AppSetting                       — encrypted settings  │
│  ├── ExportSnapshot                   — export versioning   │
│  ├── ScrapeProfile                    — per-domain config   │
│  └── LlmProvider                      — LLM routing        │
│                                                             │
│  SQLite (local dev) ──or──▶ PostgreSQL (multi-worker prod)  │
└─────────────────────────────────────────────────────────────┘

Data Flow

Keywords (paste / Google Sheets / bulk queue)
        │
        ▼
  Queue Project (status: queued → running)
        │
        ▼
  Build Amazon search URLs (domain from scrape profile)
        │
        ▼
  Fetch search results via scraper pool
  (ScrapeOwl → ScraperAPI → ScrapingBee → ZenRows → DataForSEO)
        │
        ▼
  Extract product links + ASINs (dedupe, randomize count)
        │
        ▼
  Check ASIN cache ──▶ cached? reuse ──▶ not cached? scrape
        │
        ▼
  Fast mode: done ─── Full mode: visit each product page
        │                          extract title, bullets,
        │                          description, specs, reviews
        ▼
  Store in database, track credits, update progress via SSE
        │
        ▼
  Auto-retry failed keywords (exponential backoff)
        │
        ▼
  Queue: advance to next project
        │
        ▼
  Export: Roundup packs / CSV / JSON / Google Sheets
        │
        ▼
  Optional: LLM relevance filter (per-keyword scoring)

Multi-Worker Architecture (Production)

┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  M3 Ultra    │  │  M2 Ultra    │  │  Mac Mini 1  │  │  Mac Mini 2  │
│  Worker      │  │  Master Node │  │  Worker      │  │  Worker      │
│              │  │              │  │              │  │              │
│ RoundupForge │  │ RoundupForge │  │ RoundupForge │  │ RoundupForge │
│ DojoClaw     │  │ DojoClaw     │  │ DojoClaw     │  │ DojoClaw     │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │                 │
       └─────────────────┼─────────────────┼─────────────────┘
                         │
                ┌────────▼────────┐
                │  PostgreSQL     │
                │  (M2 Ultra)     │
                │                 │
                │  Shared DB:     │
                │  - RF tables    │
                │  - DC tables    │
                │  - ASIN cache   │
                └─────────────────┘

Features

Scraping & Data Collection

  • Batch keyword processing — paste or load up to 10,000 keywords at once
  • 21 Amazon marketplaces — US, UK, DE, FR, IT, ES, CA, AU, JP, IN, BR, MX, NL, SE, PL, BE, SG, SA, AE, TR, EG
  • Two scraping modes — Fast (1 API call per keyword) or Full (1 + N per keyword)
  • Randomized product counts — set a range (e.g., 7–15) for natural-looking roundups
  • Multi-scraper pool — ScrapeOwl primary with automatic failover to 4 other providers
  • Scraper plugin registry — extensible adapter system for new scraping backends
  • Exponential backoff — retries with jitter and retry-after header support
  • Typed error classification — RateLimitError, BlockedError, TimeoutError, AuthError, ParseError

Queue & Job Management

  • Sequential project queue — projects run one at a time, auto-advance on completion
  • Bulk queue from Google Sheets — queue all sheet tabs as separate projects in one click
  • Global max concurrency — configurable cap (1-50) applied across all projects
  • Retry/Resume bypasses queue — runs immediately in parallel with queued projects
  • Durable job runs — JobRun model with heartbeat tracking survives server restarts
  • Graceful shutdown — SIGTERM/SIGINT handlers for clean process termination
  • Queue recovery — orphaned "running" projects auto-recovered on restart

LLM & Filtering

  • Relevance filter — LLM-based product scoring per keyword (manual trigger)
  • Conservative prompt — only drops wrong-category items (accessories, toys, unrelated)
  • Per-keyword progress — live filtering progress with error resilience
  • Multiple LLM providers — OpenAI, Claude, OpenRouter, Ollama, LM Studio

Export & Integration

  • Roundup export — ZimmWriter-compatible format, auto-split into packs of 100
  • "Save All in One File" — combine all packs into a single download
  • CSV and JSON export — full structured data with exclusion filtering
  • Google Sheets sync — load keywords from and push results back to Sheets
  • Export versioning — snapshot records with content hash for audit trail

Monitoring & Analytics

  • Dashboard — projects, keywords, products, credits, success rate, daily stats
  • Failure patterns — 10-category error summarization on dashboard
  • Credit tracking — ScrapeOwl credits tracked per project
  • Browser notifications — desktop alerts on project completion/failure
  • SSE progress — real-time streaming with polling fallback
  • System status API — database, queue, integrations health check

Security & Settings

  • Optional admin auth — APP_ADMIN_TOKEN for deployment protection
  • Encrypted secrets — AES-256-GCM for API keys stored in database
  • Persisted settings — DB-backed config with environment variable fallback
  • Masked API keys — secrets never exposed to browser

Scrape Profiles

  • Amazon marketplace dropdown — quick profile creation for any supported country
  • Profile validation — domain, selector, and affiliate code validation before use
  • Test-scrape preview — test a profile against a single URL before saving
  • CSS selector config — title, image, feature bullets, description, reviews

Quick Start

Prerequisites

Installation

git clone https://github.com/MeyerThorsten/grimfaste-roundupforge.git
cd grimfaste-roundupforge
npm install
cp .env.example .env       # add your SCRAPEOWL_API_KEY
npx prisma db push
npx prisma generate
npm run dev

Open http://localhost:3000.

First Run

  1. Go to Settings and add your ScrapeOwl API key
  2. Paste keywords or load them from Google Sheets
  3. Select Fast mode (default) for ASIN collection
  4. Click Run Batch — project is queued and starts automatically
  5. Watch progress with live updates
  6. Click Export Roundup for ZimmWriter-ready output

Scraping Modes

Fast Mode (default)

  • 1 API call per keyword — fetches only the Amazon search results page
  • Extracts: ASIN, title, image URL, product URL, affiliate URL
  • Speed: ~3,600 keywords/hour at 25 concurrent requests
  • Cost: 1 ScrapeOwl credit per keyword

Full Mode

  • 1 + N API calls per keyword — fetches search page + each product page
  • Extracts: everything from Fast mode, plus feature bullets, description, specs, reviews
  • Speed: depends on products per keyword and concurrency
  • Cost: 1 + N ScrapeOwl credits per keyword

Environment Variables

Variable Required Description
DATABASE_URL Yes file:./dev.db (SQLite) or postgresql://...
SCRAPEOWL_API_KEY Yes ScrapeOwl API key
APP_ADMIN_TOKEN No Admin auth token for deployment protection
APP_SETTINGS_MASTER_KEY No Encryption key for secrets (auto-generated if not set)
GOOGLE_SERVICE_ACCOUNT_JSON No Google Cloud service account JSON
GOOGLE_SHEET_ID No Default Google Sheet spreadsheet ID

All scraper keys, LLM providers, and settings are configurable from Settings in the app.


API Reference

Projects

Method Endpoint Description
GET /api/projects List all projects
POST /api/projects Create and auto-queue project
GET /api/projects/[id] Get project with keywords + products
PATCH /api/projects/[id] Update project name
POST /api/projects/[id]/run Retry/resume (bypasses queue)
POST /api/projects/[id]/stop Stop running or dequeue
GET /api/projects/[id]/export?format=json|csv|roundup Export results
GET /api/projects/[id]/progress SSE progress stream
POST /api/projects/[id]/relevance Run relevance filter

Queue & Bulk

Method Endpoint Description
GET /api/queue Queue status (running + queued projects)
POST /api/bulk-queue Queue all Google Sheets tabs as projects

System

Method Endpoint Description
GET /api/dashboard Aggregated analytics and stats
GET /api/system/status Health check and diagnostics
GET /api/scrapers Active scraper summary + plan limits

Settings

Method Endpoint Description
GET/POST /api/settings/general Retry count, max concurrency
GET/POST /api/settings/scrapers Scraper keys, plans, toggles
GET/POST /api/settings/google Google Sheets configuration
GET/POST/DELETE /api/settings/llm LLM provider management

Project Structure

grimfaste-roundupforge/
├── prisma/
│   └── schema.prisma                 # Database schema (8 models)
├── src/
│   ├── app/
│   │   ├── layout.tsx                 # Root layout with nav
│   │   ├── page.tsx                   # Home — keywords, Sheets, batch config
│   │   ├── dashboard/page.tsx         # Analytics dashboard
│   │   ├── profiles/page.tsx          # Scrape profile editor
│   │   ├── projects/[id]/page.tsx     # Results — progress, products, export
│   │   ├── settings/page.tsx          # All settings management
│   │   ├── components/               # Shared UI components
│   │   └── api/                       # REST API routes
│   ├── lib/
│   │   ├── prisma.ts                  # Prisma client singleton
│   │   ├── services/                  # Project, product, settings, job-run services
│   │   ├── scraping/                  # Adapter interface, 5 providers, pool, registry
│   │   ├── sheets/                    # Google Sheets service
│   │   ├── jobs/                      # Queue processor, runner, cancellation
│   │   ├── hooks/                     # Scrape lifecycle hooks
│   │   ├── observability/             # Failure categorization
│   │   ├── auth/                      # Admin authentication
│   │   ├── settings/                  # Crypto, scraper config
│   │   ├── export/                    # CSV + Roundup serializers
│   │   ├── llm/                       # LLM provider abstraction
│   │   └── parsing/                   # Keyword input parser
│   └── types/index.ts                 # TypeScript interfaces
├── docs/
│   ├── design/DESIGN.md              # Architecture document
│   └── roadmap/                       # Phase planning documents
├── middleware.ts                       # Auth middleware
├── vitest.config.ts                   # Test configuration
└── package.json

Development

npm run dev              # Start dev server (port 3000)
npm run test             # Run vitest tests
npx tsc --noEmit         # Type check
npm run build            # Production build
npx prisma db push       # Push schema changes
npx prisma generate      # Regenerate Prisma client
npx prisma studio        # Browse database

About

RoundupForge is built and maintained by Grimfaste — the analytics command center for publishers managing hundreds of WordPress sites.

RoundupForge serves as the data collection layer in the Grimfaste platform, feeding product data to DojoClaw for AI-powered article generation and multi-site publishing.

Learn more at grimfaste.com


License

RoundupForge is licensed under the GNU Affero General Public License v3.0.

About

RoundupForge by Grimfaste — Free Amazon product roundup research tool. Batch keyword processing, ASIN collection, Google Sheets integration, multi-scraper support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages