Skip to content

Crenwuste/Website-Technologies-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Website Technologies Scraper

Features implemented in the code

1. Reading domain input

  • The code reads domains from api.snappy.parquet using pandas.
  • Each input domain is analyzed, and detected technologies are extracted.

2. Collecting website data

  • For each domain, it builds the URL (https:// + domain, and http:// + domain when needed).
  • It sends a GET request to the homepage.
  • It collects:
    • page HTML
    • HTTP headers
    • cookies returned by the server

3. Extracting signals

  • It parses HTML with BeautifulSoup.
  • It extracts:
    • all URLs from <script src="proxy.php?url=https%3A%2F%2Fgithub.com%2F...">
    • inline JavaScript from <script> tags
    • cookies in key=value format
    • headers as text (Key: Value)
    • lowercase-normalized headers for robust matching

4. Rule-based detection (RULES)

  • The code defines a set of technology rules (e.g., Shopify, WordPress, GTM, etc.).
  • Each rule has multiple signals (script_src, inline_js, cookie, header, html).
  • For each signal, it searches for the pattern in extracted data.
  • When a match is found, it adds evidence (evidence) to the result.

5. Confidence scoring for each detection

  • Based on the number of matched signals:
    • 1 signal -> low
    • 2 signals -> medium
    • 3+ signals -> high

6. HEAD checks on external scripts

  • For external scripts (absolute URLs), it sends a HEAD request.
  • It extracts headers:
    • Server
    • X-Powered-By
  • It adds extra detections with evidence such as:
    • HEAD <url> -> Server: ...
    • HEAD <url> -> X-Powered-By: ...

7. Detection from Content-Security-Policy (CSP)

  • It reads the Content-Security-Policy header.
  • It splits the content into tokens and searches for known domains in CSP_DOMAIN_MAP.
  • If a mapped domain is found, it adds the corresponding technology with high confidence.

8. Combining results

  • For each domain, it combines the 3 detection sources:
    • rule-based results
    • HEAD script results
    • CSP results
  • It returns a final object with:
    • domain
    • technologies (each containing technology, confidence, evidence)

9. Exporting results

  • All detections are saved in results.json.

Reasoning and implementation decisions

  • Evaluated approaches - I evaluated three options: rule-based detection, headless browser rendering, and external technology detection services.
  • What I chose and why - I chose rule-based detection + HTTP signals (HEAD and CSP) because it is faster, and provides clear evidence for each detection.
  • How I built the rules - I extracted most technology rules from manual inspection of website HTML, and I built CSP domain mappings by exploring linked resources and headers in Postman.
  • What I did not choose for now - I did not use a headless browser for all domains because the current approach prioritizes simpler execution and more predictable processing for large batches.
  • Accepted trade-offs - I gained speed and operational simplicity, but accuracy may decrease on sites that load technologies strictly dynamically.

Main issues in the current implementation

  • False positives from simple pattern matching - The current implementation uses substring matching, which can generate false positives in some cases. I would reduce this with stricter rules and multi-signal validation.
  • No deduplication between detection sources - The same technology can be detected from rules, CSP, and HEAD, resulting in duplicate entries. I would add a deduplication step and aggregate evidence into a single object per technology.
  • Overly simple confidence score - Current confidence is based only on the number of signals. I would switch to weighted scoring, where strong signals (specific cookie/header) weigh more than generic signals.
  • Limited performance on large batches - Sequential execution becomes slow at scale. I would optimize with controlled concurrency (async/thread pool), retry logic, and caching for repeated requests.

How I will discover new technologies in the future

  • Analysis of unknown domains in results - I inspect domains with low confidence or no detections and extract new patterns from scripts, cookies, and headers.
  • Automatic rule suggestion - I build a step that proposes candidate rules from repeated signals (for example: cookie prefixes, JS endpoints, global variables).
  • Using rule sets from open-source projects - I use rule sets from open-source projects (for example, Wappalyzer) and adapt relevant ones into my own rules.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages