Website Technologies Scraper

Features implemented in the code

1. Reading domain input

The code reads domains from api.snappy.parquet using pandas.
Each input domain is analyzed, and detected technologies are extracted.

2. Collecting website data

For each domain, it builds the URL (https:// + domain, and http:// + domain when needed).
It sends a GET request to the homepage.
It collects:
- page HTML
- HTTP headers
- cookies returned by the server

3. Extracting signals

It parses HTML with BeautifulSoup.
It extracts:
- all URLs from <script src="proxy.php?url=https%3A%2F%2Fgithub.com%2F...">
- inline JavaScript from <script> tags
- cookies in key=value format
- headers as text (Key: Value)
- lowercase-normalized headers for robust matching

4. Rule-based detection (`RULES`)

The code defines a set of technology rules (e.g., Shopify, WordPress, GTM, etc.).
Each rule has multiple signals (script_src, inline_js, cookie, header, html).
For each signal, it searches for the pattern in extracted data.
When a match is found, it adds evidence (evidence) to the result.

5. Confidence scoring for each detection

Based on the number of matched signals:
- 1 signal -> low
- 2 signals -> medium
- 3+ signals -> high

6. HEAD checks on external scripts

For external scripts (absolute URLs), it sends a HEAD request.
It extracts headers:
- Server
- X-Powered-By
It adds extra detections with evidence such as:
- HEAD <url> -> Server: ...
- HEAD <url> -> X-Powered-By: ...

7. Detection from Content-Security-Policy (CSP)

It reads the Content-Security-Policy header.
It splits the content into tokens and searches for known domains in CSP_DOMAIN_MAP.
If a mapped domain is found, it adds the corresponding technology with high confidence.

8. Combining results

For each domain, it combines the 3 detection sources:
- rule-based results
- HEAD script results
- CSP results
It returns a final object with:
- domain
- technologies (each containing technology, confidence, evidence)

9. Exporting results

All detections are saved in results.json.

Reasoning and implementation decisions

Evaluated approaches - I evaluated three options: rule-based detection, headless browser rendering, and external technology detection services.
What I chose and why - I chose rule-based detection + HTTP signals (HEAD and CSP) because it is faster, and provides clear evidence for each detection.
How I built the rules - I extracted most technology rules from manual inspection of website HTML, and I built CSP domain mappings by exploring linked resources and headers in Postman.
What I did not choose for now - I did not use a headless browser for all domains because the current approach prioritizes simpler execution and more predictable processing for large batches.
Accepted trade-offs - I gained speed and operational simplicity, but accuracy may decrease on sites that load technologies strictly dynamically.

Main issues in the current implementation

False positives from simple pattern matching - The current implementation uses substring matching, which can generate false positives in some cases. I would reduce this with stricter rules and multi-signal validation.
No deduplication between detection sources - The same technology can be detected from rules, CSP, and HEAD, resulting in duplicate entries. I would add a deduplication step and aggregate evidence into a single object per technology.
Overly simple confidence score - Current confidence is based only on the number of signals. I would switch to weighted scoring, where strong signals (specific cookie/header) weigh more than generic signals.
Limited performance on large batches - Sequential execution becomes slow at scale. I would optimize with controlled concurrency (async/thread pool), retry logic, and caching for repeated requests.

How I will discover new technologies in the future

Analysis of unknown domains in results - I inspect domains with low confidence or no detections and extract new patterns from scripts, cookies, and headers.
Automatic rule suggestion - I build a step that proposes candidate rules from repeated signals (for example: cookie prefixes, JS endpoints, global variables).
Using rule sets from open-source projects - I use rule sets from open-source projects (for example, Wappalyzer) and adapt relevant ones into my own rules.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
WebScraper.py		WebScraper.py
requirments.txt		requirments.txt
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website Technologies Scraper

Features implemented in the code

1. Reading domain input

2. Collecting website data

3. Extracting signals

4. Rule-based detection (`RULES`)

5. Confidence scoring for each detection

6. HEAD checks on external scripts

7. Detection from Content-Security-Policy (CSP)

8. Combining results

9. Exporting results

Reasoning and implementation decisions

Main issues in the current implementation

How I will discover new technologies in the future

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Website Technologies Scraper

Features implemented in the code

1. Reading domain input

2. Collecting website data

3. Extracting signals

4. Rule-based detection (RULES)

5. Confidence scoring for each detection

6. HEAD checks on external scripts

7. Detection from Content-Security-Policy (CSP)

8. Combining results

9. Exporting results

Reasoning and implementation decisions

Main issues in the current implementation

How I will discover new technologies in the future

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4. Rule-based detection (`RULES`)

Packages