pyccwebgraph: Python Interface to CommonCrawl Webgraph

Discover related domains using link topology from CommonCrawl's webgraph.

Installation

Prerequisites:

Python 3.8+
Java 17+ (install instructions)
~30GB disk space for webgraph data

pip install pyccwebgraph

First use downloads graph data:

from pyccwebgraph import CCWebgraph, get_available_versions

# List available versions
versions = get_available_versions()
print(versions[:3])  # ['cc-main-2024-nov-dec-jan', 'cc-main-2024-feb-apr-may', ...]

webgraph = CCWebgraph.setup(
    webgraph_dir="/data/my-webgraph",
    version="cc-main-2024-feb-apr-may"
)

# Find domains that link TO seeds (backlinks)
results = webgraph.discover_backlinks(
    seeds=["cnn.com", "bbc.com", "nytimes.com"],
    min_connections=3  # Must link to all seeds
)

print(f"Found {len(results['nodes'])} domains")
print(f"Top result: {results['nodes'][0]}")
# {'domain': 'news-aggregator.com', 'connections': 15, 'percentage': 50.0}

Working with NetworkX

# Get results as NetworkX graph
G = webgraph.discover_backlinks(
    seeds=["cnn.com", "bbc.com"],
    min_connections=2,
    format='networkx'  # Returns nx.DiGraph
)

# Run standard NetworkX algorithms
import networkx as nx

# Centrality analysis
pr = nx.pagerank(G)
bc = nx.betweenness_centrality(G)

# Community detection
from cdlib import algorithms
communities = algorithms.louvain(G)

# Visualization
from pyvis.network import Network
net = Network(notebook=True)
net.from_nx(G)
net.show("network.html")

Performance: Large Graphs with NetworKit

For large discovered subgraphs (>100K nodes), use NetworKit instead of NetworkX:

# Discover large subgraph
G_nk, name_map = webgraph.discover_backlinks(
    seeds=seed_list,
    min_connections=2,
    format='networkit'  # Returns NetworKit graph
)

CC-Webgraph mapping

# Check if domain exists in graph
vid = webgraph.domain_to_id("example.com")
if vid is not None:
    print(f"Found at vertex ID {vid}")

# Get all domains this domain links to
outlinks = webgraph.get_successors("cnn.com")
print(f"CNN links to {len(outlinks)} domains")

# Get all domains linking to this domain  
backlinks = webgraph.get_predecessors("cnn.com")
print(f"{len(backlinks)} domains link to CNN")

# Validate seeds before discovery
found, missing = webgraph.validate_seeds(["cnn.com", "fake-site.xyz"])
print(f"Found: {found}")
print(f"Missing: {missing}")

Links

Interactive demo: https://github.com/PeterCarragher/NetNeighbors
PyPI: https://pypi.org/project/pyccwebgraph/
Documentation: https://pyccwebgraph.readthedocs.io/
Research Papers for webgraph-based discovery:
- ACM TIST 2025
- ICWSM 2024
CommonCrawl Webgraphs: https://commoncrawl.org/web-graphs
cc-webgraph: https://github.com/commoncrawl/cc-webgraph

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
examples		examples
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
design_spec.md		design_spec.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyccwebgraph: Python Interface to CommonCrawl Webgraph

Installation

Working with NetworkX

Performance: Large Graphs with NetworKit

CC-Webgraph mapping

Links

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pyccwebgraph: Python Interface to CommonCrawl Webgraph

Installation

Working with NetworkX

Performance: Large Graphs with NetworKit

CC-Webgraph mapping

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages