Discover related domains using link topology analysis from the CommonCrawl web graph.
Based on:
- Carragher, P., Williams, E. M., & Carley, K. M. (2024). Detection and Discovery of Misinformation Sources using Attributed Webgraphs. ICWSM 2024. Paper
- Carragher, P., Williams, E. M., Spezzano, F., & Carley, K. M. (2025). Misinformation Resilient Search Rankings with Attributed Webgraphs. ACM TIST.
Dataset:
- CommonCrawl webgraph (Nov-Dec 2024, Jan 2025)
- 93.9M domains, 1.6B edges
- Domain-level aggregation
What this notebook does: Given a list of seed domains, discovers other domains that are connected via backlinks or outlinks in the CommonCrawl web graph.
⏱️ Time: ~15 minutes (first time only)
- Click Runtime → Change runtime type
- Set Runtime shape to High-RAM
⚠️ - Set Hardware accelerator to GPU (optional, for faster processing)
- Click Save
Why? The CommonCrawl webgraph requires >40GB RAM to process.
Recommended! This caches the ~23GB webgraph so you don't re-download it every session.
Run the "Mount Google Drive" cell below and follow the prompts.
- Check Available RAM - Verifies you have enough memory
- Mount Google Drive - (Optional) For persistent caching
- Install Java 17 - Required for WebGraph (~2 min)
- Download cc-webgraph Tools - Clones and builds tools (~2 min)
- Download CommonCrawl Webgraph - Downloads pre-built graph files (~10 min for 23GB)
- Verify Installation - Confirms everything is ready
Note: Graph files are pre-built by CommonCrawl - no build step needed!
Scroll down to Section 3: Discovery Interface and interact with the form!
If you use this notebook in your research, please cite:
@article{carragher2024detection,
title={Detection and Discovery of Misinformation Sources using Attributed Webgraphs},
author={Carragher, Peter and Williams, Evan M and Carley, Kathleen M},
journal={Proceedings of the International AAAI Conference on Web and Social Media},
volume={18},
pages={218--229},
year={2024},
url={https://arxiv.org/abs/2401.02379}
}
@article{carragher2025misinformation,
title={Misinformation Resilient Search Rankings with Attributed Webgraphs},
author={Carragher, Peter and Williams, Evan M and Spezzano, Francesca and Carley, Kathleen M},
journal={ACM Transactions on Intelligent Systems and Technology},
year={2025}
}Links:
- Paper (ICWSM 2024): https://arxiv.org/abs/2401.02379
- GitHub Repository: https://github.com/CASOS-IDeaS-CMU/Detection-and-Discovery-of-Misinformation-Sources
- CommonCrawl Webgraphs: https://commoncrawl.org/web-graphs
- cc-webgraph Tools: https://github.com/commoncrawl/cc-webgraph
Contact:
- Peter Carragher: [email protected]
- CASOS Lab: http://casos.cs.cmu.edu/
License: MIT
Acknowledgments: This notebook uses the CommonCrawl web graph dataset and the WebGraph framework developed by Sebastiano Vigna and Paolo Boldi.