Posts

2026-02-24: The 10th Computational Archival Science (CAS) Workshop Trip Report

Image
IEEE BigData 2025-The10th Computational Archival Science (CAS) Workshop Home Page The 10th Computational Archival Science (CAS) Workshop  is part of  2025 IEEE Big Data Conference (IEEE BigData 2025) . It was an online workshop held on Tuesday December 9, 2025. It included close to 70 participants, with a keynote from Dr. Phang Lai Tee, National Archives of Singapore and Chair of the UNESCO Memory of the World Preservation Sub-Committee on Artificial Intelligence, and 18 papers from 27 institutions in 8 countries spanning 5 continents: Canada, USA (North America) / Brazil (South America) / Scotland, Spain, Switzerland (Europe) / South Africa (Africa) / Korea (Asia). Michael Kurtz, who passed on December 17th, 2022   launched the CAS initiative in 2016 , with Victoria Lemieux, Mark Hedges, Maria Esteva, William Underwood, Mark Conrad, and Richard Marciano. The 10th CAS workshop was organized by the CAS Workshop Chairs:  Mark Hedges from King’s College Lond...

2026-02-13: Paper Summary: "High Fidelity Web Archiving of News Sites and New Media with Browsertrix"

Image
Figure 1: The Browsertrix Tool Suite For the Saving Ads project and the Game Walkthroughs and Web Archiving project , we have used Browsertrix Crawler , ArchiveWeb.page , and ReplayWeb.page , which are Webrecorder tools that were discussed in Walsh et al.’s paper “ High Fidelity Web Archiving of News Sites and New Media with Browsertrix .” In this paper, Walsh et al. describe tools that are integrated with Browsertrix and the features that differentiated their tools from other web archive crawlers and replay systems. Browsertrix is a free and open-source web archiving platform that can be run locally , self-hosted , or used through Webrecorder's hosted service . Browsertrix uses Browsertrix Crawler for archiving web pages, ArchiveWeb.page to patch archived web pages, and ReplayWeb.page to replay archived web pages (Figure 1). Walsh, Tessa, Henry Wilkinson, and Ilya Kreymer. “High Fidelity Web Archiving of News Sites and New Media with Browsertrix.” in Proceedings of the 2024 Int...

2026-02-12: How to archive web pages in bulk using the Internet Archive Google Sheets service

Image
Results of archiving web pages using the Wayback Machine Google Sheets service One of the greatest services the Internet Archive (IA) offers is the Save Page Now (SPN) service . It is easy to use and it was completely overhauled in 2019 with added features . However, the UI is limited to archiving a single URL at a time. To overcome this limitation, the IA launched the  Google Sheets service , which allows you to submit a Google Sheet with the first column full of URLs (up to 5,000 URLs), and it will archive them all in the Wayback Machine. I needed to archive 1.5 million URLs that I collected from four Arabic and English news websites published between 1999 and 2022 . This is obviously not doable one URL at a time, but the IA's Google Sheets service makes it possible. because I can archive 5,000 URLs at a time and up to 30.000 URLs (six sheets) per day. The first step is to create a Google sheet and paste the URLs you need to archive in the first column (one URL in each row)...