Wayback Machine – Internet Archive Blogs https://blog.archive.org Updates from the Internet Archive Wed, 18 Feb 2026 00:57:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.8 https://blog.archive.org/wp-content/uploads/2023/03/ia-logo-sq-150x150.png Wayback Machine – Internet Archive Blogs https://blog.archive.org 32 32 Wayback Machine Director Pushes Back on AI Scraping Fears Driving Archive Blocks https://blog.archive.org/2026/02/18/wayback-machine-director-pushes-back/ https://blog.archive.org/2026/02/18/wayback-machine-director-pushes-back/#comments Wed, 18 Feb 2026 00:57:47 +0000 https://blog.archive.org/?p=30135

As reported by Nieman Lab last month, some major media organizations—including The New York Times, The Guardian, and Reddit—have started blocking the Wayback Machine from archiving their sites over unfounded concerns about AI scraping.

Last week, tech writer Mike Masnick (Techdirt) explained why this is “a mistake we’re going to regret for generations.”

Today, Mark Graham, director of the Wayback Machine, has published a response to the Nieman Lab reporting, pushing back on the media organizations’ concerns about the Wayback Machine being a backdoor to AI scraping. Graham writes:

“These concerns are understandable, but unfounded… like others on the web today, we expend significant time and effort working to prevent such abuse.”

Read the post to learn how Graham is working to protect the integrity of the Wayback Machine, and why limiting web archiving threatens our shared digital history.

]]>
https://blog.archive.org/2026/02/18/wayback-machine-director-pushes-back/feed/ 1
Preserving the Open Web: Inside the New Wayback Machine Plugin for WordPress  https://blog.archive.org/2026/02/04/inside-the-new-wayback-machine-plugin-for-wordpress/ Wed, 04 Feb 2026 12:00:00 +0000 https://blog.archive.org/?p=30082

Link rot. There’s nothing quite as frustrating as clicking on a link that leads to nowhere.

WordPress, which powers more than 40% of websites online, recently partnered with the Internet Archive to address this problem. Engineers from the Internet Archive and Automattic worked together to create a plugin that can be added to a WordPress website to improve the user experience and check the Wayback Machine for an archived version of any webpage that has been moved, changed or taken down.

The free Internet Archive Wayback Machine Link Fixer, publicly launched last fall, combats link rot by seamlessly redirecting the user to a reliable backup page when it encounters a missing page. When the plugin is added to a website, it will do a scan, see what pages exist, and then automatically save those pages to a queue to be archived. If it doesn’t exist, then it will be sent for capture.

DOWNLOAD THE PLUGIN

Once the software is installed on a WordPress website, the plugin will auto redirect users to the Wayback Machine version of a missing page. 

Broken links are one of the web’s most relentless problems. Pew Research found that 38% of the web has disappeared over the past decade and for web admins, “It’s a never-ending game of whack-a-mole to keep links working,” said Matt Blumberg, Product Manager with the Wayback Machine. “This new tool prevents those inevitable 404s by automatically updating links to a preserved copy and it proactively archives pages in the Wayback Machine, where they’re kept accessible for free, long-term, so your site stays usable without manual fixes.”

“It’s very important that websites have a memory and that the web overall as has a memory. We are increasingly using [the web] as our only source of truth. When links go dead, in effect, the truth goes dead. This has become even more important in the world of AI.”

Alexander Rose, Director of Long-term Futures for Automattic Inc.

Many WordPress websites are homespun and are most susceptible to having links go dead. Remedying this problem is not only valuable to individuals, but also to the overall culture, said Alexander Rose, Director of Long-term Futures for Automattic Inc., the technology company behind WordPress.com.

“We need to have an accurate memory of the things that get said, posted, and the ways that we have communicated over time,” Rose said. “Otherwise we’re either doomed to repeat errors or we’re going to make choices that are uninformed by the past.”

The link fixer is expanding the “heroic effort” made by the Internet Archive over the years to preserve everything from small websites to NASA.gov and WhiteHouse.gov, he said.

“It’s very important that websites have a memory and that the web overall as has a memory,” Rose said. “We are increasingly using [the web] as our only source of truth. When links go dead, in effect, the truth goes dead. This has become even more important in the world of AI.”

As the plugin rolls out, Rose and Blumberg said they are open to feedback. The goal is to make the software as easy as possible to use. Next, they will fine tune the features and promote its broad use.

“As it becomes a solid piece of software that people know and like, then I think it has a path to being integrated much more deeply,” Rose said. “It’s early days, but every person I’ve talked to about it is excited to see the potential end of the dreaded 404 error.”

]]>
Follow the Changes: 9 Ways Web Archives are Used in Digital Investigations https://blog.archive.org/2026/02/02/follow-the-changes/ https://blog.archive.org/2026/02/02/follow-the-changes/#comments Mon, 02 Feb 2026 16:49:01 +0000 https://blog.archive.org/?p=30043 Guest post from Thais Lobo, Liliana Bounegru & Jonathan W. Y. Gray, King’s College London.

This work was supported by the Centre for Digital Culture and Department of Digital Humanities at King’s College London and developed further through collaborations with researchers and students at the University of Amsterdam.


Digital journalists increasingly turn to web archives like the Wayback Machine to follow how things on the Internet break, change or disappear – from deleted posts to quietly edited pages.

The web has become not only a source of information but also the subject of media investigations, prompting journalists, researchers and activists to use digital archives to reconstruct timelines, verify claims, uncover hidden connections and hold powerful actors to account.

As online materials grow more fragile and prone to disappearance, the Internet Archive’s Wayback Machine has been critical in making “lost” web pages available – recently celebrating archiving over a trillion pages.

As we’ve previously written about on this blog, the Wayback Machine is an important resource for our work as media researchers, helping us to trace histories of digital media objects (for example, changes in ad tracker signatures of viral “fake news” sites over time).

We are also interested in how others use web archives across fields, and what we can learn from each other.

In this piece we draw on the Internet Archive’s News Stories collection to surface practices and use cultures of the Wayback Machine amongst journalists and media organisations. We analysed a dataset of about 8,600 news articles, assembled by the IA via daily Google News keyword searches since 2018.

Drawing on a combination of digital methods, machine learning and lots of reading – we surfaced nine ways that journalists use the Wayback Machine in their reporting.

***

1. following what is deleted

Shifting political alliances are a common driver of online footprint erasure. Deleted tweets have revealed past critics in current allies (here and here), and current career aspirations were juxtaposed with earlier conflicting stances in personal blogs and websites (here, here, here and here). 

Unannounced takedowns of collections or site sections on government websites often prompt investigations using archival snapshots. Examples include removed editions of presidential newsletters and deleted staff contact lists for services supporting vulnerable groups, signaling access-to-information breaches. 

The removal of official publications also enticed further contextualisation, revealing cases in which information was deleted due to being incomplete, inaccurate or inconveniently timed

Beyond politics, erasing on corporate websites highlights commercial and reputational pressures, such as deleted statements on forced labour, product safety and climate deception.

2. following what has been altered

Subtle alterations on webpages can also reveal a plain-to-see effort to reshape narratives.

Reporting based on archived pages shows how wording edits can move in opposite directions: from hardening language on migration ahead of a policy announcement to softening controversial statements in view of a political nomination, or erasing customer protection promises prior to a bankruptcy filing. 

In other cases, small additions to online content have proved just as revealing. A before and after snapshot of a blog post showed how a supposed early warning about a virus threat was added only after the pandemic began. Similarly, changes to a social media platform’s API rules appeared shortly after third-party apps were banned, subtly reframing the policy to align with new restrictions.  

3. following what is banned

Sometimes removals are deliberate, often at the request of companies seeking to enforce copyright, control branding, or limit liability.

Reports from media investigations highlight how such bans can affect games (here, here, here and here), apps and technical reviews.

In some cases, the bans intersect with political pressures, such as Hong Kong news outlets being shuttered under pro‑Beijing pressure, and disinformation networks being taken down due to links to state actors.

4. following what is broken

Archived snapshots are also often the only way to reconstruct what preceded a link break, when it happened, and what information was effectively cut off.

For example, an investigation into a set of broken URLs on a government website revealed that the pages themselves had not been removed, but the links pointed to outdated servers, creating a false impression of secrecy that sparked a conspiracy theory.

In another case, a major technical glitch took multiple Nigerian government websites offline, cutting off access to official information and showing how even unintentional failures can undermine transparency.

5. following what is hacked

Compromised versions of hacked websites and social media accounts present another form of using archived snapshots as traceable historical record.

For example, past screenshots of Twitter’s bio page revealed inconsistencies in claims about an alleged takeover of the US president’s social media account. In other cases, such snapshots helped surface a forensic trail and distinguish unauthorised activity carried out by activists (here and here) from the ones linked to cybercriminal groups (here).

6. following what is connected

Archived web data often uncovers unexpected linkages between domains’ ownership that appear unrelated on the surface.

For example, journalists used analytics codes of copies of sites maintained by the Wayback Machine to uncover disinformation networks. In another investigation, archived records verified that a website redirect to Joe Biden’s presidential campaign was unrelated to him, debunking conspiracy theories about the domain’s ownership.

Snapshots of a fake Black Lives Matter Facebook page and its associated websites allowed reporters to trace the individuals behind the operation. Similarly, archived versions of Amazon storefronts exposed networks of accounts generating affiliate revenue from coordinated product listings.

7. following what is reported

Archived web pages have proven vital for tracing how stories are presented across media outlets and platforms.

Investigations have examined archived versions of individual pages, such as headline coverage relying heavily on unverified claims, a news agency editorial premature assessment, or the unflagging of a branded content

In another case, snapshots of the Google homepage captured during the 2018 State of the Union speech disproved a viral claim that Google ignored Donald Trump’s address in favour of Barack Obama.

8. following what is unchanged

In other investigations, the most revealing detail is what did not change.

For example, during a bushfire crisis in Australia, archived pages showed that a key policy statement by the Greens party was left untouched, despite a disinformation campaign claiming to the contrary.

Similarly, a social media account circulated as having been reactivated under a new wave of laissez-faire moderation was, in fact, never suspended.

9. following what is saved 

When forums, platforms and websites vanish, it’s the work of crowdsourced archivists that capture their traces before they vanish for good.

In several reported cases, users raced to preserve spaces such as a long-running forum for sex workers, a 16-year-old Q&A site, a meme-sharing platform, and a free music library

Archiving web pages can become part of the story.

***

These are some of the ways we’ve noticed journalists using web archives – and there are many more! If you know of other interesting examples, we’d love to hear from you.

We hope that these nine ways may help to inspire critical and creative uses of web archives to “follow the changes” – exploring what they can tell us about digital culture and society, and the times we live in.

This work was supported by the Centre for Digital Culture and Department of Digital Humanities at King’s College London and developed further through collaborations with researchers and students at the University of Amsterdam.


About the authors

Thais Lobo is research associate at the Department of Digital Humanities, King’s College London, with a previous career in journalism.

Jonathan W. Y. Gray is Co-director of the Centre for Digital Culture and Reader in Critical Infrastructure Studies at the Department of Digital Humanities, King’s College London. He is also co-founder of the Public Data Lab; research associate at the Digital Methods Initiative (University of Amsterdam) and the médialab (Sciences Po, Paris). More about his work at jonathangray.org

Liliana Bounegru is Senior Lecturer (Associate Professor) in Digital Media, Culture and Society at the Department of Digital Humanities, King’s College London. She is also co-founder of the Public Data Lab, member of the Digital Methods Initiative at the University of Amsterdam and associate of the Sciences Po Paris médialab. More about her work can be found at lilianabounegru.org.

]]>
https://blog.archive.org/2026/02/02/follow-the-changes/feed/ 1
Celebrate 1 Trillion Web Pages with Original Net.Art Works: Internet Archive x Gray Area https://blog.archive.org/2025/10/24/celebrate-1-trillion-web-pages-with-original-net-art-works-internet-archive-x-gray-area/ https://blog.archive.org/2025/10/24/celebrate-1-trillion-web-pages-with-original-net-art-works-internet-archive-x-gray-area/#comments Fri, 24 Oct 2025 18:17:42 +0000 https://blog.archive.org/?p=29468
Pretty Guardian Shrine (2025) by Ophira Horwitz

Internet Archive x Gray Area: Trillionth Webpage Net.Art Commissions
Date: Saturday, November 1
Time: 5:00 to 8:00pm
Location: Internet Archive, 300 Funston Avenue, San Francisco
Admission: Free
REGISTER NOW!

The Internet Archive has reached an extraordinary milestone: one trillion web pages archived. This civilization-scale achievement marks decades of dedication to preserving the ephemeral nature of digital culture and ensuring universal access to human knowledge.

To commemorate this historic moment, San Francisco interdisciplinary arts and technology non-profit Gray Area has partnered with the Internet Archive to commission a series of original net.art works that engage with the vast holdings of the Internet Archive and explore what it means to create, preserve, and access culture online.

REGISTER NOW

Commissioned Artists

  • Chia Amisola
  • Spencer Chang
  • Sarah Friend & Arkadiy Kukarkin
  • Ophira Horwitz
  • Mai Ishikawa-Sutton & Raúl Feliz
  • Olivia McKayla Ross
  • Jesse Walton
  • Rodell Warner

The commissioned artists have drawn from the Internet Archive’s expansive collections to create web-based artworks that reflect on themes of memory, digital archaeology, and the human stories embedded within preserved data. These works exist as both online experiences and physical installations at the Internet Archive, bridging the digital and material worlds in ways that honor the Archive’s dual nature as both a technological achievement and a profoundly human endeavor.

Curated by Amir Esfahani (Internet Archive) and Wade Wallerstein (Gray Area)

]]>
https://blog.archive.org/2025/10/24/celebrate-1-trillion-web-pages-with-original-net-art-works-internet-archive-x-gray-area/feed/ 1
Web Archive 96: How the Smithsonian Helped Create One of the First Wayback Machine Collections https://blog.archive.org/2025/10/20/web-archive-96-how-the-smithsonian-helped-create-one-of-the-first-wayback-machine-collections/ Mon, 20 Oct 2025 12:00:00 +0000 https://blog.archive.org/?p=29441
Screenshot from the Wayback Machine of the Web Archive 96 project page (October 11, 1997).

In 1996, the World Wide Web was starting to catch on. Politicians were just beginning to explore how to use online communication to reach voters. And in a house in San Francisco, the fledgling Internet Archive was starting to archive pieces of the web before they disappeared.

That same year, a letter arrived from Washington, D.C., with the Smithsonian Institution’s iconic sunburst logo at the top. The Smithsonian had agreed to partner with the Internet Archive to preserve the digital record of the 1996 U.S. presidential election.

“It was a major milestone for us,” recalls Internet Archive founder Brewster Kahle. “The big Smithsonian was working with this new little Internet Archive nonprofit library.”

Together, the two institutions launched Web Archive 96, one of the first web collections the Internet Archive ever created. It captured the early campaign webpages of candidates Bill Clinton, Bob Dole, and Ross Perot — online brochures filled with policy positions, photos, and promises — along with news coverage of the race. It was a pioneering effort to preserve the political life of a nation as it moved onto the web. The collection is now a foundational part of our cultural history on the web, and is available for public access via the Wayback Machine.

Explore Web Archive 96 via the Wayback Machine

Nearly thirty years later, that collaboration still stands out as visionary: two institutions, one old and one new, working together to recognize the internet as part of our shared cultural record.

Politics Goes Digital

We the People: Winning the Vote exhibit installation, National Museum of American History, 1996-2000.

In Washington, D.C., the National Museum of American History added a personal computer displaying online presidential election website content to its “We The People” campaign exhibit. “It was delivered and was displayed next to campaign buttons from the 1800s,” Kahle recalled.

Indeed, Smithsonian curators Larry Bird and Harry Rubenstein traveled to New Hampshire and Iowa every four years to collect buttons, signs and physical memorabilia from the campaign offices. Just as television changed the political landscape in the 1960s, they recognized the potential influence of the web in 1996. When they heard Kahle was archiving campaigns, Bird said they were “ecstatic” to collaborate.

“We were all over it,” said Bird, now a curator emeritus from the Smithsonian division of political history. “We were super glad that we could take this non-dimensional thing and for it to have a presence on the floor – even in this most rudimentary, stripped down way – limited to the candidates’ websites. It was an acknowledgement of where things were heading.”

Jeff Ubois, who forged the partnership in 1996, recalled “Why would anyone care about the ephemera of the web?” as the prevailing attitude at the time. “The Smithsonian helped change some of that.”

Once the Internet Archive partnered with the Smithsonian, “it wasn’t possible to dismiss web archiving as irrelevant, impossible, useless,” Ubois said.

People contact the Smithsonian often, Bird said, and the Internet Archive outreach was unexpected, but welcome. “We were constantly looking at the way things were shifting in politics, which always takes what’s popular and successful in the real world and bends it into its own political world or reality,” he said. “And this just seemed to be yet, the latest iteration of that as a cultural phenomenon….To have [the Internet Archive] assemble it wasn’t anything that any of us could have done at the time.”

‘Collection of Record for the Web’

Bird said the Internet Archive is a “remarkable resource” that he and other researchers have relied on for years.

“The museum is the collection of record for material things, objects, and dimensional things. And the Internet Archive is the collection of record for the web and all that implies,” Bird said. “There’s hardly anything that it doesn’t touch anymore. It didn’t start out that way, but it’s become that. It’s the collection of record that people use and cite and compare. It’s a tremendous historical resource.”

Preserving the evolution of political campaigns is important to anyone trying to do research or understand political trends over time, said David Almacy, president and chief executive officer of Far Post Media, a digital public affairs firm in Virginia and former White House E-Communications Director for President George W. Bush. In 1996, campaign websites were primarily online brochures – just text and photos without much customization. Today, websites are more advanced with video, digitally integrated with interactive elements that can be tailored to the user.

“The value is to provide an archive and a record of what was said, and basically a snapshot in time politically,” Almacy said. “It actually becomes fascinating to go back and look at the issues that were facing the country that would be deemed priorities in 1996 and how that compares to today. I assume a lot are the same – the economy, education, immigration, national security, global peace – but they’ve evolved in different ways. Many are very important to Americans, just as they were back then.”

]]>
Celebrating 1 Trillion Webpages Archived: Share Your Wayback Story https://blog.archive.org/2025/09/23/celebrating-1-trillion-webpages-archived-share-your-wayback-story/ https://blog.archive.org/2025/09/23/celebrating-1-trillion-webpages-archived-share-your-wayback-story/#comments Tue, 23 Sep 2025 17:07:47 +0000 https://blog.archive.org/?p=29321 This October, the Internet Archive’s Wayback Machine will reach an extraordinary milestone: 1 trillion webpages preserved.

Since 1996, the Wayback Machine has been capturing the web—saving the voices, creativity, and communities that make up our shared digital history. Nearly one trillion pages later, we’re still archiving, so that future generations can look back and understand the world as we lived it online.

Now we want to invite you to share your story with us!

Record a video answering the question: “Why is the Wayback Machine important to you?

Guidelines:

  • Keep it to about 1 minute, record in vertical/portrait format, and leave a second of silence at the start and end so nothing gets cut off.
  • Use any device you like: your phone, webcam, etc.

Share your video so we can find it:

  • Post it on your preferred social media platform with the hashtag #Wayback1T
  • Or, upload it directly to Archive.org!

Uploading to archive.org

  • Create a free account: Sign up here.
  • Use the upload form and select your video file.
  • Add the subject tag Wayback1T when filling out the form.

We’ll be sharing some of our favorites on our channels as part of this celebration.

The web changes fast, but thanks to you—and thanks to one trillion pages saved—the memory of the internet endures.

Join the celebration. Tell your Wayback story today.

]]>
https://blog.archive.org/2025/09/23/celebrating-1-trillion-webpages-archived-share-your-wayback-story/feed/ 13
Looking back on “Preserving the Internet” from 1996 https://blog.archive.org/2025/09/02/looking-back-on-preserving-the-internet-from-1996/ https://blog.archive.org/2025/09/02/looking-back-on-preserving-the-internet-from-1996/#comments Tue, 02 Sep 2025 18:41:15 +0000 https://blog.archive.org/?p=29207 As the Internet Archive celebrates 1 trillion web pages archived, it’s worth revisiting what founder Brewster Kahle imagined back in 1996—when the web was still young and the Wayback Machine was years away from its public debut.

Nearly three decades ago, Internet Archive founder Brewster Kahle sketched out a bold vision for preserving the web before it could slip away—warning that without action, the digital age might echo the cultural losses of Alexandria’s library or early film reels.

Today, in 2025, many of the ideas he laid out in “Preserving the Internet,” published in the March 1997 issue of Scientific American, have come to life: a global digital library, tools that fight link rot, and researchers mining web history to understand our present. Other challenges he foresaw—like obsolete formats, legal battles, and questions of digital memory—remain pressing, but his optimism still holds: by building archives together, we can create a more reliable, enduring memory for the internet age.

Read the published paper in Scientific American.
Read the original pre-print via the Wayback Machine, or below:


Preserving the Internet
Brewster Kahle
Internet Archive

11/4/96
Bold efforts to record the entire Internet are expected to lead to new services.
Submitted to Scientific American for March 1997 Issue

The early manuscripts at the Library of Alexandria were burned, much of early printing was not saved, and many early films were recycled for their silver content. While the Internet’s World Wide Web is unprecedented in spreading the popular voice of millions that would never have been published before, no one recorded these documents and images from 1 year ago. The history of early materials of each medium is one of loss and eventual partial reconstruction through fragments. A group of entrepreneurs and engineers have determined to not let this happen to the early Internet.

Even though the documents on the Internet are the easy documents to collect and archive, the average lifetime of a document is 75 days and then it is gone. While the changing nature of the Internet brings a freshness and vitality, it also creates problems for historians and users alike. A visiting professor at MIT, Carl Malamud, wanted to write a book citing some documents that were only available on the Internet’s World Wide Web system, but was concerned that future readers would get a familiar error message “404 Document not found” by the time the book was published. He asked if the Internet was “too unreliable” for scholarly citation.

Where libraries serve this role for books and periodicals that are no longer sold or easily accessible, no such equivalent yet exists for digital information. With the rise of the importance of digital information to the running of our society and culture, accompanied by the drop in costs for digital storage and access, these new digital libraries will soon take shape.

The Internet Archive is such a new organization that is collecting the public materials on the Internet to construct a digital library. The first step is to preserve the contents of this new medium. This collection will include all publicly accessible World Wide Web pages, the Gopher hierarchy, the Netnews bulletin board system, and downloadable software.

If the example of paper libraries is a guide, this new resource will offer insights into human endeavor and lead to the creation of new services. Never before has this rich a cultural artifact been so easily available for research. Where historians have scattered club newsletters and fliers, physical diaries and letters, from past epochs, the World Wide Web offers a substantial collection that is easy to gather, store, and sift through when compared to its paper antecedents. Furthermore, as the Internet becomes a serious publishing system, then these archives and similar ones will also be available to serve documents that are no longer “in print”.

Apart from historical and scholarly research uses, these digital archives might be able to help with some common infrastructure complaints:

– Internet seems unreliable: “Document not found”
– Information lacks context: “Where am I? Can I trust this information?”
– Navigation: “Where should I go next?”

When working with books, libraries help with some of these issues, with “the stacks” of books, links to other libraries and librarians to help patrons.

Preservation of our Digital History

Where we can read the 400 year-old books printed by Gutenberg, it is often difficult to read a 15 year-old computer disk. The Commission for Preservation and Access in Washington DC has been researching the thorny problems faced trying to ensure the usability of the digital data over a period of decades. Where the Internet Archive will move the data to new media and new operating systems every 10 years, this only addresses part of the problem of preservation.

Using the saved files in the future may require conversion to new file formats. Text, images, audio, and video are undergoing changes at different rates. Since the World Wide Web currently has most of its textual and image content in only a few formats, we hope that it will be worth translating in the future, whereas we expect that the short lived or seldom used formats not be worth the future investment. Saving the software to read discarded formats often poses problems of preserving or simulating the machines that they ran on.

The physical security of the data must also be considered. Natural and political forces can destroy the data collected. Political ideologies change over time making what was once legal becomes illegal. We are looking for partners in other geographic and national locations to provide a robust archive system over time. To give some level of security from commercial forces that might want exclusive access to this archive, the data is donated to a special non-profit trust for long-term care taking. This non-profit organization is endowed with enough money to perform the necessary maintenance on the storage media over the years.

Packaging enough meta-data (information about the information) is necessary to inform future users. Since we do not know what future researchers will be interested in, we are documenting the methods of collection and attempt to be complete in those collections. As researchers start to use these data, the methods and data recorded can be refined.

Technical Issues of Gathering Data

Building the Internet Archive involves gathering, storing, and serving the terabytes of information that at some point were publicly accessible on the Internet.

Gathering these distributed files requires computers to constantly probe the servers looking for new or updated files. The Internet has several different subsystems to make information available such as the World Wide Web (WWW), File Transfer Protocol (FTP), Gopher, and Netnews. New systems for three-dimensional environments, chat facilities, and distributed software require new efforts to gather these files. Each of these systems requires special programs to probe and download appropriate files. Estimating the current size, turnover, and growth of the public Internet has proven tricky because of the dynamic nature of the systems being probed.

Protocol Number of Sites Total Data Change rate

WWW 400,000 1,500GB 600GB/month

Gopher 5,000 100GB declining (from Veronica Index)

FTP 10,000 5,000GB not known

Netnews 20,000 discussions 240GB 16GB/month

The World Wide Web is vast, growing rapidly, and filled with transient information. Estimated at 50 million pages with the average page online for only 75 days, the turnover is considerable. Furthermore, the number of pages is reported to be doubling every year. Using the average web page size of 30 kilobytes (including graphics) brings the current size of the Web to 1.5 terabytes (or million megabytes).

To gather the World Wide Web requires computers specifically programmed to “crawl” the net by downloading a web page, then finding the links to graphics and other pages on it, and then downloading those and continuing the process. This is the technique that the search engines, such as Altavista, use to create their indices to the World Wide Web. The Internet Archive currently holds 600GB of information of all types. In 1997 we will have collected a snapshot of the documents and images.

The information collected by these “crawlers” is not, unfortunately, all the information that can be seen on the Internet. Much of the data is restricted by the publisher, or stored in databases that are accessible through the World Wide Web but are not available to the simple crawlers. Other documents might have been inappropriate to collect in the first place, so authors can mark files or sites to indicate that crawlers are not welcome. Thus the collected Web will be able to give a feel of what the web looked like at a particular time, but will not simulate the full online environment.

While the current sizes are large, the Internet is continuing to grow rapidly. When it is common to connect one’s home camcorder to the upcoming high bandwidth Internet, it will not be practical to archive it all. At some point we will have to become more select what data will be of the most value in the future, but currently we can be afford to gather it all.

Storing Terabytes of Data Cost Effectively

Crucial to archiving the Internet, and digital libraries in general, is the cost effective storage of terabytes of data while still allowing timely access. Since the costs of storage has been dropping rapidly, the archiving cost is dropping. The flip side, of course, is that people are making more information available.

To stay ahead of this onslaught of text, images, and soon video information we believe we have to store the information for much less money than the original producers paid for their storage. It would be impractical to spend as much on our storage as everyone else combined.

Storage Technologies Cost per GigaByte Random access time

Memory (RAM) $12,000/GB 70nanoSeconds

Hard Disk $200/GB 15miliSeconds

Optical Disk Jukebox $140/GB 10seconds

Tape Jukebox $20/GB 4minutes

Tapes on shelf $2/GB human assistance required

(1 GigaByte = 1000 MegaBytes, 1TeraByte = 1000GigaBytes. A GigaByte is roughly enough to store 1000 books or 1 hour of compressed video)

With these prices, we chose hard disk storage for a small amount of the frequently accessed data combined with tape jukeboxes. In most applications we expect a small amount of information to be accessed much more frequently than the rest, leveraging the use of the faster disk technology rather than the tape jukebox.

Providing Access and New Services

After gathering and storing the public contents of the Internet, what services would then be of greatest value with such a repository? While it is impossible to be certain, digital versions of paper services might prove useful.

For instance, we can provide a “reliability service” for documents that are no longer available from the original publisher. This is similar to one of the roles of a library. In this way, one document can refer, through a hypertext link, to a document on another server and a reader will be able to follow that link even if the original is gone. We see this as an important piece of infrastructure if the global hypertext system is to become a medium for scholarly publishing.

Another application for a central archive would be to store an “official copy of record” of public information. These records are often of legal interest, helping to determine what was said or known at a particular time.

Historians have already found the material useful. David Allison of the Smithsonian Institution has used the materials for an exhibit on Presidential Election websites, which he thinks might be the equivalent to saving videotapes of early TV campaign advertisements. David Eddy Spicer of Harvard’s Kennedy School of Government has used the materials for their “case studies” in much the same way they collect old newspapers articles to capture a point in time.

With copies of the Internet over time and cross correlation of data from multiple sources, new services might help users understand what they are reading, when it was created, and what other people thought of it. With these services, people might be able to give a context to the information they are seeing and therefore know if they can trust it. Furthermore, the coordination of this meta-information and usage data can help build services for navigating the sea of data that is available.

Companies are also interested in saving similar information and building similar services based on their internal information to help employees effectively learn from the experiences of others.

The technologies and the services that will grow out of building digital archives and digital libraries could lead towards building a reliable system of information interchange based on electrons rather than paper. Using the “library” might be done many times a day to use documents that are no longer available on the Internet.

Legal and Social Issues

Creating an archive of informal and personal information has many difficult legal and social issues even if the material was intended to be publicly accessible at some point. Such a collection treads into the murky area intellectual property in the digital era. What can be done with the digital works that are collected gets into the area of copyright, privacy, import/export restrictions, and possession of stolen property.

To give a few examples: what if a college student made a web page that had pictures of her then-current boyfriend, but later wanted to take it down and “tear it up”, yet it lived on in digital archives (whether accessible or not). Should she have the right to remove that document? Should a candidate for political office be able to go back 15 years to erase his postings to public bulletin boards that have been saved in the Archive? What if a software program that is legal to publish in Denmark, but illegal in the United States is collected by an archive: should this program be removed and hidden even from historians and scholars? The legal and social issues raised by the construction of the Archive are not easily resolved.

By allowing authors to exclude their information from the Archive we hope to avoid some of the immediate issues, and allow enough time to pass to understand the larger issues at hand.

The Internet Archive might be able to help resolve some of these issues by publicly drawing the issues out and by participating in the debates. While many of these questions will take years to resolve, we feel it is important to proceed with the collection of the material since it can never be recovered in the future.

Where does it go from here?

The new technologies and services currently being created might be useful in all digital libraries and help make the Internet more robust and useful.

Through an archive of what millions of people are interested in making public, we might be able to detect new trends and patterns. Since these materials are in computer readable form, searching them, analyzing them, and distributing them has never been easier. A variety of services built on top of large data sets will allow us to connect people and ideas in new ways.

For instance, Firefly Inc. is using the individual tastes in music and movies to help suggest other CD’s and videos based on finding “similar” people. They have even found that people are interested in communicating with the other “similar” people directly thus forming communities based on similar interests. This kind of computer matchmaking which is based on detailed portraits of people’s preferences suggests similar services based on reading habits.

Trends in academic fields might be able to be detected more easily by studying gross statistics of the communications in the field. The hypertext links of the World Wide Web form an informal citation system similar to the footnote system already in use. Studying the topography of these links and their evolution might provide insights into what any given community thought was important.

If archiving cultural and personal histories become useful commercially, then the efforts can be expanded to record radio and video broadcasts. These systems might allow us to study these effects and influences on our lives.

Current terabyte technologies (storage hardware and management software) are relatively rare and specialized because of their costs, but as the costs drop we might see new applications that have traditionally used non-computer media. For instance,

– A video store holds about 5,000 video titles, or about 7 terabytes of compressed data.
– A music radio station holds about 10,000 LP’s and CD’s or about 5 terabytes of uncompressed data.
– The Library of Congress contain about 20 million volumes, or about 20 terabytes text if typed into a computer.
– A semester of classroom lectures of a small college is about 18 terabytes of compressed data.

Therefore the continued reduction in price of data storage, and also data transmission, could lead to interesting applications as all the text of a library, music of a radio station, and video of a video store become cost effective to store and later transmitted in digital form.

In the end, our goal is to help people answer hard questions. Not “what is my bank balance?”, or “where can I buy the cheapest shoes”, or “where is my friend Bill?” – these will be answered by smaller commercial services. Rather, answer the hard questions like: “Should I go back to graduate school?” or “How should I raise my children?” or “What book should I read next?”. Questions such as these can be informed by the experiences of others. Can machines and digital libraries really help in answering such questions? In the long term, we believe yes, but perhaps in new ways which would have importance in education and day-to-day life.

Further Reading:

Preserving Digital Objects: Recurrent Needs and Challenges, December 1995 presentation at 2nd NPO conference on Multimedia Preservation, Brisbane, Australia.

The Vanished Library, Luciano Canfora. University of Berkeley Press, 1990.

Biography:

Brewster Kahle is a founder of the Internet Archive in April 1996. Before that, he was the inventor of the Wide Area Information Servers (WAIS) system in 1989 and founded WAIS Inc in 1992. WAIS helped bring commercial and government agencies onto the Internet by selling Internet publishing tools and production services to companies such as Encyclopaedia Britannica, New York Times, and the Government Printing Office.

Schooled at MIT (BSEE ’82), Brewster designed super computers in the 80’s at Thinking Machines Corporation.

]]>
https://blog.archive.org/2025/09/02/looking-back-on-preserving-the-internet-from-1996/feed/ 1
From India to the World: A Scholar’s Tribute to the Internet Archive https://blog.archive.org/2025/08/29/from-india-to-the-world-a-scholars-tribute-to-the-internet-archive/ Fri, 29 Aug 2025 20:33:26 +0000 https://blog.archive.org/?p=29181 Every day, people around the world use the Internet Archive to learn, research, and discover. Aadarsh Pathak, a scholar in India, called the Internet Archive “a guardian of our collective digital heritage” in a recent note. His words inspire us—and we’d love to hear yours as we celebrate 1 trillion web pages archived.

Share your story through our testimonial form.

Aadarsh Pathak,
Research Scholar,
Deen Dayal Upadhyaya Gorakhpur University
I am writing to you as a research scholar to express my profound gratitude for your visionary creation, the Internet Archive. It is not merely a digital library; for academics like myself, it is an indispensable and unparalleled resource.

Your incredible project has preserved countless historical documents, books, and web materials that would have otherwise been lost to time. The ability to access primary sources, trace the evolution of ideas through archived web pages, and find rare texts has been absolutely critical to the depth and authenticity of my research. The Wayback Machine, in particular, has often been my last resort for retrieving crucial online information that has disappeared from the live web.

The Internet Archive is more than just a tool it is a guardian of our collective digital heritage and a powerful democratizing force for knowledge. Your contribution to education, research, and the open access movement is truly monumental and an inspiration to us all.

Thank you for your unwavering commitment to preserving our history and for building a foundation upon which so much future discovery will depend.

With deepest appreciation, 
Aadarsh Pathak 
Research Scholar
Deen Dayal Upadhyaya Gorakhpur University, Gorakhpur, Uttar Pradesh, India 

]]>
Wayback Machine to Hit ‘Once-in-a-Generation Milestone’ this October: One Trillion Web Pages Archived https://blog.archive.org/2025/07/01/wayback-machine-to-hit-once-in-a-generation-milestone-this-october-one-trillion-web-pages-archived/ https://blog.archive.org/2025/07/01/wayback-machine-to-hit-once-in-a-generation-milestone-this-october-one-trillion-web-pages-archived/#comments Tue, 01 Jul 2025 12:00:00 +0000 https://blog.archive.org/?p=28965 Illustration of a towering monolith with "1T" engraved on it, symbolizing the Internet Archive's milestone of archiving 1 trillion web pages. The monolith stands against a cosmic backdrop with a glowing light behind it, evoking a sense of scale and wonder. The Internet Archive logo appears in the lower left corner.

This October, the Internet Archive’s Wayback Machine is projected to hit a once-in-a-generation milestone: 1 trillion web pages archived. That’s one trillion memories, moments, and movements—preserved for the public, forever.

We’ll be commemorating this historic achievement on October 22, 2025, with a global event: a party at our San Francisco headquarters and a livestream for friends and supporters around the world. More than a celebration, it’s a tribute to what we’ve built together: a free and open digital library of the web.

Join us in marking this incredible milestone. Together, we’ve built the largest archive of web history ever assembled. Let’s celebrate this achievement—in San Francisco and around the world—on October 22.

Here’s how you can take part:

1. RSVP
Sign up now to be the first to know when registration opens for our in-person event and livestream.
RSVP now

2. Support the Internet Archive
Help us continue preserving the web for generations to come.
Donate today!

3. Share Your Story
What does the web mean to you? How has the Wayback Machine helped you remember, research, or recover something important?
Submit your story

Let’s work together toward October 22—a day to look back, share stories, and celebrate the web we’ve built and preserved together.

]]>
https://blog.archive.org/2025/07/01/wayback-machine-to-hit-once-in-a-generation-milestone-this-october-one-trillion-web-pages-archived/feed/ 2
BBC News: Can the Internet Archive Save Our Digital History?  https://blog.archive.org/2025/05/05/bbc-news-can-the-internet-archive-save-our-digital-history/ Mon, 05 May 2025 17:09:00 +0000 https://blog.archive.org/?p=28773 “A Time Machine for the Web” — the BBC just released a must-watch video on the Internet Archive and why our mission matters more than ever.

Watch here: https://www.youtube.com/watch?v=jh98N46DM5k

Inside the Internet Archive’s San Francisco headquarters, you’ll find racks of servers preserving humanity’s digital memory — from old websites to disappearing government data, books to historic videotapes.

“We are a digital library for our times — and hopefully, for all times,” says Mark Graham, director of the Wayback Machine.

But preserving access to information isn’t always easy. From political pressure to digital vanishing acts, the work of saving knowledge requires both care and courage.

In a time when websites can be taken down overnight — from climate change pages to stories celebrating diversity — the Wayback Machine ensures they’re not lost forever.

Former Air Force engineer Jessica Peterson, whose achievements were erased from the live web:

“I didn’t know [the Wayback Machine] existed… It gave me some relief.”

Whether you’re a researcher, student, journalist, or citizen — our goal is the same:
Universal access to all knowledge.

If you value a free and open internet, watch this video.
Then explore the Wayback Machine: https://web.archive.org/

]]>