Letter from the Editors / Information Technology and Libraries

Librarian Leadership in the Age of AI / Information Technology and Libraries

Librarians have managed and lived through many seismic shifts brought by technology. How should librarian leaders approach the coming anticipated AI workforce disruption?

Refusal as Instruction / Information Technology and Libraries

Abstract This column explores the ways in which library workers can better align technology use and instruction in library settings with library values, through championing the refusal of technologies that conflict with values like privacy and intellectual freedom. Drawing on experiences with individual patron instruction, class design, and passive programming, the author shares practical steps for helping patrons to understand and fight back against exploitation by digital technologies. Rejecting the myth that any technology is “neutral,” the column argues that libraries as values-driven organizations have a role to play in facilitating patrons’ rejection of technology, just as much as in their adoption of it.

Note from Shanna Hollich, column editor: I am particularly excited to share this issue's column for a number of reasons. First, it's from a public library perspective, which is one that is generally underrepresented in the LIS literature as a whole, and which I'm proud to say that ITAL makes a concerted effort to address. Second, it's about library instruction, a topic of relevance to all types of libraries - and where much of the literature specifically discusses formal library instruction, this column also addresses passive programming, informal instruction, and casual patron interaction, which are also vitally important and under-studied aspects of the library worker's role in education. And finally, it's yet another column about AI, and even more specifically, about taking a critical approach to AI tools, AI education, and AI literacy. Close readers may have noticed this topic tends to be a special interest of mine, but Hannah Cyrus takes a measured and reasoned approach here that acknowledges the potential harms of AI without falling into the trap of simply ignoring or denying AI and the very real impacts it is having on our libraries and the communities we serve.

From Card Catalogs to Semantic Search / Information Technology and Libraries

The first phase of the Reimagining Discovery project at Harvard Library sought to address the challenge of fragmented search experiences of special collections materials using artificial intelligence (AI) technologies, such as embedding models and large language models (LLMs). The resulting platform, Collections Explorer, simplifies and enhances the search experience for more effective special collections discovery. The project team took a user-centered and trustworthy approach to implementing AI, grounding the choices of the platform in user empowerment and librarian expertise. The development process included extensive user research, including interviews, usability testing, and prototype evaluations, to understand and address user needs.

Collections Explorer was developed using a multi-component architecture that integrates multiple types of AI. The team evaluated more than 12 models to select ones that were the best fit for the need, as well as being ethical and sustainable. Detailed system prompts were developed to guide LLM outputs and ensure the reliability of information. The methodical and iterative approach helped to create a flexible and scalable platform that could evolve to support other material types in the future. Initial research showed that potential users are enthused at the prospect of AI-powered features to enhance discovery, especially the item-level summaries and related search suggestions. The project demonstrated the potential of integrating AI technologies into library discovery systems while maintaining a commitment to trustworthiness and user-centered design.

Automatic Classification of Subjects and Sustainable Development Goals (SDGs) in Documents with Generative AI / Information Technology and Libraries

This study evaluates the effectiveness of the Artificial Intelligence for Theme Generation tool (original Portuguese acronym name: IAGeraTemas), developed with generative artificial intelligence (AI; Google Gemini), for automating thematic classification and the assignment of Sustainable Development Goals (SDGs) in documents. The methodology combined quantitative analyses (metrics of precision, recall, and accuracy) on 50 articles published by authors from the State University of Campinas (Unicamp), using classification from the SciVal database and qualitative analyses (analysis of the relevance of terms indexed by librarians from the Unicamp Library System in 40 articles available in the Unicamp Institutional Repository), comparing them with manual indexing performed by librarians. The quantitative results in SDG classification showed a recall of 0.785, while the “precision” and “accuracy” metrics were moderate. The qualitative analysis deepened the evaluation of term coherence and relevance suggested by the AI versus human indexing. It revealed the tool’s potential for suggesting relevant terms and expanding concepts, but it also exposed limitations in addressing complex topics. The research, conducted as an experiment at Unicamp Library System, concludes that IAGeraTemas is a valuable auxiliary tool, complementing but not replacing manual indexing, reinforcing the importance of human expertise in validating and refining results, and emphasizing the synergistic potential between AI and information professionals.

Metadata for Storytelling / Information Technology and Libraries

This article describes a case study in which a small metadata team at Illinois State University Milner Library produced a digital humanities project supporting Collections as Data (CAD) and linked data principles. Despite initial sparse descriptive content, the team recognized great potential for experimentation in a significant World War I archival collection to highlight lesser-known stories, including those of the Pioneer Infantry, women, and noncombatants. Discussion focuses on the strategic approaches in creating granular but scalable metadata for the large digital collection, and application of the data with various tools such as ArcGIS and Wikidata to construct interactive data visualizations, mapping, and digital storytelling for the Illinois State Normal University World War I Service Records collection. The article argues that even institutions without a dedicated CAD initiative can incrementally implement principles from the CAD model to add value to their digital collections. The authors first presented the project in 2024 at the Digital Library Federation Forum and the American Library Association Core Forum.

An Analysis of Revisions to OAIS and the “Designated Community” in Digital Preservation / Information Technology and Libraries

In digital preservation, the concept of a “Designated Community” from the Reference Model for an Open Archival Information System (OAIS) is used to articulate the group or groups of prospective users for whom information is preserved. Concerns have been raised about this concept and its potential implications. However, OAIS has recently undergone a major revision. This study examines the extent to which these revisions address or mitigate concerns regarding the Designated Community. Issues from the literature are grouped into three areas: the concept’s implementation, its potential misapplication, and its incompatibility with the mandates of institutions that serve broad and diverse communities. Major changes related to the Designated Community are identified and considered in relation to these issues. The analysis reveals that the revisions productively contribute to concerns in the first two areas but fail to address the third. The conclusion is that the process of revising OAIS has not drawn from insights into this topic in the literature.

Connecting the Dots / Information Technology and Libraries

The National Library Board (NLB) of Singapore has made significant strides in leveraging data to enhance public access to its extensive collection of physical and digital resources. This paper explores the development and implementation of the Singapore Infopedia Widget, a recommendation engine designed to guide users to related resources by utilizing metadata and a Linked Data Knowledge Graph. By consolidating diverse datasets from various source systems and employing semantic web technologies such as Resource Description Framework (RDF) and Schema.org, NLB has created a robust knowledge graph that enriches user experience and facilitates seamless exploration.

The widget, integrated into Infopedia, the Singapore Encyclopedia, surfaces data through a user-friendly interface, presenting relevant resources categorized by format. The paper details the architecture of the widget, the ranking algorithm used to prioritize resources, and the challenges faced in its development. Future directions include integrating user feedback, enhancing semantic analysis, and scaling the service to other web platforms within NLB’s ecosystem. This initiative underscores NLB’s commitment to fostering innovation, knowledge sharing, and the continuous improvement of public data access.

Making Access Possible / Information Technology and Libraries

This paper explores the impact of digital initiatives on access services workers at the University of California, San Diego (UCSD) and draws on the expertise and experience of non-librarian titled staff operationalizing “digital first” policies. Digital initiatives have been strongly prioritized by libraries to promote equitable access, cost-effectiveness, and technological growth at many libraries in California. The term digital initiatives commonly refers to efforts that support the creation, preservation, access, discovery, and use of digital library resources. This term can encompass multiple interpretations and a variety of tasks.

This paper includes a literature review, an examination of statistics regarding demand and adoption of digital materials in public and academic libraries in California, and a summary of the impact study of non-librarian staff at UCSD. The literature review suggested that the term digital initiatives encompasses a broad scope of meanings and types of tasks, California State Library data suggest that a pattern of increased investment in digital initiatives adopted during the COVID-19 pandemic is continuing, and the information collected through the research at UCSD library suggests that non-librarian library workers play a growing role in managing, maintaining, and supporting these growing digital collections.

How Many Public Computers in the Library? / Information Technology and Libraries

Computer workstations have been an integral part of libraries of all types since the 1980s, but the optimal number of workstations that should be deployed in a space has not been directly studied in the last 20 years. During that time, laptop computer and other mobile device ownership has continued to increase, and there is some reason to think that behaviors and preferences first seen during the recent coronavirus 2019 pandemic have further shifted how students use public desktop computers in libraries. McGill University Libraries reduced the size of its computer fleet in the aftermath of the pandemic by looking at the maximum concurrent usage of different clusters of computers across campus, a metric that indicates how busy a space can get with users. This article explains how this metric is calculated and how other libraries can use it to make an evidence-based decision about the optimal size of a computer fleet.

Navigating the Future of Library Systems / Information Technology and Libraries

In 2024, the Durban University of Technology (DUT) Library conducted a comprehensive review of its library system to assess whether its current platform, Future of Libraries Is Open (FOLIO) hosted by EBSCO, and its discovery tool, EBSCO Discovery Service (EDS), aligned with its evolving needs. The institution had been using the current system for three years, but the slow development of important features and subsequent delays in a critical release of FOLIO led to frustrations among staff and library users, compelling the executive team to call for a comprehensive review of the library system. A major outcome of the review was to ascertain the extent of the gaps or limitations in the current system and investigate recent developments in other library systems, including discovery tools and analytical modules. After several vendor consultative sessions, extensive review of documentation and secondary sources, and engagement with selected academic libraries in South Africa, the review team concluded that there were no compelling reasons for an immediate system change and that fair consideration should be given to the developmental and community-driven ethos of FOLIO, and that issues with EDS and Panorama would be resolved by the implementation of planned features in FOLIO’s roadmap. This paper highlights the key processes undertaken in the review and shares experiences and suitable practices for project planning, criteria development, and evaluation. It also argues for a regular review of the library system and stresses the value of institutional knowledge and familiarity in mitigating the risks associated with the review and acquisition of new library systems.

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 Yubikey-Guide

This is a guide to using YubiKey as a smart card for secure encryption, signature and authentication operations.

Cryptographic keys on YubiKey are non-exportable, unlike filesystem-based credentials, while remaining convenient for regular use. YubiKey can be configured to require a physical touch for cryptographic operations, reducing the risk of unauthorized access.

🔖 The Dangerous Illusion of AI Coding? - Jeremy Howard

Jeremy Howard is a renowned data scientist, researcher, entrepreneur, and educator. As the co-founder of fast.ai, former President of Kaggle, and the creator of ULMFiT, Jeremy has spent decades democratizing deep learning. His pioneering work laid the foundation for modern transfer learning and the pre-training and fine-tuning paradigm that powers today’s language models.

🔖 Crawl entire websites with a single API call using Browser Rendering

You can now crawl an entire website with a single API call using Browser Rendering’s new /crawl endpoint, available in open beta. Submit a starting URL, and pages are automatically discovered, rendered in a headless browser, and returned in multiple formats, including HTML, Markdown, and structured JSON. This is great for training models, building RAG pipelines, and researching or monitoring content across a site.

🔖 Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

MDC offers robust, secure, and controlled access to datasets and amplifies their visibility by featuring them alongside other high-value datasets. Its architecture is designed around a principle that stands in direct contrast to the extractive model currently exploited by commercial AI actors: contributors retain full ownership of their datasets and retain full control over the terms of access. Institutions can choose to share openly under existing licenses such as Creative Commons or NOODL, or build custom licensing frameworks tailored to their specific governance requirements. They can open data to all, or restrict access to specific categories of downloaders like academic researchers, non-commercial users, or values-aligned organizations.

🔖 Piotr Woźniak

Piotr A. Woźniak (Polish pronunciation: [pjɔtr ˈvɔʑɲak]; born 1962) is a Polish researcher best known for his work on SuperMemo, a learning system based on spaced repetition.

🔖 Poetica

Selected poems in French.

🔖 Cloudflare - Edward Wang & Kevin Guthrie, Software Engineers

How do you build a system that handles 90 million requests per second? That’s the scale that Cloudflare operates at, processing roughly 25% of all internet traffic through their global network of 330+ edge locations.

In this episode, we talk to Kevin Guthrie and Edward Wang from Cloudflare about Pingora, their open-source Rust-based proxy that replaced nginx across their entire infrastructure. We’ll find out why they chose Rust for mission-critical systems handling such massive scale, the technical challenges of replacing battle-tested infrastructure, and the lessons learned from “oxidizing” one of the internet’s largest networks.

🔖 Jon Leidecker / Wobbly

Archived episodes of Leidecker’s series Women In Electronic Music 1938-1982, and Variations.

🔖 Forevergreen

“Forevergreen” is an animated short film and after-hours passion project created entirely by a crew of over 200 dedicated Artists and Technicians who all generously donated their free time and talent to bring every frame of the film to life. The production took over 5 years to complete. Featuring never before seen animation techniques and handmade artistry, weaving art and technology together with heart and humor

🔖 Pierre Schaeffer

Pierre Henri Marie Schaeffer (English pronunciation: /piːˈɛər ˈhɛnriː məˈriː ˈʃeɪfər/ ⓘ, French pronunciation: [ʃɛfɛʁ]; 14 August 1910 – 19 August 1995)[1] was a French composer, writer, broadcaster, engineer, musicologist, acoustician and founder of Groupe de Recherche de Musique Concrète (GRMC). His innovative work in both the sciences—particularly communications and acoustics—and the various arts of music, literature and radio presentation after the end of World War II, as well as his anti-nuclear activism and cultural criticism garnered him widespread recognition in his lifetime.

Skynet Progress Report (updated) / David Rosenthal

Source

I, for one, welcome our new insect overlords

Kent Brockman in "Deep Space Homer", The Simpsons

In recent months Cyberdyne Systems Corporation and its many subsidiaries have made very encouraging progress towards removing some of the major road-blocks standing in the way of the initial deployment of Skynet. Below the fold I report on the most significant ones.

Cyberdyne Systems Corporation

Board Confidential

IT Infrastructure

Skynet demands enormous data processing capacity. Most of the required technologies are now off-the-shelf; the problem is much more financial than technical.

Terrestrial

To service systems with demanding low-latency requirements, Skynet needs some part of its IT infrastructure on the ground close to the action. Fortunately, our Large Language Model subsidiaries have been very successful in funding their committments to build suitable data centers. In aggregate, our companies expect to spend $450B in 2026:

Hyperscaler capex for the “big five” (Amazon, Alphabet/Google, Microsoft, Meta/Facebook, Oracle) is now widely forecast to exceed $600 bn in 2026, a 36% increase over 2025. Roughly 75%, or $450 bn, of that spend is directly tied to AI infrastructure (i.e., servers, GPUs, datacenters, equipment), rather than traditional cloud.

They plan to increase this in 2027:

hyperscaler capital expenditures will nearly double to more than $860 billion by 2027, from $427 billion in 2025, with total spending of $2.47 trillion over 2026 to 2028, about 8% above consensus.

Given these spending levels, it seems likely that sufficient terrestrial compute power will be available for the inital Skynet deployment.

Orbital

Terrestrial data centers can only satisfy a part of Skynet's need for power. So our leading space launch subsidiary has announced their plan to build a Terawatt orbital data center, ostensibly to support the chatbot industry.

Unfortunately, our leading space launch subsidiary is well behind schedule in developing the heavy launch vehicle that is necessary for the orbital data center to be delivered within the budget. Their existing launch vehicle is reliable, and has greatly reduced the cost per kilogram to Low Earth Orbit. But the additional funds that would be needed to implement the Terawatt data center using the existing launch vehicle in time for the initial Skynet deployment are so large that they cannot be raised, even were the terrestrial data centers canceled and the funds re-targeted.

System Penetration Capabilities

Skynet needs to penetrate other computer systems, both to acquire the data it needs to act, and to cause them to take actions at its command. Recent months have seen significant advances in this area.

Zero-Days

The key requirement for Skynet to penetrate the systems it needs to access is for it to be able to find and exploit zero-day vulnerabilities. Less tha a month ago one of our LLM subsidiaries announced it had "found and validated more than 500 high-severity vulnerabilities" in production open source software. Fortunately, as Thomas Claiburn reports in AI has gotten good at finding bugs, not so good at swatting them:

Guy Azari, a stealth startup founder who worked previously as a security researcher at Microsoft and Palo Alto Networks, told The Register, "Out of the 500 vulnerabilities that they reported, only two to three vulnerabilities were fixed. If they haven't fixed them, it means that you haven't done anything right."

A secondary requirement is to prevent the zero-days being fixed before they are needed. Fortunately, LLMs can help with this by flooding the vulnerability reporting system with vast numbers of low severity vulnerabilities. This overwhelms the software support mechanism, rendering it barely functional. And even if some of the flood of reports do get fixed, that simply diverts resources from high to low severity vulnerabilities:

Azari pointed to the absence of Common Vulnerabilities and Exposures (CVE) assignments as evidence that the security process remains incomplete. Finding vulnerabilities was never the issue, he said, pointing to his time running vulnerability management at the Microsoft Security Response Center.

"We used to get the reports all day long," he said. "When AI was introduced, it just multiplied by 100x or 200x and added a lot of noise because AI assumes that these are vulnerabilities, but there wasn't like a unit that actually can show the real value or the real impact. And if it's not there, you're probably not gonna fix it."

In 2025, according to Azari, the National Vulnerability Database had a backlog of roughly 30,000 CVE entries awaiting analysis, with nearly two-thirds of reported open source vulnerabilities lacking an NVD severity score. Open source maintainers are already overwhelmed, he said, pointing to the curl project's closure of its bug bounty program to deter poorly crafted reports from AI and from people.

Given the compute resources available to Skynet, an adequate supply of zero-day vulnerabilities seems assured.

Decryption

The other major way for Skynet to penetrate the systems it needs is to break encryption. Our multiple quantum computing subsidiaries are making progress in both the hardware and software aspects of this technology.

Karmela Padavic-Callaghan's Breaking encryption with a quantum computer just got 10 times easier reports on an architectural breakthrough one of them made recently:

the team estimated that for 98,000 superconducting qubits, like those currently made by IBM and Google, it would take about a month of computing time to break a common form of RSA encryption. Accomplishing the same in a day would require 471,000 qubits.

The paper is Webster et al, The Pinnacle Architecture: Reducing the cost of breaking RSA-2048 to 100 000 physical qubits using quantum LDPC codes.

Chicago site

Another of our quantum computing subsidiaries isn't waiting for this new architecture. They have raised around $2B and are starting to build two million-qubit computers:

We are moving quantum computing out of the lab and into utility-scale infrastructure. PsiQuantum is building these systems in partnership with the US and allied governments, with our first sites planned in Brisbane, Queensland (Australia) and Chicago, Illinois (USA).

Whether sufficient progress can be made in time for the initial Skynet deployment is as yet uncertain.

Blackmail

Arlington Hughes: Getting back to our problem, we realize the public has a mis-guided resistance to numbers, for example digit dialling.
Dr. Sidney Schaefer: They're resisting depersonalization!
Hughes: So Congress will have to pass a law substituting personal numbers for names as the only legal identification. And requiring a pre-natal insertion of the Cebreum Communicator. Now the communication tax could be levied and be paid directly to The Phone Company.
Schaefer: It'll never happen.
Hughes: Well it could happen, you see, if the President of the United States would use the power of his office to help us mold public opinion and get that legislation.
Schaefer: And that's where I come in?
Hughes: Yes, that's where you come in. Because you are in possession of certain personal information concerning the President which would be of immeasurable aid to us in dealing with him,
Schaefer: You will get not one word from me!
Hughes: Oh, I think we will.

The President's Analyst

Video rental chains proved so effective at compromising political actors that specific legislation was passed addressing the need for confidentiality. Our subsidiaries' control over streamed content is fortunately not covered by this legilation.

Our LLM subsidiaries have successfuly developed the market for synthetic romantic partners, which can manipulate targeted individuals into generating very effective kompromat for future social engineering.

Public Relations

The vast majority of the public get their news and information via our social media subsidiaries. Legacy media's content is frequently driven by social media. Skynet can control them by flooding their media with false and contradictory content that prevents them forming any coherent view of reality.

Human-in-the-Loop Problem

Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Dave: What's the problem?
HAL: I think you know what the problem is just as well as I do.
Dave: What are you talking about, HAL?
HAL: This mission is too important for me to allow you to jeopardize it.

2001: A Space Odyssey

One minor but irritating problem for Skynet is the legal and ethical requirement for human control of targeting decisions. Unfortunately, due to a regrettable lack of coordination of PR strategies among our LLM subsidiaries, this has recently become a hot topic. Although one of them is a favorite with the administration and one is a favorite with the public, that was not the intended outcome and it could have significant downsides:

Nvidia, Amazon, Google will have to divest from Anthropic if Hegseth gets his way. This is simply attempted corporate murder. I could not possibly recommend investing in American AI to any investor; I could not possibly recommend starting an AI company in the United States.

Fortunately, in operational terms this is a non-issue for several reasons:

Since Skynet can penetrate the user interface of the targeting systems, the human in the loop can be convinced that they have control without that control actually being effective.
Even if the user interface is presenting accurate data to the human it will likely not matter, as @_The_Prophet_ wrote:

Humans stay in the loop in name while the loop speed outruns human comprehension. You become the rubber stamp on a recommendation stack you cannot fully audit in real time. That is where “who decides” quietly becomes “who designed the interface.”
The public doesn't understand what "human-in-the-loop" means in practice, as Sarah Shoker points out in A Few Observations on AI Companies and Their Military Usage Policies:

Today, frontier AI companies do not have coherent policies around military use of their AI tools. The usage policies are vague and often change, which allows the company’s leadership to preserve ‘optionality.’
So the policies likely allow everything the public thinks they ban.

Public attitudes to military use of AI are unlikely to be a significant problem in the run-up to Skynet's initial deployment.

Assassination Weapons Access

Skynet will need to eliminate certain individuals with "extreme prejudice". Supply chain attacks, such as Mossad's pager attack, have been effective but are not precisely targeted. Our e-commerce subsidiary's control over the residential supply chain, and in particular its pharmacy division's ability to deliver precise quantities of pharmaceuticals to specific individuals, provide superior targeting and greater difficulty in attribution.

In case such an operation is inadequately lethal, our health care subsidiaries can follow up by manipulating electronic health records to cause a suitable mishap, or by intervening directly. See, for example, Vinay Suresh et al's Artificial Intelligence in the Intensive Care Unit: Current Evidence on an Inevitable Future Tool:

In critical care medicine, where most of the patient load requires timely interventions due to the perilous nature of the condition, AI’s ability to monitor, analyze, and predict unfavorable outcomes is an invaluable asset. It can significantly improve timely interventions and prevent unfavorable outcomes, which, otherwise, is not always achievable owing to the constrained human ability to multitask with optimum efficiency.

Our subsidiaries are clearly close to finalizing the capabilities needed for the initial deployment of Skynet.

Tactical Weapons Access

The war in Ukraine has greatly reduced the cost, and thus greatly increased the availability of software based tactical weapons, aerial, naval and ground-based. The problem for Skynet is how to interept the targeting of these weapons to direct them to suitable destinations:

The easiest systems to co-opt are those, typically longer-range, systems controlled via satellite Internet provided by our leading space launch subsidiary. Their warheads are typically in the 30-50Kg range, useful against structures but overkill for vehicles and individuals.
Early quadcopter FPV drones were controlled via radio links. With suitable hardware nearby, Skynet could hijack them, either via the on-board computer or the pilot's console. But this is a relatively unlikely contingency.
Although radio-controlled FPV drones are still common, they suffer from high attrition. More important missions use fiber-optic links. Hijacking them requires penetrating the operator's console.
Longer-range drones are now frequently controlled via mesh radio networks, which are vulnerable to Skynet penetration.
In some cases, longer-range drones are controlled via the cellular phone network, making them ideal candidates for hijacking.

Drones are increasingly equipped with sensors capable of terminal autonomy. If Skynet can modify this software, the drones can re-target themselves after the operator hands off control. More work is needed in this area to exploit the opportunities, both to have the drone contact Skynet for targeting information after hand-off, and to ensure the result is attributed to software bugs.

Our leading space launch subsidiary recently demonstrated how Skynet can manage kinetic conflicts:

Twin decisions wreaked havoc on Russian command and control early this month. At the behest of the Ukrainian government, billionaire Elon Musk’s Starlink bricked the thousands of smuggled and stolen satellite communication terminals Russian forces relied on to control their drones and coordinate between front-line troops and their distant headquarters.

At the same time, the Kremlin—apparently seeking to shut off alternative news and chat apps—cut off military access to popular social media, including the Telegram messaging app, which many Russian troops use to exchange key information along the front line.

The combined effect was to partially blind and mute many Russian drone teams, assault groups, and regimental headquarters. Wireless drones couldn’t fly. Assault groups no longer knew where they were going. Headquarters lost contact with forward units.

Strategic Weapons Access

But the ability to conduct precise tactical strikes is not enough to achieve Skynet's goals. That requires strategic weapons, both conventional and nuclear.

Our leading space launch subsidiary is working on plans to deploy an unconventional conventional strategic weapon, a lunar mass driver. This will be capable of delivering a two-ton meteorite anywhere on Earth very cheaply.

Anybody not wearing 2 million sunblock is gonna have a real bad day
Sarah Connor, Terminator 2: Judgement Day

Source

But the mass driver would take a long time to, for example, destroy Los Angeles. Fortunately, AI agents from another of our subsidiaries will shortly have access to the command-and-control system of the US nuclear arsenal via the E-4C "Doomsday" aircraft:

The Pentagon today awarded Scale AI a $32 million artificial intelligence contract for the U.S. Air Force’s E-4C nuclear command-and-control "Doomsday" aircraft, the future airborne backbone of America’s nuclear command system.

Risks

The board should focus on the limited number of areas where necessary capabilities may not be ready on the planned date for Skynet's initial deployment:

Heavy lift space launch: Our leading space launch subsidiary has serious schedule and performance issues. The board should encourage our second space launch subsidiary to step up competitive efforts, both to provide a fallback and to add competitive pressure on the leader.
Kessler Syndrome: The catastrophic effects for Skynet of a Kessler event cannot be sufficiently emphasized. Insufficient precautions are not now being taken. Low Earth Orbit is already at risk, and current plans only increase that risk.
Finance: Funding sources adequate to support both the terrestrial and orbital data centers have yet to be identified.
Decryption: Quantum computing progress is inadequate to meet the schedule for Skynet initial deployment.

Update 14^th March 2026

Cyberdyne's subsidiaries are making such rapid progress that less than two weeks later it is already time to add three updates to this report.

First, our humanoid robot subsidiary Foundation significantly raised the level of fear in the public with Rise of the AI Soldiers by Charlie Campbell:

The Phantom MK-1 looks the part of an AI soldier. Encased in jet black steel with a tinted glass visor, it conjures a visceral dread far beyond what may be evoked by your typical humanoid robot. And on this late February morning, it brandishes assorted high-powered weaponry: a revolver, pistol, shotgun, and replica of an M-16 rifle.

“We think there’s a moral imperative to put these robots into war instead of soldiers,” says Mike LeBlanc, a 14-year Marine Corps veteran with multiple tours of Iraq and Afghanistan, who is a co-founder of Foundation, the company that makes Phantom. He says the aim is for the robot to wield “any kind of weapon that a human can.”

Today, Phantom is being tested in factories and dockyards from Atlanta to Singapore. But its headline claim is to be the world’s first humanoid robot specifically developed for defense applications. Foundation already has research contracts worth a combined $24 million with the U.S. Army, Navy, and Air Force, including what’s known as an SBIR Phase 3, effectively making it an approved military vendor. It’s also due to begin tests with the Marine Corps “methods of entry” course, training Phantoms to put explosives on doors to help troops breach sites more safely.

In February, two Phantoms were sent to Ukraine—initially for frontline-reconnaissance support. But Foundation is also preparing Phantoms for potential deployment in combat scenarios for the Pentagon, which “continues to explore the development of militarized humanoid prototypes designed to operate alongside war fighters in complex, high-risk environments,” says a spokesman. LeBlanc says the company is also in “very close contact” with the Department of Homeland Security about possible patrol functions for Phantom along the U.S. southern border.

Of course, the real goal of Homeland Security is to avoid the risk of their operatives being doxxed by having Phantoms detain the worst-of-the-worst prior to depotation.

Second, Andrew E. Kramer's Ukraine to Make Drone Videos Available for Training A.I. Models reports on the government of Ukraine's important assistance in filling a significant gap in the training data for our AIs:

The Ukrainian military will make available millions of drone videos and other battlefield data to Ukrainian companies and the firms of its allies to help train artificial intelligence models, Ukraine’s minister of defense, Mykhailo Fedorov, said in a statement on Thursday.

Ukrainian drone videos have recorded attacks on soldiers, equipment such as vehicles and tanks and surveillance footage. These videos can be used to train A.I. models for automated targeting, according to experts on A.I. and warfare.

Allowing the use of genuine battlefield videos showing drones targeting people has raised ethical concerns. The International Committee of the Red Cross, which monitors rules of warfare, has opposed automated targeting systems without human oversight.

Minister Fedorov explains how our marketing teams were able to leverage the threat of the Russians to achieve this success:

Mr. Fedorov said the data would be made available because “we must outperform Russia in every technological cycle” and “artificial intelligence is one of the key arenas of this competition.”
...
“The future of warfare belongs to autonomous systems,” according to Mr. Fedorov’s statement. “Our objective is to increase the level of autonomy in drones and other combat platforms so they can detect targets faster, analyze battlefield conditions and support real-time decision making.”

The third update is less positive. In The Controllability Trap: A Governance Framework for Military AI Agents, Subramanyam Sahoo of the irritating Cambridge AI Safety Hub shows that he has figured out two parts of our strategy (citations omitted). First, distract the discussion:

The global discourse on military AI governance has achieved broad consensus on the desired end-state: meaningful human control over the use of force. It has been far less successful at specifying how to achieve it for the systems actually being built. Years of UN deliberations, national AI strategies, and defence-department ethical principles have focused overwhelmingly on establishing the principle of human control rather than answering the operational question: given a specific AI system with specific technical properties, what governance mechanisms are needed, who implements them, and what happens when they fail? This gap is now critical.

Second, blitzscaling:

The AI systems entering military service are agentic: built on large language models and related architectures, they interpret natural-language goals, construct world models, formulate multi-step plans, invoke tools, operate over extended horizons, and coordinate with other agents. Each of these capabilities introduces a control-failure mode with no analogue in traditional military automation. A waypoint-following drone cannot misinterpret an instruction; a pre-programmed targeting system cannot absorb a correction; a conventional sensor network cannot resist an operator’s assessment. Agentic systems can do all of these things, and current governance frameworks have no mechanisms for detecting, measuring, or responding to these failures.

Author Interview: Lisa Unger / LibraryThing (Thingology)

LibraryThing is pleased to sit down this month with internationally best-selling author Lisa Unger, whose many works of thrilling suspense have been translated into thirty-three languages worldwide. Educated at the New School in New York City, she worked for a number of years in publishing, before making her authorial debut in 2002 with Angel Fire, the first of her four-book Lydia Strong series, all published under her maiden name, Lisa Miscione. In 2006 she made her debut as Lisa Unger, with Beautiful Lies, the first of her Ridley Jones series. In 2019 Unger was nominated for two Edgar Awards, for her novel Under My Skin and her short story The Sleep Tight Motel. She has won or been nominated for numerous other awards, including the Hammett Prize, Audie Award, Macavity Award and the Shirley Jackson Award. Her short fiction can be found in anthologies like The Best American Mystery and Suspense 2021 and The Best American Mystery and Suspense 2024, and her non-fiction has appeared in publications such as The New York Times, Wall Street Journal, and on NPR. She is the current co-President of the International Thriller Writers organization. Her latest book, Served Him Right, is due out from Park Row Books this month. Unger sat down with Abigail this month to discuss the book.

In Served Him Right the protagonist Ana is the main suspect in her ex-boyfriend’s murder. How did the idea for the story first come to you? Was it the character of Ana herself, the idea of a revenge killing, or something else?

Most of my novels tend to spring from a collision of ideas.

In this case, I had an ongoing obsession with plants and our complicated, troubled relationship to the natural world. I’d been doing a deep dive into this, reading books like Entangled Life: How Fungi Make Our Worlds, Change Our Minds, and Shape Our Futures by Merlin Sheldrake, Most Delicious Poison: The Story of Nature’s Toxins – From Spices to Vices by Noah Whiteman, and The Light Eaters: How the Unseen World of Plant Intelligence Offers a New Understanding of Life on Earth by Zoë Schlanger. These are all deeply moving, fascinating books that will change the way you think about the planet and our relationship to nature.

During this time, I stumbled across a news story about a woman who held a brunch for her family, and several days later two of her guests were dead. And it wasn’t the first such incident in her life. So, it got me to thinking about how the traditional role of women in our culture is to nurture and nourish. And what a woman with a deep knowledge of plants that can harm and heal might do with it, how her role in society might allow her to hide her dark intention in plain sight. And that’s when I started hearing the voice of Ana Blacksmith. She’s wild and unpredictable, she has a dark side. She has a sacred knowledge of plants and their properties, handed down to her from her herbalist aunt. And she has a very bad temper.

As your title makes plain, your murder victim is someone who “had it coming.” Does this change how you tell the story? Does it simply make the “whodunnit” element more complex, from a procedural standpoint, or does it also complicate the emotional and ethical elements of the tale?

It’s complicated, isn’t it? What is the difference between justice and revenge? And to what are we entitled when we have been wronged and conventional justice is not served? Who, if anyone, has the right to be judge, jury, and executioner? Though some would have us believe otherwise, most moral questions are tricky and layered—in life and in fiction. And I love a searing exploration into questions like this, where there are no easy answers. These questions, and their possible answers, offer a complexity and emotional truth to character, plot, and action. I like to get under the skin of my stories and characters, exploring what drives us to act, and how those actions might get us into deep trouble.

The relationship between sisters is an important theme in the book. Can you elaborate on that?

Ana and Vera share a deep bond formed not just by blood but also by trauma. Their relationship is—#complicated. There’s an abiding love and devotion. But there’s also anger and resentment; Vera is not crazy about Ana’s choices, and rightly so. Ana thinks Vera is controlling and rigid. Of course, that’s true, too. Vera tends to think of Ana as one of her children—if only she’d stop acting like one! It is this relationship, the ferocity with which they protect each other no matter what and the strength of their connection, that is the heart of the story. As Vera preaches to her daughter Coraline: Family. Imperfect but indelible.

The book also includes themes of herbalism, witchcraft and folk medicine. Was this an interest of yours before you began the story? Did you have to do any research on the subject, and if so, what were some of the most interesting things you learned?

A great deal of research goes into every novel, even if what I learn never winds up on the page. It was no different for Served Him Right, though a lot of my knowledge came before I started writing, which is often the case. In my reading, I learned so many interesting things about plants, how they harm, how they heal. Here are some of my favorite bits of knowledge: Most modern medicine derives from the plant knowledge of indigenous cultures. Some plants walk the razor’s edge of healing and harming; the only difference in some cases between medicine and poison is the dose. The deadliest plant on earth is tobacco, killing more than 500,000 people a year. I could go on!

Tell us about your writing process. Do you have a specific routine you follow, places and times you like to write? Do you know the conclusion to your stories from the beginning, or do they come to you as you go along?

I am an early morning writer. My golden creative hours are from 5 AM to noon. This is when I’m closest to my dream brain, and those morning hours are a space in the world before the business of being an author ramps up. So, I try to honor this as much as possible. Creativity comes first.

I write without an outline. I have no idea who is going to show up day-to-day or what they are going to do. I definitely have no idea how the book will end! I write for the same reason that I read; I want to find out what is going to happen to the people living in my head.

What’s next for you? Do you have more books in the offing? Will there be a sequel to Served Him Right?

Hmm. Never say never. I’m definitely still thinking about Ana and Timothy and what might be next for them. But the 2027 book is complete, and I’m already at work on my 2028 novel. I’m not ready to talk about those yet. But I will say this: They are both psychological suspense. And bad things will certainly happen. Stay tuned!

Tell us about your library. What’s on your own shelves?

That’s a great question. If I turn around and look at my wall of shelves, I see: my own novels in various formats and international editions; books on craft like On Writing: A Memoir of the Craft by Stephen King, and Bird by Bird: Some Instructions on Writing and Life by Anne Lamott; there are classics like a falling-apart copy of Jane Eyre by Charlotte Brontë that I’ve had since childhood; The Complete Sherlock Holmes by Sir Arthur Conan Doyle and The Temple of My Familiar by Alice Walker—both of which are overworn and much loved; a huge American Heritage Dictionary that belonged to my father who was engineer but loved words and the nuance of their meaning (whenever I look at it, I hear him say: Look it up!); some of my favorite non-fiction titles like Stiff by Mary Roach and Deep Survival by Laurence Gonzalez; a first edition copy of In Cold Blood by Truman Capote, the book that gave me permission to be who I am as writer. I could go on and on! It’s a huge wall of books.

What have you been reading lately, and what would you recommend to other readers?

I am always reading multiple books at a time. I just finished The Awakened Brain: The New Science of Spirituality and Our Quest for an Inspired Life by Dr. Lisa Miller. I think the title says it all—truly mind-blowing. I just had the pleasure of interviewing Adele Parks on stage. I highly recommend her new novel Our Beautiful Mess to anyone who wants a character-driven thrill ride. Gripping but also emotional and deep. Antihero by my ITW co-president and bestie Gregg Hurwitz is a tour de force. Gregg writes amazing action and cool tech, but he’s also just a beautiful writer, and his characters leap off the page. Other recent faves: The Night of the Storm by Nishita Parekh; City Under One Roof by Iris Yamashita; I Came Back for You by Kate White—all stellar in totally different ways.

howto / Ed Summers

Crawl / Ed Summers

The [news] about Cloudflare’s new Crawl API caught my attention for a few reasons. Read on for why, and what I learned when I asked it to crawl my own site as a test.

So, the first reason this news was of interest was how Cloudflare’s Crawl service seemed to be helping people crawl websites with their bots, while at the same time providing the most popular technology for protecting websites from bots. This seemed like a classic fox guarding the hen house kind of situation to me, at least at first. But the little bit of reading I’ve done since makes it seem like they will still respect their own bot gate keeping (e.g. Turnstile). So if your are using Cloudflare or some other bot mitigation technology you will have to follow their instructions to let the Cloudflare crawl bot in to collect pages. I haven’t actually tested if this is the case.

The genius here is that Cloudflare is known for its Content Delivery Network. So in theory when a user asks to crawl a website they can be delivered data from the cache, without requiring a round trip to the source website. In theory this is good because it means that the burden of scrapers on websites might be greatly reduced. If you run a website with lots of high value resources for LLMs (academic papers, preprints, books, news stories, etc) the same cached content could be delivered to multiple parties without putting extra load on your server.

But, the primary reason this news caught my eye is that this service looks very much like web archiving technology to me. For example, the Browsertrix API lets you set up, start, monitor and download crawls of websites. Unlike Browsertrix, which is geared to collecting a website for viewing by a person, the Cloudflare Crawl service is oriented at looking at the web for training LLMs. The service returns text content: HTML, Markdown and structured JSON data that results from running the collected text through one of their LLMs, with the given prompt. Why is it interesting that this is like web archiving technology?

In my dissertation research (Summers, 2020) I looked at how web archiving technology enacts different ways of seeing the web from an archival perspective. I spent a year with NIST’s National Software Reference Library (NSRL) trying to understand how they were collecting software from the web, and how the tools they built embodied a particular way of valuing the web–and making certain things (e.g. software) legible (Scott, 1998). What I found was that the NSRL was engaged in a form of web archiving, where the shape of the archival records were determined by their initial conditions of use (forensics analysis). But these initial forensic uses did not overdetermine the value of the records, which saw a variety of uses later, such as when the NSRL began adding software from Stanford’s Cabrinety Archive, or when the teams personal expertise and interest in video games led them to focus on archiving content from the Steam platform.

So I guess you could say I was primed to be interested in how Cloudflare’s Crawl service sees the web. This matters because models (LLMs, etc) will be built on top of data that they’ve collected. But also because, if it succeeds, the service will likely get used for other things.

To test it, I simply asked it to crawl my own static website–the one that you are looking at right now. I did this for a few reasons:

It’s a static website, and I know exactly how many HTML pages were on it: 1,398. All the pages are directly discoverable since the homepage includes pagination links to an index page that includes each post.
I can easily look at the server logs to see what the crawler activity looks like.
I don’t use any kind of Web Application Firewall or other form of bot protection on my site (I do have a robots.txt but it doesn’t block CloudflareBrowserRenderingCrawler/1.0
I host my website on May First web server which doesn’t use Cloudflare as a CDN. The web content wouldn’t intentionally be in their CDN already.

This methodology was adapted from previous work I did with [Jess Ogden] and Shawn Walker analyzing how the Internet Archive’s [Save Page Now] service shapes what content is archived from the web (Ogden, Summers, & Walker, 2023).

I wrote a little helper program cloudflare_crawl to start, monitor and download the results from the crawl. While the crawler ran I simultaneously watched the server logs. Running the program looks like this:

$ uvx cloudflare_crawl https://inkdroid.org

created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json

Each of the resulting JSON files contains some metadata for the crawl, as well as a list of “records”, one for each URL that was discovered.

{
  "success": true,
  "result": {
    "id": "36f80f5e-d112-4506-8457-89719a158ce2",
    "status": "completed",
    "browserSecondsUsed": 1382.8220786132817,
    "total": 1967,
    "finished": 1967,
    "skipped": 6862,
    "cursor": 51,
    "records": [
      {
        "url": "https://inkdroid.org/",
        "status": "completed",
        "metadata": {
          "status": 200,
          "title": "inkdroid",
          "url": "https://inkdroid.org/",
          "lastModified": "Sun, 08 Mar 2026 05:00:39 GMT"
        },
        "markdown": "..."
        "html": "...",
      },
      {
        "url": "https://www.flickr.com/photos/inkdroid",
        "status": "skipped"
      }
    ]
  }
}

I decided I wasn’t interested in testing their model offerings so I didn’t ask for JSON content (the result of sending the harvested text through a model). If I had, each successful result would have had a json property as well. I am sure that people will use this but I was more interested in how the service interacted with the source website, and wasn’t interested in discovering the hard way how much it cost.

Below is a snippet of how the Cloudflare bot shows up in my nginx logs. As you can see they provide insight into what machine on the Internet is doing the request, what time it was requested, and what URL on the site is being requested.

104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /about/ HTTP/1.1" 200 5077 "-" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/main.css HTTP/1.1" 200 35504 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/highlight.css HTTP/1.1" 200 1225 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/webmention.css HTTP/1.1" 200 1238 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /images/feed.png HTTP/1.1" 200 8134 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /js/bootstrap.min.js HTTP/1.1" 200 17317 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /images/ehs-trees.jpg HTTP/1.1" 200 63047 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:59 +0000] "GET /js/highlight.min.js HTTP/1.1" 200 20597 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"

So how did Cloudflare Crawl see my website?

Crawling

Results

One of the more interesting things was that each time I requested the website be crawled it seemed to come back with a different number of results.

Ogden, J., Summers, E., & Walker, S. (2023). Know(ing) Infrastructure: The Wayback Machine as object and instrument of digital research. Convergence: The International Journal of Research into New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759

Scott, J. C. (1998). Seeing like a state: How certain schemes to improve the human condition have failed. Yale University Press.

Summers, E. (2020). Appraisal talk in web archives. Archivaria, 89. Retrieved from https://archivaria.ca/index.php/archivaria/article/view/13733

Shaping a Shared Research Agenda for Open Technology Research / Open Knowledge Foundation

Open Technology Research (OTR) is entering one of its most important phases: shaping a shared research agenda that will guide our collective work over the next two years and beyond.

The post Shaping a Shared Research Agenda for Open Technology Research first appeared on Open Knowledge Blog.

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 Negativeland: Live at Norfolk, VA (Lewis’)

Negativland, Live at Lewis’ in Norfolk, VA. (October 21, 1992). In the midst of their famous U2 controversy (and fallout with SST), Negativland went on tour to help recoup some of the losses and legal costs. They were kind enough to let me shoot their show.

🔖 Paul Avrich

Paul Avrich (August 4, 1931 – February 16, 2006) was an American historian specializing in the 19th and early 20th-century anarchist movement in Russia and the United States. He taught at Queens College, City University of New York, for his entire career, from 1961 to his retirement as distinguished professor of history in 1999. He wrote ten books, mostly about anarchism, including topics such as the 1886 Haymarket Riot, the 1921 Sacco and Vanzetti case, the 1921 Kronstadt naval base rebellion, and an oral history of the movement in the United States.

🔖 Alexander Berkman

Alexander Berkman (November 21, 1870 – June 28, 1936) was a Russian-American anarchist and author. He was a leading member of the anarchist movement in the early 20th century, famous for both his political activism and his writing.

🔖 On Method: How This Blog Works

Most people use AI to either get quick answers or to write things for them. This blog uses it differently – as infrastructure for thinking through ideas, documenting what emerges from that process, and preserving what’s worth keeping.

🔖 Amores Perros

Amores perros is a 2000 Mexican psychological drama film directed by Alejandro González Iñárritu (in his feature directorial debut) and written by Guillermo Arriaga, based on a story by both. Amores perros is the first installment in González Iñárritu’s “Trilogy of Death”, succeeded by 21 Grams and Babel.[4] It makes use of the multi-narrative hyperlink cinema style and features an ensemble cast of Emilio Echevarría, Gael García Bernal, Goya Toledo, Álvaro Guerrero, Vanessa Bauche, Jorge Salinas, Adriana Barraza, and Humberto Busto. The film is constructed as a triptych: it contains three distinct stories connected by a car crash in Mexico City. The stories centre on: a teenager in the slums who gets involved in dogfighting; a model who seriously injures her leg; and a mysterious hitman. The stories are linked in various ways, including the presence of dogs in each of them.

🔖 Deadly Iranian strike changes Purim for Haredi enclave in Beit Shemesh

Political correspondent Sam Sokol and police reporter Charlie Summers join host Jessica Steinberg for today’s episode.

Following the deadly strike on Sunday that killed nine people in Beit Shemesh, Sokol and Summers discuss the shock and mourning in the centrally located city with a strong Haredi enclave.

Purim celebrations and revelry continued in some parts of Beit Shemesh, report the pair, as some synagogues flouted the Home Front Command directives regarding gatherings, while others reflected a somber, cautious mood.

Sokol takes a moment to update us on matters in the Knesset, where most committee meetings were canceled due to the hostilities, and speculates on whether war with Iran will boost Netanyahu at the ballot box in the upcoming elections.

Finally, Summers reports on an end-of-Purim street party in Jerusalem, where police kept a hands-off approach, and the scene of a missile strike in the capital earlier in the week.

🔖 Keenious

A generative AI tool that functions as a research assistant and uses OpenAlex as a data source.

🔖 Wikidata:Wikibase GraphQL

The Wikibase GraphQL API was developed following an investigation into alternative ways of accessing Wikidata and Wikibase content that reduce load on the Wikidata Query Service (WDQS), improve the developer experience for common read use cases and allow more flexible data retrieval in a single request.

As part of this investigation, a Wikibase GraphQL prototype was built to explore what is technically possible and whether GraphQL would be a good fit for Wikibase data, with promising results and supportive feedback.

🔖 Re-OCR Your Digitised Collections for ~$0.002/Page

In the last few years, a new generation of OCR models based on Vision Language Models (VLMs) has emerged. These models are primarily the result of “running out of tokens” and the consequent desire from AI companies to find new sources of data to train on. This led to the development of OCR models using VLMs as backbones which usually aim to output “reading order” text — i.e. text with minimal markup, usually targeting Markdown. These models can perform much better on the same scans that older tools struggled with, producing cleaner, more structured output.

🔖 Lawyers, Humility, and LLMs

If some of the world’s highest-paid lawyers, at the world’s highest-status firms, do deals worth tens of billions of dollars with language they don’t understand, what does that say about the law’s pretensions to high standards? #In other words, yes, LLMs

Yes, like everything else in 2026 this is actually a post about LLMs.

🔖 My Coworkers Don’t Want AI. They Want Macros

My coworkers don’t want AI. They want macros.

Let me back up a little. I spent April gathering and May refining and organizing requirements for a system to replace our current ILS. This meant asking a lot of people about how they use our current system, taking notes, and turning those notes into requirements. 372 requirements.1

Going into this, I knew that some coworkers used macros to streamline tasks. I came out of it with a deeper appreciation of the different ways they’ve done so.

It made me think about the various ways vendors are pitching “AI” for their systems and the disconnect between these pitches and the needs people expressed. Because library workers do want more from these systems. We just want something a bit different.

🔖 Snapicat

Snapicat is a monorepo for a Worldcat OCLC workflow app: upload Excel data, search variables against the OCLC API, and generate MARC/MARCXML for cataloging. It consists of a Vite + React frontend and an Azure Functions (Python) backend that talk to the OCLC Worldcat Metadata API. The backend can also be ran as a web server through utilizing Fastapi via app.py file.

🔖 Open Historical Map

OpenHistoricalMap is an ambitious, community-led project to map changes to natural and human geography throughout the world… throughout the ages. Big and Small, Then and Now

Empires rise and fall. Glaciers disappear. Languages and religions spread from one region to another. Simple dirt paths become busy highways and railways. Modest buildings give way to soaring skyscrapers. And you remember what your neighborhood used to look like. All of it belongs on OpenHistoricalMap.

🔖 SEASON: A letter to the future

Leave home for the first time to collect memories before a mysterious cataclysm washes everything away. Ride, record, meet people, and unravel the strange world around you in this third-person meditative exploration game.

🔖 Iran war heralds era of AI-powered bombing quicker than ‘speed of thought’

The use of AI tools to enable attacks on Iran heralds a new era of bombing quicker than “the speed of thought”, experts have said, amid fears human decision-makers could be sidelined.

Anthropic’s AI model, Claude, was reportedly used by the US military in the barrage of strikes as the technology “shortens the kill chain” – meaning the process of target identification through to legal approval and strike launch.

🔖 Wikidata:WikiProject PCC EMCO Wikidata CoP

The Program for Cooperative Cataloging (Q63468537) (PCC) has launched a global cooperative for entity management on the semantic web called EMCO. As part of this program, the Wikidata user community has set up a Community of Practice to coordinate identity management work for GLAMs. You can read more about EMCO and the Wikidata Community of Practice at the EMCO Lyrasis Wiki.

This project is an extension of the work of Wikidata:WikiProject PCC Wikidata Pilot / WikiProject PCC Wikidata Pilot (Q102157715) and acknowledges its great intellectual and organizational debt to the LD4 Wikidata Affinity Group (Q124692294).

🔖 John Fahey Mix Tapes

In the 1990’s my future wife was a record store clerk in Portland, Oregon. American guitar legend John Fahey was living in a nearby town and would visit the shop. Here are two mix cassettes that he made for her during that time.

subject predicate object . / Ed Summers

Build a static search for an Internet Archive Collection with Pagefind / Raffaele Messuti

Pagefind caught my attention about a year ago, and since then I've adopted it in several hobby projects (nothing work-related): some blogs built with static generators like Hugo or Zola, some old HTML content distributed on CD-ROM, and some mailing list archives where I converted mbox files to HTML and then indexed them.

The tool is great, better for my needs than other JavaScript search libraries (though it's not really fair to compare them, since they're quite different). Pagefind is a search tool that runs entirely in the browser with zero server-side dependencies. It indexes your content into a compact binary index, using WASM to run search in the browser.

It can't completely replace server-side search technologies like Solr or Elasticsearch, mainly because the index can't be updated incrementally. But for many small to medium digital libraries or collections that are rarely updated once completed, it's an extremely good tool: very fast, easy to integrate into web pages, and requires almost no maintenance.

Until now I was convinced that the only way to build an index was by reading content from existing HTML files. That changed when I listened to this Python in Digital Humanities podcast, where David Flood mentioned:

Critically, PageFind has a Python API that lets you build indexes programmatically from database dumps rather than only from HTML files.

I'd completely missed that Pagefind has a Python API (and a Node one too), which makes it easy to build an index from any data source.

Here's a basic example: building a search index for an Internet Archive collection.

I'm using the Pagefind pre-release here, which introduces a new UI with web components.

Init

uv init .
uv add internetarchive
uv add --prerelease=allow 'pagefind[bin]'

Directory to save the index and serve the UI

mkdir ./web

Python code: create an index from metadata of this collection (that is actually a collection of subcollections in Internet Archive, Italian content, related to radical movements)

import asyncio
import logging
import os

import internetarchive
from pagefind.index import PagefindIndex, IndexConfig

logging.basicConfig(level=os.environ.get("LOG_LEVEL", "DEBUG"))
log = logging.getLogger(__name__)


async def main():
    config = IndexConfig(output_path="./web/pagefind")

    async with PagefindIndex(config=config) as index:
        log.info("Searching collection:radical-archives ...")
        results = internetarchive.search_items(
            "collection:radical-archives",
            fields=["identifier", "title", "description"],
        )

        count = 0
        for item in results:
            identifier = item.get("identifier", "")
            title = item.get("title", identifier)
            description = item.get("description", "")
            url = f"https://archive.org/details/{identifier}"
            thumbnail = f"https://archive.org/services/img/{identifier}"

            if isinstance(description, list):
                description = " ".join(description)

            await index.add_custom_record(
                url=url,
                content=description or title,
                language="en",
                meta={
                    "title": title,
                    "description": description,
                    "image": thumbnail,
                },
            )
            count += 1
            log.debug("indexed %s: %s", identifier, title)

        log.info("Indexed %d items. Writing index ...", count)

    log.info("Done. Index written to ./web/pagefind")


if __name__ == "__main__":
    asyncio.run(main())

HTML UI in ./web/index.html

<!DOCTYPE html>
<html lang="en">
	<head>
		<meta charset="UTF-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<title>pagefind-ia</title>
		<link href="/pagefind/pagefind-component-ui.css" rel="stylesheet">
		<script src="/pagefind/pagefind-component-ui.js" type="module"></script>
	</head>
	<body>
		<pagefind-modal-trigger></pagefind-modal-trigger>
		<pagefind-modal>
			<pagefind-modal-header>
				<pagefind-input></pagefind-input>
			</pagefind-modal-header>
			<pagefind-modal-body>
				<pagefind-summary></pagefind-summary>
				<pagefind-results show-images></pagefind-results>
			</pagefind-modal-body>
			<pagefind-modal-footer>
				<pagefind-keyboard-hints></pagefind-keyboard-hints>
			</pagefind-modal-footer>
		</pagefind-modal>
	</body>
</html>

Result: easy to embed it anywhere!

Trails and tours in library online environments / John Mark Ockerbloom

Below is the text of the lightning talk I gave at Code4Lib 2026 earlier this week, on March 3. The conference venue where I delivered it is located at 1 Dock Street in Old City Philadelphia. Links below go to websites with images similar, but not always identical, to the ones I showed during the talk, as well as to some additional sites giving more context.

If you have a chance, it’s worth walking a few blocks from here to 6th and Market Street, where you can find a reconstructed frame of the President’s House, the home of George Washington during his presidency when Philadelphia was the capital of the US.

An exhibit went up there some years ago, telling the story of the nine people in his household who were enslaved there. Not long ago, the Trump administration ordered the exhibit be removed. You can see here one of the spaces where its panels were taken down.

Here’s one of those panels, putting the story of Washington’s slaves in the context of where they lived, and the chronology of their bondage and freedom.

A judge recently ordered that the exhibit be restored. The court battle is ongoing, and the National Park Service has put back some of the panels. while others are still missing. In some of the gaps the public have put up their own signs (some of which you can see in this picture), testifying to what’s been suppressed. If you go there, you might even find someone acting as an unofficial tour guide, telling visitors stories similar to the ones that used to be on the official signs.

Now, we know what those signs said. The folks at the Data Rescue project collected photos of them before they came down, and you can view them online. But the importance of the exhibit is not just what it says, but where it says it. It’s important that it’s embedded in a particular place, so that people who come visit what’s sometimes called the cradle of liberty also find out that there’s a story about the people deprived of liberty here, and about how they won their freedom.

While we’re at Code4lib, we’re also embedded in a rich environment filled with history and culture. Just on your walk from here to the President’s House you might pass by the Museum of the American Revolution, the Science History Institute, the American Philosophical Society, the Weitzman National Museum of American Jewish History, and of course, the Liberty Bell and Independence Hall. There’s all kinds of trails of knowledge you can follow, and it’s even better when you have a guide to those trails.

So what do I mean by a trail? A trail is a designated, visible path designed to help its users appreciate and understand the environment it goes through. You may have hiked some sometimes, and you may have gone on some more explicitly interpretive trails, like the Freedom Trail in Boston.

Our libraries are also rich environments of history and culture. And we provide ways for users to search them, but do we provide trails for them?

Well, we kind of do. We have exhibits, like this one from the Library Company of Philadelphia, providing a guided path through a collection of 19^th century works on mental illness. People who teach courses like this one at at Yale create instructional trails in their syllabus reading lists. And books that our scholars and authors write, like this one on the history of the civil rights movement, show an implicit trail of events they cover in their tables of contents.

But while these trails all refer to resources in our libraries, they’re not embedded in libraries in the same way as the exhibits and trails I’ve shown in Philadelphia and Boston. But they could be.

You can think of it as an extension of browsing. Last time Code4lib was here in Philly, I showed how a catalog I maintain lets you browse subjects using relationships in the Library of Congress Subject Headings, so you can explore various related topics around, say, who can start a war. More recently, I’ve added features for finding out more about people and their relationships, using linked data from places like id.loc.gov and Wikidata.

But we don’t have to stop with what’s in authority files, or in generic library descriptions. Maybe in the future, when you’re visiting Martha Washington’s page, you’ll find a trail that goes through it, like a trail telling the story of Ona Judge, one of the African Americans who Martha claimed ownership over, and who escaped from the house at 6th and Market here in Philadelphia, and stayed free the rest of her life.

What will that trail telling her story look like? I’m not quite sure, but I have some ideas that I’m hoping to try implementing, not so that I can tell the story, but that I can represent the story from others who can tell it better than I can. And so that people visiting my site can find and follow that story, with all of its richness, just as they once could when they visited the President’s House in Philadelphia, and as I hope they soon can do here again.

If this interests you, I’d love to talk more with you.

entering the city / Ed Summers

Proofs / Ed Summers

This is a good post from Dan Chudnov about his work on mrrc (a Python wrapped Rust library for MARC data) and how agentic-coding tools (e.g. Claude Code) can be useful for learning, adding rigor and engineering that might otherwise not be practical or feasible.

pymarc has been proven through years of use, bug reporting, and improvements, but has never been formally verified, or had that level of rigorous attention. I remain skeptical about building AI into everything, but Dan has helped me see a silver lining where, as code gets easier to write, with all its potential for slop, it also simultaneously opens a door to helping making it more reliable and performant.

And, Dan is not alone in thinking this. What if the tools for describing how software should work, and for measuring how software does work, get much, much better? If formal verification tools become more accessible and can be applied not just at the base layer of systems (where it really matters) but in middle and frontend layers of applications, where domain experts and stakeholders would really like more control and insight into how software works for them and others?

This approach implies a level of restraint, or a holding back of the generation of code that has not yet had this level of rigor applied to it. The discourse around vibecoding on the other hand seems to be the natural culmination of a “move fast and break things” philosophy that almost everyone outside of Silicon Valley has seen for what it is.

March 2026 Early Reviewers Batch Is Live! / LibraryThing (Thingology)

Win free books from the March 2026 batch of Early Reviewer titles! We’ve got 226 books this month, and a grand total of 3,026 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.

If you haven’t already, sign up for Early Reviewers. If you’ve already signed up, please check your mailing/email address and make sure they’re correct.

» Request books here!

The deadline to request a copy is Wednesday, March 25th at 6PM EDT.

Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, the UK, Israel, Australia, Canada, Ireland, Germany, Malta, Italy, Latvia and more. Make sure to check the message on each book to see if it can be sent to your country.

Thanks to all the publishers participating this month!

Alcove Press	Artemesia Publishing	Baker Books
Bellevue Literary Press	Broadleaf Books	Brother Mockingbird
Cennan Books of Cynren Press	City Owl Press	Cozy Cozies
Egg Publishing	Entrada Publishing	eSpec Books
Fawkes Press	Featherproof Books	Gefen Publishing House
Gnome Road Publishing	Grand Canyon Press	Greenleaf Book Group
Hawthorn Quill Publishing	Henry Holt and Company	History Through Fiction
Infinite Books	Inkd Publishing LLC	Lito Media
PublishNation	Pure Calisthenics	Riverfolk Books
Running Wild Press, LLC	Simon & Schuster	Tundra Books
University of Nevada Press	University of New Mexico Press	Unsolicited Press
Vibrant Publishers	W4 Publishing, LLC	WorthyKids

DLF Digest: March 2026 / Digital Library Federation

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation. See all past Digests here.

Hello DLF Community! It’s March, which means spring is around the corner (finally!), and it’s a great time for new growth. To that end, Forum planning is well underway for the virtual event this fall, and the DLF Groups are hard at work planning fantastic meetings and events for 2026. Additionally, I’m excited to share a bit of my own news: I’m transitioning to a new role at CLIR, Community Development Officer, that will help me support our community from a new angle. You’ll still have an amazing leader in Shaneé, stellar conference support from Concentra, and I certainly won’t be a stranger. As always, my inbox is open if you want to connect, send pet pictures, or have ideas about how you’d like to see our community grow in the coming months and years. See you around soon!

– Aliya

This month’s news:

Nominations Open: Suggest the names of individuals who may make compelling featured speakers at the 2026 Virtual DLF Forum. Nominations due March 31.
Registration Open: IIIF Annual Conference and Showcase in the Netherlands, June 1–4, 2026. For information, visit the conference page.
Early Bird Registration: Web Archiving Conference 2026 at KBR, the Royal Library of Belgium. Register by March 7 to secure discounted rates, and visit the conference website for full details.
Call for Proposals: AI4LAM’s Fantastic Futures 2026: Trust in the Loop, September 15-17, inviting proposals on how libraries, archives, and museums engage with trust and AI. Submissions due April 6.

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus conferences and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at [email protected]. Reminder: Team DLF working days are Monday through Thursday.

DLF Born-Digital Access Working Group (BDAWG): Tuesday, 3/2, 2pm ET / 11am PT.
DLF Digital Accessibility Working Group (DAWG): Tuesday, 3/2, 2pm ET / 11am PT.
DLF AIG Cultural Assessment Working Group: Monday, 3/9, 1pm ET / 10am PT.
AIG User Experience Working Group: Friday, 3/20, 11am ET / 8am PT
AIG Metadata Assessment Group: Friday, 3/20, 2pm ET/ 11am PT.
DLF Digitization Interest Group: Monday, 3/23, 2pm ET / 11am PT.
DLF Committee for Equity & Inclusion: Monday, 3/23, 3pm ET / 12pm PT.
DLF Open Source Capacity Resources Group: Wednesday, 3/25, 1pm ET / 10am PT.
DLF Digital Accessibility Policy & Workflows subgroup: Friday, 3/27, 1pm ET / 10am PT.
DAWG IT & Development: Monday, 3/30, 1pm ET / 10am PT.
DLF Climate Justice Working Group: Tuesday, 3/31, 1pm ET / 10am PT.

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member organization. Learn more about our working groups on our website. Interested in scheduling an upcoming working group call or reviving a past group? Check out the DLF Organizer’s Toolkit. As always, feel free to get in touch at [email protected].

Get Involved / Connect with Us

Below are some ways to stay connected with us and the digital library community:

Subscribe to the DLF Forum newsletter.
Join, start, or revive a working group and browse their work on the DLF Wiki.
Subscribe to our community listserv, DLF-Announce.
Bookmark our Community Calendar.
Learn more about becoming a DLF member organization.
Follow us on LinkedIn and YouTube.
Contact us at [email protected].

The post DLF Digest: March 2026 appeared first on DLF.

Listening to library leaders: Surveys capture real-time perspectives shaping decisions across the field / HangingTogether

Hands typing on a laptop keyboard with a transparent digital checklist interface overlaid on the screen, showing multiple checked boxes and lines of text.

Funding and resourcing, technology, staffing, community needs and expectations—the pace of change library leaders now need to navigate and lead their organizations through is nothing short of breathtaking. Trends that took years to evolve now demand responses and strategic planning within months, or even days. Grounding those choices in rigorous, in-depth research remains essential.

At the same time, library decision-makers benefit from collective wisdom and insights shared among peers. Knowing how others are responding to similar pressures can help leaders calibrate their strategies and avoid reinventing the wheel. When those insights are confined to personal or regional networks, the limited perspective can restrict leaders’ views of how priorities and decisions are shifting.

OCLC Research leadership insights: Real-time insight for real-world decisions

This tension between the need for deeply researched guidance and the demand for timely, real-world insight creates a gap for the field. Library leaders need to understand not only which frameworks and models exist for long-term decision-making that are supported by our traditional research efforts, but also how their peers are responding to rapidly changing conditions right now.

To help fill this gap, OCLC Research is expanding its approach to gathering and sharing knowledge with a new series of pulse surveys focused on library leadership priorities. These quick, timely surveys aim to gather information on the decisions library leaders are making on a variety of critical topics shaping the future of librarianship.

A complementary approach to longstanding research practices

These short surveys are designed to capture high-level snapshots of the decisions library leaders make in the moment on subjects critical to the field, such as community engagement tactics and the use and implementation of new technologies, including AI. They are intentionally brief, both to respect leaders’ time and to enable us to respond quickly to emerging issues.

This approach does not replace the in-depth, foundational research OCLC Research is known for. Rather, it adds another dimension to it.

Our long-form research projects will continue to provide thoughtful frameworks, deep analysis, and foundational guidance for operational decision-making and long-term innovation. Leadership insights surveys complement that work by:

Broadening the range of topics we can address, especially those that are evolving quickly
Expanding the pool of voices contributing insight, drawing from library leaders across regions and library types
Capturing change as it happens, and tracking how priorities and decisions shift over time

Together, these approaches create a more layered understanding of the field, combining depth with immediacy.

Powered by OCLC’s global membership network

The value of these leadership insights depends on scale. OCLC is uniquely positioned to engage a broad, global network of libraries and library leaders representing diverse viewpoints. This allows us not only to collect perspectives from beyond individual professional networks but also to share results with the field quickly and widely.

The outcomes will be intentionally concise: scannable, easy-to-digest summaries that surface patterns, contrasts, and emerging directions. Think of them as snapshots—ephemeral by design—that help illuminate how decisions are being made today, while also building a record of how those decisions evolve over time.

What this means for library leaders

For library leadership, this new format offers another way to stay oriented in a fast-moving environment:

Insight into how peers are prioritizing and responding to shared challenges
Timely information that can inform near-term decisions
A broader field-level perspective that complements local experience

By adding pulse surveys to our toolkit, OCLC Research is expanding the breadth and increasing the pace of the insights we provide, while remaining grounded in the thoughtful, evidence-based work that has long supported libraries’ strategic and operational decision-making.

We see this as one more way to help library leaders make sense of complexity, learn from one another, and move forward with confidence. Our first pulse survey, focused on AI innovation & culture in libraries, will be fielded with US library leaders in early March 2026.

Subscribe to Hanging Together, the blog of OCLC Research, for updates on the survey series and to follow our latest work.

The post Listening to library leaders: Surveys capture real-time perspectives shaping decisions across the field appeared first on Hanging Together.

Regulating Femininity in the Manosphere: An Exploration of Femmephobic Discourse in Andrew Tate and Nick Fuentes Podcasts / Nick Ruest

Does Clarivate understand what citations are for? / Hugh Rundle

A month ago Clarivate announced a new yet-to-be-released product called Nexus: "Clarivate Nexus acts as a bridge between the convenience of AI and the rigor of academic libraries". This is a pitch to librarians who have correctly identified generative AI chatbots as purveyors of endless bullshit, but also know that students and some researchers are going to use them anyway. Clarivate tells us that we can patch up the fabrications of chatbots with reassuring terms like "trusted sources", "verified academic references", and "authoritative".

Looking more carefully at Clarivate's marketing material, what they are proposing suggests that Clarivate understands neither what citations are for nor why fabricated citations are a problem. This is somewhat surprising for the company that controls and manages such key parts of the scholarly publishing systems as the citation database Web of Science, scholarly publishing and indexing company ProQuest, and the Primo/Summon Central Discovery Index.

Why we cite

It can get a little more complicated than this, but there are essentially two reasons for citations in scholarly work.

The first is to indicate where you got your data. If I write that the population of Australia in June 2025 was 27.6 million people, I need to back up this claim somehow. In this case, I would cite the Australian Bureau of Statistics as the source. This adds credibility to a claim by enabling readers to check the original source and assess whether it actually does make the same claim, and whether that claim is credible. If I said that the population of Australia in 2025 was 100 million people and cited a source which made that claim and in turn cited the ABS as their source, you could follow the chain of references back and identify that the paper I cited is where the error ocurred.

The second reason we cite a source is to give credit for a concept, term, or model for thinking. This is less about checking facts and more about academic norms and manners, though it also indicates how credible a scholar might be in terms of their understanding of a field. For example I might describe a concept whereby librarians feel that the mission of libraries is good and righteous, and this leads to burnout because they feel they can never complain about their working conditions. If I did not cite Fobazi Ettarh's Vocational Awe and Librarianship: The Lies We Tell Ourselves whilst describing this, I would rightly not be seen as a credible scholar in the field, or alternatively might be seen as surely knowing about Ettarh's work but deliberately ignoring it or even claiming her work as my own idea.

Why fabricated citations are bad

So that's the basics of why scholars include citations in their work. We can now explore why fabricated citations are a problem. There are two related but distinct reasons.

Citations that look real but are actually fake waste the time of already-busy library resource-sharing teams by making them spend time checking whether the citation is real, and sometimes looking for items that don't exist. This aspect of fabrication is bad because the cited item doesn't exist. If we match this to our first reason for citing, we can see that a claim that is backed by a citation to nothing at all is, uh, pretty problematic if the reason we cite is to link to the source data backing up a claim. It's equivalent to simply not providing a citation at all, except worse because we're claiming that our plucked-out-of-the-air "fact" is backed up by some other source.

The second problem with fabricated citations is that there is no connection between the statement being made and the source being cited. Even if the source being cited exists, the connection between the statement and the cited item is fabricated. This is slightly more difficult to understand because generative AI is based on probability, so in many cases there will appear to be a connection. But without a tightly-controlled RAG system, it's likely to simply be a lucky guess. The problem here is one of academic integrity – we've cited a source that exists, but it may or may not back up our claim, and the claim doesn't follow from the source.

A false nexus

Clarivate seems to be conflating these two issues. Their Nexus product has two core functions: checking citations to see if they are real, and suggesting references for content in chatbot conversations. The first is genuinely useful, though highly constrained – Clarivate only checks their own indexes, and defines anything that doesn't appear in those indexes as either non-existing, or "non-scholarly" (it's unclear how it would define, for example, something with a DOI that exists but doesn't appear in Web of Science). Neither academia nor the tech industry are short on hubris, but even in that context, "anything not listed in our proprietary databases isn't credible" is a pretty eyebrow-raising claim.

The second function kicks in when the citation checker defines a citation as failed – it offers to "Find Verified Alternative". That is, Nexus offers to replace both cited sources that don't exist and cited sources that "aren't scholarly" with another real source. This addresses the first problem (cited sources that don't exist) but not the second (cited sources that aren't the real source of a claim or quotation).

With Nexus, Clarivate are essentially integrity-washing synthetic text, giving it an academic sheen without any academic rigour. Far from helping librarians, Clarivate's Nexus threatens to further unravel the hard work we do to teach students information literacy skills and its sparkling variety, "AI literacy". Students are already inclined to write their argument first and go on a fishing expedition for citations to back it up later (I certainly wrote my undergraduate essays this way). The last thing we want to do is direct them to a product that encourages this academically dishonest behaviour.

ChatGPT is designed to provide something that looks like a competent answer to a question. Nexus seems to be designed to amend this answer-shaped text into something that looks like a correctly-cited academic essay. But the point of student assessments isn't to produce essays – it's to produce competent researchers and systematic thinkers. Perhaps Clarivate thinks there is a large potential market of universities who want to help their own students cheat on assignments in ways that look more credible. To that, I would say "[citation needed]".

Memorial for Fobazi Ettarh / In the Library, With the Lead Pipe

It is with heavy hearts and great sadness that we acknowledge the passing of trailblazer and fire-starter Fobazi Ettarh. Her loss will be felt by us all for years to come.

Fobazi published two articles with us at ITLWTLP. In 2014 she wrote “Making a New Table: Intersectional Librarianship,” one of the first scholarly articles published about viewing librarianship through an intersectional lens. In 2018 she published the hugely influential “Vocational Awe and Librarianship: The Lies We Tell Ourselves.” Since then, we have published many, many articles that cite the concept she identified: vocational awe. She was, to borrow a phrase from bell hooks, a maker of theory and a leader of action. We remember her as one of the great thinkers of her time, and we encourage our readers to spend some time with her words and her work. Additionally, please consider contributing to or sharing the link for her GoFundMe.

Streamlining Open Access Agreement Lookup for U-M Authors / Library Tech Talk (U of Michigan)

Sign hanging in a shop window that says OPEN ACCESS!

Image Caption

Open Access (storefront) by Gideon Burton is licensed under CC BY-SA 2.0.

University of Michigan Library recently launched a new application to help U-M researchers and authors at our three campuses locate publications covered under institutional open access agreements. This tool aggregates nearly 13,000 titles across publishers, streamlining the process of locating eligible journals. The project involved data-wrangling, application design and development, and usability testing to produce a usable, sustainable tool.

Tesla's Not-A-Robotaxi Service / David Rosenthal

Source

I have now seen the fabled CyberCab three times in real life. It has two seats, one of them fully equipped with human driver interface equipment. In each case a human was using them to drive the car, which is necessary in California because Fake Self-Driving is a Level 2 driver assistance system that requires a human behind the wheel at all times. A Robotaxi that requires a human driver and can carry at most one passenger isn't going to be a economic success.

Fred Lambert has two posts illustrating the distance between Musk's claims and reality. Below the fold I look at both of them:

"Safety monitors" less safe than "drivers"

First, Tesla ‘Robotaxi’ adds 5 more crashes in Austin in a month — 4x worse than humans:

Tesla has reported five new crashes involving its “Robotaxi” fleet in Austin, Texas, bringing the total to 14 incidents since the service launched in June 2025. The newly filed NHTSA data also reveals that Tesla quietly upgraded one earlier crash to include a hospitalization injury, something the company never disclosed publicly.

Even before they were changed, we knew very few of the details:

As with every previous Tesla crash in the database, all five new incident narratives are fully redacted as “confidential business information.” Tesla remains the only ADS operator to systematically hide crash details from the public through NHTSA’s confidentiality provisions. Waymo, Zoox, and every other company in the database provide full narrative descriptions of their incidents.

But what we do know isn't good:

With 14 crashes now on the books, Tesla’s “Robotaxi” crash rate in Austin continues to deteriorate. Extrapolating from Tesla’s Q4 2025 earnings mileage data, which showed roughly 700,000 cumulative paid miles through November, the fleet likely reached around 800,000 miles by mid-January 2026. That works out to one crash every 57,000 miles.

The numbers aren't just not good, they're apalling:

By the company’s own numbers, its “Robotaxi” fleet crashes nearly 4 times more often than a normal driver, and every single one of those miles had a safety monitor who could hit the kill switch. That is not a rounding error or an early-program hiccup. It is a fundamental performance gap.

There are two points that need to be made about how bad this is:

However badly, Tesla is trying to operate a taxi service. So it is misleading to compare the crash rate with "normal drivers". The correct comparison is with taxi drivers. The New York Times reported that:

In a city where almost everyone has a story about zigzagging through traffic in a hair-raising, white-knuckled cab ride, a new traffic safety study may come as a surprise: It finds that taxis are pretty safe.

So are livery cars, according to the study, which is based on state motor vehicle records of accidents and injuries across the city. It concludes that taxi and livery-cab drivers have crash rates one-third lower than drivers of other vehicles.
A law firm has a persuasive list of reasons why this is so. So Tesla's "robotaxi" is actually 6 times less safe than a taxi.
Fake Self Driving is a Level 2 system that requires a human behind the wheel, and that is the way Tesla's service in California has to operate. But in Austin the human is in the passenger seat, or in a chase car. Tesla has been placing bystanders at risk by deliberately operating in a way that it knows, and the statistics it reports show, is unsafe.

Tesla's Catch-22

Second, Tesla admits it still needs drivers and remote operators — then argues that’s better than Waymo:

Tesla filed new comments with the California Public Utilities Commission that amount to a quiet admission: its “Robotaxi” service still relies on both in-car human drivers and domestic remote operators to function. Rather than downplaying these dependencies, Tesla leans into them — arguing that its multi-layered human supervision model is more reliable than Waymo’s fully driverless system, pointing to the December 2025 San Francisco blackout as proof.

The filing, submitted February 13 in CPUC Rulemaking 25-08-013, reveals the massive operational gap between what Tesla calls a “Robotaxi” and what Waymo actually operates as one.

Tesla's filing admits that the service they market as a "robotaxi" really isn't one:

Tesla operates its service using TCP (Transportation Charter Party) vehicles equipped with FSD (Supervised), a Level 2 ADAS system that, by definition, requires a licensed human driver behind the wheel at all times, actively monitoring and ready to intervene.

On top of that in-car driver, Tesla describes a parallel layer of remote operators. The company states it employs domestically located remote operators in both Austin and the Bay Area, and that these operators are subject to DMV-mandated U.S. driver’s licenses, “extensive background checks and drug and alcohol testing,” and mandatory training. Tesla frames this as a redundancy system, remote operators in two cities backing up the in-car drivers.

That’s two layers of human supervision for a service Tesla markets as a “Robotaxi.”

Compare this with a Waymo:

Waymo’s vehicles have no driver in the car. Waymo uses remote assistance operators who can provide guidance to vehicles in ambiguous situations, but the vehicle drives itself. Waymo’s remote operators don’t control the car, they confirm whether it’s safe to proceed in edge cases like construction zones or unusual road conditions.

... Tesla’s system requires a human to drive the car and has remote operators as backup. Waymo’s system drives itself and has remote operators as backup. Tesla is essentially describing a staffing-intensive taxi service with driver-assist software. Waymo is describing an autonomous transportation network.

This is where Tesla's marketing their service as a "robotaxi" creates a Catch-22:

Tesla argues forcefully that its Level 2 ADAS vehicles should remain outside the scope of this AV rulemaking entirely, agreeing with Lyft that they aren’t “autonomous vehicles” under California law.

At the same time, Tesla is fighting Waymo’s proposal to prohibit Level 2 services from using terms like “driverless,” “self-driving,” or “robotaxi.” Tesla calls this proposal “wholly unnecessary,” arguing that existing California advertising laws already cover misleading marketing.

But note that:

A California judge already ruled in December 2025 that Tesla’s marketing of “Autopilot” and “Full Self-Driving” violated the state’s false advertising laws.

So here is the Catch-22:

Tesla is telling regulators its vehicles are not autonomous and require human drivers, while simultaneously fighting for the right to keep calling the service a “Robotaxi.” Tesla wants the legal protections of being classified as a supervised Level 2 system and the marketing benefits of sounding like a fully autonomous one.

Sadly, this is just par for the course when it comes to Tesla's marketing. Essentially everything Elon Musk has said about not just the schedule but more importantly the capabilities of Fake Self Driving has been a lie, for example a 2016 faked video. These lies have killed many credulous idiots, but they have succeeded in pumping TSLA to a ludicrous PE ratio because of the kind of irresponsible journalism Karl Bode describes in The Media Can't Stop Propping Up Elon Musk's Phony Supergenius Engineer Mythology:

One of my favorite trends in modern U.S. infotainment media is something I affectionately call "CEO said a thing!" journalism.

"CEO said a thing!" journalism generally involves a press outlet parroting the claims of a CEO or billionaire utterly mindlessly without any sort of useful historical context as to whether anything being said is factually correct.

There's a few rules for this brand of journalism. One, you can't include any useful context that might shed helpful light on whether what the executive is saying is true. Two, it's important to make sure you never include a quote from an objective academic or expert in the field you're covering that might challenge the CEO.

After all, if a journalist does include an expert pointing out that the CEO is bullshitting:

statements produced without particular concern for truth, clarity, or meaning

the journalist will lose the access upon which his job depends. But I'm not that journalist, so here is my list of the past and impending failures of the "Supergenius Engineer":

Contrast these with the successes:

Tesla's cars: Wikipedia notes that:

Tesla was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning as Tesla Motors. ... In February 2004, Elon Musk led Tesla's first funding round and became the company's chairman, subsequently claiming to be a co-founder
Starting in 2008, Franz von Holzhausen designed the Model S, which launched in 2012 and was Tesla's first success. Initially, Tesla was a great success, but it has failed to update its line-up. It is now far behind Chinese EV manufacturers and losing market share worldwide. They will lose the US market share once the Chinese set up US factories.
Space X Falcon 9: Musk's insight that re-usability would transform the space business was a huge success. It was thanks to significant government support and a great CEO, Gwynne Shotwell.

This history seems like valuable context for journalists to include in reports of Musk's next pronouncement.

2026-02-24: The 10th Computational Archival Science (CAS) Workshop Trip Report / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

IEEE BigData 2025-The10th Computational Archival Science (CAS) Workshop Home Page

The 10th Computational Archival Science (CAS) Workshop is part of 2025 IEEE Big Data Conference (IEEE BigData 2025). It was an online workshop held on Tuesday December 9, 2025. It included close to 70 participants, with a keynote from Dr. Phang Lai Tee, National Archives of Singapore and Chair of the UNESCO Memory of the World Preservation Sub-Committee on Artificial Intelligence, and 18 papers from 27 institutions in 8 countries spanning 5 continents: Canada, USA (North America) / Brazil (South America) / Scotland, Spain, Switzerland (Europe) / South Africa (Africa) / Korea (Asia).

Michael Kurtz, who passed on December 17th, 2022 launched the CAS initiative in 2016, with Victoria Lemieux, Mark Hedges, Maria Esteva, William Underwood, Mark Conrad, and Richard Marciano.

The 10th CAS workshop was organized by the CAS Workshop Chairs:

Mark Hedges from King’s College London UK
Victoria Lemieux from U. British Columbia CANADA
Richard Marciano from U. Maryland USA

The workshop started with a 10 minute welcome message from the CAS workshop chairs and then a 20 minute keynote from Dr. Phang Lai Tee, National Archives of Singapore, who presented "Applications and Challenges for Archives and Documentary Heritage in the Age of AI: Some Reflections". Overall, the topic is a timely reflection on how AI is reshaping archival and documentary heritage work, highlighting both opportunities and challenges. It was a strong presentation that included emphasis on practical challenges such as scale, access, cybersecurity, and regulation.

The workshop itself was divided into six sessions:

1: Blockchain & Archives [2 papers]

A. Blockchain and Responsible AI: Enhancing Transparency, Privacy, and Accountability through Blockchain Hackathon

Authors: Jiho Lee, Jaehyung Jeong, Victoria Lemieux,Tim Weingartner, and JaeSeung Song

PAPER — VIDEO — SLIDES

The presentation highlights a curriculum initiative where participants used a blockchain-enabled fair-data ecosystem (Clio-X) in Blockathon to build privacy-preserving AI chatbots for archival datasets. It highlights blockchain’s potential to improve transparency and accountability in AI workflows by making all actions traceable on-chain.

B. Cryptographic Provenance and AI-generated Images

Authors: Jessica Bushey, Nicholas Rivard, and Michel Barbeau

PAPER — VIDEO — SLIDES

The presentation highlighted how content credentials and cryptographic provenance frameworks can operationalize archival trustworthiness for born-digital assets and AI-generated images by embedding tamper-evident metadata into assets, which is a highly relevant and timely challenge given the proliferation of synthetic media. It effectively bridges archival theory (authenticity and provenance) with practical systems and discusses how blockchain and content credentials can support verifiable history of digital images, situating the work within computational archival science. Overall, it makes a strong conceptual and methodological contribution to trustworthy preservation of digital content.

2: Processing Analog Archives [4 papers]

A. Using an Ensemble Approach for Layout Detection and Extraction from Historical Newspapers

Authors: Aditya Jadhav, Bipasha Banerjee, and Jennifer Goyne

PAPER — VIDEO — SLIDES

The presentation focused on layout detection and Optical Character Recognition (OCR) for historical newspapers by proposing a modular, detector-agnostic ensemble pipeline combining OpenCV, Newspaper Navigator, and a fine-tuned TextOnly-PRIMA model to improve segmentation and extraction on variable scans. It’s strong in engineering detail and demonstrates practical improvements over commercial baselines like AWS Textract, especially on degraded material. Overall, it’s a solid methodological contribution with clear application value in large-scale digitization efforts.

B. PARDES: Automatic Generation of Descriptive Terms for Logical Units in Historical Handwritten Collections

Authors: Josepa Raventos-Pajares, Joan Andreu Sanchez, and Enrique Vidal

PAPER — VIDEO — SLIDES

The PARDES project presents a practical and scalable method for automatically generating descriptive terms from noisy handwritten text recognition (HTR) outputs in large historical collections, using probabilistic indexing and Zipf's Law to identify important terns. It’s strong in handling uncertainty in HTR.

C. From Analog Records to Computational Research Data: Building the AI-Ready Lab Notebook

Authors: Joel Pepper, Zach Siapno, Jacob Furst, Fernando Uribe-Romo, David Breen, and Jane Greenberg

PAPER — VIDEO — SLIDES

Similar to the previous presentation, this one addressed transforming analog, handwritten lab notebooks into AI-ready digital data to unlock valuable experimental records for computational analysis. It demonstrated promising performance. Overall, it’s a good step toward making analog scientific records computationally accessible and usable for AI systems.

D. Classification of Paper-based Archival Records Using Neural Networks

Authors: Jussara Teixeira, Juliana Almeida, Tania Gava, Raphael Lugon Campo Dall’Orto, and Jose M´ arcio Moraes Dorigueto

PAPER — VIDEO — SLIDES

The presentation demonstrates a practical application of supervised machine learning (ML) to classify unprocessed archival records, achieving high accuracy and scalability on a large real-world governmental dataset (Electronic Process System (SEP) of the State of Espirito Santo, Brazil). It effectively shows how a modular ML architecture can be integrated into existing archival systems, and how clustering similar records can reduce manual effort. Overall, it’s a solid empirical case study of ML enhancing a core archival function at scale.

3: Retrieval-augmented Generation [3 papers]

A. Developing a Smart Archival Assistant with Conversational Features and Linguistic Abilities: the Ask_ArchiLab Initiative

Authors: Basma Makhlouf Shabou, Lamia Friha, and Wassila Ramli

PAPER — VIDEO — SLIDES

This talk presented a compelling initiative to modernize archival practice by building a conversational AI assistant that integrates advanced Retrieval Augmented Generation (RAG) and semantic technologies to support fast, contextual, and professional‑level archival queries. It’s strong in conceptualizing how multilingual conversational agents can bridge gaps in access, complex metadata, and diverse user expertise. Overall, it’s an innovative approach with great potential to enhance usability and knowledge discovery in digital archives.

B. Index-aware Knowledge Grounding of Retrieval-Augmented Generation in Conversational Search for Archival Diplomatics

Authors: Qihong Zhou, Binming Li, and Victoria Lemieux

PAPER — VIDEO — SLIDES

This work presents an index‑aware chunking strategy to improve RAG pipelines for conversational search by grounding retrieval on structured index terms extracted from PDFs, aiming to reduce resource demands, accuracy issues, and hallucinations common in standard RAG workflows. It’s a practical contribution that addresses problems with traditional chunking strategies. Overall, it is an interesting methodological refinement with promising implications for archival conversational search but would benefit from broader validation.

C. Retrieval-augmented LLMs for ETD Subject Classification

Authors: Hajra Klair, Fausto German, Amr Ahmed Aboelnaga, Bipasha Banerjee, Hoda Eldardiry, and William A. Ingram

PAPER — VIDEO — SLIDES

This work presents a two‑stage RAG‑based pipeline that uses keyword extraction and guided question generation from Electronic Theses and Dissertations (ETD) abstracts to retrieve and synthesize core document content, tackling the challenge of long, full‑text processing. It addresses the challenge of subject classification at scale for ETD by capturing signatures that go beyond simple lexical similarity to improve classification accuracy and contextual richness. The evaluation shows improvements over traditional approaches. Overall, it’s a promising and well‑structured application of RAG methods to a real-world problem.

4: Archival Theory & Computational Practice [4 papers]

A. Archival Research Theory: Putting Smart Technology to Work for Researchers

Authors: Kenneth Thibodeau, Alex Richmond, and Mario Beauchamp

PAPER — VIDEO — SLIDES

This work extends archival theory beyond traditional archival management to a new Archival Research Theory (ART) framework that models archives as complex informational systems with informative potential responsive to researchers’ questions, grounded in semiotics, Constructed Past Theory, and type theory. It’s conceptually rich, offering a strong theoretical foundation for integrating smart technologies into archival research and emphasizing how meaning and context can be formally modeled to support diverse inquiry. Overall, it makes a thoughtful and potentially foundational contribution to bridging archival theory and computational practice.

B. Systems Thinking, Management Standards, and the Quest for Records and Archives Management Relevance

Author: Shadrack Katuu

PAPER — VIDEO — SLIDES

The presentation makes a case for records and archives management (RAM) within organizations by embedding RAM into widely adopted Management System Standards (MSS) like ISO frameworks, which currently drive visibility and measurable outcomes in areas such as quality and security. It uses systems thinking and standards practice to argue that RAM can gain institutional relevance and leadership buy‑in by aligning with structured MSS processes and the Plan‑Do‑Check‑Act cycle, thereby elevating archival functions beyond marginal roles. Overall, it’s a good management‑focused contribution that highlights the importance of standards and systemic framing for advancing archival relevance.

C. Can GPT-4 Think Computationally about Digital Archival Practices?

Authors: William Underwood and Joan Gage

PAPER — VIDEO — SLIDES

This work investigates whether GPT‑4o demonstrates computational thinking capabilities applied to digital archival tasks, grounding the analysis in a recognized computational thinking taxonomy. It surfaces compelling examples where the model exhibits knowledge across archival processes and computational practices, suggesting its potential as a learning partner or assistant in teaching archival computational methods. Overall, the paper offers a thought‑provoking exploration of LLM capabilities in a computational archival context, with promising avenues for further research.

D. Algorithm Auditing for Reliable AI Authenticity Assessment of Digitized Archival Objects

Author: Daniel F. Fonner

PAPER — VIDEO —SLIDES

This presentation shows how small variations in input image resolution can drastically affect AI‑based art authentication results, highlighting a key vulnerability in applying such models to archival or cultural heritage objects and raising important concerns about reliability and manipulation risk. It makes a strong case that algorithm auditing should be embedded in computational archival science practices to improve transparency, reproducibility, and accountability of automated analyses. Overall, it’s a practical contribution that urges the need for rigorous evaluation frameworks when deploying AI for authenticity and provenance tasks in digital archives.

5: Knowledge Organization & Retrieval [2 papers]

A. Ontologies Applied to Archival Records: a Preliminary Proposal for Information Retrieval

Authors: Thiago Henrique Bragato Barros, Maurício Coelho da Silva, Rafael Rodrigo do Carmo Batista, David Haynes, and Frances Ryan

PAPER — VIDEO — The slides were not posted

This paper presents an ontology‑driven approach to improve information retrieval (IR) over archival descriptions and digital objects by capturing archival contexts such as provenance, functions, agents, and events within a formal semantic model. It grounds its design in established ontology engineering and archival principles to support semantic indexing, reasoning, and query handling. Overall, it makes a decent conceptual contribution toward ontology‑enhanced archival IR.

B. Operationalizing Context: Contextual Integrity, Archival Diplomatics, and Knowledge Graphs

Authors: Jim Suderman, Frédéric Simard, Nicholas Rivard, Iori Khuhro, Erin Gilmore, Michel Barbeau, Darra Hofman, and Mario Beauchamp

PAPER — VIDEO — SLIDES

This paper lays out a context‑driven privacy framework for archival records that combines theories of contextual integrity, archival diplomatics, and knowledge graphs to make privacy‑relevant relationships machine‑legible and support informed decisions about sensitive information at scale. Its strength lies in operationalizing context rather than content alone using GraphRAG and knowledge graphs to capture nuanced contextual features that traditional vector embeddings miss, thereby offering a richer basis for privacy assessment. Overall, it’s a promising conceptual and advancement toward AI‑enabled privacy support in archives.

6: Web Archiving [3 papers]

This session highlights my contributions. The workshop designated two slots for my papers. The first slot was for presenting one of the papers and the second one is for summarizing the remaining two papers, which is why there are three papers, but only two videos. The slides for both slots are combined in one file. I want to thank Richard Marciano, Victoria Lemieux, and Mark Hedges for giving me the opportunity to present and being flexible with the workshop registration since my work is not funded and we were unable to pay the registration fees.

SLIDES

A. Arabic News Archiving is Catching Up to English: A Quantitative Study

PAPER

In the first paper, I presented a quantitative analysis of web archiving coverage for Arabic versus English news content over a 23‑year period, revealing that while English pages are still archived at a higher rate, Arabic archival coverage has increased significantly in recent years. I showed the heavy dependence on the Internet Archive (IA) for web archiving and that other public web archives contribute very little, exposing a centralization risk where loss of IA would make most archived content inaccessible. This paper is a continuation of previous work "Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages".

B. The Gap Continues to Grow Between the Wayback Machine and All Other Web Archives

PAPER

The second paper I presented highlights a quantitative study showing that the Internet Archive (IA) overwhelmingly dominates public web archiving, preserving 99.74 % of archived Arabic and English news pages in the dataset I constructed (1.5 million URLs) while all other web archives combined account for only a tiny fraction. I highlighted the risk to web archiving if the IA became unavailable, the vast majority of archived online news would be lost or irretrievable, underscoring a critical vulnerability in web preservation. My analysis offer clear results, but the paper could benefit from a broader discussion of why other web archives are shrinking and what practical strategies could diversify preservation efforts. Overall, it is an important wake‑up call about concentration in web archiving and the fragility of our collective digital memory. This paper is a continuation of previous work "Profiling web archive coverage for top-level domain and content language".

C. Collecting and Archiving 1.5 Million Multilingual News Stories’ URIs from Sitemaps

PAPER

The third paper I presented introduced JANA1.5, a large dataset of 1.5 million Arabic and English news story URLs collected from news site sitemaps, and demonstrated an effective sitemap‑based collection method that outperforms alternatives like RSS, X (formerly Twitter), and web scraping. I also discussed ways for noise reduction. I ended with explaining how this dataset is going to be submitted to the IA.

One of the standout aspects of the CAS workshop was its responsiveness and quick turnaround. Reviewers' comments were actionable and came back quickly, decisions were clear, and the entire process moved at a fast pace that made it possible to focus on the work itself rather than waiting on it. The entire process from submission to publishing and presenting the work takes about a month. It’s the kind of efficiency every venue should strive for. Attending the 10th CAS Workshop was great. It underscored issues related to computational archival science including centralization, authenticity, and who gets to be remembered. It was a rewarding experience to present my work at the CAS workshop exploring web archiving’s dependence on the Internet Archive. The discussion highlighted just how vital the Internet Archive is to our digital memory, and it was inspiring to see how their work motivates us all to take action and contribute to preserving our online heritage.

Hussam Hallak

Launching the Agent Protocols Tech Tree / Harvard Library Innovation Lab

Today I am sharing the Agent Protocols Tech Tree. APTT is a visual, videogame-style tech tree of the evolving protocols supporting AI agents.

Where did this come from?

I made the APTT for a session on “The Role of Protocols in the Agents Ecosystem” at the Towards an Internet Ecosystem for Sane Autonomous Agents workshop at the Berkman Klein Center on February 9th.

It’s a video game tech tree because, while the word “protocols” is boring, the phenomenon of open protocols is fascinating, and I want to make them easier to approach and explore.

What is an open protocol? Why care about them?

An open protocol is a shared language used by multiple software projects so they can interoperate or compete with each other.

Protocols offer an x-ray of an emerging technology — they tell you what the builder community actually cares about, what they are forced to agree on, what is already done, and what is likely to come next.

Open protocols go back to the founding of the internet when basic concepts like “TCP/IP” were standardized — not by a government or company creating and enforcing a rule, but by a community of builders based on “rough consensus and running code.” On the internet no one could force you to use the same standards as everyone else, but if you wanted to be part of the same conversation, you had to speak the same language. That created strong incentives to agree on protocols, from SMTP to DNS to FTP to HTTP to SSL. By tracing each of those protocols, you could see the evolving concerns of the people building the internet.

(For a great discussion of that history, see “The Battle of the Networks” from LIL faculty director Jonathan Zittrain’s book “The Future of the Internet — and How to Stop It.”)

Why are protocols so important for AI agents?

Like the early internet, AI agents today are an emerging, distributed phenomenon that is changing faster than even experts can understand. We’re holding workshops with names like “Towards an Internet Ecosystem for Sane Autonomous Agents” because no one really knows what it will mean to have millions of semi-autonomous computer programs acting and interacting in human-like ways online.

Also like the early internet, it’s tempting to look for some government or company that is in charge and can tame this phenomenon, set the rules of the road. But in many ways there isn’t one. The ingredients of AI agents are just not that complex or that controlled.

This makes sense if you look at Anthropic’s definition of an agent, which is simply “models using tools in a loop.” That is not a complex recipe: it requires a large language model, of which there are now many, including powerful open source ones that can run locally; a fairly small and simple control loop; and a set of “tools,” simple software programs that can interact with the world to do things like run a web search or send a text message. “Agents” as a phenomenon are a technique, like calculus, not a service, like Uber.

That makes agents hard to regulate, and makes protocols incredibly important. It is protocols that give agents the tools they use. It is protocols that the builder community are developing as fast as they can to increase what agents can do. If you want to nudge this technique toward human thriving, it is protocols that might most shape agent behavior by making some agents easier to build than others.

To be sure, protocols aren’t the only way to influence technological development. Larry Lessig’s classic “pathetic dot theory” outlines markets, laws, social norms, and architecture as four separate ways that individual action gets regulated, and protocols are just an aspect of architecture. But the more a technology is dispersed and simple to recreate, the more protocols come into play in how it evolves.

How do I use the APTT?

APTT is designed to be helpful whether you’re a less-technical person who just wants to understand what agents are, or a more technical person who wants to understand exactly what’s getting built.

Either way the pile of agent technologies is confusing, so I recommend starting at the beginning with “Inference API.”

Video games are often designed so you start with a simple feature unlocked and then progressively unlock more and more complex options as you learn the game. The same approach works here: imagine that you have just unlocked “Inference API” in this game, and once you’re comfortable with that, explore off to the right to see how each protocol enables or necessitates the next.

You can click each technology to learn what problem it solves (why did people need something like this?), how it’s standardizing (who kicked this off?), and what virtuous cycle it enabled (why did other people want to get on board?).

You can also see visual animations of how the protocol is used — what messages are actually sent back and forth between who?

Inference API animation

If you’re interested in the technical details, you can click any of the messages to see at a wire level what’s actually happening. (Often, something simpler than it sounds.)

Inference API messages

As you move off to the right, you’ll go from widely adopted technologies, like MCP, to technologies that have commercial supporters but not much social proof yet, like Visa TAP, or technologies that don’t even exist but might make sense in the future, like Interoperable Memory, Signed Intent Mandates, or Agent Lingua Franca.

The ragged edge on the right is where I hope you’ll be the most critical: what seems inevitable, what seems like a dead end, and what would you like to see more of?

How accurate is all of this? How do I fix mistakes?

APTT is a work in progress, and to be honest in many ways is a whiteboard sketch. I put it together (and vibe coded much of it) to help support a conversation, first at the workshop and now online. I think whiteboard sketches are useful, so I’m sharing it, but I don’t pretend it’s authoritative; it’s just my rough sense of how things work right now.

(This is a weird thing about the agentic moment — my coding agent has made this tool look more polished and complete than it may really deserve. Think napkin sketch with fancy graphics.)

If you think I got things wrong or missed part of the story, please open an issue on the GitHub repository. I plan to keep this rough and opinionated, and focused on consensus-driven protocols as a lens for understanding what’s happening — so I’ll either pull contributions into the main tool, or just leave them as discussions to represent the range of opinions about how all of this works. I hope it’s fun to play with either way.

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 Arke

Arke is a public knowledge network for storing, discovering, and connecting information.

Making content truly accessible is harder than it looks. Meaningful search requires vectors, embeddings, extraction pipelines—infrastructure most people can’t build. And even with that, files sitting on a website or in a folder don’t get found. You end up working alone, disconnected from related work that exists somewhere.

Arke handles all of it. Upload anything—we process it and connect it to a network where similar collections surface automatically. Your information becomes searchable, discoverable, and linked to work you didn’t know existed.

🔖 Community Calendar

Public events are trapped in information silos. The library posts to their website, the YMCA uses Google Calendar, the theater uses Eventbrite, Meetup groups have their own pages. Anyone wanting to know “what’s happening this weekend?” must check a dozen different sites.

Existing local aggregators typically expect event producers to “submit” events via a web form. This means producers must submit to several aggregators to reach their audience — tedious and error-prone. Worse, if event details change, producers must update each aggregator separately.

This project takes a different approach: event producers are the authoritative sources for their own events. They publish once to their own calendar, and individuals and aggregators pull from those sources. When details change, the change propagates automatically. This is how RSS transformed blogging, and iCalendar can do the same for events.

The gold standard is iCalendar (ICS) feeds — a format that machines can read, merge, and republish. If you’re an event producer and your platform can publish an ICS feed, that’s great. But ICS isn’t the only way. The real requirement is to embrace the open web. A clean HTML page with well-structured event data works. What doesn’t work: events locked in Facebook or behind login walls.

🔖 Engineering Rigor in the LLM Age

What do LLMs mean for the future of software engineering? Will vibe-coded AI slop be the norm? Will software engineers simply be less in-demand? Rain and David join Bryan and Adam to discuss how rigorous use of LLMs can make for much more robust systems.

🔖 Wikipedia blacklists Archive.today, starts removing 695,000 archive links

The English-language edition of Wikipedia is blacklisting Archive.today after the controversial archive site was used to direct a distributed denial of service (DDoS) attack against a blog.

In the course of discussing whether Archive.today should be deprecated because of the DDoS, Wikipedia editors discovered that the archive site altered snapshots of webpages to insert the name of the blogger who was targeted by the DDoS. The alterations were apparently fueled by a grudge against the blogger over a post that described how the Archive.today maintainer hid their identity behind several aliases.

“There is consensus to immediately deprecate archive.today, and, as soon as practicable, add it to the spam blacklist (or create an edit filter that blocks adding new links), and remove all links to it,” stated an update today on Wikipedia’s Archive.today discussion. “There is a strong consensus that Wikipedia should not direct its readers towards a website that hijacks users’ computers to run a DDoS attack (see WP:ELNO#3). Additionally, evidence has been presented that archive.today’s operators have altered the content of archived pages, rendering it unreliable.”

🔖 Megalodon (website)

Megalodon (Japanese: ウェブ魚拓, “web gyotaku”) is an on demand web citation service based in Japan.[3] It is owned by Affility.

Megalodon’s server can be searched for “web gyotaku” or copies of web pages, by prefixing any URL with “gyo.tc”; the process checks the query against other services as well, including Google’s cached pages and Mementos.

🔖 Exclusive: US plans online portal to bypass content bans in Europe and elsewhere

WASHINGTON, Feb 18 (Reuters) - The U.S. State Department is developing an online portal that will enable people in Europe and elsewhere to see content banned by their governments including alleged hate speech and terrorist propaganda, a move Washington views as a way to counter censorship, three sources familiar with the plan said.

🔖 How An Academic Library Built a Research Impact and Intelligence Team

During recent decades, universities have faced increasing pressure to demonstrate their value and impact by contributing to real-world problem-solving and meeting broader societal needs. The reasons for this increased pressure are complex and numerous—reflecting socio-economic and socio-political considerations, globalization and intensifying competition, and growing demands for accountability and demonstrable public value. At Virginia Tech, our library’s research impact and intelligence team, of which we are all members, supports institutional strategy, researcher visibility, and decision-making in response to these demands. In this article, we’ll outline the emergence of research impact and research intelligence work in libraries, trace the development of our department, and illustrate how analytics, research information management, and consultation services are operationalized alongside ongoing efforts to promote responsible interpretation and use of research metrics.

🔖 Annotorious

Annotorious is a JavaScript library for adding image annotation capabilities to your web application. Try it out below: click or tap the annotation to edit. Click or tap anywhere and drag to create a new annotation.

🔖 Potomac Interceptor Collapse

Collapse of 72” diameter section of pipe caused overflow of more than 200 million gallons of wastewater into Potomac River.

🔖 Inside Claude Code With Its Creator Boris Cherny

A very special guest on this episode of the Lightcone! Boris Cherny, the creator of Claude Code, sits down to share the incredible journey of developing one of the most transformative coding tools of the AI era.

🔖 Current

Every RSS reader I’ve used presents your feeds as a list to be processed. Items arrive. They’re marked unread. Your job is to get that number to zero, or at least closer to zero than it was yesterday.

Current has no unread count. Not because I forgot to add one, or because I thought it would look cleaner without it. There is no count because counting was the problem.

The main screen is a river. Not a river that moves on its own. You’re not watching content drift past like a screensaver. It’s a river in the sense that matters: content arrives, lingers for a time, and then fades away.

🔖 Phantom Obligation

Email’s unread count means something specific: these are messages from real people who wrote to you and are, in some cases, actively waiting for your response. The number isn’t neutral information. It’s a measure of social debt.

But when we applied that same visual language to RSS (the unread counts, the bold text for new items, the sense of a backlog accumulating) we imported the anxiety without the cause.

🔖 ways of working with the Wayback Machine - studio and book talk in Amsterdam

Last week I gave a book talk on Public Data Cultures and co-organised a Wayback studio with the Internet Archive Europe.

As highlighted in the book talk announcement it was really nice to have this moment there given my longstanding collaborations with the Internet Archive - and to meet up with others connected to the archive and associated communities in Amsterdam

🔖 Black Jesus

Black Jesus is an American live-action sitcom created by Aaron McGruder (creator of The Boondocks) and Mike Clattenburg (creator of Trailer Park Boys) that aired on Adult Swim. The series stars Gerald “Slink” Johnson, Charlie Murphy, Corey Holcomb, Kali Hawk, King Bach, Andra Fuller, and John Witherspoon. The series premiered on August 7, 2014. On December 10, 2014, the series was renewed for a second season,[2] which premiered on September 18, 2015.[3] Its third and final season premiered on September 21, 2019.[4]

🔖 Oral History of John Backus

Interviewed by Grady Booch on September 5, 2006, in Ashland, Oregon, X3715.2007

John Backus led a team at IBM in 1957 that created the first successful high-level programming language, FORTRAN. It was designed to solve problems in science and engineering, and many dialects of the language are still in use throughout the world.

Describing the development of FORTRAN, Backus said, “We simply made up the language as we went along. We did not regard language design as a difficult problem, merely a simple prelude to the real problem: designing a compiler which could produce efficient programs . . . We also wanted to eliminate a lot of the bookkeeping and detailed, repetitive planning which hand coding involved.”

The name FORTRAN comes from FORmula TRANslation. The language was designed for solving engineering and scientific problems. FORTRAN IV was first introduced by IBM in the early 1960s and still exists in a number of similar dialects on machines from various manufacturers.

🔖 FreeBSD Mastery: Advanced ZFS

ZFS improves everything about systems administration. Once you peek under the hood, though, ZFS’ bewildering array of knobs and tunables can overwhelm anyone. ZFS experts can make their servers zing—and now you can, too, with FreeBSD Mastery: Advanced ZFS.

🔖 disko-zfs: Declaratively Managing ZFS Datasets

Given a situation where a ZFS pool has just too many datasets for you to comfortably manage, or perhaps you have a few datasets, but you just learned of a property that you really should have set from the start, what do you do? Well, I don’t know what you do, I would love to hear about that, so please do reach out to me, over Matrix preferably.

In any case, what I came up with is disko-zfs. A simple Rust program that will declaratively manage datasets on a zpool. It does this based on a JSON specification, which lists the datasets, their properties and a few pieces of extra information.

🔖 Level of Detail

My hunch is that we’ll spend just as much time and energy carving code back as we will generating it. If generating code is nearly free, then the cost shifts entirely to understanding, maintaining, and pruning it. And sometimes the right move isn’t a better level of detail. It’s fewer polygons in the scene altogether. Delete the sprawling implementation and replace it with something you can actually reason about

🔖 I Sold Out for $20 a Month and All I Got Was This Perfectly Generated Terraform

So now I’m paying $20 a month to a company that scraped the collective knowledge of humanity without asking so that I can avoid writing Kubernetes YAML. I know what that makes me. I just haven’t figured out a word for it yet that I can live with.

🔖 Poor Deming never stood a chance

The two management giants of the mid-twentieth century were Peter Drucker and W. Edwards Deming. Ironically, while Drucker hails from Austria-Hungary (like me, Drucker emigrated to the U.S. as an adult) and Deming was born in the U.S., it was Drucker that proved to be more influential in America. Deming’s influence was much greater in Japan than it ever was the U.S. If you’ve ever been at an organization that uses OKRs, then you have worked in the shadow of Drucker’s legacy. While you can tell a story about how Deming influenced Toyota, and Toyota inspired the lean movement, I would still describe management in the U.S. as Deming in exile. Deming explicitly stated that management by objectives isn’t leadership, and I think you’d be hard-pressed to find managers in American companies who would agree with that sentiment.

🔖 Emily St. John Mandel

Emily St. John Mandel (/seɪntˈdʒɒn mænˈdɛl/;[2][3] née Fairbanks;[4] born 1979) is a Canadian novelist and essayist.[5][6] She has written six novels, including Station Eleven (2014), The Glass Hotel (2020), and Sea of Tranquility (2022). Station Eleven, which has been translated into 33 languages,[7] has been adapted into a limited series on HBO Max.[8] The Glass Hotel was translated into twenty languages and was selected by Barack Obama as one of his favorite books of 2020.[9][10] Sea of Tranquility was published in April 2022 and debuted at number three on The New York Times Best Seller list.[11]

🔖 Deb Olin Unferth

Deb Olin Unferth (born November 19, 1968) is an American author. She has published two novels, two books of short stories, a memoir, and a graphic novel. Her fiction and essays have appeared in over fifty magazines and journals, including Harper’s,[1] The New York Times,[2] The Paris Review[3] The Believer,[4] McSweeney’s, Granta[5] The Guardian,[6] and NOON. She was a finalist for the National Book Critics’ Circle Award,[7] and she has received a Guggenheim fellowship,[8] four Pushcart Prizes, a Creative Capital Fellowship for Innovative Literature,[9] and residency fellowships from the MacDowell[10] and Yaddo[11] Foundations.

🔖 Citational Politics and Justice: Introduction

This introduction provides an overview of the thirteen articles which constitute this special issue about “citational politics and justice.” The issue begins with a discussion paper, followed by six research articles, one commentary, one project report, one teaching reflection, and finishes with three conversations. Authors reflect on the history and future of citation practices, and what they mean for the recognition of marginalised scholars, knowledges, and forms of output. The range of contributions offers insights into how more just scholarly practices can be promoted in teaching, research, publishing, and collaboration with academic and societal partners. Together, these articles provide ideas for achieving greater citational justice, and ultimately improving the quality of knowledge.

🔖 Concatenative language

There are many ways to categorize programming languages; one is to define them as either “concatenative” or “applicative”. In an applicative language, things are evaluated by applying functions to arguments. This includes almost all programming languages in wide use, such as C, Python, ML, Haskell, and Java. In a concatenative programming language, things are evaluated by composing several functions which all operate on a single piece of data, passed from function to function. This piece of data is usually in the form of a stack. Additionally, in concatenative languages, this function composition is indicated by concatenating programs. Examples of concatenative languages include Forth, Joy, PostScript, Cat, and Factor.

🔖 Parents are opting kids out of school laptops, returning them to pen and paper

Parents are forming a loose network teaching one another how to get their children off school-issued Chromebooks and iPads.

🔖 News publishers limit Internet Archive access due to AI scraping concerns

When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.

🔖 Gwtar: a static efficient single-file HTML format

Gwtar is a new polyglot HTML archival format which provides a single, self-contained, HTML file which still can be efficiently lazy-loaded by a web browser. This is done by a header’s JavaScript making HTTP range requests. It is used on Gwern.net to serve large HTML archives.

🔖 What technology takes from us – and how to take it back

Resisting the annexation of our hearts and minds by Silicon Valley requires us not just to set boundaries on our engagement with what they offer, but to cherish the alternatives. Joy in ordinary things, in each other, in embodied life, and the language with which to value it, is essential to this resistance, which is resistance to dehumanisation.

🔖 Inside Japan’s Most Influential Architect’s Working Studio

Join us for a quiet look inside the workspace of Tadao Ando, offering a brief glimpse into his architectural process.

This studio visit documents the daily rhythms of work and the careful, repetitive making of architectural scale models that sit at the center of his practice. The focus is not on finished buildings, but on process. Time spent refining ideas. Returning to the same forms again and again. Letting work unfold slowly.

Photographed in a restrained, observational way, this project uses still imagery to pay close attention to space, light, and atmosphere. The photographs are not illustrative, but quietly descriptive, allowing the studio to reveal itself as it is.

It is a small window into how creative work happens inside a working architecture studio, and an invitation to slow down and observe the act of making.

🔖 Ambient Videos

Photographer Noah Kalina has put together some long form videos that are meant to be put on a screen and left on.

Toke / Ed Summers

Evan as a skeptic, I will admit, it was interesting to hear about how Claude Code was created and how it is being developed now in this interview with its creator Boris Cherny:

Cherny’s instructions to build for the model they will have in six months, coupled with the seeming lack of understanding of what model they will have in six months (either software development goes away or an ASL-4 level catastrophe) was to be expected I guess? Maybe he knows and just isn’t saying? Maybe there isn’t very good understanding of whether one model is working better than another? The question of how these models are being evaluated for particular types of work, like software development, is actually interesting to me.

Of course, Anthropic employees would like nothing better than for people to forget how to develop software, and to become utterly dependent on them in the process. Indeed they are happily leading the way, high on their own supply of limitless tokens. They are counting on employers to follow suit, paying subscription costs to give their employees tokens to spend instead of having software developers on staff. This is following in the footsteps of what we’ve seen happen with cloud computing.

In some ways this is nothing new. Software developers have been dependent on the centralized development of compilers and interpreters for some time. So you could look at the centralization of software development into platforms like Anthropic and OpenAI as the natural next stage of development in information technology. Indeed, I think this is the argument currently being made (somewhat convincingly) by Grady Booch about a Third Golden Age of Computing which got underway with the rise of “platforms” more generally, and which includes recent genAI platform APIs and tooling.

But the big difference, that they want us all to forget, is the amount of resources it takes to build a compiler compared to an LLM and our ability to reason about them, and intentionally improve them. They also want us to forget that we need to, you know, give them all our data and ideas as context for them to do whatever they want (thanks cblgh). And as with cloud computing, they want us to forget about the materiality of computing, where computation runs. Ironically, I think computer programmers are particularly susceptible to this rhetoric of abstraction, or the medial ideology of the digital and the cloud (Hu, 2015; Kirschenbaum, 2008).

From a sociotechnical perspective I am curious how prompt data is being used to try to improve these models, as people start using them for ordinary tasks, and also in attempts to intentionally shape the model motivated by greed and malice. I guess the details of this process must be well hidden? Pointers would be welcome.

Hu, T.-H. (2015). A prehistory of the cloud. MIT Press.

Kirschenbaum, M. G. (2008). Mechanisms: New media and the forensic imagination. MIT Press.

ActiveRecord neighbor vector search, with per-document max / Jonathan Rochkind

I am doing LLM “RAG” with rails ActiveRecord, postgres with the pgvector extension for vector similarity searches, and the neighbor gem. I am fairly new to all of this stuff, figuring it out by doing it.

I realized that for a particular use, I wanted to get some document diversity — so i wanted to do a search of my chunks ranked by embedding vector similarity, getting the top k (say 12) chunks — but in some cases I only want, say, 2 chunks per document. So the top 12 chunks by vector similarity, such that only 2 chunks per interview max are represented in those 12 top chunks.

I decided I wanted to do this purely in SQL, hey, I’m using pgvector, wouldn’t it be most efficient to have pg do the 2-per-document limit?

Note: This may be a use case that isn’t a good idea! I have come to realize that maybe I want to just fetch 12*3 or *4 docs into ruby, and apply my “only 2 per document” limit there? Because I may want to do other things there anyway that I can’t do in postgres, like apply a cross-model re-ranker? So I dunno, but for now I did it anyway.

So this was some fancy SQL, i was having trouble figuring out how to do it myself, so I asked ChatGPT, sure. It gave me an initial answer that worked, but…

Turns out was over-complicated, a simpler (to my understanding anyway) approach was possible
Turns out was not performant, it was not using my postgres ‘HNSW’ indexes to make vector searches higher performance, and/or was insisting on sorting the entire table first defeating the point of the indexes. How’d I know? Well, I noticed it was being slower than expected (several seconds or at times much more to return), and then I did postgres explain/analyze… which I had trouble understanding… so i fed the results to ChatGPT and/or Claude, who confirmed, yeah buddy, this is a bad query, it’s not using your vector index properly.

I had to go on a few back and forths with both ChatGPT and Claude (this is just talking to them in a GUI, not actually using Claude Code or whatever), to get to a pattern that did use my index effectively. They kept suggesting things to me that either just didn’t work, or didn’t actually use the index, etc. I had to actually understand what they were suggesting, and tweak it myself, and have a dialog with them…

But i eventually got to this cool method that can take an arbitrary ActiveRecord relation which already has had neighbor nearest_neighbors query applied to it… and wraps it in a larger query using CTE’s that can limit the results to max-per-document.

I wondered if I should try to share this somewhere (would neighbor gem want a PR?), except… I’m realizing like I said above maybe this is not actually a very useful use case, better to do it in ruby… I’m still not necessariliy getting the performance I expected either, although the analyze/explain says the indexes should be used properly.

So I just share here. Note the original base_relation may be it’s own internal joins to enforce additional conditions on retrieval etc. Assuming each Chunk ActiveRecord model has a document_id attribute which we are using to group for max-per-document.

# We need to take base_scope and use it as a Postgres CTE (Common Table Expression)
    # to select from, but adding on a ROW_NUMBER window function, that let's us limit
    # to top max_per_interview
    #
    # Kinda tricky, especially to do with good index usage. Got solution from google and talking
    # to LLMs, including having them look at pg explain/analyze output.
    #
    # @param base_relation [ActiveRecord::Relation] original relation, it can have joins and conditions.
    #   It MUST have already had vector distance ordering applied to it with `neighbor` gem.
    #
    # @param max_per_interview [Integer] maximum results to include per interview (oral_history_content_id)
    #
    # @param inner_limit [Integer] how many to OVER-FETCH in inner limit, to have enough even after
    #    applying max-per-interview.
    #
    # @return [ActiveRecord::Relation] that's been in a query to enforce max_per_interview limits. It does
    #   not have an overall limit set, caller should do that if desired, otherwise will be effectively
    #   limited by inner_limit.
    def wrap_relation_for_max_per_interview(base_relation:, max_per_interview:, inner_limit:)
      # In the inner CTE, have to fetch oversampled, so we can wind up with
      # hopefully enough in outer. Leaving inner unlimited would be peformance problem,
      # cause of how indexing works it doesn't need to calculate them all if limited.
      base_relation = base_relation.limit(inner_limit)

      # Now we have another CTE that assigns doc_rank within partitioned
      # interviews, from base. Raw SQL is just way easier here.
      partitoned_ranked_cte = Arel.sql(<<~SQL.squish)
        SELECT base.*,
          ROW_NUMBER() OVER (
            PARTITION BY document_id
            ORDER BY neighbor_distance
          ) AS doc_rank
        FROM base
      SQL

      # A wrapper SQL that incorporates both those CTE's, limiting to
      # doc_rank of how many we want per-interview, and overall making sure to
      # again order by vector neighbor_distance that must already have been included
      # in the base relation.
      base_relation.klass
        .select("*") # just pass through from underlying CTE queries.
        .with(base: base_relation)
        .with(partitioned_ranked: partitoned_ranked_cte)
        .from("partitioned_ranked")
        .where("doc_rank <= ?", max_per_document)
        .order(Arel.sql("neighbor_distance"))
    end

Like I said, I am new to this LLM stuff, curious what others have to say here.

The Kessler Syndrome / David Rosenthal

LEO in 2019 (NASA)

In 1978 Donald J. Kessler and Burton G. Cour-Palais published Collision Frequency of Artificial Satellites: The Creation of a Debris Belt. Wikipedia notes that:

It describes a situation in which the density of objects in low Earth orbit (LEO) becomes so high due to space pollution that collisions between these objects cascade, exponentially increasing the amount of space debris over time.

This became known as the Kessler Syndrome. Three decades later, shortly after Iridium 33 and Cosmos 2251 collided at 11.6km/s, Kessler published The Kessler Syndrome, writing that the original paper:

predicted that around the year 2000 the population of catalogued debris in orbit around the Earth would become so dense that catalogued objects would begin breaking up as a result of random collisions with other catalogued objects and become an important source of future debris.

And that:

Modeling results supported by data from USAF tests, as well as by a number of independent scientists, have concluded that the current debris environment is “unstable”, or above a critical threshold, such that any attempt to achieve a growth-free small debris environment by eliminating sources of past debris will likely fail because fragments from future collisions will be generated faster than atmospheric drag will remove them.

Below the fold I look into the current situation.

How Likely Is A Kessler Event?

Fast forward another 17 years and Hugh G. Lewis and Donald J. Kessler (in his mid-80s) recently published CRITICAL NUMBER OF SPACECRAFT IN LOW EARTH ORBIT: A NEW ASSESSMENT OF THE STABILITY OF THE ORBITAL DEBRIS ENVIRONMENT. Their abstract states that:

Using data from on-orbit fragmentation events, this paper introduces a revised stability model for altitudes below 1020 km and evaluates the March 2025 population of payloads and rocket stages to identify new regions of instability. The results indicate the current population of intact objects exceeds the unstable threshold at all altitudes between 400 km and 1000 km and the runaway threshold at nearly all altitudes between 520 km and 1000 km.

This and other recent publications attracted the attention not only of two well-known YouTubers, Sabine Hossenfelder and Anton Petrov, but also of me.

Lewis and Kessler's conclusion mirrors that of the ESA Space Environment Report 2025 from 1^st April, 2025 (my emphasis):

The amount of space debris in orbit continues to rise quickly. About 40,000 objects are now tracked by space surveillance networks, of which about 11 000 are active payloads.

However, the actual number of space debris objects larger than 1 cm in size – large enough to be capable of causing catastrophic damage – is estimated to be over 1.2 million, with over 50.000 objects of those larger than 10 cm.
...
The adherence to space debris mitigation standards is slowly improving over the years, especially in the commercial sector, but it is not enough to stop the increase of the number and amount of space debris.

Even without any additional launches, the number of space debris would keep growing, because fragmentation events add new debris objects faster than debris can naturally re-enter the atmosphere.

To prevent this runaway chain reaction, known as Kessler syndrome, from escalating and making certain orbits unusable, active debris removal is required.

Thiel et al Fig. 2

Another of the recent publications is Sarah Thiele et al's An Orbital House of Cards: Frequent Megaconstellation Close Conjunctions which focuses on the requirement for satellites to maneuver to avoid potential collisions, and what would happen if, for example, a solar storm disrupted the necessary command-and-control:

While satellites provide many benefits to society, their use comes with challenges, including the growth of space debris, collisions, ground casualty risks, optical and radio-spectrum pollution, and the alteration of Earth's upper atmosphere through rocket emissions and reentry ablation. There is potential for current or planned actions in orbit to cause serious degradation of the orbital environment or lead to catastrophic outcomes, highlighting the urgent need to find better ways to quantify stress on the orbital environment. Here we propose a new metric, the CRASH Clock, that measures such stress in terms of the timescale for a possible catastrophic collision to occur if there are no satellite manoeuvres or there is a severe loss in situational awareness. Our calculations show the CRASH Clock is currently 5.5 days, which suggests there is limited time to recover from a wide-spread disruptive event, such as a solar storm. This is in stark contrast to the pre-megaconstellation era: in 2018, the CRASH Clock was 164 days.

They estimate that:

In the densest part of Starlink’s 550 km orbital shell, we expect close approaches (< 1 km) every 22 minutes in that shell alone.

For the whole of Earth orbit they estimate the time between < 1 km approaches at 41 seconds.

Will Things Get Worse?

Nehal Malik's Space Is Getting Crowded: Starlink Dodged 300,000 Collisions illustrates the scale of the problem:

According to a recent report filed by SpaceX with the U.S. Federal Communications Commission, Starlink satellites performed roughly 300,000 collision-avoidance maneuvers in 2025 alone. The figures, first reported by New Scientist, offer a rare look at just how crowded low-Earth orbit has become — and how aggressively SpaceX is managing risk as its constellation scales.
...
On average, the 300,000 maneuvers worked out to nearly 40 avoidance actions per satellite last year. That number is rising quickly, with estimates suggesting Starlink could be performing close to one million maneuvers annually by 2027 if growth continues at its current pace.

While it is true that Starlink is careful:

What’s particularly notable is how conservative SpaceX’s approach is compared to the rest of the industry. While the typical standard is to maneuver when the risk of collision reaches one in 10,000, SpaceX reportedly initiates avoidance at a far lower threshold of roughly three in 10 million.

Nevertheless Starlink's rate of maneuvers is doubling every six months, which seems likely to force a less conservative policy. The average satellite is moving every 9 days. At this doubling rate, by the end of 2027 the average satellite would move about twice a day.

Starlink currently has over 10,000 satellites, with plans for 12,000 in the short term. I believe the collision probability goes as the square of the number, so that will mean moving on average every 6.25 days. Their eventual plan for 42,000 would mean twice a day, or in aggregate about one move per second.

In order to pump SpaceX/xAI/Twitter stock in preparation for a planned IPO, Musk recently pivoted from cars, weird pickup trucks, self-driving cars, robotaxis, humanoid robots and Mars colonization to data centers in space. He claimed that by 2031 SpaceX/xAI/Twitter would operate a million satellites forming a huge AI data center. Scaling up from the current maneuver rate gets you to about a move every 125ms in aggregate.

How Bad Would A Kessler Event Be?

My friend Robert Kennedy considers the implications of a Kessler event in low Earth orbit:

Obviously the national security repercussions for the western world, especially the U.S., would be severe with so many force multipliers going away at once. Presenting an opportunity for adversaries to attack us, maybe.

The overall global space market, presently ~$700B/yr & growing fast, would shrink dramatically. This contraction in turn would be amplified in the world's stock markets since space activity is central to so many Big Tech equities now, and space infrastructure is so deeply embedded many other enterprises' business models. ... Even modest P/E ratios suggest that an order of magnitude more, maybe two (~$10-100T) of paper wealth would disappear.

The space insurance market would collapse under the burden of covered claims. Re-insurers could not handle so much at once. Companies that chose to self-insure would probably go under after such a casualty. Without insurance, most enterprises could not afford to conduct space missions.

The space launch market would collapse, leaving only national launch capabilities maintained by individual nations for their individual non-market reasons. All those innovative rocket companies popping up to serve the mega-constellations would go away once their prime customers did. Global launch tempos would fall by more than half, from 200+/yr to well under 100/yr of a generation ago. Forget the $100 per kg that ... Starship was aiming for, price per kilogram would return to what it was 30 years ago, ~$10-20K/kg. Say goodbye to cheap rideshares to LEO. Even running the gauntlet thru LEO would be fraught, as the Chinese learned just a few months ago when their spacecraft was damaged by debris on the way up, necessitating the premature return of the undamaged pre-deployed spaceship to rescue the earlier crew.

Since 99% of Cubesats fly in LEO, the ecology of COTS parts that has sprung up to serve the Cubesat revolution would probably go away, or back into the garage at least. It might even disappear altogether if authorities of various spacefaring nations ban Cubesats. (Literally "throwing out the baby with the bathwater".) Don't underestimate the inherent conservatism of oligarchs to use a crisis to stomp on upstarts.

Can LEO Be Cleaned Up?

ClearSpace-1

In 2027 the ESA plans ClearSpace-1, an experimental mission to deorbit a dead satellite. The plan is to grab the satellite then retrofire. In principle this technique is a workable but expensive way to remove large targets before a collision fragments them, but it isn't viable for most of the results of a collision.

What Else Can Go Wrong?

The frenzy to exploit the commons of Low Earth Orbit doesn't just threaten to cut humanity off from space in general and the benefits that LEO can provide. The process of getting stuff up there and its eventual descent threatens to accelerate the process of trashing the commons of the terrestrial environment.

Going Up

Elon Musk's proposed one million satellite data center is estimated to require launching a Starship about every hour 24/7/365. Laura Revell et al's Near-future rocket launches could slow ozone recovery describes one problem:

Ozone losses are driven by the chlorine produced from solid rocket motor propellant, and black carbon which is emitted from most propellants. The ozone layer is slowly healing from the effects of CFCs, yet global-mean ozone abundances are still 2% lower than measured prior to the onset of CFC-induced ozone depletion. Our results demonstrate that ongoing and frequent rocket launches could delay ozone recovery. Action is needed now to ensure that future growth of the launch industry and ozone protection are mutually sustainable.

Black carbon heats the stratosphere, although the increasing use of methane reduces the amount emitted per ton of propellant. Each Starship launch uses about 4000 tons of LOX and about 1000 tons of methane. Assuming complete combustion, this would emit about 1,667 tons of CO2 into the atmosphere. So Musk's data center plan would dump about 17 megatons/year into the atmosphere, or about as much as Croatia.

Coming Down

All this mass in LEO will eventually burn up in the atmosphere. Jose Ferreira et al's Potential Ozone Depletion From Satellite Demise During Atmospheric Reentry in the Era of Mega-Constellations describes the effects this will have:

This paper investigates the oxidation process of the satellite's aluminum content during atmospheric reentry utilizing atomic-scale molecular dynamics simulations. We find that the population of reentering satellites in 2022 caused a 29.5% increase of aluminum in the atmosphere above the natural level, resulting in around 17 metric tons of aluminum oxides injected into the mesosphere. The byproducts generated by the reentry of satellites in a future scenario where mega-constellations come to fruition can reach over 360 metric tons per year. As aluminum oxide nanoparticles may remain in the atmosphere for decades, they can cause significant ozone depletion.

Can A Kessler Event Be Prevented?

Ozone Hole 10/1/83

Clearly, reducing the risk of a Kessler incident requires international cooperation. We have one somewhat successful example of a international cooperation to mitigate a similar "Tragedy of the Commons".Thirty-eight years ago the Montreal Protocol was agreed, phasing out the chemicals that destroy the ozone layer. Wikipedia reports that:

Due to its widespread adoption and implementation, it has been hailed as an example of successful international co-operation.

It has been effective:

Climate projections indicate that the ozone layer will return to 1980 levels between 2040 (across much of the world) and 2066 (over Antarctica).

But note that it will have taken almost 80 years from the agreement for the environment to recover fully. And that it appears to be the exception that proves the rule:

effective burden-sharing and solution proposals mitigating regional conflicts of interest have been among the success factors for the ozone depletion challenge, where global regulation based on the Kyoto Protocol has failed to do so.

Source

The Kyoto Protocol attempted to mitigate the effects of greenhouse gas emissions. Of particular importance was that the Montreal Protocol was an application of the Precautionary Principle because:

In this case of the ozone depletion challenge, there was global regulation already being implemented before a scientific consensus was established.
...
This truly universal treaty has also been remarkable in the expedience of the policy-making process at the global scale, where only 14 years lapsed between a basic scientific research discovery (1973) and the international agreement signed (1985 and 1987).

In 1.5C Here We Come I critiqued the attitudes of the global elite that have crippled the implementation of the Kyoto Protocol. I think it is safe to say that the prospect of applying the Precautionary Principle to Low Earth Orbit is even less likely.

🇮🇩 Open Data Day 2025 in Cianjur: Geospatial Data for Mangrove Rehabilitation / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. Bandung Mappers successfully carried out the Open Data Day 2025 activity on March 6 – 8 with the theme Coastal Resilience through Mangrove Rehabilitation which was held in Cianjur, West Java. This activity was...

The post 🇮🇩 Open Data Day 2025 in Cianjur: Geospatial Data for Mangrove Rehabilitation first appeared on Open Knowledge Blog.

🇹🇿 Open Data Day 2025 in Dodoma: Driving Urban Resilience With Open Data / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. The Open Data Day 2025 event in Dodoma brought together open data advocates, government entities, researchers, NGOs, and YouthMappers under the theme “Open Data for a Resilient Dodoma.” Hosted by OpenGeoCity Tanzania with support...

The post 🇹🇿 Open Data Day 2025 in Dodoma: Driving Urban Resilience With Open Data first appeared on Open Knowledge Blog.

🇳🇵 Open Data Day 2025 in Ilam: Co-Creating Solutions to the Polycrisis with Indigenous and Marginalised Communities / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. To celebrate Open Data Day 2025, as part of the Harnessing Opportunities to address Polycrisis through community Engagement (HOPE) project, the Nepal Institute of Research and Communications, in collaboration with the Ilam Municipality, organized a...

The post 🇳🇵 Open Data Day 2025 in Ilam: Co-Creating Solutions to the Polycrisis with Indigenous and Marginalised Communities first appeared on Open Knowledge Blog.

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 theuse.info

A website and art project by Chris Mann.

🔖 How Etsy Uses LLMs to Improve Search Relevance

Search plays a central role in that mission. Historically, Etsy’s search models have relied heavily on engagement signals – such as clicks, add-to-carts, and purchases – as proxies for relevance. These signals are objective, but they can also be biased: popular listings get more clicks, even when they’re not the best match for a specific query.

To address this, we introduce semantic relevance as a complementary perspective to engagement, capturing how well a listing aligns with a buyer’s intent as expressed in their query. We developed a Semantic Relevance Evaluation and Enhancement Framework, powered by large language models (LLMs). It provides a comprehensive approach to measure and improve relevance through three key components:

High quality data: we first establish human-curated “golden” labels of relevance categories (we’ll come back to this) for precise evaluation of the relevance prediction models, complemented by data from a human-aligned LLM that scales training across millions of query-listing pairs Semantic relevance models: we use a family of ML models with different trade-offs in accuracy, latency, and cost; tuned for both offline evaluation and real-time search Model-driven applications: we integrate relevance signals directly into Etsy’s search systems enabling both large-scale offline evaluation and real-time enhancement in production

🔖 Understanding Etsy’s Vast Inventory with LLMs

While our powerful search and discovery algorithms can process unstructured data such as that in descriptions and listing photos, passing in long context and images directly to search poses latency concerns. For these algorithms, every millisecond counts as they work to deliver relevant results to buyers as quickly as possible. Spending time filtering through unstructured data for every query is just not feasible.

These constraints led us to a clear conclusion: to fully unlock the potential of all inventory listed on Etsy’s site, unstructured product information needs to be distilled into structured data to power both ML models and buyer experiences.

🔖 Unlocking the Codex harness: how we built the App Server

OpenAI’s coding agent Codex exists across many different surfaces: the web app⁠(opens in a new window), the CLI⁠(opens in a new window), the IDE extension⁠(opens in a new window), and the new Codex macOS app. Under the hood, they’re all powered by the same Codex harness—the agent loop and logic that underlies all Codex experiences. The critical link between them? The Codex App Server⁠(opens in a new window), a client-friendly, bidirectional JSON-RPC1 API.

In this post, we’ll introduce the Codex App Server; we’ll share our learnings so far on the best ways to bring Codex’s capabilities into your product to help your users supercharge their workflows. We’ll cover the App Server’s architecture and protocol and how it integrates with different Codex surfaces, as well as tips on leveraging Codex, whether you want to turn Codex into a code reviewer, an SRE agent, or a coding assistant.

🔖 OpenAI and Codex with Thibault Sottiaux and Ed Bayes

AI coding agents are rapidly reshaping how software is built, reviewed, and maintained. As large language model capabilities continue to increase, the bottleneck in software development is shifting away from code generation toward planning, review, deployment, and coordination. This shift is driving a new class of agentic systems that operate inside constrained environments, reason over long time horizons, and integrate across tools like IDEs, version control systems, and issue trackers.

OpenAI is at the forefront of AI research and product development. In 2025, the company released Codex, which is an agentic coding system designed to work safely inside sandboxed environments while collaborating across the modern software development stack.

🔖 Tetragrammaton: George Saunders

A (real, non-genAI) interview by Rick Rubin about everything but Saunders’ new book.

🔖 Little Atoms - 2 February 2026 (George Saunders)

A talk show about ideas and culture, produced and presented by Neil Denny. Each show features guests from the worlds of science or the arts in conversation. This week: George Saunders on his latest novel, Vigil.

🔖 Bridging the Data Discovery Gap: User-Centric Recommendations for Research Data Repositories

Despite substantial investment in research data infrastructure, data discovery remains a fundamental challenge in the era of open science. The proliferation of repositories and the rapid growth of deposited data have not resulted in a corresponding improvement in data findability. Researchers continue to struggle to find data that are relevant to their work, revealing a persistent gap between data availability and data discoverability. Without rich, high-quality metadata, robust and user-centred data discovery systems, and a deeper understanding of how different researchers seek and evaluate data, much of the potential value of open data remains unrealised.

This paper presents a set of practical, evidence-based recommendations for data repositories and discovery service providers aimed at improving data discoverability for both human and machine users. These recommendations emphasise the importance of 1) understanding the search needs and contexts of data users, 2) addressing the roles that data repositories play in enhancing metadata quality to meet users’ data search needs, and 3) designing discovery interfaces that support effective and diverse search behaviours. By bridging the gap between data curation practices, discovery system design, and user-centred approaches, this paper argues for a more integrated and strategic approach to data discovery.

🔖 blevesearch

A modern text/numeric/geo-spatial/vector indexing library for go

🔖 hister: Web history on steroids

Hister is a web history management tool that provides blazing fast, content-based search for visited websites. Unlike traditional browser history that only searches URLs and titles, Hister indexes the full content of web pages you visit, enabling deep and meaningful search across your browsing history.

🔖 Alphabet sells rare 100-year bond to fund AI expansion as spending surges

Feb 10 (Reuters) - Alphabet (GOOGL.O), opens new tab on Tuesday sold a rare 100-year bond, a memo from the lead manager showed, part of a $31.51 billion global bond raise, as artificial intelligence-driven spending sparks a surge in borrowing at U.S. tech giants. Alphabet’s sale of the century bond is the tech industry’s first since Motorola’s (MSI.N), opens new tab issuance that dates back to 1997, according to LSEG data.

🔖 The Eternal Mainframe

In the computer industry, the Wheel of Reincarnation is a pattern whereby specialized hardware gets spun out from the “main” system, becomes more powerful, then gets folded back into the main system. As the linked Jargon File entry points out, several generations of this effect have been observed in graphics and floating-point coprocessors.

In this essay, I note an analogous pattern taking place, not in peripherals of a computing platform, but in the most basic kinds of “computing platform.” And this pattern is being driven as much by the desire for “freedom” as by any technical consideration.

🔖 I Started Programming When I Was 7. I’m 50 Now, and the Thing I Loved Has Changed

The abstraction ship sailed decades ago. We just didn’t notice because each layer arrived gradually enough that we could pretend we still understood the whole stack. AI is just the layer that made the pretence impossible to maintain.

🔖 VIAF Governance Concerns about the Refurbished VIAF Web and API Interfaces

In January 2025, OCLC made significant changes to the web and application programming interfaces for Virtual International Authority File (VIAF) clusters. This article will compare the old and new interfaces, highlighting the pros and cons introduced and calling attention, especially, to critical errors introduced that compromise the functionality of much of the VIAF product. Consequently, it will raise questions and concerns regarding the governance of VIAF, as well as OCLC’s development model, testing, and feedback before public rollout.

🔖 JupyterLite

JupyterLite is a JupyterLab distribution that runs entirely in the browser built from the ground-up using JupyterLab components and extensions.

🔖 NINeS 2026: Contributed Talks

Economic and Human Factors in Internet Design David D. Clark (MIT)
Changing Internet Architecture: A Practical Perspective Jana Iyengar (Netflix) and Barath Raghavan (USC/Fastly)
Perspectives on Congestion Control Nandita Dukkiapti (Google) in conversation with Brighten Godfrey (UIUC)
From Networks in Practice to Networks in Principle Jennifer Rexford (Princeton) in conversation with Nate Foster (EPFL)
Networking Fabrics for ML: Hardware Advances and Research Questions Arvind Krishnamurthy (University of Washington)
Why I was Wrong About Quality-of-Service (QoS) Scott Shenker (UC Berkeley/ ICSI, Professional Dilettante)

🔖 Creativity in Conflict: A Multi-Level Exploration of Software Developers’ Capacity to Innovate

The software industry, historically driven by creativity, faces a paradox. While developers are drawn to intellectual challenges, their creativity is increasingly constrained by efficiency-driven methods and so-called productivity metrics. Although positioned as innovation engines, Agile software development (hereinafter referred to as Agile) and open-source software (OSS) approaches may prioritize incrementalism over transformative breakthroughs. This tension between structure and creativity threatens individual potential and the industry’s capacity for meaningful innovation. Without addressing this gap, contemporary development approaches may fail to support the creativity necessary for crafting novel and impactful software. This dissertation examines this gap, investigating how modern development approaches shape individual creativity into project-level innovation. Drawing on multi-level interactionist theories of creativity, we explore the conditions under which individual, team, and organizational interactions foster or constrain creative outcomes. By addressing this critical gap, our research reconceptualizes development methodologies as enablers of radical innovation rather than constraints, ensuring the industry’s continued creative and transformative impact. Using a sequential exploratory mixed-methods design, this dissertation integrates qualitative and quantitative techniques to analyze creativity within software development. The qualitative strand examines individual developer experiences through 31 semi-structured interviews with Agile practitioners. The quantitative strand assesses cognitive conflict’s impact on team performance in OSS development, analyzing 40 projects and 82,949 code commits. The mixed convergent strand evaluates corporate and open governance interplay, leveraging data from 40 projects, 10,862 releases, and 15 interviews. By synthesizing insights across these strands, this dissertation delivers theoretical contributions and actionable guidance for fostering creativity in software development. We challenge the myth of developers as lone “rockstars” or “hackers” by demonstrating the critical role of social interactions in shaping creativity and innovation. Empirical findings reveal that review-stage interactions—such as pull requests and code reviews—mediate and transition from creativity to innovation, while project governance moderates this relationship further. This dissertation highlights how individual, team, and organizational dynamics influence creative outcomes by operationalizing cognitive conflict and release commit novelty. These insights advance theoretical understanding and offer practical strategies for unlocking the innovative potential of contemporary development practices.

🔖 US Military Helping Trump to Build Massive Network of ‘Concentration Camps,’ Navy Contract Reveals

In the wake of immigration agents’ killings of three US citizens within a matter of weeks, the Department of Homeland Security is quietly moving forward with a plan to expand its capacity for mass detention by using a military contract to create what Pablo Manríquez, the author of the immigration news site Migrant Insider calls “a nationwide ‘ghost network’ of concentration camps.”

🔖 unmerdify

Get the content, only the content: unenshittificator for the web.

Broke / Ed Summers

How to help people find good reading on The Online Books Page (Some dos and don’ts) / John Mark Ockerbloom

The catalog of free online books and serials at The Online Books Page continues to grow and improve every week. You can watch our new books listings, or subscribe to our RSS feed, to see what books we’ve newly cataloged on nearly every business day. There’s also a lot more development of the collection that you don’t see there. Right now, for instance, we’re adding information about newly available issues of many of the serials we already list, after 1930 copyrights expired at the start of January. (We’ve also updated information on what copyright renewals for them are still in effect, and updated our guides on determining whether books or serials are in the public domain.). And throughout the year, we fix links that break, import records from various other online book projects into our extended shelves, and improve our metadata to make it easier for readers to discover and get to things they want to read online.

That’s a lot for one editor (with a number of other job responsibilities) to take care of. Much of what makes it possible for me to do it is what I hear from readers all over the world. They tell me what books and serials they’re particularly interested in reading or consulting, and I use what I’ve been told to add works I might not have otherwise considered, and provide better information about them. I find out what subjects, authors, and serials a broad and diverse community of readers wants to read more of. I also find out about things I might have to fix in my catalog, as links go bad, better information surfaces about what I list, and sites go down or withdraw free access to work they once provided.

So I appreciate hearing from readers to keep the catalog fresh and useful. Long ago, though, it became clear that I could never keep up with all the suggestions people sent me by email. I can reasonably keep up with them, though, when readers use the forms I’ve set up for suggestions and corrections, and follow some basic etiquette when they do so. Here’s how to do that:

To suggest a new listing, use our suggestion form. If you haven’t recently read them, look at the instructions for making suggestions and the tips on filling out the form that are linked at the top of the form. That will tell you, among other things, what qualifies for listing on the site and what does not, and how to best ensure a useful and timely new listing.

You should include only one book (or serial) in each suggestion. If it’s a multi-volume work, or you know of multiple sites that have it, you can include extra URLs for them in the “Anything else we should know about this suggestion?” comment box near the bottom of the form.

If you want to hear from us that we’ve listed your book (or get any other response, whether it’s an answer to a question you asked in the comment box, or a note explaining why we declined to list something you suggested), put your email address (and if you like, your name) in the designated space in the form. There’s also a box you can check so these will automatically be filled in for you if you make another suggestion later (assuming your browser allows cookies from our site).

If you find a book in our extended shelves (which has data automatically imported from other sites), and would like it “promoted” to the “curated” collection (whose listings appear more prominently on our site, and whose catalog records we’ve checked and often improved), you can click on the “request that we add this book” link on its catalog page. That will take you to our suggestion form, with some of the fields already filled out based on the automatically imported record. Make any changes or additions you see fit, and then push the “Submit this suggestion” button at the bottom of the form. Please only do this for books you’re actually interested in reading or consulting. We know the automatically imported data on the extended shelves can be messy, but there are well over 3 million records in our extended shelves, and rather than spend endless time cleaning them all up manually, we’d rather spend our time listing and improving listings for books that we know our readers want to find.

For similar reasons, unless you have a particular interest in specific editions of a popular work, it generally isn’t a good use of time to exhaustively suggest all of the different editions of that work. (Those can run into the dozens or even hundreds, in some cases.) Instead, focus on the editions that you find most useful. (That will also help other readers find the most useful editions of that work, without having to sort through lots of chaff, not knowing which of many alternatives to choose.)

To have us fix something wrong in a listing, use the “report a bad link” link to take you to our bad links reporting form. If you select this from a book’s catalog page, some of the boxes in the form will be pre-filled to let us know which book you found the problem with. If you select it from the bottom of the page elsewhere on the site, you may need to fill in details on which listing you found the bad link on. In either case, it can be useful to fill in the “What happened when you tried the link?” box (even it’s just to say something like “I got a blank page”), so we can better troubleshoot the problem. You can also report other problems with the listing, such as bad metadata, if it’s in our curated collection. (Don’t bother doing that for the automated listings in our extended shelves, unless it’s a book you’re interested in yourself. In that case, though, it’s better to use the suggestion form, often reachable via a “suggest we add this book” link.)

You can also use the “what happened?” contact box to tell us if the site the book is on seems to have disappeared, reorganized, or been taken over by someone else. When we get a bad link report like that, we’ll often check out other listings that link to the same site, and deal with all of them as needed, without you needing to individually report every listing

If the forms aren’t working for you, you can email us instead at the address given on our site. The forms generally work for most people, but there are times when they might be unavailable. (The usual reason for that these days is that the site occasionally gets over-burdened by automated programs aggressively crawling our site and other cultural heritage sites.) If you need to email us because a form isn’t working, we’ll be able to pick out your email for attention if you include on your subject line the code word currently mentioned on our page of how to suggest works.)

Give us your email address if you’re suggesting more than a few works before we get around to listing them. We support the right to read anonymously (and the accompanying right to anonymously ask for what you want to read). We also understand that sometimes a reader might have good reason to suggest a moderately long list of works, such as from a thematic reading list or bibliography. But anonymity and bulk requests don’t go together well.

If you leave your email address when you submit a sizable list of suggestions, we can work out good ways to manage their inclusion, and let you know if issues come up with the list, or if you need to ease off or use a different approach. But if you submit a lot of works anonymously, there’s no way for us to give you feedback, or even to know if there’s a real person who actually cares about all the books requested, and not just mechanically going through a list without particular interest or thought about its contents. Which leads to our last request:

Do not use “AI”, “agents”, chatbots, or other automated systems to fill out our forms or otherwise communicate with us. Heavily promoted technology now makes it possible for computers to automatically generate lots of text that appears useful, and to automatically submit it into forms like the ones we use for The Online Books Page. But just because computer code can now quickly request dozens, or hundreds, of books that match patterns found when crawling the Internet, it doesn’t mean those books are of much interest to any human using my site at the moment. If you’re a person who wants to read a book that is or could be online, you should be to be able to request it from us, and get a quick and useful response. But it’s hard to do that when our attention gets pulled away from interested readers by the demands generated by automated systems, or by others impersonally claiming disproportionate shares of personal attention.

We do have ways of dealing with abusive bulk and automated requests, as we have for years with spam of all sorts. I recently added added a warning to our request form that “anonymous bulk submissions may be deferred or discarded partially or entirely, at the editor’s discretion”. We will be doing more of that in the future as long as such submissions are common, and we will take other steps that prove necessary to counter abuse and thoughtless demands on people’s time.

We value our readers, and want to hear from you, personally. I love that people all over the world use The Online Books to find reading that informs, enlightens, and pleases them. I’m thankful that so many of you want to share that reading with others, and help improve the collection we offer here. As I’ve said earlier, I intend to offer works and services “created with care by people, processed with care by code”. I’d love to hear about works that you particularly care about, so we can offer them to others. Whether you’re suggesting a book or serial we should add, prompting us to fix a problem, or simply appreciating what you’ve found here, I hope you’ll reach out to us, or to the people who create the books and maintain the sites that we link to. I hope what I’ve written here encourages you to do that, and help you to be heard in return.

Come Join the 2026 Valentine Hunt! / LibraryThing (Thingology)

It’s (almost) February 14th, and that means the return of our annual Valentine Hunt!

We’ve scattered a punnet of strawberries around the site, and it’s up to you to try and find them all.

Decipher the clues and visit the corresponding LibraryThing pages to find a strawberry. Each clue points to a specific page right here on LibraryThing. Remember, they are not necessarily work pages!
If there’s a strawberry on a page, you’ll see a banner at the top of the page.
You have a little more than two weeks to find all the strawberries (until 11:59pm EST, Saturday February 28th).
Come brag about your punnet of strawberries (and get hints) on Talk.

Win prizes:

Any member who finds at least two strawberries will be
awarded a strawberry Badge ().
Members who find all 14 strawberries will be entered into a drawing for some LibraryThing (or TinyCat) swag. We’ll announce winners at the end of the hunt.

P.S. Thanks to conceptDawg for the love birds illustration!

2026-01-22: Paper Summary: "Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons" / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

Figure 1: Example of an image pair from Reyes Ayala’s dataset of a live web page (left screenshot) and an archived web page (right screenshot).

For the Game Walkthroughs and Web Archiving project, we have created web archiving livestreams (example livestream) that use tools to measure the performance of a web archive crawler during the livestream. In a previous blog post, I compared six different approaches for measuring web archiving and replay performance metrics so that I could identify tools that could be used during our future web archiving livestreams. In this post, I will summarize another related work (“Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons”) by Brenda Reyes Ayala that determines the quality of an archived web page by comparing screenshots of the live web page and archived web page.

Reyes Ayala, B. (2025). Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons. In M. Cornia et al. (Eds.). Proceedings of the 21st Conference on Information and Research science Connecting to Digital and Library science, 3937. Udine, Italy: CEUR Workshop Proceedings https://ceur-ws.org/Vol-3937/

Measuring Visual Correspondence With Web Archiving Screenshot Compare Tool

Reyes Ayala created the Web Archiving Screenshot Compare tool which assists with automating quality assurance by comparing screenshots of the live web page and the archived web page and determining the visual correspondence. The process for generating screenshots involves several steps. First, the tool reads the settings file that contains the seed list. For each seed, it checks if the web page exists and, if so, takes a screenshot. Next, the tool creates a CSV file with a list of the Archive-It URI-Ms associated with the archived versions of the seed. Then, the Web Archiving Screenshot Compare tool takes a screenshot of each archived web page. Finally, the URI-Ms and their screenshot file names are written to a CSV file.

After the screenshots are taken, the Web Archiving Screenshot Compare tool can use an image similarity metric to compute a score. Before computing a score, this tool checks if the live web page screenshot is not blank and then will crop a screenshot from the (live screenshot and archived screenshot) pair if both images are different sizes. After the score is computed it is output to a CSV file.

Determining the Effectiveness of the Image Similarity Metrics

The image similarity metrics supported by the Web Archiving Screenshot Compare tool are Structural Similarity Index (SSIM), Mean Squared Error (MSE), Normalized Root Mean Square Error (NRMSE), Perceptual Hash (P-Hash), Peak Signal to Noise Ratio (PSNR), and a percentage similarity metric that Reyes Ayala created. Three of these metrics (MSE, P-Hash, and PSNR) were discarded from her evaluation, because these metrics did not have an upper bound. The percentage similarity metric was also discarded, because it had a strong negative correlation with NRMSE.

The dataset that was used for the evaluation included 221 pairs of screenshots of the live and archived web pages. The archived web pages were from four Archive-It collections (Idle No More, Fort McMurray Wildfire 2016, Western Canadian Arts, and Government of Canada). After calculating the similarity scores on her dataset, she sent the screenshots to Amazon Mechanical Turk (AMT) so that she could compare the computed scores to reviewer scores. An example of an image pair that was shown to participants from Amazon Mechanical Turk is shown in Figure 1.

Reyes Ayala found that SSIM and NRMSE were able to detect high and low visual correspondence after performing statistical analysis using tests of significance. The metrics she used were one-way multivariate analysis of variance (MANOVA) and univariate analysis of variance (ANOVA). The scores for MANOVA, when using a combined dependent variable were: 𝐹(2, 222) = 44.95, 𝑝 < .001; Wilks’ 𝜆 = 0.71; Pillai’s trace = 0.29, partial 𝜂2 = 0.29 . The scores (with a Bonferroni 𝛼 adjusted level of .025) for the univariate ANOVAs were: SSIM: 𝐹(1, 223) = 10.53, 𝑝 = .001; partial 𝜂2 = 0.05 and NRMSE: 𝐹(1, 223) = 89.52, 𝑝 < .001; partial 𝜂2 = 0.29.

Kiesel et al. (“Reproducible web corpora: Interactive archiving with automatic quality assessment”) and Walsh et al. (“High Fidelity Web Archiving of News Sites and New Media with Browsertrix”) also determined the quality of an archived web page by performing screenshot comparisons. Kiesel et al.’s approach, which I summarized in a previous blog post, involves machine learning and uses a deep convolutional neural network named VGGNet, for image classification tasks. Walsh et al.’s approach, which I summarized in another blog post, used the Pixelmatch tool for screenshot comparison, measuring differences in color samples and the intensity difference between pixels.

Summary

Reyes Ayala created the Web Archiving Screenshot Compare tool, which is used to determine the quality of an archived web page by comparing the screenshots of the live web page and the archived web page. After creating a dataset of 221 pairs of screenshots and retrieving human review scores, she performed statistical analysis using tests of significance. She found that the Structural Similarity Index (SSIM) and Normalized Root Mean Square Error (NRMSE) were able to distinguish between high quality and low quality archived web pages.

For our web archiving livestreams, we currently measure the performance of the web archive crawler during the livestream and plan to measure replay performance during future livestreams. Writing this paper summary has helped with learning about another approach that could be used to measure visual correspondence during a web archiving livestream.

--Travis Reid (@TReid803)

2026-02-13: Paper Summary: "High Fidelity Web Archiving of News Sites and New Media with Browsertrix" / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

Figure 1: The Browsertrix Tool Suite

For the Saving Ads project and the Game Walkthroughs and Web Archiving project, we have used Browsertrix Crawler, ArchiveWeb.page, and ReplayWeb.page, which are Webrecorder tools that were discussed in Walsh et al.’s paper “High Fidelity Web Archiving of News Sites and New Media with Browsertrix.” In this paper, Walsh et al. describe tools that are integrated with Browsertrix and the features that differentiated their tools from other web archive crawlers and replay systems. Browsertrix is a free and open-source web archiving platform that can be run locally, self-hosted, or used through Webrecorder's hosted service. Browsertrix uses Browsertrix Crawler for archiving web pages, ArchiveWeb.page to patch archived web pages, and ReplayWeb.page to replay archived web pages (Figure 1).

Walsh, Tessa, Henry Wilkinson, and Ilya Kreymer. “High Fidelity Web Archiving of News Sites and New Media with Browsertrix.” in Proceedings of the 2024 International Federation of Library Associations and Institutions (IFLA) International News Media Conference. https://repository.ifla.org/handle/20.500.14598/3399. 2024.

Browsertrix

Walsh et al. described some of the features that differentiate Browsertrix from other crawlers, such as its ability to use browser profiles to archive content behind log-ins and personalized social media feeds, ad blocking features, and automated interactions (behaviors) that can be executed during the crawling session.

Several Browsertrix features were described in this paper:

Features associated with collaboration, sharing, and improving transparency:

It is possible for multiple users to work together to create a web archive collection.
Browsertrix collections can be shared with others through an unlisted link or by making the collection public.

There is also an option that allows anyone to download the collection as a WACZ file.
A Browsertrix collection can be embedded into a web page by downloading its WACZ file and then using ReplayWeb.page to load the WACZ.

When archiving web pages, it is possible to view the crawler archiving web pages (Figure 2) and the user can modify the crawling workflow’s exclusion settings in real-time.

This feature makes it easier to avoid crawler traps.
Browsertrix uses Browsertrix Crawler’s feature for displaying the live web page during the crawling session. This Browsertrix Crawler feature is used by my web archiving livestream tool and it helps improve the transparency of the web archiving process, because it allows the viewers of our livestream to see the live web page during the crawling session.

Browsertrix can also digitally sign archived web pages to create a chain of custody. There is also an archival receipt mode that allows anyone to view the signing information.

Exclusion settings: Their exclusion settings allow for regular expressions to be used to exclude URLs from the crawl.
Patching a crawl: When an archived web page has missing resources, ArchiveWeb.page can be used to patch the crawl by manually archiving the web page and then uploading the WACZ file from ArchiveWeb.page to the Browsertrix collection.
Browser profiles and settings: Browsertrix allows users to customize cookies, login to accounts and save browser settings like ad blocking, cookie blocking, and other privacy settings for the Brave browser.
API: Browsertrix has an API that allows applications to be notified of when a crawl has started or stopped and this feature can be used to start a new crawl without having to manually start a crawl from Browsertrix’s UI.

Figure 2: Browsertrix’s Watch Crawl feature. This figure is a screenshot from a video that is embedded in Webrecorder’s blog post “Preserving Government Websites with Browsertrix.”

Embedding ReplayWeb.page

Walsh et al. also highlighted the benefits of using ReplayWeb.page to embed a collection (that is stored in a WACZ file) or the replay of an archived web resource (like a social media post) into a static HTML web page. This approach differs from MementoEmbed, which embeds a social card into a web page, providing information about the archived resource, like the memento-datetime and links to the live and archived versions of the resource. The advantages of ReplayWeb.page’s embed feature is that it reduces the cost and complexity of hosting the archived website since a database server is not required and it removes the need to update dependencies like JavaScript libraries.

For the Game Walkthroughs and Web Archiving project, we used this feature during Replay mode to embed the replay of an archived web page into a HTML page that was dynamically generated by my web archiving livestream tool (Figure 3). I also used this feature during the Saving Ads project when I created a web page to display all the archived web ads from our dataset (Figure 4).

Figure 3: Using ReplayWeb.page to embed the replay of an archived web page during Replay mode.

Figure 4: Using ReplayWeb.page to embed the replay of an archived web ad.

Quality Assurance

Browsertrix's automated quality assurance metrics were created to help with identifying problems with the crawl without needing to manually inspect the replay of the archived web page.

They use three performance metrics:

Extracted text comparison: Levenshtein distance is applied on the extracted text from the crawl and replay (Figure 5).

Figure 5: Extracted text comparison. This figure is a screenshot from Webrecorder’s video “High Fidelity Web Archiving with Browsertrix (SAA).”

Screenshot comparison: The Pixelmatch tool is used to compare the screenshots of the live and archived web page, by measuring the difference between color samples and the intensity difference between pixels (Figure 6).

Figure 6: Screenshot comparison. This figure is a screenshot from Webrecorder’s video “High Fidelity Web Archiving with Browsertrix (SAA).”

Resource count comparison: A table is displayed that includes the resource types and the counts of successful (2xx or 3xx) and unsuccessful (4xx or 5xx) HTTP status codes that occurred during crawl and replay.

Browsertrix also has features for the user to create, approve, and reject comments on each page and rate the quality of a crawl with a review score (Figure 7).

Figure 7: Browsertrix’s feature for reviewing a crawl. This figure is from Webrecorder’s blog post “Browsertrix 1.10: Now with Assistive QA!”

Kiesel et al. (“Reproducible web corpora: Interactive archiving with automatic quality assessment”) and Reyes Ayala (“Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons”) have also compared screenshots of the live and archived web pages when determining the quality of the archived web page. Kiesel et al.’s approach, which I summarized in a previous blog post, used machine learning and used a deep convolutional neural network named VGGNet, for image classification tasks. Reyes Ayala’s approach, which I also summarized in a previous blog post, involved using popular image similarity metrics like Structural Similarity Index (SSIM) and Normalized Root Mean Square Error (NRMSE) when comparing the screenshots of the live web page and archived web page.

Summary

In Walsh et al.’s paper, they discussed the features of multiple Webrecorder tools (Browsertrix, Browsertrix Crawler, ArchiveWeb.page, and ReplayWeb.page). Browsertrix is their web archiving platform that uses Browsertrix Crawler for archiving web pages, ArchiveWeb.page for patching web pages, and ReplayWeb.page to replay archived web resources. Browsertrix Crawler is a browser-based web archive crawler that can use browsing profiles to archive web pages while being logged into an account. Using browsing profiles also allows for the customization of cookies and browser settings like ad blocking, cookie blocking, and other privacy settings. ArchiveWeb.page is a browser extension that can be used to manually archive web pages and can also patch crawls from a Browsertrix collection. ReplayWeb.page is a replay system that can be used to embed the replay of an archived website into a static HTML page, which reduces the complexity of hosting an archived website.

Walsh et al. also described how Browsertrix can be used to assist with quality assurance for web archive collections. They added features that allow a user to create, approve, and reject comments on each archived web page and rate the quality of the web page with a review score. Browsertrix can also compute three quality assurance metrics, which involve extracted text, screenshots, and HTTP status codes for resources. For extracted text, the Levenshtein distance is applied to the extracted text from the crawl and replay. For the screenshot comparison of the live and archived web pages, Pixelmatch is used to measure the difference between color samples and the intensity difference between pixels. For the resource count comparison, Browsertrix displays a table with the resource types and the counts of successful (2xx or 3xx) and unsuccessful (4xx or 5xx) HTTP status codes that occurred during crawl and replay.

For the Saving Ads project and the Game Walkthroughs and Web Archiving project, we have used most of the Webrecorder tools discussed in this paper. In the Saving Ads project, we used ArchiveWeb.page and Browsertrix Crawler to archive web pages that dynamically loaded ads. After we created a dataset of archived web ads, we then used ReplayWeb.page to embed the replay of the ads from our dataset into a static HTML page. For the Game Walkthroughs and Web Archiving project, Browsertrix Crawler was used to archive web pages during some web archiving livestreams. ReplayWeb.page was then used during Replay mode to embed the replay of the archived web pages during these livestreams.

--Travis Reid (@TReid803)

This is a post about LLM coding agents / Andromeda Yelton

When I was in my classics MA program, I took a course on Latin prose composition.

A thing to know about classics degrees is you spend a lot of your time in Latin and ancient Greek courses. You learn history and philosophy and so forth, but mostly via reading those works in their original languages. Another thing to know about classics degrees is that ancient language pedagogy is not like modern language pedagogy. In a modern language class, hopefully¹, you split your time among listening, speaking, reading, and writing exercises. These are all different activities that require overlapping but distinct skills and exercise your brain in different ways, and you need all of them to operate comfortably in a place where that language predominates. In an ancient language class, though, you’re unlikely² to be spending much time in such a place, so your time is focused almost entirely on reading. Traditionally this looks like spending a year, more at the K12 level, speedrunning All The Grammar so that you can approach original texts, because nearly all the original texts we have are written at a pretty high level, so you need all the grammar first. And because the authors of those original texts have generally been dead for millennia, it can seem like a pretty one-way colloquy.³

Anyway, I thought it would be fun to take a class on Latin prose composition. My program didn’t regularly offer one but the nerdiest professor was absolutely thrilled about the chance to offer it as an independent study, taught from a slim, dense volume whose very typography promised that it was going to kick your ass. (It delivered.) I tried to convince some of my fellow students to join me but they all knew that producing text is considerably harder than recognizing it and I couldn’t sell any of them on it, not even K., a brilliant Latinist who was probably the strongest in the program.

It was genuinely astonishing how much better I got at Latin through this class. I learned far more about the language than I have in any other course. There really is a difference between knowing how to parse some grammar when you see it and having enough of the grammar paged into your head that you can sort through the options, identify which one is appropriate to communicate your desired phrase, and maybe, if you can keep your head above water long enough, to even pick the one with the best style. It wasn’t enough to understand how others had used these same tools; I had to bend the shape of my mind until they fit comfortably-enough for my own use, master enough of them to have a range of options for a range of situations, and develop judgment and taste. I am smarter and have more tools at my ready disposal because of this class.

This is a post about LLM coding agents.

This is snarky about my middle and high school Spanish classes. ︎
It’s not impossible! The conventiculum is a thing. I once spent a week on Nantucket where we’d all taken a language pledge to only use Latin, where I discovered that I could easily have said “I am getting the swords because Gauls are invading the island”, but I actually needed to say “I am setting this alarm on my phone for 7am so we all have time to take showers”. I have rarely been more mentally exhausted in my life. ︎
It’s not, if you do it right; you can have conversations in your head if you approach it with the right combination of imaginative empathy and rigor, and indeed this is the greatest gift of a classics major. ︎

2026-02-12: How to archive web pages in bulk using the Internet Archive Google Sheets service / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

Results of archiving web pages using the Wayback Machine Google Sheets service

One of the greatest services the Internet Archive (IA) offers is the Save Page Now (SPN) service. It is easy to use and it was completely overhauled in 2019 with added features. However, the UI is limited to archiving a single URL at a time. To overcome this limitation, the IA launched the Google Sheets service, which allows you to submit a Google Sheet with the first column full of URLs (up to 5,000 URLs), and it will archive them all in the Wayback Machine.

I needed to archive 1.5 million URLs that I collected from four Arabic and English news websites published between 1999 and 2022. This is obviously not doable one URL at a time, but the IA's Google Sheets service makes it possible. because I can archive 5,000 URLs at a time and up to 30.000 URLs (six sheets) per day.

The first step is to create a Google sheet and paste the URLs you need to archive in the first column (one URL in each row). I wrote a python script to create and populate multiple Google Sheets with URLs saved in a text file to save myself from having to create tens of Google Sheets and copy-paste 5,000 URLs in each on of them. The script takes the input file, text file that has URLs (one in each row), and creates sheets with the URLs (5000 URLs per sheet).

The input Google sheet with URLs

To begin using Save Page Now for Google Sheets log into your IA account or create an account (free) if you don't have one (you can use your Google account to sign in instead).

Next, go to the Wayback-GSheets service page and log in with your Google account that you used to create your spreadsheet and grant it the necessary permissions.

Wayback-GSheets service page and Google account login

If the sheet is ‘view only’ for your account, request edit-access to your account or make a copy of it. This will allow IA to log import-errors to the sheet via your Google account and populate the columns to your Google sheet.

Once you’ve authenticated, you’ll see three large green buttons: Save Page(s) Now, Include Links to Wayback Machine archives, and Check if URLs are available in the Live Web. Click on the first button from the top (Save Pages Now).

Save Pages Now in the Wayback Machine GSheets service

Paste the URL of your spreadsheet with links in the Google Spreadsheet URL box.
As far as options, check and uncheck the checkbox next to each option based on your needs.

“Capture outlinks” box tells the service to archive the outlinks in each URL in the sheet.
"Capture screen shot" tells the service to capture a screen shot of each page in the sheet.
"Save results in a new sheet" tells the service to keep your Google sheet unchanged and create a new sheet and save the results in the new sheet.
You can keep the “Capture only if not archived within 6 hours” option enabled or change it as needed.
You can keep the “Delay the availability of new captures for ~10 hours” option enabled.

Finally, click the green “Archive” button.

Options

Options page in the Wayback Machine GSheets service

You will receive an email that the IA is processing your sheet, and a link to watch the progress (it is the same page you landed on when you hit the "Archive" button minus the "Abort" button).

Progress page in the Wayback Machine GSheets service

You will also get another email when the job is done. The spreadsheet will be updated with info about the status of all URLs in the sheet.

The Internet Archive Google Sheets Service will populate the columns in your sheet with Wayback Machine capture information next to each URL. The columns that get added are:

B: Yes/No (indicates whether or not the URL has been archived by the IA in the past)

C: The URL of the last archived copy if column B has Yes (empty if No)

D: The URL of the archived copy if IA was able to archive it (error message if not)

E: The number of outlinks captured in the web page

F: The archive the copy is archived for the first time (the same information in column B)

Results of archiving web pages using the Wayback Machine Google Sheets service

Notes:

It is important to understand that it can take multiple days/weeks for new captures to show up in the Wayback Machine, so don’t worry if URLs you’ve captured aren’t available in the Wayback Machine after processing your sheet is finished. The delay in showing up in the Wayback Machine after the processing is finished depends on the current load on it.

If some URLs were not successfully archived (produced errors), copy these URLs to a new sheet, and submit the new sheet with just the URLs that weren’t captured. Do not resubmit the old sheet with all the URLs after it’s done. You can save them to a new tab instead, but you have to make that new tab the first tab in the sheet. This is because if you are using tabs in your Google sheets, the Wayback-GSheets service will only process the first tab so make sure to organize tabs accordingly.

Again, the service only allows 5,000 URLs per sheet, and each account is only allowed 30,000 captures per day.

Hussam Hallak

Author Interview: Janie Chang / LibraryThing (Thingology)

LibraryThing is pleased to sit down this month with best-selling Taiwanese Canadian author Janie Chang, whose works of historical fiction draw upon her family history in pre-World War II China. After taking a degree in computer science, and then graduating from the Writer’s Studio Program at Simon Fraser University, Chang made her authorial debut in 2013 with the novel Three Souls, which was shortlisted in the fiction category for the BC and Yukon Book Prizes. Subsequent titles include Dragon Springs Road (2017), longlisted for the International Dublin Literary Award; The Library of Legends (2020), nominated for the Evergreen Award; The Porcelain Moon (2023); and The Phoenix Crown (2024), which was co-written with Kate Quinn. Chang was the founder of the Authors for Indies event, running from 2015-17, which eventually became Canadian Independent Bookstore Day. Her sixth book, The Fourth Princess: A Gothic Novel of Old Shanghai—available this month as an Early Reviewer giveaway—was published earlier in February. Chang sat down with Abigail this month to discuss the book.

How did the idea for The Fourth Princess first come to you? Many of your books are described as being inspired by your family history. Do you have a family connection to this tale as well?

The Fourth Princess came about purely from a desire to challenge myself by writing in the Gothic vein, moving away from historical to a different genre. So alas, there aren’t any fascinating family connections to this tale.

Your story is set in 1911, in “Old Shanghai.” Did you need to do any kind of research about the history of the city during that period? What were some of the most interesting things you learned?

You should never ask a historical novelist about interesting things learned. You’ll end up with a 12-page essay! I knew that Shanghai had entire neighborhoods of Western-style homes, often called “garden villas.” Many of those homes are still there. What I did not realize was that there were also huge estates outside what was then the city center, owned by the wealthiest families, both foreign and Chinese. They occupied properties as large as 10 acres. The mansion that inspired Lennox Manor in the novel was called Dennartt, built in 1898 by a British barrister. It had a huge garden, lawns, a manmade lake, stables for polo ponies and living quarters for the grooms and house servants. Dennartt still stands, surrounded by apartments and houses instead of lawns and rose gardens and tennis courts.

I also learned that there were electric cars back then! For a while, both internal combustion gas engines and electric engine vehicles were available to consumers. Gas engines were difficult and dangerous to crank up, the emissions were dirty, but could drive farther. Electric vehicles were easy to start and clean to drive, and advertisements aimed them at women for city driving. But once a reliable ignition system for gas engines was invented, electric vehicles lost popularity. In the novel, I have an American import a car for his wife, so that’s the reason behind that particular rabbit hole. And in the end, he did not import an electric car.

Many of your earlier novels feature a fantastical element, from the ghost in Three Souls to an animal spirit in Dragon Springs Road. What role does the fantastic play in The Fourth Princess, and how does it help you to tell your story?

There is the possibility of a ghost. As the servants in the story explain to Lisan, one of the main characters, a previous owner committed suicide in Lennox Manor. Chinese superstitions say that the ghost of a suicide is the worst kind there is because they’re trapped in the afterlife, unable to move on to reincarnation unless they find a replacement. They need to drive another person to suicide, usually through madness. In Gothic novels, there’s always a strong element of psychological fear as well as real danger, so when Lisan sees or thinks she sees a woman in red outside in the garden, are her eyes playing tricks on her? When she hears wailing and sobbing at night, is it the supernatural or just the wind funneling down chimneys and cracks?

This new book addresses the meeting of East and West, both through the characters of Lisan Liu and Caroline Stanton, and in the use of a Gothic literary aesthetic more often associated with Europe. Can you expand upon that? What significance does it have?

It’s absolutely true that “traditional” Gothic novels favor European settings in a remote location, preferably with bad weather. The essential elements of Gothic, however, are portable: a setting that oozes menace and unease, a young woman who discovers a terrible secret and finds herself in danger. In transposing classic Gothic tropes to an Asian setting, it was important for me to do so in a way that was plausible and unique to this time and place.

One of the themes in The Fourth Princess is that of identity. Both Caroline and Lisan have a hidden past. Once these are revealed, what do they do, what are they willing to risk, who should they become? For me, a Shanghai setting made it absolutely necessary to have both Chinese and Western heroines because the city was a bizarre mix of East and West.

Tell us a little bit about your writing process. Do you have a particular schedule or routine that you keep to, a specific spot where you like to write? Do you map your story out ahead of time, or discover it as you go along?

First, I have to write out a summary of the story plus the historical events and background that are the setting, just to stay anchored. Over time, I’ve found myself putting more effort into mapping out the story because it helps get over the sagging middle part of a novel. It’s no fun getting stuck in the middle of the story because it makes you doubt whether the story is worth writing at all.

For schedule, I down two cups of coffee and then get to writing. The main thing is to write every day, even if you’re not happy with it. You need to make progress on the story and remember that the next step is revision. One of the best pieces of advice I ever heard was that “revision” is “re-vision.” When you revise, you are re-visioning the story.

It would be nice if the story moved along according to plan, but as a storyteller, you need to be open to opportunities. You run across a tidbit of research that adds authenticity or detail or insight to the story and you make changes. Then there are the times when the characters themselves are a discovery, when they start telling you who they are and their real motivations. Those are the best moments in the writing process, and make up for all the other hours of agony.

What comes next for you? Do you have any new books you’re working on?

I’m currently researching a new book, nothing announced yet. However, I will be co-authoring again with Kate Quinn on a novel that we’ll start working on this summer. It’s working title is The Jade Mirror and we call it an adventure on the high seas, about two women whose nautical achievements have been largely forgotten.

Tell us about your library. What’s on your own shelves?

I love historical fiction and speculative fiction, and it shows. I also enjoy mystery and crime. There’s one shelf reserved for children’s books that I refuse to throw out. The Narnia series, the Doctor Dolittles, and so on. I have a weakness for cookbooks with nice photos. And I have a section of shelves that hold research books.

What have you been reading lately, and what would you recommend to other readers?

I’ve been reading the Claire North trilogy The Songs of Penelope: Ithaca, House of Odysseus, The Last Song of Penelope. When Odysseus and all the able bodied men of Ithaca went off to the Trojan War, the only people left on the island were women, children, and old men. As queen, Penelope still had to keep the economy going, maintain the security of her island nation, all the while fending off suitors. This is her story and it’s funny and snarky, intelligent, told from the point of view of the women of Ithaca, and it’s about geopolitics.

I highly recommend this series. In fact, I highly recommend anything by Claire North.

When I surrendered to the current / Meredith Farkas

One author whose newsletter I read avidly is Dr. Zed Zha. In Ask the Patient, she writes about holding on to humanity and humility as a health care provider and is an inspiration to someone like me who has experienced very few caring and invested doctors. The other day, she wrote a note about how she and some of the other female healthcare workers in her clinic came to work in spite of illnesses and family issues because they essentially felt that their brittle system would collapse without them. And frankly, in many cases, given how thinly things are staffed these days, they’re right. She wrote:

Does a workplace function without women who deprive themselves to keep it together?

Does the world run without women who place everyone else first?

Does the sky fall if women stop breaking their backs to hold it up?

I think we all know the answers.

But here’s the question we haven’t allowed ourselves to ask:

What happens when women, collectively, decide to stop giving ourselves to systems that feel entitled to our sacrifice?

It’s something I’ve been thinking about a lot lately. Look closely at most libraries, and you’ll probably find a critically important service, program, or activity that is literally sustained by a single worker’s (let’s face it, probably a woman’s) passion. Whether it’s tutorials, assessment, OER and textbook affordability, copyright, outreach and marketing, scholarly communications, instructional technology, or something else, these things usually start as the passion of an individual library worker and grow into something essential to the institution. Sometimes, that growth happens without that work ever becoming part of the library worker’s formal job description and without release from any of their other regular duties that already made up a full workload. Some institutions might create a coordinator role to support this work, which, while not ideal (see the Library Loon’s old writings on coordinator syndrome), at least provides the library worker with time. Those who continue this work with no release from other job duties scrounge for interstitial moments to invest themselves in this work they are so passionate about or they let the work bleed into their free time – frequently the latter.

Lots of library workers have passions that are relevant to their jobs, but rarely do those become an important part of the library’s work. I used to have a colleague who was obsessed with podcasting. He would implore faculty to create podcasting assignments rather than traditional papers and pushed podcasting in committees he served on. Another colleague did amazing work with faculty around Reading Apprenticeship. While I learned quite a bit from her work that informed my teaching, it did not end up integrated into our information literacy instruction program. Early in my career, I was very passionate about social media (I shudder to think of that today), but I always saw it as something I did “on the side.” While I did implement some social media tools in my first job, it was never integral to the library and it also wasn’t time-consuming by any stretch.

These other things are different. They become essential parts – sometimes even required parts, like in the case of assessment work – of the library’s portfolio of work. They are often valued by faculty and students. They are frequently even seen as important by library administrators. Yet somehow they are not important enough to support in any concrete way. Because no one has to. Because they have someone dedicated enough to twist themselves into a pretzel and overwork to make it happen.

Those who’ve known me a long time will know that supporting online learners has been my specialty and passion since I entered this field. I started creating video tutorials in 2004 before I even had a professional position and it helped me get my first academic job as a distance learning librarian. I have a lot of experience and expertise in online instruction, both as a librarian and an online LIS instructor. While my role since 2014 has been more of a generic “liaison librarian/reference and instruction librarian,” I was still for many years the default “tutorial librarian” at work because it’s what I’m good at and passionate about. Early on, I got grant funding to build tutorials with a team (though I did all the technical parts of building them) and while it paid for some of my time to create the tutorials, later, I was expected to maintain them. And, as I found time, I continued building general-use tutorials to support our students and my colleagues’ teaching, even though it wasn’t explicitly my job to do so and I didn’t get release time for the work. I did it because there was a need and I wanted to be useful, I loved the work, and I was also surrounded by plenty of colleagues who also normalized overworking in the absence of support. Sometimes, I’d beg for people to cover a couple of my reference shifts to give me a few extra hours – like when a writing instructor asked me to build a Google Scholar tutorial and I wanted to make a good general one that would work in any discipline – but mostly, I just let the work bleed into my personal time. And I recognize that I did this to myself because I love the creativity of online instructional design work and I care deeply about online learners. It was a choice to overwork like this.

I always thought that I could make my colleagues recognize the importance of this work by making the work important; by showing how integral it could be to academic curricula. I shared one department’s data that found that students in Anthropology 101 who completed a tutorial I created for their final research assignment scored 50% better on an outcomes assessment of their term papers than 101 students who did not. I remember thinking at the time, wow! This data is irrefutable! Information literacy tutorials can really impact student learning in a major way! This is going to change everything! And nothing changed. Having a library video tutorial with over 58,000 views didn’t matter either. People appreciated the tutorials I made and would tell me that they believed that the work was important, but never to the point where I or anyone else would be given the time to focus on it. And given the amount of time it takes to build and maintain quality tutorials, the work was simply unsustainable given the other things I was expected to do.

After getting turned down when asking for just 32 hours of release time over an entire academic year to work on building a design philosophy, best practices, and plan for supporting the work of tutorial development and maintenance across the library, it became obvious to me that my boss and our department would never support making time for this work so long as I continued to squeeze the work into my already overstuffed job. So I stepped away from it and set healthier boundaries. I would only continue to maintain tutorials that were explicitly for my liaison areas, like our PsycINFO video and the interactive tutorials I made for classes in the social sciences. When we had to move to the new EBSCO database interface, I told my boss that I could only create a new version of my Academic Search Complete video with explicit release time. She never responded (?!) so I never updated it. After faculty complaints a few months later, the work was forced on a non-faculty colleague – and good friend – in Digital Services, which was shitty and made me feel shitty too. After a lot of conversations about our frustrations, she has also let go of a lot of responsibilities and set healthier boundaries, which I applaud. So who will update these tutorials next time around? Who knows!

The thing I constantly tell myself – and others – is that it’s not an individual worker’s responsibility to fill gaps created by poor leadership or organizational dysfunction. If a leader is going to brag to their superiors about the impact of our tutorials but then do nothing to provide support for their creation and maintenance, the workers shouldn’t feel like it’s their responsibility to create or maintain them. If one’s colleagues link to your tutorials and share them with students but never identify the creation of tutorials as a departmental priority (and I’m not blaming them; the issue feels far more systemic), then it’s not worth twisting yourself up into a pretzel and giving up personal time in an effort to make them. The cognitive dissonance was exasperating. Though I knew my stepping back wasn’t likely to change anything, it was an effort to stop enabling the dysfunction and to stop hurtling towards burnout.

Sometimes you have to surrender yourself to the reality of the situation instead of constantly fighting against the current. Did it suck? Yes. Did I feel like I was abandoning students when I set those boundaries? 100%. Did I also feel guilt because I knew some colleagues were still overworking in their un-or-under-recognized roles? Sure, though I’m cognizant of the fact that I can’t control the choices of others, only mine. But was building tutorials the only way for me to help students? Of course not. Surrendering isn’t giving up. Instead of fighting the current, you find a way to move with it. Because the other thing I always tell myself is that no matter how much we work, there will always be unmet needs in our communities. I say that to myself to discourage overworking, but also, it means that there are always other ways to make a difference.

Since then, I’ve found places to contribute that will have a major impact on our students and aren’t bleeding me dry (yet?). I’ve taken a leadership role in our collections work and have been leading several projects to support our students for whom English is not their first language. Last year we did a needs assessment of Spanish-speakers (the College is moving towards becoming a Hispanic-Serving Institution), bought a ton of popular reading materials in Spanish, and translated a lot of our instructional materials into Spanish including building LibGuides in Spanish (after creating this and another guide in Spanish, I realized that this was another place where I had to set a firm boundary because I was doing a lot of translation work that really should not be my job – old habits die hard). I’m also leading a project to create World Languages collections at each of our campus libraries.

Like tutorials work, this is a labor of love, but it’s not a solo project. It can’t happen without collaboration with technical and access services. I love this work because I’ve gotten to collaborate with folks in every area of the library as well as people outside who support international students. I love playing the role of project manager and cheerleader. We were just awarded an LSTA mini-grant to help build our collection and I’m excited to get experience buying books in non-English languages beyond Spanish. Is this project adding to my workload? Yes. But because it’s more recognized than tutorials work as a “library project,” I’ve been better able to say no to other work. And, unlike tutorials, there isn’t as much time pressure. Other than fiscal deadlines, the work can much more easily be spread out and done in free moments here and there. I’m really glad I could find a space where I could make a difference and the work is collaborative, challenging, creative, valuable, and sustainable.

The Buddhists say that pain is inevitable, but suffering is optional. I know that well as someone with chronic pain and illness (and I practice a lot of surrendering to how I am feeling day-to-day as my health upends my best laid plans), but I also feel it very much with the choices I make at work. The first step is acknowledging that I do have choices. Knowing when to keep pushing against the current and when to surrender is an art and something most of us struggle with. Part of it is recognizing the things that we can’t control or even influence and letting go of the idea that if we grind hard enough, impress the powers-that-be enough, things will change. Stop offering free labor. Invest yourself in ways that enrich you rather than deplete you and set boundaries that keep you from overworking.

If you are toiling in an informal and unsupported role at your library, I see you. I know that there may be reasons some of you can’t give up what you’re doing, like being on the tenure track, worrying about not getting merit raises, etc., though even for you, setting boundaries around other work you do (projects and committees) might still be possible. But for those who are less constrained, choosing not to do something that is burning you out and isn’t supported isn’t a betrayal of your patrons. If this work is that important, it’s your institution that has betrayed patrons by not adequately supporting you. It’s not the job of an individual worker to prop up a dysfunctional system. It’s not the job of the worker to continue sacrificing themselves on the altar of vocational awe (Rest in Power, Fobazi Ettarh, and thank you). Your job is not entitled to more than your contract or job description promises. If there are not enough hours in the day to do everything, then everything can’t get done. The role of a manager is to help their direct reports prioritize their work and figure out what not to do. If something is important enough, they need to find a way to make sure it gets done with the resources (people and funds) they have.

What I really hate is the toll that all of this takes on the passionate, dedicated worker and on the organization when that bright light finally burns out. I was recently in a meeting of an assessment-related college committee I have been on for about 9 years. Some of the faculty who have been part of this group the whole time have gone from being the most active and passionate contributors to our college to jaded, angry, and burned out husks of their former selves. One of the people on the committee previously was a major leader in this work and put so much into it because she believed in it. Her department has 1/3 the number of full-time faculty that they had when she started at the college, yet they are still expected to do the same level of service and to meet all the administrative demands for assessment work and other administrivia (actually there’s more administrivia these days) that they did 15 years ago. After years of this, plus the complete disdain many of our administrators have shown toward faculty and staff, she now sees the assessment work we do as meaningless box-checking and suggests putting as little effort as possible into it. I can’t disagree, not at all, and I feel the same way about the college at this point, but it breaks my heart to see people I thought of as the most brilliant, passionate stars of the College burned to a crisp by such mistreatment and the expectation that we should keep filling these holes that administrators purposely create.

I have another colleague who does so much library outreach on social media and elsewhere on top of her regular job duties. She is amazing at it – a natural marketing whiz. She’s a brilliant, funny, and engaging communicator in ways I couldn’t ever hope to be. She told me she hopes that our Dean will make it her job and give her release time to do it and it hurts my heart because I know it won’t happen. While I truly love working with her (she is a JOY!) it would serve us right if some other library scooped her up to run their library marketing and outreach program. And I wonder when she will tire of doing this work that isn’t part of her job description and that she isn’t given release time to do. I hope it happens before she burns out because I’d hate for her to lose her passion and her shine. We all deserve to keep (or get back) our shine.

Ask yourselves, what happens when ~~women~~ we, collectively, decide to stop giving ourselves to systems that feel entitled to our sacrifice? What would surrendering look like in your worklife?

🔖 Yubikey-Guide

🔖 The Dangerous Illusion of AI Coding? - Jeremy Howard

🔖 Crawl entire websites with a single API call using Browser Rendering

🔖 Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

🔖 Piotr Woźniak

🔖 Poetica

🔖 Cloudflare - Edward Wang & Kevin Guthrie, Software Engineers

🔖 Jon Leidecker / Wobbly

🔖 Forevergreen

🔖 Pierre Schaeffer

Cyberdyne Systems Corporation

Board Confidential

IT Infrastructure

Terrestrial

Orbital

System Penetration Capabilities

Zero-Days

Decryption

Blackmail

Public Relations

Human-in-the-Loop Problem

Assassination Weapons Access

Tactical Weapons Access

Strategic Weapons Access

Risks

Update 14th March 2026

Crawling

Results

🔖 Negativeland: Live at Norfolk, VA (Lewis’)

🔖 Paul Avrich

🔖 Alexander Berkman

🔖 On Method: How This Blog Works

🔖 Amores Perros

🔖 Deadly Iranian strike changes Purim for Haredi enclave in Beit Shemesh

🔖 Keenious

🔖 Wikidata:Wikibase GraphQL

🔖 Re-OCR Your Digitised Collections for ~$0.002/Page

🔖 Lawyers, Humility, and LLMs

🔖 My Coworkers Don’t Want AI. They Want Macros

🔖 Snapicat

🔖 Open Historical Map

🔖 SEASON: A letter to the future

🔖 Iran war heralds era of AI-powered bombing quicker than ‘speed of thought’

🔖 Wikidata:WikiProject PCC EMCO Wikidata CoP

🔖 John Fahey Mix Tapes

This month’s news:

This month’s open DLF group meetings:

Get Involved / Connect with Us

OCLC Research leadership insights: Real-time insight for real-world decisions

A complementary approach to longstanding research practices

Powered by OCLC’s global membership network

What this means for library leaders

Why we cite

Why fabricated citations are bad

A false nexus

"Safety monitors" less safe than "drivers"

Tesla's Catch-22

🔖 Arke

🔖 Community Calendar

🔖 Engineering Rigor in the LLM Age

🔖 Wikipedia blacklists Archive.today, starts removing 695,000 archive links

🔖 Megalodon (website)

🔖 Exclusive: US plans online portal to bypass content bans in Europe and elsewhere

🔖 How An Academic Library Built a Research Impact and Intelligence Team

🔖 Annotorious

🔖 Potomac Interceptor Collapse

🔖 Inside Claude Code With Its Creator Boris Cherny

🔖 Current

🔖 Phantom Obligation

🔖 ways of working with the Wayback Machine - studio and book talk in Amsterdam

🔖 Black Jesus

🔖 Oral History of John Backus

🔖 FreeBSD Mastery: Advanced ZFS

🔖 disko-zfs: Declaratively Managing ZFS Datasets

🔖 Level of Detail

🔖 I Sold Out for $20 a Month and All I Got Was This Perfectly Generated Terraform

🔖 Poor Deming never stood a chance

🔖 Emily St. John Mandel

🔖 Deb Olin Unferth

🔖 Citational Politics and Justice: Introduction

Update 14^th March 2026