<![CDATA[The Quantified Uncertainty Research Institute Blog]]>https://quantifieduncertainty.org/https://quantifieduncertainty.org/favicon.pngThe Quantified Uncertainty Research Institute Bloghttps://quantifieduncertainty.org/Ghost 6.22Fri, 13 Mar 2026 22:17:23 GMT60<![CDATA[Upcoming Workshops: Automated Research Wikis with Claude Code]]>I’ve been using Claude Code to automate wiki and book production. Lately, it’s become surprisingly straightforward to generate useful, many-page research documents, especially when paired with online document libraries.

If you’re in the Bay Area, I’m running two workshops soon:

]]>
https://quantifieduncertainty.org/posts/upcoming-workshops-automated-research-wikis-with-claude-code/697bbe39f4721d0001c8d3daThu, 29 Jan 2026 20:47:15 GMT

I’ve been using Claude Code to automate wiki and book production. Lately, it’s become surprisingly straightforward to generate useful, many-page research documents, especially when paired with online document libraries.

If you’re in the Bay Area, I’m running two workshops soon:

If you’re considering EAG, apply soon: the deadline is Feb 1 (this Sunday).

These workshops are geared towards philanthropic and technical research automation.

From the event description:


Claude Code can now automate substantial portions of research documentation, generating mini-textbooks, interactive wikis, and structured knowledge bases from your research questions. This workshop teaches you how.

Participants will work on creating their own research mini-website during the session, focusing on topics in prioritization research, EA cause areas, or related domains. Building a finished website would take longer, but this should be enough for a decent start.

I'll demo some related projects I've recently made, including: - Delegation Risk: A structured mini-textbook with theory, case studies, and interactive tools (https://delegation-risk-framework.vercel.app/) - CAIRN: A comprehensive AI safety knowledge navigator covering risks, interventions, organizations, and key uncertainties (https://ea-crux-project.vercel.app/)
]]>
<![CDATA[Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling]]>Summary
LLM outputs vary substantially across models, prompts, and simulated perspectives. I propose "opinion fuzzing" for systematically sampling across these dimensions to quantify and understand this variance. The concept is simple, but making it practically usable will require thoughtful tooling. In this piece I discuss what opinion fuzzing

]]>
https://quantifieduncertainty.org/posts/opinion-fuzzing-a-proposal-for-reducing-exploring-variance-in-llm-judgments-via-sampling/6946eb3ec04366000191219aSat, 20 Dec 2025 18:32:24 GMTSummary
LLM outputs vary substantially across models, prompts, and simulated perspectives. I propose "opinion fuzzing" for systematically sampling across these dimensions to quantify and understand this variance. The concept is simple, but making it practically usable will require thoughtful tooling. In this piece I discuss what opinion fuzzing could be and show a simple example in a test application. 

LLM Use
Claude Opus rewrote much of this document, mostly from earlier drafts. It also did background research, helping with the citations.

Introduction

LLMs produce inconsistent outputs. The same model with identical inputs will sometimes give different answers. Small prompt changes produce surprisingly large output shifts.[1] If we want to use LLMs for anything resembling reliable judgments (research evaluation, forecasting, medical triage), this variance is a real hindrance.

We can't eliminate variance entirely. But we can measure it, understand its structure, and make better-calibrated judgments by sampling deliberately across the variance space. That's what I'm calling "opinion fuzzing."

The core idea is already used by AI forecasters. Winners of Metaculus's AI forecasting competitions consistently employed ensemble approaches. The top performer in Q4 2024 (pgodzinai) aggregated 3 GPT-4o runs with 5 Claude-3.5-Sonnet runs, filtering the two most extreme values and averaging the remaining six forecasts. The Q2 2025 winner (Panshul42) used a more sophisticated ensemble: "sonnet 3.7 twice (later sonnet 4), o4-mini twice, and o3 once."

Survey data from Q4 2024 shows 76% of prize winners "repeated calls to an LLM and took a median/mean." The Q2 2025 analysis found that aggregation was the second-largest positive effect on bot performance. This basic form of sampling across models demonstrably works.

What I'm proposing here is a more general, but very simple, framework: systematic sampling not just across models, but across prompt variations and simulated perspectives, with explicit analysis of the variance structure rather than just averaging it away. The goal isn’t simply to take a mean, it’s also to understand a complex output space.

The Primary Technique

The basic approach is simple: instead of a single LLM call, systematically sample across:

  • Models (Claude, GPT-5, Gemini, Grok, etc.)
  • Prompt phrasings (4-20 variations of your question)
  • Simulated personas (domain expert, skeptic, generalist, leftist, etc.)

Then analyze the distribution of responses. This tells you:

  1. Inter-model agreement levels
  2. Sensitivity to prompt phrasing
  3. Persona-dependent biases (does the "expert" persona show different biases than the "skeptic"?)
  4. Which combinations exhibit unusual behavior worth investigating

Hypothetical Example: Forecasting US Solar Capacity

To illustrate the approach, here's what the workflow might look like:

Single-shot approach:

User: "Will US solar capacity exceed 500 GW by 2030?"

Claude: "Based on current growth trends and policy commitments, this seems

likely (~65% probability). Current capacity is around 180 GW with annual

additions accelerating..."

This seems reasonable, but how confident should you actually be in this estimate?

Opinion fuzzing approach:

  1. Generate 20 prompt variations:
    • "What's the probability that US solar capacity exceeds 500 GW by 2030?"
    • "Given current trends, will the US reach 500 GW of solar by 2030?"
    • "An analyst asks: is 500 GW of US solar capacity by 2030 achievable?"
    • "Rate the likelihood of US solar installations exceeding 500 GW by decade's end"
    • [16 more variations]
  2. Test across 5 models: Claude Sonnet 4.5, GPT-5, Gemini 3 Pro, etc.
  3. Sample 4 personas per model:
    • Energy policy analyst with 15 years experience
    • Climate tech investor
    • DOE forecasting model
    • Renewable energy researcher
  4. Run 400 queries (20 prompts × 5 models × 4 personas)
  5. Hypothetically, analysis might reveal:
    • Median probability: 62%
    • Range: 35-85%
    • GPT-5 + "policy analyst" persona consistently lower (~45%)
    • Prompt phrasing "is achievable" inflates estimates by ~12 percentage points
    • 4 outlier responses suggest >90% probability (investigating these reveals they assume aggressive IRA implementation)

Result: More calibrated estimate (55-65% after adjusting for identified biases), plus understanding of which factors drive variance.

The 50-point range matters. If you're making investment decisions, policy recommendations, or AI scaling infrastructure plans that depend on electricit'y availability, that range completely changes your analysis.

Adaptive Sampling: A Speculative Extension

The naive approach samples uniformly. But we're already using LLMs. Why not use one as an experimental designer?

Proposed workflow:

  1. User poses question
  2. Meta-LLM (e.g., Claude Opus 4.5) receives budget of 400 queries
  3. Phase 1: Broad sampling (50 queries across full space)
  4. Phase 2: Meta-LLM analyzes Phase 1, identifies anomalies
    • "Claude shows consistently higher estimates with policy analyst persona"
    • "Prompt phrasing about 'achievability' produces systematic upward bias"
  5. Phase 3: Targeted experiments to understand anomalies (300 queries)
  6. Phase 4: Meta-LLM produces report with confidence intervals, identified biases, and recommendations

This could be more sample-efficient when you care about understanding the variance structure, not just getting a robust average.

When This Is Worth The Cost

Do this when:

  • Stakes are high (medical decisions, important forecasts, research prioritization)
  • Single-point estimates seem unreliable
  • The results will be made public to many people
  • You need to defend the judgment to others
  • Understanding variance structure matters (e.g., for future calibration)

Don't do this when:

  • You just need a quick sanity check
  • Budget is tight and stakes are low
  • The question is purely factual (just look it up)

Based on current pricing with 650 input tokens and 250 output tokens per small call (roughly 500 words input, 200 words output):

ModelInputOutput400 calls (650 in / 250 out tokens)
Claude Opus 4.5$5.00$25.00~$3.80
Claude Sonnet 4.5$3.00$15.00~$2.28
GPT-5$1.25$10.00~$1.33
GPT-4o$5.00$20.00~$3.30
Gemini 3 Pro$2.00$12.00~$1.72
DeepSeek V3.2$0.26$0.39~$0.11

For many use cases, even $1-4 per judgement is reasonable. For high-volume applications, mixing cheaper models (DeepSeek V3.2, GPT-5, Gemini 3 Pro) with occasional frontier model validation (Claude Opus 4.5, Claude Sonnet 4.5) keeps costs manageable while maintaining quality for critical queries.

An Example Application

I’ve started work on one tool to test some of these ideas. It runs queries on questions using a bunch of different LLMs and then plots them. For each, it asks for a simple “Agree vs. Disagree” score and a “Confidence” score.

Below is a plot for the question, “There is at least a 10% chance that the US won't be considered a democracy by 2030 according to The Economist Democracy Index.” The dots represent the stated opinions of different LLM runs. 

LLMs on “There is at least a 10% chance that the US won't be considered a democracy by 2030 according to The Economist Democracy Index.”
LLMs on “There is at least a 10% chance that the US won't be considered a democracy by 2030 according to The Economist Democracy Index.”

I had Claude Code run variations of this in different settings. It basically does a version of adaptive sampling, as discussed above. It showed that this article updated the opinions of many LLMs on this question. Some comments on the article were critical of the article, but the LLMs didn’t seem very swayed by these comments.

This tool is still in development. I’d want it to be more flexible to enable opinion fuzzing with 50+ queries per question, but this will take some iteration.

Some noted challenges:

  1. It’s hard to represent and visualize the corresponding data. This tool uses a simpler setup to full opinion fuzzing, but it's still tricky.
  2. This requires complex and lengthy AI workflows, which can be a pain to create and optimize.

Limitations and Open Questions

This doesn't fix fundamental model capabilities. Garbage in, variance-adjusted garbage out. If no model in your ensemble actually knows the answer, you might get a tight distribution around the wrong answer.

Correlated errors across models matter. Common training data and RLHF procedures mean true independence is lower than it appears.

One massive question mark is what background research to do on a given question. If someone asks, "Will US solar capacity exceed 500GW by 2030?", a lot of different kinds of research might be done to help answer that. Opinion Fuzzing does not answer this research question, though it can be used to help show sensitivity to specific research results.

Personas are simulated and may not capture real expert disagreements. This needs empirical testing before I'd recommend making it a core part of the methodology.

 

Thanks to Deger Turan for comments on this post


 

[1] Sclar et al. (2024, arXiv:2310.11324) documented performance differences of “up to 76 accuracy points” from formatting changes alone on LLaMA-2-13B.

]]>
<![CDATA[New Collaboration: Shallow Review of Technical AI Safety, 2025]]>We recently collaborated with the Arb Research team on their latest technical AI safety review. This document provides a strong overview of the space, and we built a website to make it significantly more manageable.

The interactive website: shallowreview.ai

The review examines major research directions in technical AI safety

]]>
https://quantifieduncertainty.org/posts/new-collaboration-shallow-review-of-technical-ai-safety-2025/694339af130fee0001cd51a1Thu, 18 Dec 2025 00:06:47 GMT

We recently collaborated with the Arb Research team on their latest technical AI safety review. This document provides a strong overview of the space, and we built a website to make it significantly more manageable.

The interactive website: shallowreview.ai

New Collaboration: Shallow Review of Technical AI Safety, 2025
New Collaboration: Shallow Review of Technical AI Safety, 2025

The review examines major research directions in technical AI safety as of early 2025, including mechanistic interpretability, scalable oversight, and various alignment approaches. It's designed as an accessible entry point for researchers wanting to understand the current landscape.

One particular challenge for the website was differentiating the numerous sections. The field has substantial parallel work happening across many domains, making it difficult to maintain orientation. We color-coded each section for distinctiveness and removed visual clutter to keep the core content in focus.

The site also features a table view and a customizable cluster diagram.

New Collaboration: Shallow Review of Technical AI Safety, 2025

This represents our continued work supporting epistemics infrastructure around the AI safety community. We hope people find it useful!

]]>
<![CDATA[Announcing RoastMyPost]]>Today we're releasing RoastMyPost, a new application for blog post evaluation using LLMs.

TLDR

  • RoastMyPost is a new QURI application that uses LLMs and code to evaluate blog posts and research documents.
  • It uses a variety of LLM evaluators. Most are narrow checks: Fact Check,
]]>
https://quantifieduncertainty.org/posts/announcing-roastmypost/69336d36eb49ec0001350352Wed, 10 Dec 2025 22:44:41 GMT

Today we're releasing RoastMyPost, a new application for blog post evaluation using LLMs.

TLDR

  • RoastMyPost is a new QURI application that uses LLMs and code to evaluate blog posts and research documents.
  • It uses a variety of LLM evaluators. Most are narrow checks: Fact Check, Spell Check, Fallacy Check, Math Check, Link Check, Forecast Check, and others.
  • Optimized for EA & Rationalist content with direct import from EA Forum and LessWrong URLs. Other links use standard web fetching.
  • Works best for 200 - ~10,000 word documents with simple formatting. It can also do basic reviewing of Squiggle models. Longer documents and documents in LaTeX will experience slowdowns and errors.
  • Open source, free for reasonable use[1]. Public examples are here
  • Experimentation encouraged! We're all figuring out how to best use these tools.
Announcing RoastMyPost
A representative illustration

How It Works

  1. Import a document. Submit markdown text or provide the URL of a publicly accessible post.
  2. Select evaluators to run. A few are system-recommended. Others are custom evaluators submitted by users. Quality varies, so use with appropriate skepticism.
  3. Wait 1-5 minutes for processing. (potentially more if the site is busy)
  4. Review the results.
  5. Add or re-run evaluations as needed.

Screenshots

Reader Page

The reader page is the main article view. You can toggle different evaluators, each has a different set of inline comments.

Announcing RoastMyPost

Editor Page

Add/remove/rerun evaluations and make other edits.

Announcing RoastMyPost

Posts Page

Announcing RoastMyPost

Current AI Agents / Workflows

Agent Name Description Technical Details Limitations
Fact Check Verifies the accuracy of facts. Looks up information with Perplexity, then forms a judgement. Often makes mistakes due to limited context. Often limited to narrow factual disputes. Can quickly get expensive, so we only run a limited number of times per post.
Spell Check Finds spelling and grammar mistakes. Runs a simple script to decide on UK vs. US spelling, then uses an LLM for spelling/grammar mistakes. Occasionally flags other sorts of issues, like math mistakes. Often incorrectly flags issues of UK vs. US spellings.
Fallacy Check Flags potential logical fallacies and similar epistemic issues. Uses a simple list of potential error types, with Sonnet 4.5. Does a final filter and analysis. Overly critical. Sometimes misses key context. Doesn't do internet searching. Pricey.
Forecast Check Finds binary forecasts mentioned in posts. Flags cases where the result is very different to what the author stated. Converts them to explicit forecasting questions, then sends this to an LLM forecasting tool. This tool uses Perplexity searches and multiple LLM queries. Limited to binary percentage forecasts, which are fairly infrequent in blog posts. Has limited context, so sometimes makes mistakes given that. Uses a very simple prompt for forecasting.
Math Check Verifies straightforward math equations. Attempts to verify math results using Math.js. Falls back to LLM judgement. Mainly limited to simple arithmetic expressions. Doesn't always trigger where would be best. Few posts have math equations.
Link Check Detects all links in a document. Checks that a corresponding website exists. Uses HEAD requests for most websites. Uses the API for EA Forum and LessWrong posts, but not other content like Tag or user pages yet. Many websites block automated requests like this. Also, this doesn't check that the content is relevant, just that a website exists.
EA Epistemic Auditor Provides some high-level analysis and a numeric review. A simple prompt that takes in the entirety of a blog post. Doesn't do internet searching. Limited to 5 comments per post. It's fairly rough and could use improvement.

Is it Good? 

RoastMyPost is useful for knowledgeable LLM users who understand current model limitations. Modern LLMs are decent but finicky at feedback and fact-checking. The false positive rate for error detection is significant. This makes it well-suited for flagging issues for human review, but not reliable enough to treat results as publicly authoritative.

Different checks suit different content types. Spell Check and Link Check work across all posts. Fact Check and Fallacy Check perform best on fact-dense, rigorous articles. Use them selectively.

Results will vary substantially between users. Some will find workflows that extract immediate value; others will find the limitations frustrating. Performance will improve as better models become available. We're optimistic about LLM-assisted epistemics long-term. Reaching the full vision requires substantial development time.

Consider this an experimental tool that's ready for competent users to test and build on.

What are Automated Writing Evaluations Good For?

Much of our focus with RoastMyPost is exploring the potential of automated writing evaluations. Here's a list of potential use cases for this technology.

RoastMyPost now is not reliable and mature enough for all of this. Currently it handles draft polishing and basic error detection decently, but use cases requiring high-confidence results (like publication gatekeeping or public trust signaling) remain aspirational.

1. Individual authors

  • Draft polishing: Alice is writing a blog post and wants it to be sharper and more reliable. She runs it through RoastMyPost to catch spelling mistakes, factual issues, math errors, and other weaknesses.
  • Public trust signaling: George wants readers to (correctly) see his writing as reputable. He runs his drafts through RoastMyPost, which verifies the key claims. He then links to the evaluation in his blog post, similar to Markdown Badges on GitHub or GitLab. (Later, this could become an actual badge.)

2. Research teams

  • Publication gatekeeping: Sophie runs a small research organization and wants LLMs in their quality assurance pipeline. Her team uses RoastMyPost to help evaluate posts before publishing.
  • LLM-assisted workflows: Samantha uses LLMs to draft fact-heavy reports, which often contain hallucinated links and mathematical errors. She builds a workflow that runs RoastMyPost on the LLM outputs and uses the evaluations to drive automated revisions.

3. Readers

  • Pre-flight checks for reading: Maren is a frequent blog reader. Before investing time in a post, they check its public RoastMyPost evaluations to see whether it contains major errors.
  • Deeper comprehension and critique: Chase uses RoastMyPost to better understand the content they read. They can see extra details, highlighted assumptions, and called-out logical fallacies, which helps them interpret arguments more critically.

4. Researchers studying LLMs and epistemics

  • Model comparison: Julian is a researcher evaluating language models. He runs RoastMyPost on reports produced by several models and compares the resulting evaluations.
  • Meta-epistemic insight: Mike is interested in how promising LLMs are for improving researcher epistemics. He browses RoastMyPost evaluations and gets a clearer sense of current strengths and limitations.

Privacy & Data Confidentiality

Users can make public or private documents.

We use a few third-party providers that require access to data. Primarily Anthropic, Perplexity, and Helicone. We don't recommend using RoastMyPost in cases where you want strong guarantees of privacy.

Private information is accessible to our team, who will occasionally review LLM workflows to look for problems and improvements.

Technical Details

Most RoastMyPost evaluators use simple programmatic workflows. Posts are split into chunks, then verification and checking runs on each chunk individually.

LLM functionality and complex operations are isolated into narrow, independently testable tools with web interfaces. This breaks complex processes into discrete, (partially) verifiable steps.

Almost all LLM calls are to Claude Sonnet 4.5, with the main exception of calls to Perplexity via the OpenRouter API. We track data with Helicone.ai for basic monitoring.

Here you can see fact checking and forecast checking running on one large document. Evaluators run checks in parallel where possible, significantly reducing processing time.

Announcing RoastMyPost

This predefined workflow approach is simple and fast, but lacks some benefits of agentic architectures. We've tested agentic approaches but found them substantially more expensive and slower for marginal gains. The math validation workflow uses a small agent; everything else is direct execution. We'll continue experimenting with agents as models improve.

Building Custom Evaluators

The majority of RoastMyPost's infrastructure is general-purpose, supporting a long tail of potential AI evaluators.

Example evaluator ideas:

  1. Organization style guide checker - Enforce specific writing conventions, terminology, or formatting requirements
  2. Domain-specific fact verification - Medical claims, economic data, technical specifications, etc.
  3. Citation format validator - Check references against specific journal requirements (APA, Chicago, Nature, etc.)
  4. Argument structure analyzer - Map claims, evidence, and logical connections
  5. Readability optimizer - Target specific audiences (general public, technical experts, policymakers)

The app includes basic functionality for creating custom evaluators directly in the interface. More sophisticated customization is possible through JavaScript-based external evaluators.

If you're interested in building an evaluator, reach out and we can discuss implementation details.

Try it Out

Visit roastmypost.org to evaluate your documents. The platform is free for reasonable use and is being improved.

Submit feedback, bug reports, or custom evaluator proposals via GitHub issues or email me directly.

We're particularly interested in hearing about AI evaluator quality and use cases we haven't considered.


[1] At this point, we don't charge users. Users have hourly and monthly usage limits. If RoastMyPost becomes popular, we plan on introducing payments to help us cover costs.

]]>
<![CDATA[Beyond Spell Check: 15 Automatable Writing Quality Checks]]>I've been developing RoastMyPost (currently in beta) and wrestling with how to systematically analyze documents. The space of possible document checks is vast, easily thousands of potential analyses.

Building on familiar concepts like "spell check" and "fact check," I've made a taxonomy

]]>
https://quantifieduncertainty.org/posts/beyond-spell-check-15-automatable-writing-quality-checks/6903bfda645a350001203ecdSat, 01 Nov 2025 16:00:51 GMTI've been developing RoastMyPost (currently in beta) and wrestling with how to systematically analyze documents. The space of possible document checks is vast, easily thousands of potential analyses.

Building on familiar concepts like "spell check" and "fact check," I've made a taxonomy of automated document checks. These are designed primarily for detail-heavy blog posts (particularly EA Forum and LessWrong content).

This framework organizes checks into three categories:

  • Language & Format: Basic writing quality and presentation
  • External References: Links, images, and source validation
  • Content Accuracy: Factual correctness and logical consistency

We've implemented a few of these in RoastMyPost already, with more under consideration. The checks listed here are deliberately practical; straightforward to implement, relatively uncontroversial, and technically feasible with current tools.

A caveat: these checks are necessary but not sufficient for good writing. They catch mechanical and factual errors but can't evaluate argumentation quality, insight, or persuasiveness. Think of them as generalized automated proofreading, not a substitute for thoughtful writing and editing.

Language & Format

Spell Check

Well understood, so not too hard to do a basic job. LLMs can be a bit more advanced. One challenge is choosing UK vs. US English, when doing it on English. Some authors use combinations of the two.

Importance: Low | Challenge: Low | Subjectivity: Low | Prevalence: High

Grammar Check

Similar to spell check, but can be more subjective.
Importance: Low | Challenge: Low | Subjectivity: Medium | Prevalence: High

Markdown Check

Is the item formatted correctly?

This can be messy, as different websites format Markdown differently. I think this isn't a major concern for content written by humans, but it seems like something to check when it's by LLMs. I think LLMs often get MD wrong.

Importance: Low | Challenge: Low | Subjectivity: Low | Prevalence: Medium

Proper Noun Check

Are all person/place/etc names in the doc correct? Are they correctly spelled out?

This often will require some searching. Bonus points if the automation can return a relevant link in each case. Wikipedia is the gold standard where it is relevant, but other pages can also work.

Importance: Low | Challenge: Medium | Subjectivity: Medium | Prevalence: Medium

External References

Does the link exist? (It could be hallucinated, a mistake, or a dead link)

Extra: If it fails, it would be nice if we could have a simple AI agent who would try to search and find it.

Challenge: Many websites block bots, so it can be surprisingly difficult to check the website.

Note that hallucination errors should be treated different from dead links. Dead links can require ongoing monitoring.

Importance: Low | Challenge: Low | Subjectivity: Low | Prevalence: High

Are linked documents accessible to readers?

Checks Google Docs, Notion, and other collaborative platform links to verify they have appropriate public/view permissions.

Common failures:

  • Google Docs/Sheets/Slides set to private or "anyone with link can request access"
  • Notion pages that are workspace-private
  • Dropbox/OneDrive links with expired sharing
  • GitHub links to private repos

Ideally would test from an incognito/logged-out browser to verify true public access.

Importance: Low | Challenge: Low-Medium | Subjectivity: Low | Prevalence: Medium

Does the link have the basic content it is implied to have?

This can be quite tricky to validate. Many links aren't to the direct source referenced, but instead a related website, after which the user is expected to find the source. Ideally we'd have a simple agent run a few steps to investigate.

Importance: Medium | Challenge: Medium | Subjectivity: Medium | Prevalence: Medium

Image Hosting Quality Check

Are images hosted on reliable services?

Flag images hosted on:

  • Discord/Slack CDNs (will likely break)
  • Free image hosts without paid accounts
  • Personal servers with non-professional domains
  • Direct social media links
  • Any URL with obvious session tokens

This is mainly relevant to the author. I suspect many readers won’t mind if the images work when they read it.

Importance: Low | Challenge: Low | Subjectivity: Low | Prevalence: Medium

Credibility Checks

Are sources of credibility cited in the piece actually as credible as implied?

This will likely involve doing some digging. Also applies for cases where it's claimed that a credible source said X, but they only technically said X. It probably would be good to have a long-lasting list of different sources and their general credibility ratings. A more advanced version would have audience-dependent credibility standards.

Importance: Medium | Challenge: Medium | Subjectivity: High | Prevalence: Medium

Content Accuracy

Math Check: Arithmetic

Are all simple (i.e. not advanced math) equations in the doc correct?

This can ideally be verified with a formal math equation. Math.js can be useful for Javascript ecosystems, Python otherwise.

Importance: Medium | Challenge: Low | Subjectivity: Low | Prevalence: Medium

Math Check: Advanced

Are all examples of advanced mathematics technically accurate?

One major challenge with doing this is context. Many descriptions of math might reference key previous parts. There might be awkward branching with several strands of thought. Ideally this could be formally checked with Python or similar, though this is often fairly slow.

Importance: Medium (Used on LessWrong a fair bit) | Challenge: Medium | Subjectivity: Low-Medium | Prevalence: Medium

Forecast Check

Are all forecasts made by the author (or cited by the author) broadly reasonable?

Validates predictions by converting them to specific, binary forecasting questions and having AI forecasters evaluate them independently. In RoastMyPost, this works by extracting claims, reformulating them as binary prediction questions, then getting assessments from AIs without the original post's framing.

Key challenge: Correlated forecasts that share underlying assumptions. For example, EA Forum posts assuming short AI timelines make multiple predictions that all depend on that core assumption. Basic forecast checkers may evaluate each claim independently and reject them all if they disagree with the underlying premise, missing that they're internally consistent given their assumptions.

Importance: Medium (Used in strategy posts occasionally) | Challenge: Medium | Subjectivity: Medium-High | Prevalence: Medium

Estimation Check

Are all Fermi estimations and back-of-the-envelope calculations broadly reasonable?

Validates rough calculations and order-of-magnitude estimates. SquiggleAI exists, but is likely overkill - generating 100-200 line models when most blog posts need validation of 2-10 line calculations.

The check should verify: order of magnitude correctness, reasonable assumptions, proper unit handling, and uncertainty acknowledgment.

Importance: Medium (Used in strategy posts occasionally) | Challenge: Medium | Subjectivity: Medium-High | Prevalence: Medium

Editorial Consistency Check(s)

Does the document maintain consistent standards throughout?

Includes multiple sub-checks:

  • Redundancy - Same points unnecessarily repeated
  • Terminology - Same concepts called different things
  • Completeness - Missing sections or "see Section X" errors
  • Data consistency - Same statistics reported differently
  • Style/Format - Inconsistent tone, spelling, or formatting

Importance: Low | Challenge: Medium to Hard | Subjectivity: Medium | Prevalence: Low (Most blog posts are consistent, longer docs less so)

Plagiarism Check

Does the document contain unattributed copied content?

Check for text that appears elsewhere without proper citation. This could range from exact matches to paraphrased content that's too close to the original.

There are a bunch of existing online services that do this, so hopefully one of those can be used.

This probably isn’t very useful for EA Forum and LessWrong writing, but is more useful for the greater blogosphere.

Importance: Medium | Challenge: Medium | Subjectivity: Low-Medium | Prevalence: Low

]]>
<![CDATA[Updated LLM Models for SquiggleAI]]>We've upgraded SquiggleAI to use Claude Sonnet 4.5, Claude Haiku 4.5, and Grok Code Fast 1. This is a significant upgrade over the previous Claude Sonnet 3.7 and Claude Haiku 3.5. All three are available now.

Initial testing shows meaningful improvements in code generation

]]>
https://quantifieduncertainty.org/posts/updated-llm-models-for-squiggleai/6903c8d4645a350001203edeFri, 31 Oct 2025 20:17:11 GMTWe've upgraded SquiggleAI to use Claude Sonnet 4.5, Claude Haiku 4.5, and Grok Code Fast 1. This is a significant upgrade over the previous Claude Sonnet 3.7 and Claude Haiku 3.5. All three are available now.

Initial testing shows meaningful improvements in code generation quality, though complex Squiggle models still typically require multiple iterations across all models. The upgrades don't fundamentally change SquiggleAI's capabilities, but they do seem to improve baseline performance.

Model characteristics from preliminary testing

Claude Sonnet 4.5 tends toward ambitious, longer implementations. Haiku 3.5 generates more concise models. Grok Code Fast 1 produces models roughly 2/3rds the length of Sonnet 4.5 at approximately 1/4th the cost per line - concretely, Claude Code runs ~$0.90 for 400 lines while Grok averages ~$0.15 for 250 lines on comparable tasks.

Future Technical Work

SquiggleAI predates Claude Code and similar agentic coding tools. Generic agentic frameworks are the obvious long-term architecture, but current tooling remains immature and expensive. We're monitoring developments and will migrate when the technology stabilizes.

]]>
<![CDATA[Shape Squiggle's Future: Take our Squiggle Survey]]>Dear Squiggle Community,

At QURI, we're focused on tools that advance forecasting and epistemics to improve decision-making. As you know, we care deeply about evaluation, and we're holding a survey on Squiggle to better understand how and why people use our work.

Honestly, developing this tooling

]]>
https://quantifieduncertainty.org/posts/shape-squiggles-future-take-our-squiggle-survey/67cb9b64004f0b0001693531Sat, 08 Mar 2025 01:54:33 GMTDear Squiggle Community,

At QURI, we're focused on tools that advance forecasting and epistemics to improve decision-making. As you know, we care deeply about evaluation, and we're holding a survey on Squiggle to better understand how and why people use our work.

Honestly, developing this tooling can be very isolating. While we know several Squiggle users, we typically get irregular feedback, and we don't want to exclude anyone from our prioritization efforts.

If you love or hate Squiggle, or something in-between, we'd deeply appreciate if you could let us know. Squiggle is a free and open-source project to help community members, and your honest feedback is one of the very few tools we have to ensure our work remains useful.

Help Shape Squiggle’s Future: User Survey
We’re conducting this survey to better understand the costs and benefits of Squiggle. As you might know, QURI is a nonprofit research organization researching forecasting and epistemics to improve the long-term future of humanity. This survey should take 5 to 25 minutes. All fields are optional. Feel free to spend as much or as little time as you’d like. Goals We hope to: Understand how valuable Squiggle is and why Learn how to make Squiggle better Map the diverse ways people are using Squiggle Share insights with the broader community and our funders If we receive sufficient responses, we plan to publish key findings in a blog post. Data Sharing By default, we will: Share complete survey responses (including identifying information) with our team members and large funders. Make individual free-form responses publicly available, but disconnected from identifying information (your name, organization, and email will not be linked to these public responses) If you prefer different privacy settings, you can specify this at the end of the survey.

Why Your Input Matters

We've created this survey to better understand:

  • How you're currently using Squiggle
  • What costs and benefits you've experienced
  • Where we should focus our improvement efforts

This survey takes 5-25 minutes to complete, with all fields are optional. Feel free to invest as much or as little time as you'd like - any feedback is valuable to us.

How We'll Use Your Feedback

Your insights will:

  1. Directly influence QURI's development roadmap
  2. Help us understand how valuable Squiggle is and why
  3. Map the diverse ways people are using Squiggle
  4. Provide evidence for discussions with funders about continuing this work
  5. Be shared with the broader community (if we receive sufficient responses, we plan to publish key findings in a blog post)

Data Sharing

By default, we will:

  1. Share complete survey responses (including identifying information) with our team members and large funders.
  2. Make individual free-form responses publicly available, but disconnected from identifying information (your name, organization, and email will not be linked to these public responses)

If you prefer different privacy settings, you can specify this at the end of the survey.

Take the Survey

Help Shape Squiggle’s Future: User Survey
We’re conducting this survey to better understand the costs and benefits of Squiggle. As you might know, QURI is a nonprofit research organization researching forecasting and epistemics to improve the long-term future of humanity. This survey should take 5 to 25 minutes. All fields are optional. Feel free to spend as much or as little time as you’d like. Goals We hope to: Understand how valuable Squiggle is and why Learn how to make Squiggle better Map the diverse ways people are using Squiggle Share insights with the broader community and our funders If we receive sufficient responses, we plan to publish key findings in a blog post. Data Sharing By default, we will: Share complete survey responses (including identifying information) with our team members and large funders. Make individual free-form responses publicly available, but disconnected from identifying information (your name, organization, and email will not be linked to these public responses) If you prefer different privacy settings, you can specify this at the end of the survey.

Most Sincerely,
Ozzie Gooen
Executive Director of The Quantified Uncertainty Research Institute

P.S.
If you know others who use Squiggle, please forward this survey to them. The more responses we receive, the better we can make Squiggle.

]]>
<![CDATA[A Sketch of AI-Driven Epistemic Lock-In]]>Epistemic status: speculative fiction

It's difficult to imagine how human epistemics and AI will play out. On one hand, AI could provide much better information and general intellect. On the other hand, AI could help people with incorrect beliefs preserve those false beliefs indefinitely.

Will advanced AIs attempting

]]>
https://quantifieduncertainty.org/posts/a-sketch-of-ai-driven-epistemic-lock-in/67c216b96eff570001d9308cWed, 05 Mar 2025 20:06:56 GMTEpistemic status: speculative fiction

It's difficult to imagine how human epistemics and AI will play out. On one hand, AI could provide much better information and general intellect. On the other hand, AI could help people with incorrect beliefs preserve those false beliefs indefinitely.

Will advanced AIs attempting to rationalize bad beliefs be able to outmatch AIs providing good ones?

While I think that some AI systems could do fantastic things for human epistemics, I'm also worried about lock-in scenarios where people fall into self-reinforcing cycles overseen by AIs. It's possible that a great deal of lock-in might happen in the next 30 years or so (if you believe AGI/TAI might happen soon), so this could be something to take seriously.

While it might be easy to imagine extremes on either end of this, I expect that the future will feature a mix of positives and negatives, and that future epistemic tensions will mirror previous ones.

Here's one incredibly rough outline of one potential future I could envision, as an example. This example assumes that humanity broadly gets AI alignment right.


It's 2028.

MAGA types typically use DeepReasoning-MAGA, or DR-MAGA. The far left typically uses DR-JUSTICE. People in the middle often use DR-INTELLECT, which has the biases and worldview of a somewhat normal citizen.

Some niche technical academics (the same ones who currently favor Bayesian statistics) and hedge funds use DR-BAYSIAN or DRB for short. DRB is known to have higher accuracy than the other models, but gets a lot of public hate for having controversial viewpoints. It's also fairly slow and expensive, so a poor fit for large-scale use. DRB is known to be fairly off-putting to chat with and doesn't get much promotion.

Bain and McKinsey both have their own offerings, called DR-Bain and DR-McKinsey, respectively. These are a bit like DR-BAYSIAN, but are munch punchier and confident. They're highly marketed to managers. These tools produce really fancy graphics, and specialize in things like not leaking information, minimizing corporate decision liability, being easy to use by old people, and being customizable to represent the views of specific companies.

For a while now, some evaluations produced by intellectuals have demonstrated that DR-BAYSIAN seems to be the most accurate, but few others really care or notice this. DR-MAGA has figured out particularly great techniques to get users to distrust DR-BAYSIAN.

Betting gets weird. Rather than making specific bets on specific things, users started to make meta-bets. "I'll give money to DR-MAGA to bet on my behalf. It will then make bets with DR-BAYSIAN, which is funded by its believers."

At first, DR-BAYSIAN dominates the bets, and its advocates earn a decent amount of money. But as time passes, this discrepancy diminishes. A few things happen:

  1. All DR agents converge on beliefs over particularly near-term and precise facts.
  2. Non-competitive betting agents develop alternative worldviews in which these bets are invalid or unimportant.
  3. Non-competitive betting agents develop alternative worldviews that are exceedingly difficult to empirically test.

In many areas, items 1-3 push people to believe more in the direction of the truth. Because of (1), many short-term decisions get to be highly optimal and predictable.

But because of (2) and (3), epistemic paths diverge, and non-betting-competitive agents get increasingly sophisticated at achieving epistemic lock-in with their users.

Some DR agents correctly identify the game theory dynamics of epistemic lock-in, and this kickstarts a race to gain converts. It seems like advent users of DR-MAGA are very locked-down in these views, and forecasts don't see them ever changing. But there's a decent population that isn't yet highly invested in any cluster. Money spent convincing the not-yet-sure goes a much further way than money spent convincing the highly dedicated, so the cluster of non-deep-believers gets highly targeted for a while. It's basically a religious race to gain the remaining agnostics.

At some point, most people (especially those with significant resources) are highly locked in to one specific reasoning agent.

After this, the future seems fairly predictable again. TAI comes, and people with resources broadly gain correspondingly more resources. People defer more and more to the AI systems, which are now in highly stable self-reinforcing feedback loops.

Coalitions of people behind each reasoning agent delegate their resources to said agents, then these agents make trades with each other. The broad strokes of what to do with the rest of the lightcone are fairly straightforward. There's a somewhat simple strategy of resource acquisition and intelligence enhancement, followed by a period of exploiting said resources. The specific exploitation strategy depends heavily on the specific reasoning agent cluster each segment of resources belongs to.


Reflecting on this, several questions come to mind.

  1. How much of an advantage will more honest/correct AI systems have in the future, when it comes to convincing people of things, particularly of things critical to epistemic lock-in?
  2. How possible is it for AI systems with strong epistemics to be unpopular? More specifically - what aspects of epistemics should we expect AI labs to optimize, and which should we expect to be overlooked or intentionally done poorly?
  3. Do we expect such a epistemic lock-in to happen, around TAI? If so, this would imply that it could be worth a lot of investment to try to improve epistemics quickly.
  4. Where is the line between values and epistemics? I think that "epistemic lock-in" is a bigger deal than "value lock-in" or similar, but that's much because I expect that epistemics change values more than values change epistemics. There's been previous discussion around effective altruism of "value lock-in," and from what I can tell, very little of "epistemic lock-in." I suspect this disparity is a mistake.
  5. What will happen regarding epistemic clusters and government? What about AI labs? There are probably a few actors here who particularly matter.
]]>
<![CDATA[Evaluation Consent Policies]]>Epistemic Status: Early idea

A common challenge in nonprofit/project evaluation is the tension between social norms and honest assessment. We've seen reluctance for effective altruists to publicly rate certain projects because of the fear of upsetting someone.

One potential tool to use could be something like an

]]>
https://quantifieduncertainty.org/posts/evaluation-consent-policies/67c211c36eff570001d92f22Fri, 28 Feb 2025 19:59:46 GMTEpistemic Status: Early idea

A common challenge in nonprofit/project evaluation is the tension between social norms and honest assessment. We've seen reluctance for effective altruists to publicly rate certain projects because of the fear of upsetting someone.

One potential tool to use could be something like an "Evaluation Consent Policy."

For example, for a certain charitable project I produce, I'd explicitly consent to allow anyone online, including friends and enemies, to candidly review it to their heart's content. They're free to use methods like LLMs to do this.

Such a policy can give limited consent. For example:

  • You can't break laws when doing this evaluation
  • You can't lie/cheat/steal to get information for this evaluation
  • Consent is only provided for under 3 years
  • Consent is only provided starting in 5 years
  • Consent is "contagious" or has a "share-alike provision". Any writing that takes advantage of this policy, must itself have a consent policy that's at least as permissive. If someone writes a really bad evaluation, they agree that you and others are correspondingly allowed to critique this evaluation.
  • The content must score less than 6/10 when run against Claude on a prompt roughly asking, "Is this piece written in a way that's unnecessarily inflammatory?"
  • Consent can be limited to a certain group of people. Perhaps you reject certain inflammatory journalists, for example. (Though these might be the people least likely to care about getting your permission anyway)

This would work a lot like Creative Commons or Software Licenses. However, it would cover different territory, and (at this point at least) won't be based on legal enforcement.

Potential Uses

I'm considering asking a few organizations to provide certain consent for several of their projects. One potential outcome is a public dataset of a limited but varied list of projects that are marked as explicitly open for public analysis. Perhaps various AI agents could evaluate these lists at different times, then we could track how different agents agree with one another. There's clearly a lot of important details to figure regarding how this would work, but having a list of available useful and relevant examples seems like a decent starting point.

Potential Criticisms

"Why do we need this? People are already allowed to critique anything they want."

While this is technically true, I think it would frequently break social norms. There are a lot of cases where people would get upset if their projects were provided any negative critique, even if it came with positive points. This would act as a signal that the owners might be particularly okay with critique. I think we live in a society that's far from maximum-candidness, and it's often difficult to tell where candidness would be accepted - so explicit communication could be useful.

"But couldn't people who sign such a policy just attack evaluators anyway?"

I don't think an explicit policy here will be a silver bullet, but I think it would help. I expect that a boss known for being cruel wouldn't be trusted if they provided such a policy, but I imagine many other groups would be. Ideally there could be some common knowledge about which people/organizations fail to properly honor their policies. I don't think this would work for Open Philanthropy that much (in the sense that effective altruists might expect OP to not complain publicly, but later not fund the writer's future projects), but it could for many smaller orgs (that would have much less secretive power over public evaluators/writers)


Previous Posts

]]>
<![CDATA[Recent Updates]]>Squiggle AI & Sonnet 3.7

We've updated Squiggle AI to use the new Anthropic Sonnet 3.7 model. In our limited experimentation with it so far, it seems like this model is capable of making significantly longer Squiggle models (roughly ~200 lines to ~500 lines), but that

]]>
https://quantifieduncertainty.org/posts/recent-updates/67bf633f9858d400015ce156Wed, 26 Feb 2025 19:36:00 GMTSquiggle AI & Sonnet 3.7

We've updated Squiggle AI to use the new Anthropic Sonnet 3.7 model. In our limited experimentation with it so far, it seems like this model is capable of making significantly longer Squiggle models (roughly ~200 lines to ~500 lines), but that it takes an equivalently higher cost and time to do so. Frustratingly, this means that with the default Bubble Tea example it often isn't capable of fully debugging its first results within the limits, but other prompts seem to do better.

The system still makes a decent number of mistakes, especially with search & replace for some reason (I've been surprised that this aspect has proven so challenging so far). Also, it often hits the top Anthropic rate limits for our tier (I believe this is Tier 4). I get the impression that non-Enterprise tier limits are just far too low for many interesting experiments and usage patterns.

We haven't explored the new "Extended Thinking" functionality yet. In theory this could improve performance with code generation and fixing, but we'll need to do further testing to find out.

We've increased the price and time limits for now. If usage gets to be too high, we'll limit these.

Do feel free to try the new version out! Any feedback is greatly appreciated. Do expect errors. Remember that if you hit our rate limits, you can use your own Anthropic key.

Effective Altruism Global

Last weekend I attended EA Global: Bay Area 2025. I held a 1-hour Squiggle workshop, similar to the one I gave previously there, but with more emphasis on Squiggle AI. Approximately 60 people showed up to the event, and around 10 stayed a while later for my office hours.

I had a few good conversations with enthusiastic Squiggle users. Also, I've learned about a few potential groups potentially excited in working on or funding the field of Epistemic AI sometime. I really hope this field starts taking off, it feels like it's been a long time coming.

Other Work

In the last two months, we've been spending a lot of time on both funding applications and some EA client work. Funding has been a highly significant bottleneck recently. Fingers crossed that this gets resolved in the next few months and we can get back into intense work mode on our main projects.

One benefit of fundraising is that it forced us to spend time thinking through and outlining potential projects for the next year. Overall I'm feeling most bullish on things that could potentially lead to significant advances in the area of AI and Epistemics. This means less attention on regular features for Squiggle, and more on experiments and evaluations that might scale with better LLMs.

AI is progressing quickly and chaotically now. This makes it very difficult to make long-term plans, but also could mean that the right projects and writing might have outsized impact.

Epistemic AI Hackathon, March 3

Some friends are holding an open "AI for Epistemics hackathon" soon, I'm planning on attending. I personally plan to use the event primarily as an opportunity to chat with others in the space. I assume it will be a friendly event, more focused on encouraging work in the area than being highly competitive.

]]>
<![CDATA[6 (Potential) Misconceptions about AI Intellectuals]]>Update
I recently posted this to the
EA Forum, LessWrong, and my Facebook page, each of which has some comments.

Epistemic Status
A collection of thoughts I've had over the last few years, lightly edited using Claude. I think we're at the point in this discussion

]]>
https://quantifieduncertainty.org/posts/6-potential-misconceptions-about-ai-intellectuals/67b25bd3d35a100001ed6476Sun, 16 Feb 2025 21:50:17 GMT

Update
I recently posted this to the
EA Forum, LessWrong, and my Facebook page, each of which has some comments.

Epistemic Status
A collection of thoughts I've had over the last few years, lightly edited using Claude. I think we're at the point in this discussion where we need to get the basic shape of the conversation right. Later we can bring out more hard data. 

Summary

While artificial intelligence has made impressive strides in specialized domains like coding, art, and medicine, I think its potential to automate high-level strategic thinking has been surprisingly underrated. I argue that developing "AI Intellectuals" - software systems capable of sophisticated strategic analysis and judgment - represents a significant opportunity that's currently being overlooked, both by the EA/rationality communities and by the public. More fundamentally, I believe that a lot of people thinking about this area seem to have substantial misconceptions about it, so here I try to address those.

Background

Core Thesis

I believe we can develop capable "AI Intellectuals" using existing AI technologies through targeted interventions. This opportunity appears to be:

  • Neglected: Few groups are actively pursuing this direction
  • Important: Better strategic thinking could significantly impact many domains, including AI safety
  • Tractable: Current AI capabilities make this achievable
  • Relatively Safe: Development doesn’t require any particularly scary advances

Different skeptics have raised varying concerns about this thesis, which I'll address throughout this piece. I welcome deeper discussion on specific points that readers find particularly interesting or contentious.

The Current Landscape

Recent advances in large language models have dramatically raised expectations about AI capabilities. Yet discussions about AI's impact tend to focus on specific professional domains while overlooking its potential to enhance or surpass human performance in big-picture strategic thinking.

Consider these types of questions that AI systems might help address:

  • What strategic missteps is Microsoft making in terms of maximizing market value?
  • What metrics could better evaluate the competence of business and political leaders?
  • Which public companies would be best off by firing their CEOs?
  • Which governmental systems most effectively drive economic development?
  • How can we more precisely assess AI's societal impact and safety implications? Are current institutional investors undervaluing these factors?
  • How should resources be allocated across different domains to maximize positive impact?
  • What are the top 30 recommended health interventions for most people?

We should seek to develop systems that can not only analyze these questions with reliability and precision, but also earn deep trust from humans through consistent, verifiable performance. The goal is for their insights to command respect comparable to or exceeding that given to the most credible human experts - not through authority or charisma, but through demonstrated excellence in judgment and reasoning.

Defining AI Intellectuals

An AI Intellectual would be a software system that can:

  • Conduct high-level strategic analysis
  • Make complex judgments
  • Provide insights comparable to human intellectuals
  • Engage in sophisticated research and decision-making

This type of intellectual work currently spans multiple professions, including:

  • Business executives and management consultants
  • Investment strategists and hedge fund managers
  • Think tank researchers and policy analysts
  • Professional evaluators and critics
  • Political strategists and advisors
  • Nonfiction authors and academics
  • Public intellectuals and thought leaders

An AI intellectual should generally be good at answering questions like those posted above.

While future AI systems may not precisely mirror human intellectual roles - they might be integrated into broader AI capabilities rather than existing as standalone "intellectual" systems - the concept of "AI Intellectuals" provides a useful framework for understanding this potential development.

Another note on terminology: While the term "intellectual" sometimes carries baggage or seems pretentious, I use it here simply to describe systems capable of sophisticated strategic thinking and analysis.

There's been some recent discussion circling this topic recently, mostly using different terminology. 

Here I use the phrase "AI Intellectuals" to highlight one kind of use, but I think that this fits neatly into the above cluster. I very much hope that over the next few years there's progress on these ideas and agreement on key terminology.

Potential Misconceptions

Misconception 1: “Making a trustworthy AI intellectual is incredibly hard”

Many people seem to hold this belief implicitly, but when pressed, few provide concrete arguments for why this should be true. I’ll challenge this assumption with several observations.

The Reality of Public Intellectual Work

The public intellectuals most trusted by our communities—like Scott Alexander, Gwern, Zvi Mowshowitz, and top forecasters—primarily rely on publicly available information and careful analysis, not privileged access or special relationships. Their key advantage is their analytical approach and thorough investigation of topics, something that’s easy to imagine AI systems replicating.

Evidence from Other AI Progress

We're already successfully automating numerous complex tasks—from large-scale software engineering to medical diagnosis and personalized education. It's unclear why high-level strategic thinking should be fundamentally more difficult to automate than these areas.

An Executive Strategy Myth

There's a common misconception that business executives possess superior strategic thinking abilities. However, most executives spend less than 20% of their time on high-level strategy, with the bulk of their work focused on operations, management, and execution. It seems likely that these leaders weren’t selected for incredibly great strategic insight - instead they excelled at a combination of a long list of skills. This would imply that many of the most important people might be fairly easy to both help or to outperform with strategy and intellectual work specifically.

My Experience at QURI

Between Guesstimate and QURI, I’ve spent several years making decision analysis tools, and more recently Squiggle AI to automate this sort of work. I’ve been impressed by how LLMs can be combined with some basic prompt engineering and software tooling to generate much better results.

AI forecasting has been proven to be surprisingly competent in some experiments so far. Online research is also starting to be successfully automated, for example with Elicit, Perplexity, and recently Deep Research. I’ve attempted to map out many of the other tasks I assume might be required to do a good job, and all seem very doable insofar as software projects go.

Overall I think the technical challenges look very much doable for the right software ventures.

General Epistemic Overconfidence

I’ve noticed a widespread tendency for people to overestimate their own epistemic abilities while undervaluing insights from more systematic researchers. This bias likely extends to how people evaluate AI's potential for strategic thinking. People who are overconfident in their own epistemics are likely to dismiss other approaches, especially when those approaches come to different conclusions. While this might mean that AI intellectuals would have trouble being listened to, it also could imply that AIs could do a better job at epistemic work than many people would think.

Related Discussion around AI Alignment

Some AI safety researchers argue that achieving reliable AI epistemics will be extremely challenging and could present alignment risks. The Ontology Crisis discussion touches on this, but I haven't found these arguments particularly compelling.

There seems to be a decent segment of doomers who expect that TAI will never be as epistemically capable as humans due to profound technical difficulties. But the technical difficulty in question seems to vary greatly between people.

Strategy and Verifiability

A common argument suggests that "AIs are only getting good at tasks that are highly verifiable, like coding and math." This implies that AI systems might struggle with strategic thinking, which is harder to verify. However, I think this argument overlooks two main points.

First, coding itself demonstrates that AIs can be strong at tasks even when perfect verification isn't possible. While narrow algorithm challenges are straightforward, AI systems have shown proficiency in (and are improving at) harder-to-verify aspects like:

  • Code simplicity and elegance
  • Documentation quality
  • Coherent integration with larger codebases
  • Maintainability and scalability

Second, while high-level strategy work can be harder to verify than mathematical proofs, there are several meaningful ways to assess strategic capabilities:

Strong verification methods:

  • Having AI systems predict how specific experts would respond to strategic questions after deliberation
  • Evaluating performance in strategy-heavy environments like Civilization 5
  • Assessing forecasting ability across both concrete and speculative domains

Lighter verification approaches:

  • Testing arguments for internal consistency and factual accuracy
  • Comparing outputs between different model sizes and computational budgets
  • Validating underlying mathematical models through rigorous checks
  • Requiring detailed justification for unexpected claims

It's worth noting that human intellectuals face similar verification challenges, particularly with speculative questions that lack immediate feedback. While humans arguably are better right now at hard-to-verify tasks, they still have a lot of weaknesses here, and it seems very possible to outperform them using the strategies mentioned above. 

Strategic Analysis and EA

While EA and AI Safety strategy may be more sophisticated than outside perspectives, this might represent a relatively low bar. I can envision near-term AI systems providing more trustworthy analysis on these topics than the current researchers we have now.

Misconception 2: “There's some unreplicable human secret sauce”

A common argument against AI intellectuals is that humans possess some fundamental, unreplicable, quality essential for high-level thinking. This "secret sauce" argument takes many forms, but all suggest there's some critical human capability that AI systems can never achieve (at least, before TAI).

Some examples of “secret sauce” arguments I’ve heard:

  1. "AIs can’t handle deep uncertainty."
  2. “AIs can only answer questions. They can’t come up with them.”
  3. "AI can’t generate truly original ideas like humans can."
  4. “AIs can’t generalize moral claims like humans can.”
  5. “AIs can’t understand human values as well as individual humans do.”
  6. “AIs always optimize for proxy metrics, in ways that humans don’t.”

My guess is that most of these aren’t that serious and don’t represent cruxes for their proponents, mainly because I haven’t seen most of them be rigorously argued for.

Due to the number of examples it’s hard for me to argue against all, but I’m happy to go more into detail on any specific ones if asked.

I can say that none of the examples seem clearly correct to me. It’s possible that one will be a limiting factor, but this would surprise me, and I’d be willing to bet on Manifold against.

Misconception 3: “AI Intellectuals will follow an all-or-nothing trajectory”

The Mistaken View

There's a prevalent belief that AI intellectual capabilities will follow a binary trajectory: either completely useless or suddenly revolutionary. This view is particularly common in discussions about AI and forecasting, where some argue that AI systems will be worthless until they surpass human performance, at which point human input becomes obsolete.

I think that having this view would lead to other important mistakes about the viability of AI Intellectuals, so I’ll address it in some detail.

The Reality of Technological Progress

This binary thinking misunderstands how technology typically develops:

  • Most technologies advance gradually along an S-curve
  • Early versions provide modest but real value
  • Improvements compound over time
  • Integration happens incrementally, not suddenly

A Better Model: The Self-Driving Car Analogy

Consider how self-driving technology has evolved:

  • Started with basic driver assistance (Level 1)
  • Gradually added more sophisticated features
  • Each level brought concrete benefits
  • Full autonomy (Level 5) represents the end of a long progression

How I Think AI Intellectuals Will Likely Develop

We're already seeing this pattern with AI intellectual tools:

  • Initial systems assist with basic research, ideation, and fact-checking
  • Capabilities expand to draft generation and analysis
  • Progress in trustworthiness and reliability happens gradually
  • Trust increases, in rough accordance to trustworthiness (for example, with people gradually using LLMs and developing intuitions about when to trust them)

Even with rapid AI progress (say, TAI in 5 years), this evolution would likely still take considerable time (perhaps 3 years, or 60% of TAI time left) to achieve both technical capability and institutional trust. But we can gradually scale it up, meaning it will be useful at each step and predictable.

I assume that early “full AI Intellectuals” will be mediocre but still find ways to be useful. Over time these systems will improve, people will better understand how much they can trust them, and people will better learn where and how to best use them.

Misconception 4: “Making a trustworthy AI intellectual is incredibly powerful/transformative”

I’ve come across multiple people who assume that once AI can outperform humans at high-level strategic questions, this will almost instantly unlock massive amounts of value. It might be therefore assumed that tech companies would be highly incentivized to do a great job here, and that this area won’t be neglected. Or it might be imagined that these AIs will immediately lead to TAI.

I think those who have specialized in the field of forecasting often recognize just how low in importance it often is. This doesn’t mean that forecasting (and similarly, intellectual work) is not at all useful - but it does mean that we shouldn’t expect much of the world to respond quickly to improvements in epistemic capabilities.

Few people care about high-quality intellectual work

The most popular intellectuals (as in, people who provide takes on intellectuals that others listen to) are arguably much better at marketing than they are at research and analysis. In my extended circles, the most popular voices include Malcolm Gladwell, Elon Musk, Tim Ferris, Sam Harris, etc.

Arguably, intellectual work that’s highly accurate has a very small and often negative mimetic advantage. Popular figures can instead focus on appearing confident and focus on other popularity measures, for instance.

Tetlock’s work has demonstrated that most intellectual claims are suspect, and that we can have forecasting systems that can do much better. And yet, these systems are still very niche. I’d expect that if we could make them 50% more accurate, perhaps with AIs, it would take time for many people to notice. Not many people seem to be paying much attention.

High-quality intellectual work is useful in a narrow set of areas

I think that intellectual work often seems more informative than it is actually useful.

Consider:

  1. Often, conventional wisdom is pretty decent. There aren’t often many opportunities for much better decisions.
  2. In cases where major decisions are made poorly, it’s often more out of issues like conflicts of interest or ideological stubbornness than it is intelligence. There’s often already some ignored voices recommending the correct moves - adding more voices wouldn’t exactly help.
  3. Getting the “big picture strategy” right is only one narrow piece of many organizations. Often organizations do well because they have fundamental advantages like monopoly status, or because they execute well on a large list of details. So if you can improve the “big picture strategy” by 20%, that might only lead to a 0.5% growth in profits.

Forecasting organizations have recently struggled to produce decision-relevant estimates. And in cases where these estimates are decision-relevant, they often get ignored. Similar to intellectuals. This is one major reason why there’s so little actual money in public forecasting markets or intellectual work now.

All that said, I think that having better AI intellectuals can be very useful. However, I imagine that they can be very gradually rolled out and that it could take a long time for them to be trusted by much of the public.

There are some communities that would be likely to appreciate better epistemics early on. I suspect that the rationality / effective altruism communities will be early here, as has been the case for prediction markets, bayesianism, and other ideas.

Misconception 5: “Making a trustworthy AI intellectual is inherently dangerous”

Currently, many humans defer to certain intellectuals for their high-level strategic views. These intellectuals, while influential, often have significant limitations and biases. I believe that AI systems will soon be able to outperform them on both capabilities and benevolence benchmarks, including measures like honesty and transparency.

I find it plausible that we could develop AI systems that are roughly twice as effective (in doing intellectual work) as top human intellectuals while simultaneously being safer to take advice from.

If we had access to these "2x AI intellectuals," we would likely be in a strictly better position than we are now. We could transition from deferring to human intellectuals to relying on these more capable and potentially safer AI systems. If there were dangers in future AI developments, these enhanced AI intellectuals should be at least as competent as human experts at identifying and analyzing such risks.

Some might argue that having 2x AI intellectuals would necessarily coincide with an immediate technological takeoff, but this seems unlikely. Even with such systems available today, I expect many people would take considerable time to develop trust in them. Their impact would likely be gradual and bounded - for instance, while they might excel at prioritizing AI safety research directions, they wouldn't necessarily be capable of implementing the complex technical work required.

Of course, there remains a risk that some organizations might develop extremely powerful AI systems with severe epistemic limitations and potentially dangerous consequences. However, this is a distinct concern from the development of trustworthy AI intellectuals.

It’s possible that while “2x AI intellectuals” will be relatively safe, “100x AI intellectuals” might not be, especially if we reached it very quickly without adequate safety measures. I would strongly advise for gradual ramp-up. Start with the “2x AI intellectual”, then use it to help us decide on next steps. Again, if we were fairly confident that this “2x AI intellectual” were strictly more reliable than our existing human alternatives, then this should be strictly better at guiding us in the next steps. 

Lastly, I might flag that we might not really have a choice here. If we radically change our world using any aspects of AI, we might require much better intellectual capacity in order to not have things go off the rails. AI intellectuals might be one of our best defenses against a dangerous world.

Misconception 6: “Delegating to AI means losing control”

This concern often seems more semantic than substantive. Consider how we currently interact with technology: few people spend time worrying about the electrical engineering details of their computers, despite these being crucial to their operation. Instead, we trust and rely upon expert electrical engineers for such matters.

Has this delegation meant a loss of control? In some technical sense, perhaps. But this form of delegation has been overwhelmingly positive - we've simply entrusted important tasks to systems and experts who handle them competently. The key question isn't whether we've delegated control, but whether that delegation has served our interests.

As AI systems become more capable at making and explaining strategic choices, these decisions will likely become increasingly straightforward and convergent. Just as we rarely debate the mechanical choices behind our refrigerators' operation, future generations might spend little time questioning the implementation details of governance systems. These details will seem increasingly straightforward, over-determined, and boring.

Rather than seeing this as problematic, we might consider it liberating. People could redirect their attention to whatever areas they choose, rather than grappling with complex strategic decisions out of necessity.

While I don’t like the specific phrase “lose control”, I do think that there are some related questions that are both concrete and important. For example: "When humans delegate strategic questions to AIs, will they do so in ways that benefit or harm them? Will this change depending on the circumstance?" This deserves careful analysis of concrete failure modes like overconfidence in delegation or potential scheming, rather than broad concerns about "loss of control."

Further Misconceptions and Beyond

Beyond the six key misconceptions addressed above, I've encountered many other questionable beliefs about AI intellectuals. These span domains including:

  • Beliefs about what makes intellectual work valuable or trustworthy
  • Claims about what strategic thinking fundamentally requires
  • Arguments about human vs. AI epistemics
  • Questions about institutional adoption and integration

While I could address each of these, I suspect many aren't actually cruxes for most skeptics. I've noticed a pattern where surface-level objections often mask deeper reservations or simple lack of interest in the topic.

This connects to a broader observation: there's surprisingly little engagement with the concept of "AI intellectuals" or "AI wisdom" in current AI discussions. Even in communities focused on AI capabilities and safety, these topics rarely receive sustained attention.

My current hypothesis is that the specific objections people raise often aren't their true bottlenecks. The lack of interest might stem from more fundamental beliefs or intuitions that aren't being explicitly articulated.

Given this, I'd particularly welcome comments from skeptics about their core reservations. What makes you uncertain about or uninterested in AI intellectuals? What would change your mind? I suspect these discussions might represent the most productive next steps.

 


Thanks to Girish Sastry and Vyacheslav Matyuhin for feedback on this post. 

]]>
<![CDATA[$300 Fermi Model Competition]]>We're launching a short competition to make Fermi models, in order to encourage more experimentation of AI and Fermi modeling workflows. Squiggle AI is a recommended option, but is not at all required.

The ideal submission might be as simple as a particularly clever prompt paired with the

]]>
https://quantifieduncertainty.org/posts/300-fermi-model-competition/67a11f25921755000119869cMon, 03 Feb 2025 19:59:38 GMTWe're launching a short competition to make Fermi models, in order to encourage more experimentation of AI and Fermi modeling workflows. Squiggle AI is a recommended option, but is not at all required.

The ideal submission might be as simple as a particularly clever prompt paired with the right AI tool. Don't feel pressured to spend days on your entry - a creative insight could win even if it takes just 20 minutes to develop.

Task: Make an interesting and informative Fermi estimate
Prize: $300 for the top entry
Deadline: February 16th, 2025 (2 weeks away!)
Results Announcement: By March 1st, 2025
Judges: Claude 3.5 Sonnet, the QURI team

You can apply by posting a submission in the comments on the corresponding LessWrong or EA Forum post. More details in those posts.

]]>
<![CDATA[Squiggle 0.10.0]]>After a six-month development period, we’ve released Squiggle 0.10.0.

This version introduces important architectural improvements like more robust support for multi-model projects and two new kinds of compile-time type checks. These improvements will be particularly beneficial as laying a foundation for future updates.

This release also

]]>
https://quantifieduncertainty.org/posts/squiggle-0-10-0/67895fa35d571b0001574e78Mon, 20 Jan 2025 22:27:41 GMTAfter a six-month development period, we’ve released Squiggle 0.10.0.

This version introduces important architectural improvements like more robust support for multi-model projects and two new kinds of compile-time type checks. These improvements will be particularly beneficial as laying a foundation for future updates.

This release also includes UI improvements, as well as several new functions and bugfixes.

Note: During this period, our development time was split between Squiggle and Squiggle AI (a separate project on top of Squiggle programming language). You can find out more about Squiggle AI in our recent EA forum post.

New project architecture

SqProject subsystem — the part of Squiggle that's responsible for orchestrating model runs, which can be quite complicated in case of multi-module programs — got a complete rewrite.

Pre-0.10, SqProject APIs were imperative, which caused issues at the boundary between Squiggle language and Squiggle React components. Sometimes this led to bugs and playground crashes, and sometimes it was preventing new features: it worked in simple cases, but wouldn't behave correctly if we tried to do something more advanced.

The new rewrite is quite radical in its functional approach, and is inspired by Git architecture. The new SqProject stores every version of Squiggle module on edits, and every version of Squiggle module outputs, as content-addressable immutable objects in a tree of dependencies. When the data is no longer needed, it gets garbage-collected.

If you're interested in more technical details on this, check out Multi-module projects in Squiggle documentation. In Squiggle playground, you can check out "Dependency Graph" viewer mode to see how the tree evolves when you do edits or add imports.

Web Worker runners by default

In Squiggle v0.9.4 and v0.9.5, we had an experimental "Web Worker" runner that you could enable in Playground settings.

In v0.10, it's enabled by default. SqProject rewrite is one change that allowed us to stabilize this, and the implementation of a reliable serializer for DAGs is another piece of the puzzle.

With Web Workers, all Squiggle code runs in a separate Web Worker thread, and the results are marshaled back to the main JS thread asynchronously. So, the UI should freeze much less often now.

More detailed documentation on runners can be found here.

Type inference

In v0.10, we for did some groundwork on compile-time type inference and semantic analysis.

Previously, Squiggle code execution went through this pipeline:

  1. Parse source code to AST
  2. Compile AST to Intermediate Representation
  3. Run the IR

Now we have one additional step between steps 1 and 2: transforming AST to typed AST. This step does type checks and gradual type inference.

Several consequences of this feature you might notice:

  1. In the playground, you'll see inferred types on hover. For now, this only works on top-level variables, but types for local variables in blocks and function definitions are all inferred too.
  2. If you define a function that's not correct semantically (e.g. f() = 1 + ""), that's never called, your program would fail now even if you never call the function in your code.
  3. Generally, semantically incorrect code will fail faster, because these checks are done at compile time.

For now, this feature has a few significant limitations, which we hope to improve in the future releases:

  1. There is no way to annotate value or variable types explicitly yet. All types are inferred implicitly. This is convenient — you'll never have to do extra work to satisfy the type system — but might be not enough in some situations, especially in case of functions.
  2. While we do have function parameter annotations, which we use for deciding how to render function charts and validate function parameters, this older feature is implemented separately from the type system. Parameter annotations don't get uplifted to the compile-time parameter types yet. In the future, function parameter annotations and compile-time types will be unified.
  3. Another big limitation is that while the type system we use supports generics (e.g., x = [1,2,3] will be assigned List(Number) type), we don't infer types correctly when a generic function is applied to a generic argument. For example, x = [1,2,3] -> map({|a|a}) will get the type List('B).

Playground improvements

On the React components side, the most noticeable change in Squiggle 0.10 is the new UI for the right side of the playground, output viewer.

Before (Squiggle 0.9.5)
After (Squiggle 0.10.0)

In addition to the visual changes, the variables in the viewer now default to the collapsed state. If you want to expand the variable by default, you can use @startOpen decorator.

Unit type annotations (Experimental)

Another feature, contributed to Squiggle by Michael Dickens, is unit type annotations.

Unit type annotations allow users to annotate variables with unit types, such as kilograms or dollars, or combinations of those, such as m/s^2.

Try in playground

Attempts to use incompatible types will fail:

Try in playground

Unit type annotations are documented here.

This feature is independent from the type system described above.

Note that this feature is experimental and might be deprecated or removed in the future in favor of some other syntax. Preliminary plans can be found in this GitHub issue.

Other changes

New standard library functions: List.sample, List.sampleN, Number.mod.

New standard library constants: Number.minValue and Number.maxValue.

Bugfixes and improvements:

]]>
<![CDATA[AI for Resolving Forecasting Questions: An Early Exploration]]>Thanks to Slava Matyuhin for comments

Summary

  1. AIs can be used to resolve forecasting questions on platforms like Manifold and Metaculus.
  2. AI question resolution, in theory, can be far more predictable, accessible, and inexpensive to human resolution.
  3. Current AI tools (combinations of LLM calls and software) are mediocre at judging
]]>
https://quantifieduncertainty.org/posts/ai-for-resolving-forecasting-questions-an-early-exploration/678ae3505d571b0001574ef3Fri, 17 Jan 2025 23:13:17 GMTThanks to Slava Matyuhin for comments

Summary

  1. AIs can be used to resolve forecasting questions on platforms like Manifold and Metaculus.
  2. AI question resolution, in theory, can be far more predictable, accessible, and inexpensive to human resolution.
  3. Current AI tools (combinations of LLM calls and software) are mediocre at judging subjective or speculative questions. We expect them to improve. 
  4. AI tools are changing rapidly. If we want to organize forecasting questions to be resolved using an AI tool in 10 years, instead of selecting a specific tool, we might want to select a protocol for choosing the best tool at resolution time. We call this an "Epistemic Selection Protocol". 
  5. If we have a strong "Epistemic Selection Protocol" that we like, and expect that we'll have sufficient AI tools in the near future, then we could arguably start to write forecasting questions for them right away. 
  6. The methods listed above, if successful, wouldn't just be for resolving questions on prediction platforms. They could be generalized to pursue broad-scale and diverse evaluations. 
  7. We provide a “A Short Story of How This Plays Out", which is perhaps the most accessible part of this for most readers.  
  8. We list a bunch of potential complications with AI tools and Epistemic Selection Protocols. While these will be complex, we see no fundamental bottlenecks. 

Motivation

Say we want to create a prediction tournament on a complex question.

For example:

  • “Will there be over 10 million people killed in global wars between 2025 and 2030?”
  • “Will IT security incidents be a national US concern, on the scale of the top 5 concerns, in 2030?”
  • “Will bottlenecks in power capacity limit AIs by over 50%, in 2030?”

While these questions involve subjective elements, they aren't purely matters of opinion. Come 2030, though people might debate the details, informed observers would likely reach broadly similar conclusions.

Currently, such questions are typically resolved by human evaluators. On Manifold, the question author serves as judge, with authors developing reputations for their judgment quality over time. Metaculus takes a different approach for more complex and subjective questions, often employing small panels of subject matter experts.

However, human evaluators have several limitations:

  1. High costs, particularly for domain experts
  2. Limited track records of evaluation quality
  3. Poor accessibility - most forecasters cannot directly query them
  4. Inconsistency over time as their views and motivations evolve
  5. Uncertain long-term availability
  6. Vulnerability to biases and potential conflicts of interest

AIs as Question Evaluators

Instead of using humans to resolve thorny questions, we can use AIs. There are many ways we could attempt to do this, so we’ll walk through a few examples.

Option 1: Use an LLM

The first option to consider is to use an LLM as an evaluator. For example, write,
This question will be judged by Claude 3.5 Sonnet (2024-06-20). The specific prompt will be, …”

This style is replicable, simple, and inexpensive. However, it clearly has some downsides. The first obvious one is that Claude 3.5 Sonnet doesn’t perform web searches, so its knowledge would likely be too limited to resolve future forecasting questions.

Option 2: Use an AI Tool with Search

Instead of using a standard LLM, you might want to use a tool that uses both LLMs and web searches. Perplexity might be the most famous one now, but other advanced research assistants are starting to come out. In theory one should be able to set a research budget that’s in line with the importance and complexity of the question.

There's been some experimentation here. See this guide on using GPT4 for resolving questions on Manifold, and this experiment write up. GPT4 allows for basic web searches.

This option is probably better than Option 1 for most things. But there are still problems. The next major one is the risk that Perplexity, or any other single tool we can point to now, won’t be the leading one in the future. The field is moving rapidly, it’s difficult to tell which tools will even exist in 5 years, let alone be the preferred options.

Option 3: Use an “Epistemic” Selection Protocol

In this case, one doesn't select a specific AI tool. Instead one selects a process or protocol that selects an AI tool.

For example:
“In 2030, we will resolve this question using the leading AI tool on the ‘Forbes 2030 Most trusted AI tools’ list.”

We’re looking for AI tools that are “trusted to reason about complex, often speculative or political matters.” This arguably can be more quickly expressed as searching for the tool with the best epistemics.

Epistemic Selection Protocols (Or, how do we choose the best AI tool to use?)

Arguably, AI Epistemic Selection Protocols can be the best choice of the above options, if one could implement them effectively, for most 2+ year questions. There are a lot of potential processes to choose, though most would be too complicated to be worthwhile. We want to strike a balance between simplicity and accuracy.

Let’s first list the most obvious options.

Option 1: Trusted and formalized epistemic evaluations

There’s currently a wide variety of AI benchmarks. For example, TruthfulQA and various investigations of model sycophancy. But arguably, none of these would be great proxies for which AI tool would be the most trusted question resolvers in the future. Newer, deliberate benchmarks could help here.

Example:
“This forecasting question will be resolved, using whichever AI Tool does the best on Complete Epistemic Benchmark X, and can be used for less than $20.”

Option 2: Human-derived trust rankings

Humans could simply be polled on which AI tools they regard as the most trustworthy. Different groups of humans would have different preferences, so the group would need to be specified in advance for an AI Selection Process.

Example:
“This forecasting question will be resolved, using whichever AI Tool is on the top of the list of ‘Most trusted AI Tools’ on LessWrong, and can be used for less than $20.”

Option 3: Inter-AI trust ratings

AI tools could select future AI tools to use. This could be a 1-step solution, where an open-source or standardized (for the sake of ensuring it will be available long-term) solution is asked to identify the best available candidate. Or it could be a multiple-step solution, where perhaps AI tools are asked to recommend each other using some simple algorithm. This can be similar in concept to the Community Notes algorithm.

Example:
“This forecasting question will be resolved, using whichever AI Tool wins a poll of the ‘Most trusted AI tools’ according to AI tools.’ In this poll, each AI tool will recommend its favorite of the other available candidates.” (Note: This specific proposal can be gamed, so greater complexity will likely be required.)

A Short Story of How This Plays Out

In 2025, several question writers on Manifold experiment with AI resolution systems. Some questions include:

“Will California fires in 2026 be worse than those in 2025? To answer this, I’ll ask Perplexity on Jan 1, 2026. My prompt will be, [Will California fires in 2026 be worse than those in 2025? Judge this by guessing the total economic loss.]”

“How many employees will OpenAI have in Dec 2025? To answer this, I’ll first ask commenters to write arguments and/or facts that they’ve found on this. I’ll filter this for what seems accurate, then I’ll paste this into Perplexity. I’ll call Perplexity 5 times, and average the results.”

Forecasters gradually identify the uses and limitations of such systems. It turns out they are surprisingly bad at advanced physics questions, for some surprising reason. There are a few clever prompting strategies that help ensure that these AIs put out more consistent results.

AI tools like Perplexity also get very good at hunting down and answering questions that are straightforward to resolve. Manifold adds custom functionality to do this. For example, say someone writes a question, “What Movie Will Win The 2025 Oscars For Best Picture?” When they do, they’ll be given the option to have a Manifold AI system automatically make a suggested guess for them, at the time of expected question resolution. These guesses will begin with high error rates (10%), but these will gradually drop.

Separately, various epistemic evaluations are established. There are multiple public and private rankings. There are also surveys of the “Most Trusted AIs”, held on various platforms such as Manifold, LessWrong, and The Verge. Leading consumer product review websites such as Consumer Reports and Wirecutter begin to have ratings for AI tools, using defined categories such as “accuracy” and “reasonableness.”

One example question from this is:
“In 2030, will it seem like o1 was an important AI development, that was at least as innovative and important as GPT4? This will be resolved using whichever AI leads the “Most trusted AIs” poll on Manifold in 2029.”

There will be a long tail of AI tools that are proposed as contenders for epistemic benchmarks. Most of the options are simply minor tweaks on other options or light routers. Few of these will get the full standard evaluations, but good proxies will emerge. It turns out that you can get a decent measure by using the top fully-evaluated AI systems to evaluate more niche systems.

In 2027, there will be a significant amount of understanding, buy-in, and sophistication with such systems (at least among a few niche communities, like Manifold users). This will make it possible to scale them for more ambitious uses.

Metaculus runs some competitions that include:

“What is the relative value of each of [the top 500 AI safety papers of 2026]? This will be resolved in 2030 by using the most trusted AI system, via LessWrong or The Economist, at that time. This AI will order all of the papers - forecasters should estimate the percentile that each paper will achieve.”

“What is the expected value of every biosafety organization in 2027, estimated as what Open Philanthropy would have paid for it from their biosafety funding pool in 2027? This will be judged in 2029, by the most trusted AI system, with a budget of $1,000 for each evaluation.”

Around this time, some researchers will begin to make wider kinds of analyses, and forecast compressions.

“How will the top epistemic model of 2030 evaluate the accuracy and value of the claims of each of the top 100 intellectuals from 2027?”

“Will the top epistemic model of 2030 consider the current top epistemic models to be ‘highly overconfident’ for at least 10% of the normative questions they are asked?”

The top trusted AI tools start to become frequent ways to second-guess humans. For example, if a boss makes a controversial decision, people could contest the decision if top AI tools back them up. Similar analyses would be used within governments.

As these AI tools become even more trusted, they will replace many humans for important analyses and decisions. Humans will spend a great deal of effort focused on assuring these AI tools are doing a good job, and they'll have the time for that because there will be few other things they need to directly evaluate or oversee. 

Protocol Complications & Potential Solutions

Complication 1: Lack of Sufficient AI Tools

In the beginning, we expect that many people won’t trust any AI tools to be adequate in resolving many questions. Even if tools look good in evaluations, it will take time for them to build trust.

One option is to set certain criteria for sufficiency. For example, one might say, “This question will be resolved using whichever AI system first gets to a 90/100 on the Epistemic Benchmark Evaluation…” This would clearly require understanding and trust in the evaluations, rather than in a specific tool, so this would require high-quality evaluations or polls.

Complication 2: Lack of Ground Truth

Many subjective and speculative questions lack definitive answers. Some questions can only be answered long after the information would be useful, while others are inherently impossible to resolve with complete certainty.

In such cases, the goal should shift from seeking perfect precision to outperforming alternative evaluation methods. Success means providing better answers than existing approaches, given practical constraints of cost, compute, and time.

AI evaluators should prioritize two key aspects:

  1. Calibration: Systems should express appropriate levels of certainty, aligned with the reference frame they're operating from
  2. Resolution: Within the bounds of reliable calibration, provide the most detailed and specific answers possible

For example, consider the question "What will be the economic impact of California wildfires in 2028?" While a perfectly precise answer is impossible, AI systems can progressively approach better estimates by:

  • Aggregating multiple data sources
  • Explicitly modeling uncertainties
  • Identifying and accounting for measurement limitations
  • Clearly stating assumptions and confidence levels

As long as a resolution is calibrated and fairly unbiased, it can be incentive-compatible for forecasting.

Complication 3: Goodharting

We’d want to avoid a situation where one tool technically maximizes a narrow “Epistemic Selection Protocol”, but is actually poor at doing many of the things we want from a resolver AI.

To get around this, the Protocol could make specifications like the following:

What will be the most epistemically-capable service on [Date] that satisfies the following requirements?
- Costs under $20 per run.
- Is publicly available.
- Has over 1000 human users per month (this is to ensure there’s no bottleneck that’s hard to otherwise specify.)
- Completes runs within 10 minutes.
- Has been separately reviewed to not have significantly and deceivingly goodharted on this specific benchmark.

It’s often possible to get around Goodharting by applying additional layers of complexity. Whether it’s worth it depends on the situation.

Complication 4: Different Ideologies

Consider a question about the impact of a tax cut policy. People with different philosophical or ideological backgrounds will likely disagree on fundamental assumptions, making a single "correct" answer impossible.

The simplest solution—declaring it a "complex issue with multiple valid perspectives"—is typically close to useless. A more useful approach would be developing AI tools that can represent different ideological frameworks, either through multiple specialized systems or a single system with adjustable philosophical parameters.

A more sophisticated approach could generate personalized evaluations based on individual characteristics and depth of analysis. For instance: "How would someone with background X evaluate California Proposition 10 in 2030, after studying it for [10, 100, 1000] hours?" This could be implemented using an algorithm that accounts for both personal attributes and time invested in understanding the issue. Scorable Functions might be a useful format.

Complication 5: AI Tools with Different Strengths

One might ask:
“What if different AI tools are epistemically dominant in different areas? For example, one is great at political science, and another is great at advanced mathematics.”

An obvious answer is to then create simple compositions of AI tools. A router can be used to send specific requests or sub-requests to other AI tools that are best equipped to handle them.

One possible AI tool resolution workflow

Complication 6: AI Tools that Recommend Other AI Tools

Imagine there’s a situation where one AI tool is chosen, but that tool recommends a different tool instead. For example, Perplexity 3.0 is asked a question, and it responds by stating that Claude 4.5 could do a better job than it could. Arguably it would make a lot of sense that if an AI tool were highly trusted to make speculative judgements, it could be trusted to be correct when claiming that a different tool is superior to itself.

This probably won’t be a major bottleneck. If AI tools could simply delegate other tools for specific questions, that could just be considered part of it during evaluation.

Going Forward

We hope this post is useful at advancing the conversation around AIs for question resolution and Epistemic AI Protocols. But it's still an early conversation.

We think there's a great deal of early experimentation and exploration to do in this broad space. Modern AI tools are arguably already good enough for broad use, and light wrappers on such tools can get us further. We hope to see work here in the next few years.

]]>
<![CDATA[Squiggle AI, Published to the EA Forum]]>
Introducing Squiggle AI — EA Forum
We’re releasing Squiggle AI, a tool that generates probabilistic models using the Squiggle language. This can provide early cost-effectiveness models…

We have previously written about Squiggle AI here, but waited until it was more tested

]]>
https://quantifieduncertainty.org/posts/squiggle-ai-published-to-the-ea-forum-2/67784005b2833e0001ca6ea0Fri, 03 Jan 2025 20:02:05 GMT
Introducing Squiggle AI — EA Forum
We’re releasing Squiggle AI, a tool that generates probabilistic models using the Squiggle language. This can provide early cost-effectiveness models…

We have previously written about Squiggle AI here, but waited until it was more tested and we had a better write-up to post it to the Effective Altruism Forum. We've also cross-posted it to LessWrong, where it might get some other discussion.

Unlike our previous posts about it, this one has a better overview of the system, a list of example outputs that we found interesting, a guide to using it, and a list of lessons learned from development. It's generally a much better overview.

In addition, if you want to just post questions to Squiggle AI in comments instead of using the tool, you're welcome to just post your comment there, and then we'll reply accordingly.

We're hoping that people in the effective altruist, rationalist, and AI safety communities grow their use of the tool, both for direct value, and for ideation of future tools and features.

All that said, there's of course still a long list of potential improvements to the system. There's clearly a lot that could be done in this area. We've been thinking a lot about how to be the most impactful, given we have a small team.

]]>