Blog | Roger Oriol

CaixaBank importer for Beancount

Roger Oriol — Sun, 04 Jan 2026 00:00:00 GMT

Motivation

Easily import CaixaBank bank statements into your Beancount ledger.

Features

Format Support: Handles CaixaBank Excel exports (.xls and .xlsx)
Robust Parsing:
- Flexible header detection (handles statements with metadata rows)
- Supports both Excel date serial numbers and DD/MM/YYYY string dates
- European number format parsing (comma as decimal, dot as thousands separator)
Full Beancount Integration: Creates proper transactions with metadata, payee, and narration

Contributing

Contributions are welcome! Please feel free to submit a PR.

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Gym sessions tracker

Roger Oriol — Sun, 02 Nov 2025 00:00:00 GMT

Motivation

Track gym sessions as a heatmap to guilt myself into going more to the gym.

Project Overview

Gym Tracker is a containerized web application for tracking gym sessions and muscle group training frequency. It displays a heatmap of workout sessions, tracks muscle group balance, and provides insights into training patterns.

Features

Log gym sessions, including which muscles were trained
- Can log sessions after the fact
Visualize at a glance how often I'm training using a heatmap like Github's
For each body part, see the last time I've trained it, to easily see which body parts I need to train the next session
Self-hostable application

Tech Stack:

Backend: Node.js with Express Database: SQLite Frontend: Vanilla JavaScript Deployment: Docker container with Nginx reverse proxy

Contributing

Contributions are welcome! Please feel free to submit a PR.

License

This project is licensed under the MIT License

[Link] GPT-5

Roger Oriol — Thu, 07 Aug 2025 00:00:00 GMT

OpenAI has finally released it's GPT-5 model, and as we were already expecting, it's a hybrid reasoning model. Now the model itself chooses how much to think about each task, and you can force the reasoning effort as well. This probably means the end of the o series of reasoning models from OpenAI, as the regular language models and the reasoning models will now be unified.

Of course, the benchmarks look good but saturated. What stands out to me is that they announced a 74.9 score on SWE-bench (with high reasoning effort), which is just a tad over the score from Claude Opus 4.1 just announced this very same week (74.5).

With the GPT-5 iteration, come 4 new models: GPT-5, GPT-5-mini, GPT-5-nano and GPT-5 Chat. Free users will be allowed to use GPT-5, although when they hit the maximum quota, they will fallback to GPT-5-mini.

GPT-5 allows to set the reasoning effort using the "reasoning.effort" parameter, although you can also force it telling the model to "Think hard about this". These new models introduce a new reasoning tier called "minimal" which produces a few as possible reasoning tokens before answering. The output tokens can also be customized by setting the "verbosity" parameter, which didn't exist for past models. This parameter can be set to "high", "medium" or "low".

The new models also bring some new quality of life improvements for tool calling:

Tool choice: While the models can choose to call zero, one or multiple tools, you can now set "tool_choice" to "forced" to force the invocation of at least one tool. You can also set a specific function that must be called by passing {"type": "function", "name": "function name"} to the "tool_choice" parameter. Finally, in "tool_choice" you can also specify a list of allowed tools from the list of tools provided to the model: {"type": "allowed_tools", "mode": "auto", "tools": []}.
Tool preambles: New feature that makes the models explain the rationale behind why they are invoking a function. This provides transparency and better understanding on the model's process. By default, this feature is not enabled. To enable it, you have to include a system message like "Before you call a tool, explain why you are calling it.".
Custom tools: This feature allows to define functions that allow unstructured, free-form text as input, which frees the model from using a structured JSON object to call the tool. This might improve the ability of the model to call these tools. This can be even more powerful when paired with Context-Free Grammar.
Context-Free Grammar: This feature allows to set grammar rules for the free-form text, to make them follow a set of rules. You can define this rules using Lark or a regular expression.

The GPT-5 models are now available both in ChatGPT and in the OpenAI API, give them a try!

[Quote] GPT-5 variants

Roger Oriol — Thu, 07 Aug 2025 00:00:00 GMT

It's not at all straightforward to understand the variants of the GPT-5 model released today. The API docs describe four models: gpt-5, gpt-5-mini, gpt-5-nano and gpt-5-chat. However, the system card describes 6 models to replace older models, and none of the names match with the API:

It can be helpful to think of the GPT-5 models as successors to previous models:

Table 1: Model progressions

Previous model	GPT-5 model
GPT-4o	gpt-5-main
GPT-4o-mini	gpt-5-main-mini
OpenAI o3	gpt-5-thinking
OpenAI o4-mini	gpt-5-thinking-mini
GPT-4.1-nano	gpt-5-thinking-nano
OpenAI o3 Pro	gpt-5-thinking-pro

The answer is that the gpt-5 model is composed of the gpt-5-main model, the gpt-5-thinking model and a router that selects the model to send the prompt to:

GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use...

The same applies to the mini model. Gpt-5-mini is made of a gpt-5-main-mini model, a gpt-5-thinking-mini model and a router. The nano model only seems to have a thinking variant, not a main, but this makes sense as a single model without a router will allow the model to be faster. This leaves only the gpt-5-thinking-pro model, which cannot be used via API, only via ChatGPT, with a Pro subscription:

In the API, we provide direct access to the thinking model, its mini version, and an even smaller and faster nano version of the thinking model, made for developers (gpt-5-thinking-nano). In ChatGPT, we also provide access to gpt-5-thinking using a setting that makes use of parallel test time compute; we refer to this as gpt-5-thinking-pro.

[Link] GPT-OSS

Roger Oriol — Wed, 06 Aug 2025 00:00:00 GMT

Just like Sam Altman hinted at a while ago, OpenAI just released two open-weight models trying to appease the common criticism of being a company with "Open" in the name that hasn't released any open language models in a long while (since GPT-2!).

The new open-weights models (not open-source like the name seems to imply) are Mixture-of-experts models with:

116.83 billion parameters with 5.13 billion active parameters. It has 128 experts and activates 4 experts for each token.
20.91 billion parameters with 3.61 billion active parameters. It has 32 experts and activates 4 experts for each token.

Both models are reasoning models and therefore OpenAI compares them to their own o3 and o4 models. It seems like the 120b version is comparable to o4-mini and the 20b version is comparable to o3-mini. The new models have been throughoutly trained for agentic tasks as in the post-training stage, they were trained specifically to use a browser tool and a python code execution tool, as well as other generic tools.

OpenAI has also introduced a new tokenized specially for these new models called harmony. What stands out about this tokenizer from others is that it introduces a "channels" concept that allows the model to separate the output between user-facing text and internal-facing outputs. Another interesting concept that it introduces is the "system message", which differs from the already known "system prompt". The system message allows for configuration of dates like: "Knowledge cutoff: 2024-06", "Current date: 2025-06-28". It also allows to set the reasoning effort with "Reasoning: high". Finally, it also allows the configuration of channels and what are they used for and tools that the model can use.

A great feature of these models is that it seems that OpenAI has optimized them to be able to easily fit in a single H100 80GB GPU for the largest model and in a 16GB consumer GPU for the small one. This was achieved using MXFP4 quantization after training to 4.25 bits per parameter, which very significantly reduces the model size. While it is possible to natively train models in this quantization to reduce model quality degradation, it looks that in this case the quantization was applied after training.

You can easily start using these models locally using Ollama. I recommend downloading the 20b model that fits in a consumer GPU. It runs really fast in my Macbook!

[Quote] How we built our multi-agent research system

Roger Oriol — Sun, 15 Jun 2025 00:00:00 GMT

While reading Anthropic's great article "How we built our multi-agent research system", I stumbled upon this quote where Anthropic researchers present the results where they found that multi-agent systems outperform single-agents for complex tasks:

For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.

This makes a ton of sense to me. We know that LLMs do their best when the scope of the task they are given is as narrow as possible and when they have as much relevant context as possible. By using an orchestrator agent to decompose tasks and give them to sub-agents, we are effectively narrowing down the scope of the task, as well as slimming down the amount of context not relevant to the specific subtask that the sub-agent will do.

Another interesting finding from this article is that Anthropic claims that 80% of the variance of results in the BrowseComp benchmark can be explained by more token usage:

In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors.

This also makes using multiple agents more optimal, because they can use more tokens (because they do so in parallel) more efficiently (because agents are less likely to hit a context window limit where the performance starts to degrade if the context is separated for each subtask). It is also in the best interest of Anthropic that you burn tokens at 15x (according to them) the token rate with multi-agent architectures, so they get paid more. So take this with a grain of salt.

I encourage you to read the whole article, as there are many very interesting tips for designing multi-agent applications.

[Link] Artificial Intelligence 3E: Foundations of computational agents

Roger Oriol — Sat, 07 Jun 2025 00:00:00 GMT

An agent is something that acts in an environment; it does something. Agents include worms, dogs, thermostats, airplanes, robots, humans, companies, and countries. Artificial Intelligence: Foundations of Computational Agents, 3rd edition by David L. Poole and Alan K. Mackworth, Cambridge University Press 2023

This is the definition I personally like the best for what agents are in the context of AI. Since the overuse of this word has left some of us confused on what it actually means, I would say that any application that uses AI and has the ability to act on its environment, through tools or function calling, is an agent.

[Link] AGI is not multimodal

Roger Oriol — Thu, 05 Jun 2025 00:00:00 GMT

A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc.

In this excellent article, Benjamin Spiegel argues that our current approach to building LLMs cannot lead to an AGI. While the current next-token prediction approach is really good at reflecting human-understanding of the world, not everything in this world can be expressed with language and not all valid language constructs are consistent with the world. Therefore, they are not actually learning world models, but just the minimum language patterns that are useful in our written mediums.

Multimodal models can be seen as solving this problem, since they unite multiple ways to see the world in a single embedding space. However, in multimodal models different modalities are unnaturally separated in the training process. Instead of learning about something by interacting with it via different modalities, two models are separately trained for each modality and then artificially sewn together in the same embedding space.

Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.

In conclusion, while LLMs are still getting more capable, those gains are already diminishing and might hit a wall soon. To build a general model that is not constrained by the limitations of human language we should go back to the drawing board and come up with a perception system that can seamlessly unite all modalities.

This article has also made me think about AI capabilities that are thriving today because they might not need to unite multiple modalities to form an understanding of that world. For example, programming. Software is built and executed in a digital environment and ruleset that can be easily encoded into plain text. I'm genuinely curious if you need to know about anything about how the world works, apart from just how programming languages can be used (and maybe architecture of the computer and networks), to be a good programmer.

[Quote] Hype Coding - Steve Krouse

Roger Oriol — Wed, 04 Jun 2025 00:00:00 GMT

There's a new kind of coding I call "hype coding" where you fully give into the hype, and what's coming right around the corner, that you lose sight of whats' possible today. Everything is changing so fast that nobody has time to learn any tool, but we should aim to use as many as possible. Any limitation in the technology can be chalked up to a 'skill issue' or that it'll be solved in the next AI release next week. Thinking is dead. Turn off your brain and let the computer think for you. Scroll on tiktok while the armies of agents code for you. If it isn't right, tell it to try again. Don't read. Feed outputs back in until it works. If you can't get it to work, wait for the next model or tool release. Maybe you didn't use enough MCP servers? Don't forget to add to the hype cycle by aggrandizing all your successes. Don't read this whole tweet, because it's too long. Get an AI to summarize it for you. Then call it "cope". Most importantly, immediately mischaracterize "hype coding" to mean something different than this definition. Oh the irony! The people who don't care about details don't read the details about not reading the details

I would summarize this sarcastic piece by Steve Krouse by reminding everyone that, while it's fun to try new technologies, it's important not to fall victim of the hype and always use the latest, shiniest, new thing for everything. Instead of choosing a tool based on the hype around it, and what people say can do or will be able to do, assess the tool objectively in your workflow. If it makes YOU more productive, by all means, use it. If it doesn't, don't worry, the fad will die down eventually.

[Link] OpenAI Codex CLI

Roger Oriol — Wed, 16 Apr 2025 00:00:00 GMT

Together with the launch of the o3 and o4-mini reasoning models, OpenAI has released a coding assitant for the terminal: Codex.

Codex is meant to be used with OpenAI models. You can use it to create new projects, make changes to existing projects or ask the model to explain code to you, all in the terminal. It can use multimodal input (e.g. screenshots). It also allows sandboxing your development environment to secure your computer. It also allows the use of context files, ~/.codex/instructions.md for global instructions for Codex and codex.md in the project root for project-specific context.

In Full-auto mode, Codex can not only read and write files, but also run shell commands in an environment confined around the current directory and with network disabled. However, OpenAI suggests that in the future, you will be able to whitelist some shell commands to run with network enabled, once they have polished some security concerns.

You can install codex via npm:

npm i -g @openai/codex

[Link] GPT 4.1

Roger Oriol — Mon, 14 Apr 2025 00:00:00 GMT

After the unimpressive release of GPT-4.5 a month and a half ago, OpenAI is now releasing a new version - backwards. Today, they released three new models, exclusive to the API: GPT-4.1, GPT-4.1 mini and GPT-4.1 nano. In the benchmarks, GPT-4.1 easily beats GPT-4.5 at a lower price and higher speed. For this reason, OpenAI has said they will be deprecating GPT-4.5 in 3 months time.

While this is a good step ahead for OpenAI, they are still a bit behind Claude and Gemini in some key benchmarks. In SWE-bench, GPT-4.1 gets a 55%, against 70% for Claude 3.7 Sonnet and 64% for Gemini 2.5 Pro. In Aider Polyglot, GPT-4.1 gets 53%, while Claude 3.7 Sonnet gets 65% and Gemini 2.5 Pro gets 69%.

On the other hand, GPT-4.1 nano offers a similar price and latency as Gemini Flash 2.0. If the performance of this small model is comparable to Gemini Flash, it can be a great option for simple tasks.

[Link] The Agent2Agent Protocol

Roger Oriol — Sat, 12 Apr 2025 00:00:00 GMT

Just in the middle of the year of agents, Google has released two great tools for building agents: the Agent2Agent (A2A) protocol and the Agent Development Kit (ADK).

The Agent2Agent Protocol is based on JSON RPC, working both over plain HTTP and SSE. It is also built with security in mind, it implements the OpenAPI Authentication Specification.

The agents published using this protocol will advertise themselves to other agents via the Agent Card, which by default can be found at the path https://agent_url/.well-known/agent.json. The Agent Card will include information about the agent's capabilities and requirements, which will help other agents decide to ask it for help or not.

The specification includes definitions for these concepts, which agents can use to exchange between themselves: Task, Artifact, Message, Part and Push Notification.

This new protocol is not meant to replace Anthropic's Model Context Protocol. They are actually meant to work together. While MCP allows agents to have access to external tools and data sources, A2A allows agents to communicate and work together.

[Quote] LMArena on X

Roger Oriol — Tue, 08 Apr 2025 00:00:00 GMT

Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

We now have acknowledgement from LMArena of what we already knew: AI labs are cheating to get their models as high as possible in the LMArena leaderboard / benchmarks.

This is inevitable, all of them want to win the AI race at any cost. If you don't want to be fooled by ever-slightly-increasing benchmarks, you should set up your own benchmarks that measure their performance on your own use cases.

[Link] The Llama 4 herd

Roger Oriol — Sun, 06 Apr 2025 00:00:00 GMT

Meta has finally released the Llama 4 family of models that Zuckerberg hyped up so much. The Llama 4 models are open-source, multi-modal, mixture-of-experts models. First impression, these models are massive. None of these models will be able to run in the average computer with a decent GPU or any single Mac Mini. This is what we have:

Llama 4 Scout

The small model in the family. A mixture-of-experts with 16 experts, totaling 109B parameters. According to Meta, after an int-4 quantization, it fits in an H100 GPU, which is 80GB of VRAM. It's officially the model with the largest context window ever, with a supported 10M context window. However, a large context window takes a big toll on the already high VRAM requirements, so you might want to keep the context window contained. As they themselves write in their new cookbook example notebook for Llama 4:

Scout supports up to 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.

Llama 4 Maverick

The mid-sized model. This one has 128 experts, totaling 400B parameters. This one "only" features a 1M context window, due to its larger size. Maverick, as of today, has reached the second place in LMArena with 1417 ELO, only surpassed by Gemini 2.5 Pro. Which is scary, knowing this is not even the best model in the family.

Llama 4 Behemoth

The big brother in the family. 16 experts, 2 TRILLION parameters. Easily surpasses Llama 3.1 405B, which was the largest Llama model until today. This model has not yet been released, as according to Meta is still training, so we don't know anything about its capabilities.

Llama 4 Reasoning

We have no details on what it's going to be, just the announcement that it's coming soon.

Overall, these look like very capable frontier models that can compete with OpenAI, Anthropic and Google while at the same time being open-source, which is a huge win. Check out Meta's post on the models' architecture and benchmarks and also check the models on HuggingFace.

[Link] Circuit Tracing: Revealing Computational Graphs in Language Models

Roger Oriol — Sun, 30 Mar 2025 00:00:00 GMT

A group of Anthropic-affiliated scientists has released a paper where they study how human concepts are represented across Claude 3.5 Haiku's neurons and how these features interact to produce model outputs.

This is a specially difficult task since these concepts are not contained within a single neuron. Neurons are polysemantic, meaning that they encode multiple unrelated concepts in its representation. To make matters worse, superposition makes it so the representation of features are built from a combination of multiple neurons, not just one.

In this paper, the researches build a Local Replacement Model, where they replace the neural network's components with a simpler, interpretable function that mimics its behavior. Also, for each prompt, they show many Attribution Graph that help visualize how the model processes information and how the features smeared across the model's neurons influence its outputs.

Also check out the companion paper: On the Biology of a Large Language Model. In this paper the researchers also use interactive Attribution Graphs to study how models can think ahead of time to perform complex text generations that require the model to think through many steps to answer.

How to Write a Good index.html File

Roger Oriol — Sun, 23 Mar 2025 00:00:00 GMT

Every web developer has been there: you're starting a new project and staring at an empty file called index.html. You try to remember, which tags were meant to go in the <head> again? Which are the meta tags that are best practice and which ones are deprecated?

Recently, I found myself in this exact situation. My first instinct was to copy the head section from a previous project, but as I reviewed the code, I realized some tags were outdated or simply didn't apply to my new project. What followed was a deep dive into HTML head tags – which ones are essential, which are optional, and which are just cluttering my code.

If you're in a hurry and just want the template: You can find my complete starter template on GitHub. The repository contains two main files:

index.html: A clean, minimalist template with just what you need and no unnecessary extras.
index-commented.html: The same template but with detailed comments explaining how and why you should use each tag.

This article is essentially a deep dive into the comments from the index-commented.html file, providing more context and explanations for each decision made in the template.

This template represents my opinionated approach after researching current best practices. It aims to be a solid foundation for most web projects while maintaining good performance, accessibility, and search engine optimization.

Let's dive into the essential components of a well-structured HTML head.

The tags you must include

These tags should be present in virtually every HTML document you create. They're essential for proper rendering, SEO, and accessibility.

`<!DOCTYPE html>` and `lang="en"`: Setting the document type and language

<!DOCTYPE html>
<html lang="en">

Always begin your HTML document with the doctype declaration. This tells browsers which version of HTML you're using (in this case, HTML5) and helps ensure consistent rendering. The lang attribute on the <html> tag specifies the language of your page - this is crucial for screen readers, search engines, and browsers. If your content is in a different language, change the code accordingly (e.g., lang="es" for Spanish).

`<title>`: The page title

<title>Hello world!</title>

Every HTML document must have a title tag. This text appears in browser tabs, bookmarks, and search engine results. Make your titles descriptive yet concise, ideally under 60 characters. A good title both informs users about your page content and includes relevant keywords for SEO.

`<meta name="viewport">`: Configuring viewport for responsive design

<meta name="viewport" content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=5.0">

In today's mobile-first world, this meta tag is non-negotiable. It tells mobile browsers how to control the page's dimensions and scaling. Let's break down what each parameter does:

viewport-fit=cover: Ensures content extends to the edge of the display (especially important for notched phones)
width=device-width: Sets the width of the page to follow the screen width of the device
initial-scale=1.0: Sets the initial zoom level when the page is first loaded
minimum-scale=1.0: Prevents users from zooming out too much
maximum-scale=5.0: Allows users to zoom in up to 5x (limiting this completely would harm accessibility)

Without this tag, mobile devices will render pages at a typical desktop screen width and then scale them down, resulting in tiny, unreadable text and forcing users to zoom and pan.

`<meta name="description">`, `<meta name="keywords">`, `<meta name="author">`: Essential meta information

<meta name="description" content="">
<meta name="keywords" content="">
<meta name="author" content="">

These meta tags provide important information about your page:

description: A concise summary of your page content (ideally 150-160 characters). This often appears in search engine results below your title.
keywords: Relevant keywords for your page content. While less important for Google these days, other search engines and crawlers may still use this information.
author: The name of the individual or organization that created the page.

While these tags don't directly affect page rendering, they're valuable for SEO and content categorization.

`<link rel="canonical">`: Avoiding duplicate indexation

<link rel="canonical" href="" />

This tag helps search engines avoid indexing the same content multiple times when it's accessible via different URLs. For example, if your page is accessible via multiple URLs (like example.com/page and example.com/page/index.html), the canonical tag tells search engines which URL is the "official" version to index, preventing duplicate indexation which can harm your search rankings.

Fill in the href attribute with the primary URL you want search engines to associate with this content.

CSS loading strategies: Critical inline CSS vs. external stylesheets

<style>
  body {
    background: #fefefe;
    color: #222;
    font-family: 'Roboto', sans-serif;
    padding: 1rem;
    line-height: 1.8;
  }
</style>
<link rel="preload" href="main.css" as="style" onload="this.onload=null;this.rel='stylesheet'">
<noscript><link rel="stylesheet" href="main.css"></noscript>

This strategy for CSS loading optimizes performance in two ways:

Inline critical CSS: By embedding essential styles directly in the HTML document, you avoid making an additional network request that would block rendering. This makes critical styles load blazingly fast and prevents the Flash of Unstyled Content (FOUC) that can occur when styles load after content.
Asynchronous loading for non-critical CSS: The preload technique with the onload handler allows the main stylesheet to load without blocking rendering. This means your page can start displaying while the rest of the styles are still loading, creating a better user experience. The noscript tag provides a fallback for users with JavaScript disabled.

Alternatively, if your site doesn't have significant styling needs above the fold, you can use a simpler approach:

<link href="main.css" rel="stylesheet" />

This is more straightforward but can slow initial rendering as the browser must download and parse the CSS before displaying content.

`<script>`: Script loading best practices

<script type="module" src="app.js"></script>

For JavaScript, the type="module" attribute offers several advantages:

Automatically defers script loading until the DOM is ready
Enables ECMAScript modules for better code organization
Runs in strict mode by default
Allows for cleaner dependency management

For scripts that don't depend on DOM elements and should run as soon as possible, consider adding the async attribute:

<script type="module" async src="analytics.js"></script>

Additionally, it's good practice to register a service worker for offline capabilities:

<script>
  if ('serviceWorker' in navigator) {
    window.addEventListener('load', () => {
      navigator.serviceWorker.register('/service-worker.js');
    });
  }
</script>

This script is placed without defer or async attributes so it loads and executes as soon as possible, enabling offline capabilities early in the page load process. The service worker runs in the background, independent of your web page, allowing it to handle network requests and caching even when the user navigates away from your site.

The tags you maybe should include

These tags aren't necessary for every project but can be valuable in specific situations. Include them based on your project's needs.

`<meta charset="utf-8">`: Character encoding

<meta charset="utf-8">

This meta tag specifies the character encoding for your HTML document. UTF-8 is already the default character encoding in HTML5, so this tag isn't strictly necessary in many cases. However, including it explicitly ensures consistency across all browsers and prevents potential character rendering issues, especially with special characters or non-Latin alphabets.

`<base href="/">`: Defining a base URL

<base href="/">

The base tag specifies the base URL for all relative URLs in a document. If all your site's URLs are already relative to the root path ("/"), you don't need to include this tag. It's primarily useful when your site is hosted in a subdirectory but you want paths to be relative to the domain root, or when developing single-page applications with client-side routing.

`<meta name="application-name">`: Application details

<meta name="application-name" content="">

If your Progressive Web App (PWA) should have a different name than what's specified in your title tag, use this meta tag. It defines the name that will appear when your web application is installed on a device or pinned to a user's start menu or taskbar.

`<meta name="theme-color">`: Browser UI theme color

<meta name="theme-color" content="#33d">

This meta tag defines the color used by the user agent in the UI elements around your page, such as the browser's address bar in mobile browsers or the title bar in some desktop browsers. Choose a color that reflects your brand identity to create a more integrated visual experience.

`<meta name="color-scheme">`: Light and dark mode support

<meta name="color-scheme" content="light dark">

This tag informs the browser if your site supports light mode, dark mode, or both. The value "light dark" means that both schemes are supported, with light being preferred. This helps browsers render form controls, scrollbars, and other UI elements in the appropriate color scheme, creating a better user experience that respects system preferences.

`<meta property="og:">`: Social media integration with Open Graph

<meta property="og:title" content="" />
<meta property="og:type" content="website" />
<meta property="og:url" content="" />
<meta property="og:image" content="" />

Open Graph meta tags optimize how your content appears when shared on social media platforms like Facebook, LinkedIn, and X (formerly Twitter). While not essential for basic functionality, they significantly improve the appearance and engagement of your content when shared.

Key Open Graph tags include:

og:title: The title of your page/content (can differ from your HTML title)
og:type: The type of content (website, article, product, etc.)
og:url: The canonical URL of your page
og:image: The URL to an image representing your content

If your site's content is likely to be shared on social platforms, filling these tags with appropriate values can dramatically improve click-through rates and engagement.

`<link rel="manifest">` and `<link rel="icon">`: PWA support and favicons

<link rel="manifest" href="manifest.json">
<link rel="icon" href="/favicon.ico" type="image/x-icon">

For Progressive Web Apps, use the manifest link to point to a JSON file that contains your app's metadata, including icons. The manifest should include your app's icons in various sizes for different devices and contexts.

For regular websites that aren't PWAs, use the icon link to define your site's favicon. While technically optional, a favicon helps with brand recognition and user experience, so most sites should include one.

`<link rel="alternate">`: Alternate Content Types for your site

<link rel="alternate" type="application/rss+xml" href="/feed.xml">
<link rel="alternate" type="text/markdown" href="/llms.txt">

This link tag serves multiple purposes:

It helps RSS readers and other feed aggregators discover your site's RSS feed. Include it if your site provides an RSS feed (common for blogs, news sites, or regularly updated content collections).
It can also be used to specify an llms.txt file for your site as a type="text/markdown" alternate link. This file provides your site's content in an easy to digest format for Large Language Models scanning your site. Note: I stole that idea from Giles Thomas.

If your site doesn't offer these features, you can safely omit these tags.

`<link rel="preload">`, `<link rel="preconnect">`, `<link rel="prefetch">`: Resource optimization

<link rel="preload" href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&display=swap" as="style" onload="this.onload=null;this.rel='stylesheet'">
<link rel="preconnect" href="https://fonts.gstatic.com">
<link rel="prefetch" href="/next-page.html">

These link tags help optimize resource loading:

preload: Tells the browser to download and cache a resource as soon as possible. Useful for critical resources needed early in the rendering process, like fonts or important images.
preconnect: Establishes an early connection to external domains from which you'll fetch resources later. This saves time by having the connection ready when needed.
prefetch: Suggests to the browser that a resource might be needed for future navigation. The browser will download it when idle, making subsequent page loads faster.

Use these selectively based on your performance needs. Over-using them can waste bandwidth, so focus on truly critical resources.

Conclusion

This template gives you a solid starting point for any web project. Of course, this template is completely opinionated and the best setup for your needs may vary. If I left out some common tags from the template, it's probably because they are not needed, at least for most cases.

I welcome your feedback! If you think I've missed something important, please open an issue or submit a pull request on the GitHub repository.

Also, if you want a starter template that goes further than just the index.html file, you might want to check out HTML5 Boilerplate. It's a great resource to get up an running really fast when building a Progressive Web App.

Resources

If you want to read more about HTML head tags, I suggest starting here:

And once more, you can find the my index.html template on GitHub.

HTML Starter Template

Roger Oriol — Sat, 22 Mar 2025 00:00:00 GMT

Motivation

For all of us how can't remember any of the tags we should/shouldn't include in the HTML head or which ones are now deprecated because they used to be a hack for IE6.

What is it?

There are two index.html files:

A regular index.html with just what you need to have and no bullshit.
An index-commented.html with comments explaining how and why you should use each tag.

The commented version includes two types of comments:

ALWAYS: In my opinion, you should always use those tags.
MAYBE: You should or shouldn't use those tags in certain situations or are a matter of preference.

Contributing

Feel free to open an Issue or PR if you think I missed anything.

License

CC0: This work has been marked as dedicated to the public domain.

[Link] Claude Think Tool

Roger Oriol — Sat, 22 Mar 2025 00:00:00 GMT

The Anthropic team has discovered an interesting approach to LLM thinking capabilities. Instead of making the model think deeply before answering or taking an action, they experimented with giving the model a think tool. The think tool does nothing but register a thought in the state. However, it does allow the model to decide when it's appropriate to stop and think more carefully about the current state and the best approach to move forward.

The thinking done using the think tool will not be as deep and it will be more focused on newly obtained information. Therefore, the think tool is specially useful when the model has to carefully analyze the outputs of complex tools and act on them thoughtfully.

[Quote] 🔭 The Einstein AI model

Roger Oriol — Wed, 12 Mar 2025 00:00:00 GMT

These benchmarks test if AI models can find the right answers to a set of questions we already know the answer to. However, real scientific breakthroughs will come not from answering known questions, but from asking challenging new questions and questioning common conceptions and previous ideas. - Thomas Wolf

Interesting reflection from Thomas from HuggingFace. Current LLMs have limited potential to make breakthroughs since they cannot "think out-of-the-box" from their training data. We might be able to give the LLMs the ability to explore outside their known world by mechanisms like reinforcement learning + live environment feedback, or other mechanisms that we haven't thought about yet. Still, significant breakthroughs will be hard for LLMs since the real breakthroughs that make a huge impact are usually very far away from established knowledge - very far from the AI model's current probability space.

About the Dead Internet Theory and AI

Roger Oriol — Sun, 23 Feb 2025 00:00:00 GMT

On the Internet nobody knows you are a dog... or an LLM

The Dead Internet Theory is a thought that has gained a lot of traction recently. I have to admit, the first time it was explained to me, I felt an eerie realization. Like I had already been experiencing it, but I hadn't paid too much attention to it. The first moment, I felt scared for the future and nostalgic for the past. But that's an old man's attitude, to be so resistant of change. After that moment, I started reflecting. How did we get here? And how can we prevent the Internet from continuing down this concerning path?

What Is The "Dead Internet Theory"?

The main premise of the Dead Internet Theory is an outrageous statement that challenges our view of the state of the Internet:

The Internet feels empty and devoid of people.

This theory originated in a 4chan post around 2019. This 4chan anonymous user writes his recent experience on the current state of the Internet. While his tone is pretty outrageous and paranoid (as is expected of a 4chan board), he raises some valid concerns that resonate with many other Internet users. This has led to this hypothesis gaining a lot of traction online and others sharing their own experiences and thoughts on it on other Internet boards.

So what do they mean by empty and devoid of people? There's plenty of people on the Internet, right? I believe there's two factors to it.

First, we are increasingly interacting less directly with other humans. Now we are talking to an "audience". But really we are talking to the recommendation algorithm, our god, so it graces our posts with engagement. When we post on social networks, we don't expect our friends and relatives to see and interact with it. We expect strangers we don't know or care about to like it. This impersonal relationship is making the Internet less social and more of a hustle.

Second, an increasing amount of users of the Internet are not people. They are bots. Fake profiles that algorithmically play the viral content game to gain influence on the social platforms. Advertisement and/or scam e-mails, messages or even calls massively horizontally scaled thanks to bots. Generative AI answering questions on behalf of actual people on PhysicsForums and other forums like StackOverflow.

The Internet was supposed to connect people all around the world. And it did that, beautifully. But for the last years, it's been going in the opposite direction. It's driving us apart, isolating us from other people and keeping us content with experiencing an Internet massivelly filled with content, without the need to interact with other people.

Is It Really That Bad?

To be fair, it's not like nobody uses the Internet to communicate with anyone anymore. Messaging apps are very prevalent and the users of these apps are mostly human. We talk with our friends, make plans, catch up with people that are not as much in our lives as they used to. But this is a private Internet, not an open Internet. A private chat is not meant to be shared, discovered by people interested on what you are talking about or build a little garden in a corner of the Internet that's just yours.

Also, people are not interacting with the Internet like they used to. I believe the rise of smartphones has contributed to that greatly. Smartphones have been very optimized for consuming content, and especially for infinite, mindless scrolling. It's much more lucrative for your app to have your users trapped in an infinite scroll, consuming content (and ads) by the ton, than to have them create and share. And to achieve that, they provide (impose on) you an amazing recommendation algorithm that will play your psychology to keep you engaged. But if people are only consuming, they are not interacting, discussing, or building. You see a lot of people on your feed, sure, but in such large numbers and impersonal style that it doesn't feel like there's people on the other side.

Paradoxically, recommendation algorithms also make us more likely to interact with content that arouses anger in us than we agree with, since this drives more engagement. This amazing video by Kurzgesagt talks about the psychology of social interactions in an open, algorithmic internet. It comes to the conclusion that small, year 2000 forum-like communities on the Internet were the most similar to the social mechanisms our brains are accustomed to and thrive in.

And it has gotten worse recently, thanks to Generative AI and Large Language Models. Bots were already popping up everywhere, but with the power of recent Generative AI models, they have gotten so much better at deceiving people. And some content posted by people is not actually thought about and written by those people. They just asked an LLM to write it for them and copy-pasted the output into their post. A lot of news, or articles like this one, are written almost instantly (and without any thought behind) thanks to Generative AI. Yes, I have also used it to write, but I try to keep it just as a proofreading/editing/brainstorming/idea drafting aid. I don't post anything I haven't thought about deeply and didn't originate from me.

StackOverflow is a board for asking and answering programming-related questions. Since ChatGPT, it has seen a lot of answers generated by LLMs. I would argue that if you post a question there, you would like an answer from a person that has experience in that topic. If you wanted an answer from ChatGPT you would ask it yourself. And sometimes you might want to, but ChatGPT generally will not be able to solve problems it has never seen like an expert would, but it will still be happy to answer. And this makes a lot of the answers confidently wrong and void of human expertise. This is why StackOverflow has decided to ban all use of Generative AI in content posted to the platform.

The wave of content generated by AI and unapologetically non-reviewed by humans has been so strong that a term has been coined for it: AI Slop. Slop is usually defined as spam, but for AI-generated content. I would define it as low-quality content created by a Generative AI model without any human review or even human thought behind it. AI Slop threatens to flood the Internet with useless posts, algorithm-pleasing content, mass AI-generated pictures and videos at a pace unprecedented until now, thanks to no human being needed to create them.

As if this wasn't bad enough, Meta is now experimenting with not only allowing AI bots on the platform, but using those bots themselves to drive up engagement. In an interview with the Financial Times, Connor Hayes said:

They’ll have bios and profile pictures and be able to generate and share content powered by AI on the platform ... that’s where we see all of this going.

[...] make our apps more entertaining and engaging.

Basically, if you post something on Instagram and nobody likes it or comments it, it might dissuade you from interacting with the app. But if suddenly your post got a lot of likes, comments and messages, even if it's from AI profiles, you will spend more time on it. Never actually interacting with anyone.

If we continue down this path, soon 99% of all content on the Internet will be AI generated and the Internet will have converted from a place to share and communicate with other people to a place to consume endless slop. It will be lonelier than ever, an Internet without people.

My Final Thoughts + Possible solutions

In my opinion, the epicenter of the problem is not AI as a whole but recommendation algorithms and infinite scrolling. The problem is a business model built on the need to drive up engagement and keeping users on the platform at any cost. And it's a titanic problem to tackle, since the whole modern Internet is built around this concept.

AI-generated content can also be a problem, but I believe it's more of a misuse problem than a fundamental problem with the technology. Generative AI models should be our helpers and assistants, but not take over our personas.

Some people advocate on making AI-generated content easier to spot by watermarking. I don't believe this is the way not only because it's difficult to do so with all the kinds of output LLMs can produce, but also because it's impossible to enforce.

Simon Willson has a great oath on Personal AI ethics. While he admits to using AI as his writing assistant, he promises to not post anything that takes longer to read than it took him to write. And I promise to do the same as well. Because, just like Simon, I think it's rude to publish text that you haven't even read yourself.

As for possible solutions to make the Internet less dead, and taking into account that deleting the Internet and starting from scratch is not possible, I've come up with a few actions that I will be applying:

Build social circles in online social vehicles that don't feature recommendation algorithms / infinite scrolling and have a limited number of people. Think Discord/Slack groups or online forums. The lack of recommendation algorithms will allow you to more directly explore the content and people you are interested in. The lack of infinite scrolling will keep you from endlessly consuming content and encourage you to interact / build more. And the limited number of people will allow for easier connection building with other members.
Also interact outside that circle to get content and world views from outside your circle, but be very critical of what you read and see. Take into account that what you read and see might very well be AI generated. Don't let that take most of your time.
Favor subscription feeds like RSS to algorithmic recommendation apps and web pages. Subscribe to your favorite blogs (wink), newsletters and podcasts via an RSS reader or similar. Explore the personal pages of other people. If you are looking for content you consume, this is the one you are most likely to enjoy and won't keep you stuck infinitely scrolling. When you are done, you are done.

In the end, to make your Internet less dead, the goal is to not spend as much time on the Internet, but to be better connected to the things that matter to you.

The Rise Of Reasoner Models: Scaling Test-Time Compute

Roger Oriol — Sat, 11 Jan 2025 00:00:00 GMT

Le Penseur (1904) in the Musée Rodin in Paris. Picture by CrisNYCa - Wikipedia.

A new kind of LLM has recently been popping out everywhere: Reasoner models. Kickstarted by OpenAI's o1 and o3, these models are a bit different from the rest. These models particularly shine when dealing with mathematical problems and coding challenges, where success depends on following precise, logical steps to reach a correct solution. On the other hand, these models take much longer to answer than any conventional model.

The approach to problem-solving used by these models mirrors a well-known distinction in human cognition: System 1 versus System 2 thinking. Traditional LLMs operate much like System 1 thinking - quick, intuitive, and based on pattern recognition. They generate responses rapidly based on their trained neural networks. In contrast, Reasoner models embody System 2 thinking - deliberate, methodical, and self-correcting. They can pause, reflect on their reasoning, and even backtrack when they detect potential errors in their logic.

The key innovation enabling these capabilities isn't found in revolutionary architectures or training methods, but rather in a different approach to computation: scaling test-time compute.

What is Test-time Compute?

At its core, test-time compute represents a fundamental shift in how we allocate computational resources in AI systems. While traditional models focus on scaling training time and data (train-time compute) to improve, Reasoner models invest those computational resources during the actual problem-solving phase (test-time compute). This approach essentially gives the model more time to "think" about its answers.

The concept of "thinking longer" might seem similar to existing techniques like Chain-of-Thought (CoT) prompting, but there's a crucial difference. CoT prompting encourages models to spell out its reasoning and train of thought, but the intermediate steps are not validated or weighted against alternatives. Therefore, even if most steps are spot on, any error in the intermediate reasoning steps will compound, leading to incorrect final answers. Test-time compute solves this by enabling models to actively verify and correct their reasoning process.

How Does Test-time Compute Work?

The simplest method to implement test-time compute is iterative self-refinement. In this method, the model outputs its reasoning and thoughts to solve a problem. Then, all the previous model outputs are passed through the same model again, making it pay attention to its reasoning and try to find errors and correct them. After sufficient iterations, the quality of the response should be much higher than the first output. This is however a naïve approach, that doesn’t allow the model to be creative to find the right solution.

A better approach is Verifier-guided search. In this approach, the model generates multiple different answers and a verifier selects the best one. A high temperature is often used to encourage the model to be creative and explore many different solutions. The same model used to generate can be the verifier, however, it’s a much better approach to train a smaller, separate model for that role.

The verifier can score the correctness of the overall solution (ORM - Outcome Reward Model) or it can score each of the steps of the solution separately (PRM - Process Reward Model). A PRM is much more expensive to run on all the steps than an ORM, however it is a better approach since ORM will mark a solution with all correct steps except one as incorrect. PRM, on the other hand, will mark all correct steps as correct and will encourage the model to rethink the only incorrect step to get to the correct solution.

Since the PRM will need to evaluate many different approaches to the solution, which branch for each different step proposed, we need efficient search strategies to find the path to the correct solution. The following are few different search strategies to find the correct solution path:

Best of N: Generate N independent solutions and score each step using the reward model. Select the solution with the highest score in all its steps.
Best of N Weighted: The implementation is the same as Best of N, however, identical responses are aggregated together. Therefore, the most common solutions will have higher scores.
Beam Search: Generate N solutions for the first step. All generated solutions are scored and the top N/M will be selected for further investigation. For each of the top solutions, M different possible next steps will be generated, and therefore, N steps will be graded for the next step. This continues until getting to the final solution.
DVTS (Diverse Verifier Tree Search): Very similar to Beam Search, but we start with N/M different subtrees and we select only the best step for each of them. Then, for each of the best steps, we generate M new next steps and we score and select the best of them until we get to the final answer.
Lookahead search: Also similar to Beam Search, but in order to grade each step, we generate the next step for that path and we also grade that next step. We use the score from the next step to score the previous step. Using this information, we end up selecting the paths that will be further explored and the ones that will be discarded. This method is also similar to the Monte Carlo Tree Search algorithm.

So, which of these strategies is the best? It depends. For simpler problems and lower compute iteration budget, Best of N weighted seems to be best. But for harder problems and higher compute budgets, Beam Search and its derivatives are definitely better. To achieve the best performance for both simple and complex problems, a “compute-optimal” strategy can be chosen. This strategy will estimate the difficulty of the problem using a model and choose the search strategy accordingly.

When correct solutions to hard problems are finally found using search, the data from the reasoning of the whole solution can then be used to further improve both the reasoning model and the reward model using reinforcement learning. This further improves the model on its reasoning abilities and makes it able to get to the correct answer faster in subsequent iterations.

Performance Improvements Through Test-time Compute

When we evaluate models that use test-time compute using math and coding benchmarks we find remarkable improvements. In an article from HuggingFace, “Scaling Test-time Compute with Open Models”, Llama-3.2 3B using 256-iteration test-time compute was found to be better than Llama-3.1 70B, a model over 20 times larger. The paper “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters” also finds that a PaLM 2-S small model can outperform a 14 times larger model using test-time compute.

These findings suggest that some problems, especially reasoning-heavy ones like math and coding, can be solved not by using bigger models, but by making the model “think longer”.

Limitations of Test-time Compute

Does the improvement that reasoning models show mean that, to get better models, we should always let them “think for longer” instead of pretraining bigger models with more data? The already mentioned paper “Scaling Test-time Compute with Open Models” shows that this does not seem to be the case. The authors of the paper tested which was more effective for solving harder problems; pretraining or scaling test-time compute, and their conclusion is the following:

Test-time and pretraining compute are not 1-to-1 “exchangeable”. On easy and medium questions, which are within a model’s capabilities, or in settings with small inference requirement, test-time compute can easily cover up for additional pretraining. However, on challenging questions which are outside a given base model’s capabilities or under higher inference requirement, pretraining is likely more effective for improving performance.

Therefore, although scaling test-time compute can help models get to the correct answer even if they struggle to get it right, for test-time to work, the solution has to be in the models capabilities and knowledge to begin with. Scaling test-time compute is not a solution that can solve any problem just by giving it more time.

Conclusion

Reasoner models like o1 and o3 are indeed impressive, and this leads many people to wonder if they are close to achieving Artificial General Intelligence (AGI). Of course, reasoning through test-time compute is not AGI at all. Test-time compute, while powerful, is not sufficient on its own to bridge the gap to AGI and, as we have established, has many shortcomings. Plus it only seems to work well on tasks that have steps that have to be objectively correct to get to the right solution, namely math and coding tasks.

However, this step to achieve strong reasoning with smaller models does seem to fit with OpenAI’s 5 step plan to AGI. Step 1 is conversational AI and step 2 is reasoning AI, which seems to have been achieved. The third step will be autonomous AI, which is already in the crosshair of many agentic models and agentic systems that are currently being developed.

So, should you use Reasoner models instead of regular ones? The decision should depend on your specific use case. These models excel at tasks requiring careful reasoning and verification, particularly in domains like mathematics, coding, and logical problem-solving. However, for tasks that require quick responses or deal with more subjective matters, traditional LLMs might still be the better choice.

References

Scaling Test-Time Compute with Open Models. Edward Beeching, Lewis Tunstall, Sasha Rush (HuggingFace). URL: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute.
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective. Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, Xipeng Qiu. URL: https://arxiv.org/abs/2412.14135.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar. URL: https://huggingface.co/papers/2408.03314

AI in 2024: Year in Review and Predictions for 2025

Roger Oriol — Mon, 23 Dec 2024 00:00:00 GMT

The past year has been transformative for artificial intelligence, marked by breakthrough innovations, emerging regulations, and a shift toward practical AI tools that enhance productivity. As we look ahead to 2025, let's review the major developments of 2024 and explore what the future might hold.

Part I: 2024 Year in Review

OpenAI in the front

OpenAI maintained its position at the forefront of AI innovation in 2024. The release of GPT-4o in May marked a significant milestone as the first frontier multimodal model, capable of understanding and generating content across different forms of media. This was followed by o1 in September, along with its lighter counterpart o1-mini, and the announcement of o3 in December.

Perhaps most notably, OpenAI's Sora project revolutionized video generation. Announced in February and enhanced with Sora Turbo in December, this technology demonstrated unprecedented capabilities in creating realistic video content from text descriptions.

Agentic AI and AI Assistants

Rather than replacing jobs, AI in 2024 focused on enhancing human productivity through innovative tools and assistants. Google introduced several groundbreaking projects: Astra, an AI assistant for phones and smart glasses, and Mariner, a Chrome extension that enables Gemini to interact with your browser. These tools represent a shift from simple chat interfaces to interactive agents that can understand and manipulate our digital environment.

Anthropic joined this trend by enabling Claude to use computers the way humans do—viewing screens, moving cursors, and interacting with interfaces. This development opened new possibilities for AI assistance in everyday computer tasks.

The developer community saw particularly exciting advances. Github's Copilot and Cursor AI have been very well received by the developer community. On the other hand, autonomous coding agents like Devin and Github Workspace are still rough around the edges. These tools aren't replacing developers; instead, they're supercharging their productivity by handling routine tasks and suggesting improvements.

AI Regulation

2024 saw significant progress in AI regulation, particularly with the EU AI Act. This groundbreaking legislation set the first comprehensive framework for AI regulation, establishing rules for AI system development and deployment while balancing innovation with safety and ethical concerns.

Part II: Looking Ahead to 2025

Small, specialized models

While decoder-only models like GPT and Claude have dominated headlines, in 2025 encoder-only models are still an important piece in AI applications. Just one week before ending the year, ModernBERT was released, representing a significant advance in this space that was long overdue. These models are particularly crucial for Retrieval-Augmented Generation (RAG) setups, where they excel at information retrieval and classification at a significantly lower cost to run.

In a similar fashion, the current trend of agentic systems will favor small, specialized models. These models will be fine-tuned to be as good as frontier models in really specific tasks, but used in conjunction in an agentic workflow, will allow for really powerful and resource efficient AI systems.

Multimodal models and Test-time compute

We can expect significant advances in multimodal capabilities, with AI systems becoming more natural at processing and responding to various forms of input—text, voice, images, and video—simultaneously. These improvements will make AI interactions feel more natural and contextually aware.

The concept of "test-time compute"—giving models more time to think—emerged as a game-changer in 2024 and will likely become more prominent in 2025. Frontier models like OpenAI’s o3 are crushing reasoning, math and coding benchmarks. They are consistently beating humans at really complex, deep reasoning tasks, making us wonder how far Artificial General Intelligence really is.

Test-time compute is not only showing remarkable results for frontier models. Used in small edge models, they are surpassing the performance of much larger models when given more processing time. For instance, Llama 3.2 3B outperformed Llama 3.1 70B in the Math-500 benchmark using a test-time compute strategy of 256 iterations.

The evolution of AI regulation

Following the EU's lead, we're likely to see more regions implement AI regulations in 2025. Interestingly, major tech companies are actively calling for regulation, potentially as a strategy to shape the regulatory environment and raise barriers to entry for competitors.

Conclusion

2024 has been a year of remarkable progress in AI. We have seen many new ways to use AI in our day-to-day lives, as well as new methods to improve models that are making giant strides closer to AGI. As we move into 2025, the focus appears to be shifting toward more specialized and efficient AI systems, improved reasoning capabilities through techniques like test-time compute, and a focus on AI tools to improve productivity.

Is the EU Falling Behind in the AI Race?

Roger Oriol — Fri, 29 Nov 2024 00:00:00 GMT

The recent announcement that Meta's Llama 3.2 Vision models won't be available in the European Union has reignited discussions about the impact of EU regulations on AI innovation and accessibility. This development joins a growing list of AI technologies from major tech companies that are currently unavailable to EU users, including ChatGPT's Advanced Voice mode and Apple Intelligence, raising concerns about whether the EU might be falling behind in the global AI race.

The EU AI Act and Its Impact

In April 2021, the European Commission proposed the EU AI Act. This act classifies AI systems according to the risk they pose to users. The different risk levels mean more or less regulation. Furthermore, it sets some rules for General Purpose AI systems, like Meta’s Llama or ChatGPT. Model developers must provide technical documentation, instructions, comply with the Copyright Directive, and publish a summary about the content used for training. If the model is open, like Llama, and it presents a systemic risk, which Llama does, it must also conduct model evaluations, adversarial testing, track and report serious incidents and ensure cybersecurity protections.

According to the study by Stanford University “Do Foundation Model Providers Comply with the Draft EU AI Act?”, as of June 2023, no foundation models fully comply with the EU AI Act's requirements.

Meta and EU regulation

The situation becomes more complex when examining Meta's challenges with training their models in the EU. In June 2024, Meta faced a setback when the Irish Data Protection Commission requested a delay in training their large language models using public content from adult Facebook and Instagram users in the EU. Meta expressed disappointment with this decision, arguing it would hinder European innovation.

It seems like these requirements influenced Meta's decision not to release Llama 3.2 Vision models in the EU. However, this also looks like a retaliation tactic to pressure the EU to let Meta use private user data to train its models.

As a side note, according to Llama 3.2’s Use Policy, the restriction of use of these models specifically applies to companies and individuals based in the EU who wish to use and build services using these models directly - end users in Europe that use services built on these models are not affected.

EU Needs AI

A significant countermovement has emerged in response to these regulatory challenges. The "EU Needs AI" initiative, supported by prominent figures including Meta's Chief AI Scientist Yann LeCun, argues that fragmented regulation threatens Europe's competitive position in AI development. Their position statement emphasizes that "Europe has become less competitive and less innovative compared to other regions and it now risks falling further behind in the AI era due to inconsistent regulatory decision making."

What Does the Future Hold for AI in the EU?

In my opinion, probably the pushback in regulation is overblown. I believe that, even with these regulations in place, the EU will still have access to some amazing models, and its citizens will enjoy more ethical AI practices.

While companies from Europe will need to comply with many more requirements as companies from the rest of the world, those requirements are not out of place or unnecessary. Many of the requirements for General Purpose AI systems, systemic risk or not, described in the EU AI Act, are from practices that are already widely in place and considered good practices from reliable models. Also, citizens in the EU will be able to enjoy more privacy and security than any other citizen in the world. Their private data, including pictures, won’t be used for training models. Abusive AI systems like social scoring and manipulative AI won’t be a problem for them.

In conclusion, the EU will not be falling behind in the AI race. It will keep being competitive, in part thanks to its more ethical practices. And AI regulation is not only not a blocker, but also a necessity, and more countries should follow suit.

Build an Advanced RAG App: Query Routing

Roger Oriol — Thu, 12 Sep 2024 00:00:00 GMT

In previous articles, we built a basic RAG application. We also learned to introduce more advanced techniques to improve a RAG application. Today, we will explore how to tie those advanced techniques together. Those techniques might do different, sometimes opposite, things. Still, sometimes we need to use all of them, to cover all possibilities. So let's see how we can link different techniques together. In this article we will take a look at a technique called Query Routing.

The problem with Advanced RAG Applications

When our Generative AI application receives a query, we have to decide what to do with it. For simple Generative AI applications, we send the query directly to the LLM. For simple RAG applications, we use the query to retrieve context from a single data source and then query the LLM. But, if our case is more complex, we can have multiple data sources, or different queries need different types of context. So do we build a one-size-fits-all solution, or do we make the application adapt to take different actions depending on the query?

What is Query Routing?

Query Routing is about giving our RAG app the power of decision-making. Query Routing is a technique that takes the query from the user and uses it to make a decision on the next action to take, from a list of predefined choices.

Query Routing is a module in our Advanced RAG architecture. It is usually found after any query rewriting or guardrails. It analyses the input query and it decides the best tool to use from a list of predefined actions. The actions are usually retrieving context from one or many data sources. It could also decide to use a different index for a data source (like parent-child retrieval). Or it could even decide to search for context on the Internet.

Which are the choices for the Query Router?

We have to define the choices that the Query Router can take beforehand. We must first implement each of the different strategies, and accompany each one with a nice description. It is very important that the description explains in detail what each strategy does, since this description will be what our router will base its decision on.

The choices a Query Router takes can be the following:

Retrieval from different data sources

We can catalog multiple data sources that contain information on different topics. We might have a data source that contains information about a product that the user has questions about. And another data source with information about our return policies, etc. Instead of looking for the answers for the user’s questions in all data sources, the query router can decide which data source to use based on the user query and the data source description.

Data sources can be text stored in vector databases, regular databases, graph databases, etc.

Retrieval from different indexes

Query Routers can also choose to use a different index for the same data source.

For example, we could have an index for keyword based search and another for semantic search using vector embeddings. The Query Router can decide which of the two is best for getting the relevant context for answering the question, or maybe use both of them at the same time and combine the contexts from both.

We could also have different indexes for different retrieval strategies. For example, we could have a retrieval strategy based on summaries, or a sentence window retrieval strategy, or a parent-child retrieval strategy. The Query Router can analyze the specificity of the question and decide which strategy is best to use to get the best context.

Other data sources

The decision that the Query Router takes is not limited to databases and indexes. It can also decide to use a tool to look for the information elsewhere. For example, it can decide to use a tool to look for the answer online using a search engine. Or it can also use an API from a specific service (for example, weather forecasting) to get the data it needs to get the relevant context.

Types of Query Routers

An important part of our Query Router is how it makes the decision to choose one or another path. The decision can vary depending on each of the different types of Query Routers. The following are a few of the most used Query Router types:

LLM Selector Router

This solution gives a prompt to an LLM. The LLM completes the prompt with the solution, which is the selection of the right choice. The prompt includes all the different choices, each with its description, as well as the input query to base its decision on. The response of this query will be used to programmatically decide which path to take.

LLM Function Calling Router

This solution leverages the function calling capabilities (or tool using capabilities) of LLMs. Some LLMs have been trained to be able to decide to use some tools to get to an answer if they are provided for them in the prompt. Using this capability, each of the different choices is phrased like a tool in the prompt, prompting the LLM to choose which one of the tools provided is best to solve the problem of retrieving the right context for answering the query.

Semantic Router

This solution uses similarity search on the vector embedding representation of the user query. For each choice, we will have to write a few examples of a query that would be routed to this path. When a user query arrives, an embeddings model converts it to a vector representation and it is compared to the example queries for each router choice. The example with the nearest vector representation to the user query is chosen as the path the router must route to.

Zero-shot classification Router

For this type of router, a small LLM is selected to act as a router. This LLM will be finetuned using a dataset of examples of user queries and the correct routing for each of them. The finetuned LLM’s sole purpose will become to classify user queries. Small LLMs are more cost-effective and more than good enough for a simple classification task.

Language Classification Router

In some cases, the purpose of the Query Router will be to redirect the query to a specific database or model depending on the language the user wrote the query in. Language can be detected in many ways, like using a ML classification model or a Generative AI LLM with a specific prompt.

Keyword router

Sometimes the use case is extremely simple. In this case, the solution could be to route one way or another depending on if some keywords are present in the user query. For example, if the query contains the word “return” we could use a data source with information useful about how to return a product. For this solution, a simple code implementation is enough, and therefore, no expensive model is needed.

Single choice routing vs Multiple choice routing

Depending on the use case, it will make sense for the router to just choose one path and run it. However, in some cases it also can make sense to use more than one choice for answering the same query. To answer a question that spans many topics, the application needs to retrieve information from many data sources. Or the response might be different based on each data source. Then, we can use all of them to answer the question and consolidate them in a single final answer.

We have to design the router taking these possibilities into account.

Example implementation of a Query Router

Let’s get into the implementation of a Query Router within a RAG application. You can follow the implementation step by step and run it yourself in the Google Colab notebook.

For this example, we will showcase a RAG application with a query router. The application can decide to answer questions based on two documents. The first document is a paper about RAG and the second a recipe for chicken gyros. Also, the application can decide to answer based on a Google search. We will implement a single-source Query Router using an LLM function calling router.

Load the paper

First, we will prepare the two documents for retrieval. Let's first load the paper about RAG:

Load the recipe

We will also load the recipe for chicken gyros. This recipe from Mike Price is hosted in tasty.co. We will use a simple web page reader to read the page and store it as text.

Save the documents in a vector store

After getting the two documents we will use for our RAG application, we will split them into chunks and we will convert them to embeddings using BGE small, an open-source embeddings model. We will store those embeddings in two vector stores, ready to be questioned.

Search engine tool

Besides the two documents, the third option for our router will be to search for information using a Google Search. For this example I have created my own Google Search API keys. If you want this part to work, you should use your own API keys.

Create the Query Router

Next, using the LlamaIndex library, we create a Query Engine Tool for each of the three options that the router will choose between. We provide a description for each of the tools, explaining what it is useful for. This description is very important, since it will be the basis on which the Query Router decide which path it chooses.

Finally, we create a Router Query Engine, also with Llama. We give the three query engine tools to this router. Also, we define the selector. This is the component that will make the choice of which tool to use. For this example, we are using an LLM Selector. It's also a single selector, meaning it will only choose one tool, never more than one, to answer the query.

Run our RAG application!

Our Query Router is now ready. Let's test it with a question about RAG. We provided a vector store loaded with information from a paper on RAG techniques. The Query Router should choose to retrieve context from that vector store in order to answer the question. Let's see what happens:

Our RAG application answers correctly. Along with the answer, we can see that it provides the sources from where it got the information from. As we expected, it used the vector store with the RAG paper.

We can also see an attribute "selector_result" in the result. In this attribute we can inspect which one of the tools the Query Router chose, as well as the reason that the LLM gave to choose that option.

Now let's ask it a culinary question. The recipe used to create the second vector store is for chicken gyros. Our application should be able to answer which are the ingredients needed for that recipe based on that source.

As we can see, the chicken gyros recipe vector store was correctly chosen to answer that question.

Finally, let's ask it a question that can be answered with a Google Search.

Conclusion

In conclusion, Query Routing is a great step towards a more advanced RAG application. It allows to set up a base for a more complex system, where our app can better plan how to best answer questions. Also, Query Routing can be the glue that ties together other advanced techniques for your RAG application and makes them work together as a whole system.

However, the complexity for better RAG systems doesn't end with Query Routing. Query Routing is just the first stepping stone for orchestration within RAG applications. The next stepping stone for making our RAG applications better reason, decide and take actions based on the needs of the users are Agents. In later articles, we will be diving deeper on how Agents work within RAG and Generative AI applications in general.

Build an Advanced RAG App: Query Rewriting

Roger Oriol — Sun, 30 Jun 2024 00:00:00 GMT

In the last article, I established the basic architecture for a basic RAG app. In case you missed that, I recommend to first read that article over here. That will set the base from which we can improve our RAG system. Also in that last article, I listed some common pitfalls that RAG applications tend to fail on. We will be tackling some of them with some advanced techniques in this article.

To recap, a basic RAG app uses a separate knowledge base that aids the LLM to answer the user’s questions by providing it with more context. This is also called a retrieve-then-read approach.

The problem

To answer the user’s question, our RAG app will retrieve appropriate based on the query itself. It will find chunks of text on the vector DB with similar content to whatever the user is asking. Other knowledge bases (search engine, etc.) also apply. The problem is, the chunk of information where the answer lies, might not be similar to what the user is asking. The question can be badly written, or expressed differently to what we expect. And, if our RAG app can’t find the information needed to answer the question, it won’t answer correctly.

There are many ways to solve this problem, but for this article, we will look at query rewriting.

What is Query Rewriting?

Simply put, query rewriting means we will rewrite the user query in our own words, that our RAG app will know best how to answer. Instead of just doing retrieve-then-read, our app will do a rewrite-retrieve-read approach.

We use a Generative AI model to rewrite the question. This model be a large model, like (or the same as) the one we use to answer the question in the final step. Or it can also be a smaller model, specially trained to perform this task.

Also, query rewriting can take many different forms depending on the needs of the app. Most of the time, basic query rewriting will be enough. But, depending on the complexity of the questions we need to answer, we might need more advanced techniques like HyDE, multi-querying or step-back questions. More information on those in the following section.

Why does it work?

Query Rewriting usually gives better performance in any RAG app that is knowledge intensive. This is because RAG applications are sensitive to the phrasing and specific keywords of the query. Paraphrasing this query is helpful in the following scenarios:

It restructures oddly written questions so they can be better understood by our system.
It erases context given by the user which is irrelevant to the query.
It can introduce common keywords, which will give it a better chance of matching up with the correct context.
It can split complex questions into different sub.questions, which can be more easily responded separately, each with their corresponding context.
It can answer question that require multiple levels of thinking by generating a step-back question, which is a higher-level concept question to the one from the user. It then uses both the original and the step-back question to retrieve context.
It can use more advanced query rewriting techniques like HyDE to generate hypothetical documents to answer the question. These hypothetical documents will better capture the intent of the question and match up with the embeddings that contain the answer in the vector DB.

How to implement Query Rewriting

We have stablished that there are different strategies of Query Rewriting depending on the complexity of the questions. We will briefly visit how to implement each of them. After, we will see a real example to compare the result of a basic RAG app versus a RAG app with Query Rewriting. You can also follow all the examples in the article’s Google Colab notebook.

Zero-shot Query Rewriting

This is simple query rewriting. Zero-shot refers to the prompt engineering technique of giving examples of the task to the LLM, which in this case we give none.

Few-shot Query Rewriting

For a slightly better result at the cost of using a few more tokens per rewrite, we can give some examples of how we want the rewrite to be done.

Trainable rewriter

We can fine-tune a pre-trained model to perform the query rewriting task. Instead of relying on examples, we can teach it how query rewriting should be done to achieve the best results in context retrieving. Also, we can further train it using Reinforcement Learning so it can learn to recognize problematic queries and avoid toxic and harmful phrases. Or we can also use an open-source model that has already been trained by somebody else on the task of query rewriting.

Sub-queries

If the user query contains multiple questions, this can make context retrieval tricky. Each question probably needs different information, and we are not going to get all of it using all the questions as basis for information retrieval. To solve this problem, we can decompose the input into multiple sub-queries, and perform retrieval for each of the sub-queries.

Step-back prompt

Many questions can be a bit too complex for the RAG pipeline’s retrieval to grasp the multiple levels of information needed to answer them. For these cases, it can be helpful to generate multiple additional queries to use for retrieval. These queries will be more generic than the original query. This will enable the RAG pipeline to retrieve relevant information on multiple levels.

HyDE

Another method to improve how queries are matched with contexts chunks are Hypothetical Document Embeddings or HyDE. Sometimes, questions and answers are not that semantically similar, which can cause the RAG pipeline to miss critical context chunks in the retrieval stage. However, even if the query is semantically different, a response to the query should be semantically similar to another response to the same query. The HyDE method consists of creating hypothetical context chunks that answer the query and using them to match the real context that will help the LLM answer.

Example: RAG with vs without Query Rewriting

Taking the RAG pipeline from the last article, “How to build a basic RAG app”, we will introduce Query Rewriting into it. We will ask it a question a bit more advanced than last time and observe whether the response improves with Query Rewriting over without it. First, let’s build the same RAG pipeline. Only this time, I’ll only use the top document returned from the vector database to be less forgiving to missed documents.

The response is good and based on the context, but it got caught up in me asking about evaluation and missed that I was specifically asking for tools. Therefore, the context used does have information on some benchmarks, but it misses the next chunk of information that talks about tools.

Now, let’s implement the same RAG pipeline but now with Query Rewriting. As well as the query rewriting prompts, we have already seen in the previous examples, I’ll be using a Pydantic parser to extract and iterate over the generated alternative queries.

The new query now matches with the chunk of information I wanted to get my answer from, giving the LLM a better chance of answering a much better response for my question.

Conclusion

We have taken our first step out of basic RAG pipelines and into Advanced RAG. Query Rewriting is a very simple Advanced RAG technique but a powerful one for improving the results of a RAG pipeline. We have gone over different ways to implement it depending on what kind of questions we need to improve. In future articles we will go over other Advanced RAG techniques that can tackle different RAG issues than those seen in this article.

How to build a basic RAG app

Roger Oriol — Wed, 29 May 2024 00:00:00 GMT

The dawn of Generative AI makes possible new kinds of capabilities for the applications we build. LLMs can answer the user’s questions with an incredible skill. So, why not use them as part of our systems. If the user needs help getting around the app, we can put a chat function where the LLM will answer all the user’s questions. If our app has blog posts explaining important concepts, instead of making the user read all of them to get the knowledge it needs, it could just ask and get an immediate response.

Why RAG?

We decide to integrate a LLM into our app to bring these features to our users. However, we soon find that the model can’t answer the user’s questions. It doesn’t have any information about our application! If the information needed to answer is not in the LLM’s training data, it can’t answer. Even worse, if it doesn’t know the answer, it might hallucinate a completely wrong fact! This is bad, so how do we fix this? LLMs with the Transformer architecture have shown great in-context learning capabilities. So, we just have to pass all the facts that it needs in the prompt, together with the question! Uh oh, it will definitely be expensive to stuff all the data in every prompt. So, how do we do it?

What is RAG?

RAG stands for Retrieval Augmented Generation. RAG was born together with Transformers. Initially, it was used to augment the pre-training data of LLMs with additional facts. Once Transformers’ in-context learning capabilities became obvious, it became a common practice also during inference, to augment the prompt.

A basic RAG pipeline consists of three steps: indexing, retrieval and generation. All the information that the LLM needs to answer is indexed in a vector database. When the user asks a question, we can retrieve the relevant parts of the information from that vector database. Finally, together with just the relevant information and the user’s question, we can prompt the LLM to give an answer based on the information we give it as a context. Let’s look in more detail how to achieve this.

Indexing

First, we extract the information that the model needs from wherever it is. Generative models work with plain text (some models can also work with images or other formats, which can also be indexed, but this is a topic for another time). If the information is already in plain text, we are in luck. But it might also be in PDF documents, Word documents, Excel, Markdown, etc. We must convert this data to plain text and clean it so it can be usable for the model.

Once the information is in text format, we can store it in a vector database. The vector database will store the embeddings representation of that text. That will allow us to search for parts of the text that have a similar embedding representation as another text, therefore they are about a similar concept. We will divide the whole text into smaller parts or chunks, calculate the embeddings representation for each of them, and finally store them in the vector database.

Retrieval

When the user asks us a question, we can convert that question into a vector representation, using the same embeddings model we used to index the data. With that vector representation, we will calculate the similarity factor between the question and each one of the chunks stored in the vector database. We will select the top K chunks that are the most similar to the query, and therefore their contents are about the same concept as the question (and therefore they might contain the answer).

Generation

A prompt is built, putting together the user’s question and the relevant contexts to help the LLM answer. We might also include previous messages from the conversation between the user and the AI assistant. The LLM generates an answer for the user based on the context, instead of its previously learned pre-training data.

Example

For this example, we will ingest a paper called “Retrieval-Augmented Generation for Lange Language Models: A Survey”. We will query the LLM using the information contained in this paper, so it can answer the user’s questions on its contents. You can follow this example in the Google Colab notebook provided for this article.

First, we will load the PDF document and parse it using LangChain’s PyPDF connector.

Once we have the text from the document, we have to split it into smaller chunks. We can use LangChain’s available splitters, like RecursiveCharacterSplitter in this case:

We will be using BGE-small, an opensource embeddings model. We will download it from HuggingFace Hub and run it on all chunks to calculate their vector representations.

Once we have the vector representations for all chunks, we can create an in-memory vector database and store all vectors in it. For this example, we will be using a FAISS database.

The database is now set up. Now, we will be taking queries from the user on this information. In this case, the user asks which are the drawbacks of Naive RAG. We encode this query using the same embeddings model as before. Then, we retrieve the top 5 most similar chunks to that query.

After retrieving the relevant context, we build a prompt using this information and the user’s original query. We will use Claude’s Haiku as a LLM for this example:

Common problems and pitfalls

As the title implies, this solution is a basic or naïve RAG implementation. It will empower your application to make the most out of the LLM it’s using and your data. But it won’t work for all cases. These are just some of the most common problems with RAG:

Retrieve irrelevant information. If the retriever gets data from the vector database that is not relevant to the question, it will confuse the model trying to answer the question. This might lead to either not using the context to answer the question, or answering something different than what was asked.
Miss important information. Maybe the information it needs to answer the question is not in the database. Maybe the retrieval mechanism fails to find the relevant chunks. We must find ways to help the retriever find the information it needs easily and more reliably.
Generate responses not supported by the context. If the context has the information the model needs, but it doesn’t use it and instead relies on its own pre-training data, all this was for nothing. The information from the pre-training data might be outdated or wrong. We must favor the model to always use the context to answer, or answer “I don’t know” if it can’t answer from the context.
Irrelevant response to the query. The LLM might use all the information that you give it to generate a response, but that doesn’t mean that it answers the user’s question. It’s important that the model sticks to the user’s original question, instead on getting lost in a ton of information.
Redundant response caused by similar contexts. When we ingest multiple documents with similar information, there’s a chance that the retriever will get multiple chunks of information that say almost the same. This might cause the LLM to repeat the same information more than one time in its response.

How to avoid these problems?

To avoid these problems, a naïve RAG pipeline might not be enough. We will need to set up a more advanced and complex RAG system. There exist tested techniques to solve the problems we have laid out. We can incorporate them into our RAG pipeline to improve the RAG application’s performance.

Another important point to address is that, to improve your RAG application, you will need to be able to measure and evaluate the whole process. You can’t improve what you can’t measure. Plus, when you evaluate you might find that a basic RAG setup is enough for your use case, and you don’t need to overcomplicate it. After all, even a very basic RAG implementation can improve your LLM powered application enormously.

In future articles, I will explain in more detail the advanced RAG techniques that will help us avoid common problems and bring our RAG applications to the next level.

How to use LLMs: Summarize long documents

Roger Oriol — Mon, 29 Apr 2024 00:00:00 GMT

One of the tasks Generative AI models are good at is summarizing texts. However, if you want to summarize a text, you need the whole text to fit in the context window, plus the command to summarize it. The following is an example prompt (you can test it yourself using the Google Colab notebook for this article):

Write a concise summary of the following text:
The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. Since the Turing test is a test of indistinguishability in performance capacity, the verbal version generalizes naturally to all of human performance capacity, verbal as well as nonverbal (robotic).

Summary:

This might get tricky if the text to summarize is too long. There exist LLMs with very large window contexts:

GPT-4 has a variant with a context window of 32k tokens.
Claude 3 Opus has a context window of 200k tokens.
Gemini 1.5 Pro can have a context window of up to 1 million tokens.

However, sometimes using a LLM with such a large context window is off the table. There are many factors that can make this the case:

Those models might be too expensive.
Models sometimes have a hard time using all the information in very long prompts.
All the models with large context windows might be off the table because of restrictions like only being able to use open-source models.
Your text might be longer that all available models (yes even with 1M context windows).

Luckily, there exists a technique that can get an LLM to summarize a document longer than its context window size. The technique is called MapReduce. It’s based on dividing the text in a collection of smaller texts that do fit in the context window and then summarizing each part separately. The steps to perform a MapReduce summary are the following (Again, you can execute and see for yourself all the process in the Google Colab notebook for this article):

First, the long document is divided into chunks using a text splitter. The strategy we use to divide the text can be different depending on the type of document.
- If the text has sections and all sections are smaller than the context window, we could divide it by those sections.
- If the text has no clear sections or the sections are too large, the text can be divided in equal sized chunks character-wise. This last approach has the problem of separating related sentences in different chunks. To avoid this problem, we can have an overlap between chunks. With this solution, the last N characters of a chunk will be repeated as the first N characters of the next chunk, so context is not lost.

Second, all chunks are summarized separately using the LLM. Remember to tune your summarize prompt to help understand the model what kind of document it is and how to summarize it properly. For example, you might want the summary to be a bullet point list of the main points of the text or you might want the summary to be just a few sentences.

The third step is optional. If the combination of all summaries does not fit into the context window, we won’t be able to ask for a consolidated summary. Instead, we need to make the combination of all summaries smaller.
- We will group the summaries in groups that fit into the context window.
- Then, we will write a prompt that combines all the summaries to a single summary with the key ideas.
- After reducing all the summary groups, if the resulting combined summaries still don't fit in the context window, this process is executed again.

Lastly, when all summaries fit into the context window, we can finally combine or reduce them into the final summary.
- We will write a prompt that combines all the summaries into a single final summary with all the key ideas. This prompt is usually the same as the one in step 3 since the idea is basically the same; combine a list of summaries into one shorter summary. However, you might need the freedom to diverge the two prompts to accommodate specific needs for special kinds of documents or special summaries.

And that’s it! You now have a short summary of the most important points of a large document. But before you start processing your whole documentation, there are a few important notes you need to consider:

This MapReduce method might not be less expensive than using an LLM with a large context window, especially if your cost is per token (as it is in all Model-as-a-service LLMs). Using this method, you will still use all the tokens in the document, plus the tokens of the intermediate summaries (as output and then as input of intermediate prompts). I recommend you study first which method will be more costly for your use case.
While better models will produce better summaries, the summarization capabilities of not-so-good models will be good enough for most cases. The result after many layers of summarization will be of similar quality, so you might be able to get away with using cheaper models if you decide to use this method.

Understanding LLMs: Mixture of Experts

Roger Oriol — Sat, 30 Mar 2024 00:00:00 GMT

Unlike the Transformers architecture, Mixture of Experts is not a new idea. Still, it is the latest hot topic in Large Language Model architecture. This architecture has been rumored to power OpenAI's GPT-4 (and maybe GPT3.5-turbo) and is the backbone of Mistral's Mixtral 8x7B, Grok-1 and Databricks' DBRX, which rival or even surpass GPT 3.5 with a relatively smaller size. Follow along to learn more about how this kind of architecture works and why does it lead to such great results for LLMs.

Architecture

A Mixture of Experts is a model with a sparse layer and a router. The experts reside in the sparse layer, and they are models unconnected between them. Each expert specializes in a specific task. The router is a gating mechanism that learns and decides which experts is best equipped to deal with the input. The simplicity of this concept allows this architecture to work with any type of model. In this article we will focus on Transformers where the experts are feed-forward networks, but they might as well be RNNs, SVMs or even Linear Regression models. Another possibility is Hierarchical experts, which use multiple routers at different levels.

The big advantage of this kind of architecture is conditional computation. Every single inference doesn’t need to use all the model’s weights. The gating mechanism is trained to choose the top k experts and route the input only to those. This choice also has a degree of random noise, which prevents overloading the most popular experts and ensures that other experts are also trained on all kinds of data.

History

The first sentence of this article stated that Mixture of Experts is not a recent idea. In fact, it was first proposed in 1991 in the paper Adaptive Mixture of Local Experts. In this article, the authors proposed that when the model had to perform different tasks, it was beneficial to have different experts with decoupled weights that weren’t affected by other experts fitting their weights to their own task.

Even though the idea is old, the Mixture of Experts architecture benefits a lot of today’s computing power and horizontal scaling. MoE models can easily be distributed between multiple devices. Since not all weights of the model activate on each inference, each expert can be located to a different device, which frees up the devices with other experts to handle other tasks in parallel.

How many experts should a model have?

When we train a Mixture of Experts model, we expect each expert to learn and be proficient with specific tasks. Experts do seem to specialize in handling specific inputs. For example, for a language model experts tend divide their expertise in handling nouns, verbs, punctuation, numbers and counting, etc. However, they don’t specialize in other tasks that we would consider obvious to divide. When we train a MoE model in a multilingual corpus, different experts don’t learn different languages, they all seem to try to learn all of them.

A crucial decision when designing a Mixture of Experts model is the number of experts it will have. Normally, more experts mean more efficiency, since a smaller part of the whole model will need to be used for each inference. However, there are some caveats. The advantages of adding another expert diminish the more experts we have; 4 to 16 experts seem to be a sweet spot. Also, even though it doesn’t use all weights for every inference, reducing computing time, it still must always hold all the weights in VRAM. Looking at some popular models, DBRX has 16 experts (4 activate at any inference), while Mixtral and Grok have 8 (2 activate).

Fine-tuning MoE

A particular problem with Mixture of Experts is that they are hard to fine-tune. MoEs are very prone to overfitting. After fine tuning, they are bad at reasoning tasks, but still good at knowledge tasks. A way to mitigate this is to reduce the number of experts, as fewer experts lead to better fine tuning. Also, a recent study has shed some hope for MoE fine tuning. It had great success at finetuning a Flan MoE, suggesting that Moe's might benefit from instruction fine tuning.

Scaling MoE

In the other hand, Mixture of Experts are great for high-throughput scenarios, as opposed to dense models. MoEs can be scaled with many techniques.

A paper by Google named GShard explored solving device underutilization to successfully scale a MoE horizontally across many devices. They replicated all non-MoE layers between all devices, but MoE layers had a different expert for each device. They also introduced the concept of expert capacity, which is the maximum number of tokens an expert can take before it is considered overflowed, after when the next expert in line would take over.

Another paper, named Switch Transformers, looked at techniques to reduce communication costs between devices and reduce training instabilities. To optimize parallelism, they proposed to use a single expert approach and reduce the capacity factor to almost all tokens being equally divided between the experts (with some small wiggle room for over choosing a specific expert). Switch Transformers also proposed to only use bfloat16 precision for expert layers and use full precision for other layers. This stabilizes training, as other layers like the router need better precision due to an exponentiating function, while still reducing communication costs between experts.

Optimizing MoE

Mixture of Expert models can also be optimized through different means. Distillation of a sparse model into a dense model keeps 30% of sparsity gains while being much smaller in total model size. Another technique is Aggregation of MoE, which merges weights of all experts into one, which still performs very good on all tasks. Also, QMoE is a quantization technique that can store 1.6 trillion parameters in less than 160GB (0.8 bits per parameter!).

Conclusion

In conclusion, given that there’s a need today for models that perform a multitude of different tasks for a group of millions of people (think ChatGPT or similar products), MoE’s excellence in high-throughput, distributed scenarios shines. Being training and inference efficient will also mean lower costs and faster innovation. Of course, not everything is great, there are some drawbacks. Being hard to fine tune is a problem, as needing a lot of VRAM to operate. What is certain is that in the future we will keep seeing better techniques to optimize sparse models and it will lead to better LLMs.

What to Expect for AI in 2024?

Roger Oriol — Mon, 29 Jan 2024 00:00:00 GMT

2023 was a great year for AI. Large Language Models were already in the spotlight for both users and businesses. ChatGPT had been just released in late 2022 and was taking the world by storm. Still, 2023 has brought more rapid change in the field than we could have imagined. This last year we got the newest version of OpenAI’s model, GPT-4. We also got a ton of open-source models competing with OpenAI, like LLaMa, Falcon, and Mistral. Google didn’t want to miss the party and stepped up its game unveiling Gemini, the successor to Google’s earlier model, PaLM. Anthropic launched Claude, AI21 Labs launched Jurassic-2 and Amazon announced its Generative AI service, Bedrock, as well as their own LLM, Titan. Some LLMs gained a ton of new functionality by going multimodal, as well as embracing agents, which allow them to have up-to-date information and interact with the world around them.

Also, regulators are beginning to catch up to the rapid evolution of this new technology. The United States of America passed an Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, and the European Union approved the EU AI Act, the first actual regulation on artificial intelligence.

So, what’s next for 2024? Given the unpredictably fast rise of new AI technology in the last 14 months, any prediction that we make on its evolution in the future will probably be ridiculed by reality. However, an educated guess can be made given the current direction of AI. I invite you to hear about my take on AI in 2024 and of course, take it with a grain of salt.

AGI

As new LLMs begin to approach the trillion parameters mark, new architecture ideas improve on the transformer architecture, and AI’s reasoning abilities continue to improve, I predict that we will begin to hear some Artificial General Intelligence claims. Artificial General Intelligence is the highly anticipated milestone when AI will be able to perform any intellectual task that a human could do. While this kind of artificial intelligence is still very far from being a reality, it seems like it will be inevitable. With current AIs improving exponentially, it's just a matter of time until we begin to hear claims that a top LLM has reached AGI level. While this does not mean that we are there yet, it’s a clear step that we are on the path to that very realistic future.

Open-source models

HuggingFace is a platform that makes it easy to share and contribute to models and datasets. It’s an incredible resource for anyone who wants to start using AI models right away, without having to invest time and money in training their own model from scratch. HuggingFace also provides a ton of resources for working with models and easily fine-tuning them to adapt any model to the user’s needs. In my opinion, these kinds of resources are priceless to both individuals and enterprises that want to get into using AI. And given the fact that every day more and more individuals and businesses introduce AI as a part of their workflows and products, it is likely that this kind of platform will explode further that it already has, providing tons of new open-source models and AI resources.

Multimodality

Some LLMs like ChatGPT and Google’s Gemini have already unveiled multimodal capabilities. Multimodal models are Large Language Models that can understand and interact with a combination of different types of inputs, like text, audio, images, and video. Since LLMs convert text and images to a vector representation to interpret them correctly, it's quite challenging to find a vector representation that works for both text and images, so this is an incredible feat. LLMs having multimodal capabilities broaden the list of tasks they can perform by a large margin. This year, I expect even more models embracing multimodality and the capabilities of AI agents to increase significantly.

Responsible AI

As regulators start to intensify legislation on artificial intelligence, Responsible AI will not just be a choice in developing new AI models, but a must. Responsible AI is an approach to training AI models with ethics and safety in mind. Generative AI models should be trained to be fair and unbiased, as well as to avoid harmful responses and avoid endangering human life, property, or the environment. I expect not only to see an improvement in Responsible AI frameworks and expectations but also to witness the first legal consequences for enterprises not following those new rulings.

AI use in enterprises

As using LLMs becomes easier and more democratized, the barrier of entry for tons of enterprises that are not primarily technologically focused will begin to fall. More and more kinds of companies will begin to use AI to aid in their daily operations. We will see more AI B2B products to address these new necessities. Some companies may just start using AI for the hype of it but will fail to implement it in a meaningful manner for their bottom line. On the other hand, companies that succeed in introducing AI in their daily operations will see an unprecedented boost in productivity. Sadly, this also might mean that we will see some companies lay off an important part of their workforce, as employees are quickly replaced by AI systems.

AI race

The potential of artificial intelligence is no longer a secret. Many countries around the world will not only begin to legislate for AI but also invest heavily in it, as being at the forefront of AI innovation can give them an edge in many areas, including military, scientific, economic, and informational. Even if one specific part of the world advocates for AI ethics and halts AI progress to ensure it not being a threat to humanity, other parts of the world might take this opportunity to put themselves at the front of innovation, which will inevitably make the former’s efforts worthless. We might begin to witness a new AI “space race”, as countries put pressure on each other to never stop investing resources in developing new AI, under the threat of falling behind in the new future that AI is shaping.

Even if my predictions don’t end up becoming a reality, what is certain is that AI will continue to be a major topic in 2024. I’m excited to see how AI will evolve during the following year and what new capabilities it will bring. Certainly, there are many concerns to be had with the rapid evolution of AI, but I’m hopeful that we will be responsible enough to address them before they do become a problem. As a personal note, this year I will be testing how to use AI in many of my daily activities to make them easier or even completely automate them.

How to supercharge your LLM with Langchain Agents

Roger Oriol — Sun, 06 Aug 2023 00:00:00 GMT

The problem with LLMs

LLMs are very capable to perform a lot of feats that seem incredible to us, but they are bound by the borders of what Generative AI was originally meant to do: generate text based on the data it has been trained on.

They cannot access any services on the web to give us more accurate and recent answers and instead, they are meant to answer using the training data that their creators were able to gather many years ago. They cannot access tools as simple as a calculator that would help to give mathematical answers instead of having to memorize the result of every mathematical operation. They are reliant on us to perform the actions they suggest to do in the real world and we report back to them with the results.

To give our LLMs the powers they are missing to be truly powerful for us, we can use Agents.

What are Agents?

Agents are an LLM that is being prompted to reason about the actions needed to complete a request, using a set of tools that it has been provided with. An agent can be used alongside any LLM. It is only a layer on top of it that builds the prompts indicating to the Agent the context, its personality and the strategies it must use to complete requests.

Obviously, the Agent by itself is not enough. We also need to create/use tools for the Agent to be able to use. The Agent Executor is the runtime that executes both the Agent and the tools that it uses. Tools can just be standalone functions or they can come in collections of tools called toolkits.

Tools and toolkits

Tools are functions that will perform actions on behalf of the LLM. An agent gets a list of tools for it to use and it will request to use one, several, or none. The Agent Executor will execute the required tools and feed the result back to the Agent. An example of a tool is the Google Search function, which allows LLMs to check some information that they don't have using a Google search.

For the Agent to be able to choose correctly which tools it needs, those tools have to exist, be correctly implemented for the purpose that the Agent needs them for, and be in the list of tools for the Agent. But more importantly, these tools must be thoroughly described so Agents can easily decide if they need them and what they will be helpful for.

Toolkits are just a set of tools that are usually useful together. These tools might be useful to be used together for achieving a multi-step goal, or they might be grouped because they do similar actions or actions in the same domain. An example of a toolkit is the Gmail Toolkit, which allows LLMs to read emails, draft new emails and delete them.

What strategies do Agents use?

After a set of tools is provided to the Agent, how does it know which one to use? Agents can use a variety of prompt engineering strategies to make the LLM reason and decide about the actions it has to take. Some popular prompt engineering practices were discussed in the previous article: Prompt Engineering.

The most popular strategy for agents is the ReAct method. ReAct uses few-shot learning together with some Chain-of-Thought reasoning examples. These examples contain:

Thoughts transcribed from the reasoning strategies.
Actions that let the LLM interact with its environment in a verbal manner.
Observations gained after taking the actions.

The LLM then understands how to act in this manner and interact with its tools and can apply it to real interactions. An example of a ReAct prompt would be:

Question: What is the current temperature at the city where John Cena was born in Celcius?
Thought: I need to search for the city John Cena was born in, then find the current temperature for that city, then convert the temperature to Celsius.
Action: Search[City where John cena was born]
Observation: Cena was born in West Newbury, Massachusetts.
Thought: I have to find the current temperature in West Newbury, Massachusetts.
Action: WeatherLookup[West Newbury, Massachusetts]
Observation: 81°F. Sunshine to start, then a few afternoon clouds. High 81F. Winds WNW at 5 to 10 mph. 45% humidity.
Thought: I have to convert 81°F to Celsius.
Action: UnitConversion[FahrenheitToCelcius, 81]
Observation: 27.2
Thought: The answer is 27.2 degrees Celsius.
Action: Finish[27.2 degrees Celsius]

How to use Langchain Agents

Langchain is a Python library (as well as JS/TS) that is very useful for rapidly getting started on integrating LLMs in your applications. It has many prompting strategies available out of the box and also many Agents.

You can get started writing tools of your own using Langchain’s tools, like this (VERY IMPORTANT: give the function a docstring description, this will be used by the agent to choose what tool to use):

from langchain.agents import tool

@tool
def unit_conversion(units,value):
	"""Converts a given unit to another unit. To convert from Fahrenfeit to Celcius, give FahrenheitToCelcius as first parameter, then the value as second parameter"""
   if units == "FahrenheitToCelcius":
     return (5/9)*(value - 32)
   else:
     raise Exception("Invalid type of unit conversion")

Afterwards, use this tool along with your agent of choice, also with Langchain:

from langchain.llms import OpenAI
from langchain.agents import initialize_agent, AgentType

# We will use OpenAI's GPT as the LLM
llm = OpenAI(temperature=0)

# We will create the Agent
agent = initialize_agent([unit_conversion], llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

agent.run("What is 81 degrees Fahrenheit in Celsius?")

You can learn more about Langchain and its tools by visiting its documentation.

It looks like Agents are going to bring a lot more capabilities to LLMs in the recent future. As these new capabilities roll out to applications and tools, more outstanding things will be possible. I’m very curious and looking forward to the new possibilities.

Maximizing the Potential of LLMs: Using Vector Databases

Roger Oriol — Sun, 16 Apr 2023 00:00:00 GMT

LLMs do Natural Language Processing (NLP) to represent the meaning of text as a vector. This representation of the words of the text is an embedding.

The token limit: the LLM prompting biggest problem

Currently, one of the biggest problems with LLM prompting is the token limit. When GPT-3 was released, the limit for both the prompt and the output combined was 2,048 tokens. With GPT-3.5 this limit increased to 4,096 tokens. Now, GPT-4 comes in two variants. One with a limit of 8,192 tokens and another with a limit of 32,768 tokens, around 50 pages of text.

So, what can you do when you might want to do a prompt with a context larger than this limit? Of course, the only solution is to make the context shorter. But how can you make it shorter and at the same time have all the relevant information? The solution: store the context in a vector database and find the relevant context with a similarity search query.

What are vector embeddings?

Let's start by explaining what vector embeddings are. Roy Keynes' definition is: "Embeddings are learned transformations to make data more useful". A neural network learns to transform text to a vector space that contains their actual meaning. This is more useful because it can find synonyms and the syntactical and semantical relationships between words. This visual helps understand how those vectors can encode meaning:

What do vector databases do?

A vector database stores and indexes vector embeddings. This is useful for fast retrieval of vectors and looking for similar vectors.

Similarity search

We can find similarity of vectors by calculating a vector's distance to all other vectors. The nearest neighbors will be the most similar results to the query vector. This is how flat indexes in vector databases work. But this is not very efficient, in a large database this might take a very long time.

To improve the search's performance, we can try to calculate the distance for only a subset of the vectors. This approach, called approximate nearest neighbors (ANN), improves speed but sacrifices quality of results. Some popular ANN indexes are Locally Sensitive Hashing (LSH), Hierarchical Navigable Small Worlds (HNSW) or Inverted File Index (IVF).

Integrating vector stores and LLMs

For this example, I downloaded the whole numpy documentation (with over 2000 pages) as a PDF from this URL: https://numpy.org/doc/1.23/numpy-ref.pdf.

We can write a Python code to transform the context document to embeddings and save them to a vector store. We will use LangChain to load the document and split it into chunks, and Faiss (Facebook AI Similarity Search) as a vector database.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

embeddings = OpenAIEmbeddings()

db = FAISS.from_documents(pages, embeddings)
db.save_local("numpy_faiss_index")

Now, we can use this database to perform a similarity search query to find pages that might be related to our prompt. Then, we use the resulting chunks to fill the context of our prompt. We will use LangChain to make it easier:

from langchain.vectorstores import FAISS
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI

query = "How to calculate the median of an array"

db = FAISS.load_local("numpy_faiss_index", embeddings)
docs = db.similarity_search(query)

chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff")
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

Our question for the model is "How to calculate the median of an array". Even though the context that we give it is way over the token limit, we have overcome this limitation and got an answer:

To calculate the median, you can use the numpy.median() function, which takes an input array or object that can be converted to an array and computes the median along the specified axis. The axis parameter specifies the axis or axes along which the medians are computed, and the default is to compute the median along a flattened version of the array. The function returns the median of the array elements.

For example, to calculate the median of an array "arr" along the first axis, you can use the following code:

import numpy as np
median = np.median(arr, axis=0)

This will compute the median of the array elements along the first axis, and return the result in the variable "median".

This is just one clever solution for a very new problem. As LLMs keep evolving, maybe problems like this will be solved without the need of these kinds of clever solutions. However, I'm sure that this evolution will open the door for new capabilities that might need other new clever solutions for the challenges that they may bring.

Maximizing the Potential of LLMs: A Guide to Prompt Engineering

Roger Oriol — Sun, 09 Apr 2023 00:00:00 GMT

Language models have rapidly improved in recent years, with large language models (LLMs) such as GPT-3 and GPT-4 taking center stage. These models have become popular due to their ability to perform a great variety of tasks with incredible skill. Also, as the number of parameters of these models (in the billions!) has increased, these models have unpredictably gained new abilities.

In this article, we will explore LLMs, the tasks they can perform, their shortcomings, and various prompt engineering strategies.

What are LLMs?

LLMs are neural networks that have been trained on vast amounts of text data. The training process allows the models to learn patterns in the text, including grammar, syntax, and word associations. The models use these learned patterns to generate human-like text, making them ideal for natural language processing (NLP) tasks.

Which LLMs are available?

There are several LLMs available, with GPT-4 being the most popular. Other models include LLaMA, PaLM, BERT, and T5. Each model has its strengths and weaknesses, some of them are open and others are closed and only usable via API.

Shortcomings of LLMs

Despite their impressive performance, LLMs have several limitations. One significant drawback is their inability to reason beyond the information provided in the prompt. Additionally, LLMs can generate biased text based on the data they were trained on. It is also challenging to control the output of LLMs, making it necessary to use prompt engineering strategies to achieve the desired output.

Which tasks can you perform?

We can direct LLMs to perform specific tasks for us by formatting the prompts for each case. Here's a list of those tasks and prompt templates and examples for each.

Text Summarization

LLMs can generate summaries of long texts, making it easier to understand and digest the content. We can ask to summarize a text with the following prompt template:

Template

<Full text>
Summarize the text above: / Explain the text above in <N> sentences:

For example, we can ask to summarize this text with the following prompt:

Prompt

In a mechanical watch the watch's gears are turned by a spiral spring called a mainspring. In a manual watch energy is stored in the mainspring by turning a knob, the crown on the side of the watch. Then the energy from the mainspring powers the watch movement until it runs down, requiring the spring to be wound again.
A self-winding watch movement has a mechanism which winds the mainspring using the natural motions of the wearer's body. The watch contains an oscillating weight that turns on a pivot. The normal movements of the watch in the user's pocket (for a pocketwatch) or on the user's arm (for a wristwatch) cause the rotor to pivot on its staff, which is attached to a ratcheted winding mechanism. The motion of the watch is thereby translated into circular motion of the weight which, through a series of reverser and reducing gears, eventually winds the mainspring. There are many different designs for modern self-winding mechanisms. Some designs allow winding of the watch to take place while the weight swings in only one direction while other, more advanced, mechanisms have two ratchets and wind the mainspring during both clockwise and anti-clockwise weight motions.
Explain the text above in 2 sentences:

Which gives the following output:

Output

A mechanical watch uses a mainspring to power the movement of the watch's gears. In a manual watch, the mainspring is wound by turning a knob, while a self-winding watch uses the natural motion of the wearer's body to wind the mainspring through an oscillating weight and ratcheted winding mechanism.

Information Extraction

LLMs can identify and extract specific information from a large body of text. This is useful in fields such as journalism and business intelligence. The following template instructs the model to extract information:

Template

<Full text>
Mention X in the text above: / List X in the text above:

For example, we can ask to extract the names of places from the following text:

Prompt

The earliest credible evidence for a successful design is the watch made by the Swiss watchmaker Abraham-Louis Perrelet, who lived in Le Locle. In late 1776 or early 1777, he invented a self-winding mechanism for pocket watches using an oscillating weight inside the watch that moved up and down. The Geneva Society of Arts, reporting on this watch in 1777, stated that 15 minutes walking was necessary to fully wind the watch.
List the names of places in the text above:

And the output is:

Output

Le Locle, Geneva.

Question Answering

LLMs can also answer questions based on the information provided in the prompt. The following template can be used to direct the LLM to answer a question:

Template

Answer the question based on the context below. <Other instructions>. Respond "Not sure" if not sure about the answer.
Context: <Context>
Q: <Question>
A:

For example, with the following prompt:

Prompt

Answer the question based on the context below. Answer in a single sentence. Respond "Not sure" if not sure about the answer.
Context: An automatic watch, also known as a self-winding watch or simply an automatic, is a mechanical watch where the natural motion of the wearer provides energy to wind the mainspring, making manual winding unnecessary if worn enough. It is distinguished from a manual watch in that a manual watch must have its mainspring wound by hand at regular intervals.
Q: Do automatic watches need to be manually wound?
A:

And ChatGPT responds with:

Output

No, automatic watches do not need to be manually wound if worn enough because the natural motion of the wearer provides energy to wind the mainspring.

Text Classification

LLMs can classify text into predefined categories, such as sentiment analysis or spam detection. You can format a text classification instruction as follows:

Template

Classify the text into <List of choices>:
Text: <Full text>
Classification:

For example, we can try to classify a text for whether its sentiment is positive, neutral or negative:

Prompt

Classify the text into positive, neutral or negative:
Text: That shot selection was awful
Classification:

Output

Negative

Conversation

LLMs can generate human-like conversation, making them useful in chatbots and customer service applications. We can instruct the LLM to follow the conversation acting in a specific manner.

Template

The following is a conversation with <Actor>. The <Actor> should answer <Further instructions>.
Person: <Conversation>
<Actor>: <Response>
Person: <Conversation>
<Actor>:

The LLM will fill in the response of the type of actor we instructed it to, and we can continue the conversation by passing the whole conversation with the new responses to it again. For example:

Prompt

The following is a conversation with a lawyer. The lawyer should answer simply and in layman terms.
Person: Hello, I would like to know how to dispute a parking ticket
Lawyer:

Output

Sure, I can help. The first step is to check the ticket for any errors or mistakes. If you find any, you can use those as grounds to dispute the ticket. If there are no mistakes, you can still dispute the ticket by providing evidence to support your case, such as photos or witness statements. You can also request a hearing to present your case in person.

Prompt

The following is a conversation with a lawyer. The lawyer should answer simply and in layman terms.
Person: Hello, I would like to know how to dispute a parking ticket
Lawyer: Sure, I can help. The first step is to check the ticket for any errors or mistakes. If you find any, you can use those as grounds to dispute the ticket. If there are no mistakes, you can still dispute the ticket by providing evidence to support your case, such as photos or witness statements. You can also request a hearing to present your case in person.
Person: How do I request a hearing?
Lawyer:

Output

To request a hearing, you'll need to follow the instructions on the back of the ticket or on the website provided. Usually, you'll need to submit a written request within a certain timeframe, along with any evidence or supporting documents you have. Once your request is received, you'll be notified of the date, time, and location of your hearing.

Code Generation

LLMs can generate code, making it easier for developers to create complex software programs. We can write a code comment describing what we wish the code to do and the LLM will be able to do it for us to the best of its effort. The more complex things we wish the code to do, the more specific and descriptive we will need to be in the prompt comment.

The template may vary depending on the programming language we want to generate but I would advise to roughly follow this template:

Template

<Comment symbol of your language of choice (/* or ''' or #)>
Method: <Name of the method>
Language: <Programming language>
Description: <Description of what the method must do>
Parameters: <Parameters that the method will receive>
Context: <Further information needed to implement the method (URLs, databases, etc.)>
Output: <Output of the method>
<Comment symbol of your language of choice (/* or ''' or #)>

For example, if we want to fetch an API with pagination we could write the following comment:

Prompt

/* Method: fetchFirst5Pages
 * Language: Javascript
 * Description: Fetch the first 5 pages from this paginated API: http://example.com/pages/?n=x
 * Context: This API accepts the following URL parameters:
 *   n: the number of the page to fetch
 * Output: the content of the first 5 pages
 */

Output

Here's an example implementation of the fetchFirst5Pages method in JavaScript, assuming the use of the fetch function:

async function fetchFirst5Pages() {
  const baseUrl = 'http://example.com/pages/?n=';
  const pages = [];

  for (let i = 1; i <= 5; i++) {
    const url = baseUrl + i;
    const response = await fetch(url);

    if (!response.ok) {
      throw new Error(`Error fetching page ${i}`);
    }

    const pageData = await response.json();
    pages.push(pageData);
  }

  return pages;
}

This function uses a loop to fetch the first 5 pages from the API by appending the page number to the base URL. It checks if the response from the server is okay and throws an error if it's not. Then, it parses the JSON data from the response and adds it to an array of pages. Finally, it returns the array of pages.

Reasoning

LLMs can struggle with some basic tasks, such as arithmetic problems. However, if we ask the LLM to reason upon its response, we can help it get to the right answer. This is the template I recommend:

Template

Q: <Question>
A: Let's work this out in a step by step way to be sure we have the right answer. First, <Instructions>.

In this example we can see that a simple arithmetic question fails if we ask it without reasoning:

Prompt

Answer the following question. Respond "Not sure" if not sure about the answer.
Q: The sum of prime numbers in the following list are a multiple of 3: [2, 4, 6, 7, 9, 14]
A: [Yes or No]

Output

No. The sum of prime numbers in the list [2, 4, 6, 7, 9, 14] is not a multiple of 3. Therefore, the answer is "No".

However, if we instruct the LLM to reason its answer, we get a very different result:

Prompt

Q: The sum of prime numbers in the following list are a multiple of 3: [2, 4, 6, 7, 9, 14]
A: Let's work this out in a step by step way to be sure we have the right answer. First, we find the prime numbers in the list and add them together. Then we check whether the resulting number can be divided by 3.

Output

Sure, let's work this out step by step.

The prime numbers in the list are 2 and 7. So we can add them together:

2 + 7 = 9

The resulting number is 9. Now we need to check whether it is a multiple of 3. We can do this by checking if the remainder of 9 divided by 3 is equal to 0:

9 % 3 = 0

Since the remainder is 0, we know that 9 is indeed a multiple of 3.

Therefore, the answer is "Yes", the sum of prime numbers in the list [2, 4, 6, 7, 9, 14] is a multiple of 3.

Strategies

While these templates will help you to squeeze the potential of LLMs for your needs, sometimes they can still struggle with certain complex tasks. A few techniques have been developed to help LLMs get to a satisfactory answer. Until now, we have only used zero-shot prompting, which we will now define. However, we can use more complex techniques.

Zero-shot prompting

We have been using zero-shot prompting for all examples in this article. Zero-shot prompting means passing a prompt to the LLM with just the question we want it to answer, without any examples of how it should respond in other cases. For example, in an earlier example we did:

Prompt

Classify the text into positive, neutral or negative:
Text: That shot selection was awful
Classification:

Output

Negative

This might not be enough context for the model to answer exactly how we want it to. Let's see what alternatives we have.

Few-shot prompting

As opposed to zero-shot prompting, with few-shot prompting we give the LLM a few examples of how it should respond before we ask the question. For example, let's say we want to classify the sentiment of a text like in the previous example. But we want it to give us the answer in a very specific format every time. We could give it a few examples beforehand like so:

Prompt

Classify the text into positive, neutral or negative:
Text: Today the weather is fantastic
Classification: Pos
Text: The furniture is small.
Classification: Neu
Text: I don't like your attitude
Classification: Neg
Text: That shot selection was awful
Classification:

Output

Neg

Chain of thought prompting

For even more complex tasks, we can combine few-shot prompting with making the model reason about its answer. We can use the same example as before to see how reasoning and few-shot prompting affect the result:

Prompt

Q: The sum of prime numbers in the following list are a multiple of 3: [2, 4, 6, 7, 9, 14]
A: The prime numbers in the list are 2 and 7. Their sum is 9. 9 % 3 = 0. The answer is yes.
Q: The sum of prime numbers in the following list are a multiple of 3: [1, 2, 4, 7, 8, 10]
A: The prime numbers in the list are 1, 2 and 7. Their sum is 10. 10 % 3 = 1. The answer is no.
Q: The sum of prime numbers in the following list are a multiple of 3: [1, 2, 3, 4]
A: The prime numbers in the list are 1, 2 and 3. Their sum is 6. 6 % 3 = 0. The answer is yes.
Q: The sum of prime numbers in the following list are a multiple of 3: [5, 6, 7, 8]
A:

Output

The prime numbers in the list are 5 and 7. Their sum is 12. 12 % 3 = 0. The answer is yes.

Final note

LLMs have revolutionized the field of natural language processing, but to maximize their potential, prompt engineering is essential. By understanding the tasks LLMs can perform, their shortcomings, and various prompt engineering strategies, developers can harness the power of LLMs to create innovative and effective solutions. More strategies and techniques will likely be developed in the close future so be tuned in to further advancements in this field to maximize LLM potential even more. Moreover, as LLMs continue to grow bigger with billions of additional parameters, it is probable that more tasks that we cannot even think of right now will be very possible. It's amazing to think of what will be possible using these new tools and which use cases will they serve us in the future.

How to install (and keep) extensions in SageMaker Studio

Roger Oriol — Sun, 07 Aug 2022 00:00:00 GMT

If you have been using SageMaker Studio, you might have missed some features that other modern IDEs have. As you might know, SageMaker Studio is built on top of the JupyterLab IDE. It extends it, integrating with other SageMaker resources. But this doesn't include any features for developer experience.

To enrich SageMaker Studio with the features that you need, you can install extensions to JupyterLab. You can do that using different methods.

You can use the Extension Manager, found in the menu on the left. It has a puzzle piece icon. Inside the Extension Manager, you can check the already installed extensions. You can also search for extensions if you write the name in the search bar. When you have found the extension you want to install, click the "Install" button. After installing all the needed extensions, restart the JupyterLab by refreshing the page. You should now be able to use them.

Another method to install extensions is using the Jupyter CLI tool. For this method, you will need to know the name of the extension beforehand. Open a terminal inside the Studio and type the following commands:

conda activate studio

If the package you are trying to install is in the NPM package registry, you can install it using the Jupyter CLI:

jupyter labextension install my-extension@1.2.3

You can also install extensions that are in the pip package registry with a pip command:

pip install my-extension=1.2.3

Finally, execute the following command:

restart-jupyter-lab

and refresh the page.

Installing extensions has a problem in SageMaker Studio. Every time your JupyterServer shuts down, it's going to lose all installed extensions and start from a clean state. To keep all the installed extensions, you must create a Lifecycle Configuration. This Lifecycle Configuration will install all the extensions on startup. The Lifecycle Configuration will execute a script when the JupyterServer starts. The content of the script will be:

source activate studio
jupyter labextension install extension-foo
pip install extension-bar
restart-jupyter-server

As you can notice, in this script we activate the environment with source activate studio instead of conda activate studio, this is because conda cannot be used from the lifecycle configuration.

There are many ways to create a Lifecycle configuration. You can do it via the Console, using a Cloudformation Stack, or via AWS CLI. Using the CLI you could do:

aws sagemaker create-studio-lifecycle-config \
--region <your-region> \
--studio-lifecycle-config-name my-studio-lcc \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type JupyterServer

$LCC_CONTENT is a string with the content of the script described before. Then, when you create a user profile in the SageMaker Domain, you can bind the Lifecycle Configuration to it:

aws sagemaker create-user-profile --domain-id <DOMAIN-ID> \ --user-profile-name <USER-PROFILE-NAME> \ --region <REGION> \ --user-settings '{ "JupyterServerAppSettings": {   "LifecycleConfigArns":     ["<LIFECYCLE-CONFIGURATION-ARN-LIST>"]   } }'

From now on, when users start their instance of SageMaker Studio, they will always have the extensions installed. They will be able to use them from the get go. They will also be able to still install and uninstall extensions from the interface and the terminal, but those changes will only last for the duration of the Studio session.

How to disable the download button in SageMaker Studio

Roger Oriol — Sat, 06 Aug 2022 00:00:00 GMT

Many enterprises choose a cloud environment to power the work of their data science team. If you chose the AWS SageMaker Studio, this article might interest you. Having both the data lake and the data scientist environment makes it easy to integrate them. You can choose what data any given data scientist is able to see. You might want a data scientist only to be able to use this data inside the SageMaker Studio environment. However, SageMaker Studio has a download button that lets data scientists download any data they have been working on. Once they have downloaded data to their computers, they are free to share it anywhere and with anyone.

Luckily, it is possible to disable this download button. Recently, it was only possible to disable the download button in SageMaker Notebooks. This article from Ujjwal Bhardwaj shows how to disable it in SageMaker Notebooks.

But AWS updated SageMaker Studio and now it can also disable the download button. This update lets us configure Studio to use JupyterLab version 3. In this version, JupyterLab refactored some features, including the download button. Now, those features are plugins included by default by JupyterLab, instead of hardcoded in the JupyterLab core. This means that it is now possible to disable these plugins and they won't show up in the UI.

The plugins that include a download button in the JupyterLab UI are the following:

@jupyterlab/docmanager-extension:download
@jupyterlab/filebrowser-extension:download

There are a couple of ways to disable those plugins. The most straightforward is to run these commands in a SageMaker Studio terminal:

conda activate studio
jupyter labextension disable jupyterlab/docmanager-extension:download
jupyter labextension disable @jupyterlab/filebrowser-extension:download
restart-jupyter-server

You can also use the JupyterLab configuration files. Edit the file /opt/conda/envs/studio/etc/jupyter/labconfig/page_config.json with the following content:

{
  "disabledExtensions": {
    "@jupyterlab/docmanager-extension:download": true,
    "@jupyterlab/filebrowser-extension:download": true
  }
}

and run the command:

restart-jupyter-server

You might also have to refresh the page to see the changes take place.

The problem with these approaches is that changes will only last for the duration of the session. To make the changes permanent, you have to create a Studio Lifecycle Configuration. The Lifecycle Configuration will execute a script when the JupyterServer starts. In this script, you will edit the file in the previous example.

The content of the script will be:

echo "{" > /opt/conda/envs/studio/etc/jupyter/labconfig/page_config.json
echo "  \\"disabledExtensions\\": {" >> /opt/conda/envs/studio/etc/jupyter/labconfig/page_config.json
echo "    \\"@jupyterlab/docmanager-extension:download\\": true," >> /opt/conda/envs/studio/etc/jupyter/labconfig/page_config.json
echo "    \\"@jupyterlab/filebrowser-extension:download\\": true" >> /opt/conda/envs/studio/etc/jupyter/labconfig/page_config.json
echo "  }" >> /opt/conda/envs/studio/etc/jupyter/labconfig/page_config.json
echo "}" >> /opt/conda/envs/studio/etc/jupyter/labconfig/page_config.json
restart-jupyter-server

There are many ways to create a Lifecycle configuration. You can do it via the Console, using a Cloudformation Stack, or via AWS CLI. Using the CLI you could do:

aws sagemaker create-studio-lifecycle-config \
--region <your-region> \
--studio-lifecycle-config-name my-studio-lcc \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type JupyterServer

$LCC_CONTENT is a string with the content of the script described before. Then, when you create a user profile in the SageMaker Domain, you can bind the Lifecycle Configuration to it:

aws sagemaker create-user-profile --domain-id <DOMAIN-ID> \ --user-profile-name <USER-PROFILE-NAME> \ --region <REGION> \ --user-settings '{ "JupyterServerAppSettings": {   "LifecycleConfigArns":     ["<LIFECYCLE-CONFIGURATION-ARN-LIST>"]   } }'

From now on, every time a data scientist opens their instance of SageMaker Studio, it should never display a download button. This efectively blocks them from downloading any files located in their Studio, as long as they are not able to revert these changes themselves from their terminal. Also, note that disabling the download plugin only removes all the download buttons from the interface. This does not mean that if there are other means of downloading files those are also blocked.

Feature Store - Why do you need one?

Roger Oriol — Fri, 05 Aug 2022 00:00:00 GMT

Painter photo created by rawpixel.com - www.freepik.com

A feature store is a storage system for features. Features are properties of data calculated through an ETL process or feature pipeline. This pipeline takes raw data and calculates a property from it. This property - usually a numeric value - will be useful to a machine learning model. It is important to find adequate, correct, and quality features. The quality of those features is the most important contributor to a model's success. The model will use the features either to train itself or to make predictions. A feature store will help to organize and use those features.

At its core, a feature store is only a database. More specifically, there are usually two databases. There is an offline store equipped to store large sums of data, like an HBase or S3. There is also an online store equipped for fast data serving, like Cassandra. Features are organized in feature groups, which can be thought of as tables. Features that are used together are stored in the same feature group so access to them is faster and without joins. There are many ETL processes (think Spark) that write to the offline store. Data from the offline store replicates to the online store to keep them consistent. Data streams can also write both to the online and offline stores, for fast real-time data access.

Architecture of Michelangelo Pallete feature store developed at Uber

In this article, I will lay out the advantages of including a Feature Store in your data architecture. Prescribing solutions to all cases without further thought is definitely not the answer. But almost every Data Science team will benefit from having a feature store, even if it is small.

Reusable features

The principal reason to be for a feature store is to empower data scientists to reuse features. Building feature pipelines takes around 80% of data scientists' time. Avoiding repeated feature engineering work will result in a faster work cycle. One example of feature reuse is for sharing features between training and inference. The features used for training are roughly the same as the features used for making a prediction. Another example of feature reuse is between teams or projects. Features related to core enterprise concepts are usually used throughout different ML projects. To encourage reuse, features must be discoverable through the feature store.

Feature consistency

Another benefit of centralizing features in a single feature store is feature consistency. Different data science teams might calculate similar features slightly differently. Those features might be the same concept and data scientists will have to agree to unify them. Then, if the process to calculate the feature changes, it changes for all the projects that use it. Or they might be a different concept, and data scientists will have to categorize them according to their separate quirks.

Point-in-time correctness

Feature stores also enable point-in-time correctness. The online store will always have the latest value for a feature. The offline store will store all historical values the feature had at any point. This enables data scientists to work with old values, aggregate time ranges, and so on. It also ensures the reproducibility of a model. At any point, we can recover the data used in a past training or in a past inference to debug the model.

Data health

One can also generate statistics from the feature store to monitor the health of the data. If the data drifts (its health or structure changes over time), it can be automatically detected in the pipeline. Statistics can also help explain how a feature affects the predictions of each model.

Data lineage

Using the catalog of features and models, you can draw a data lineage. This data lineage shows the data source used to create each feature. It also shows the models or other feature pipelines that use the feature. This graph enables debugging problems with data. It becomes trivial to track down where a piece of data came from and how it is being used.

Online store

In some use cases, an ML model will have a low-latency requirement. For example, if a model is being called from an API call, the user will expect a response within a few seconds. This requires very fast access to features. Instead of calculating them each time, we can access the precalculated features in the online store. We know that the online store is always going to have the up-to-date last value of the feature. The online store is also optimized for sub-second queries for a fast response.

Don't use a feature store if you don't have to. But if your organization has a medium-sized ML team or several ML teams, or it has any of the needs I exposed, consider introducing a feature store. It will only benefit your data science teams in the long run.

How to start using a feature store now?

You can build a feature store putting together your own components like Uber did with Michelangelo. You could use Hive for the offline store, Cassandra and Redis for the online store, Kafka for streaming real-time data and a Spark cluster to run ETL processes. On the other hand, you could also trust other people that have already built feature stores and use their solutions. You can choose an open-source solution and host it yourself. Some open-source solutions are:

Feast: a minimal Feature Store that lacks some features like an ETL system and data lineage. Feast has integration support with tools from GCP (BigQuery as offline store and Datastore as online store) and AWS (Redshift, DynamoDB). It also has integration support for other agnostic tools like Snowflake, Redis or Kafka.
Hopsworks: very complete Feature store. It includes tools like a Model Registry, multi-tenant governance, data lineage and much more. It can be deployed either in GCP, AWS, Azure or in premise. This is because Hopsworks provides its own technology, instead of integrating with other sources like Feast. Hopsworks is deployed in a Kubernetes cluster. This cluster includes a RonDB database for the online store and integrates with S3/Bucket for the offline store.

You can also choose a SaaS tool instead of open-source. Some examples include:

Databricks Feature Store: it is integrated inside the Databricks Lakehouse Platform. Therefore, it is a good fit if you are already using Databricks as your ML platform. It uses Delta Lake as an offline store and can be integrated with either AWS DynamoDB, AWS RDS or AWS Aurora as an online store.
SageMaker Feature Store: fully managed feature store by AWS. It uses S3 as an offline store and DynamoDB as an online store. It integrates with all the other tools in the SageMaker environment and with data sources within AWS like Redshift, Athena and S3.
Vertex AI Feature Store: a feature store fully managed by Google in their cloud provider GCP. It uses BigQuery as an offline store and BigTable as an online store. It integrates with all the other tools in the Vertex AI environment and with BigQuery and GCS as data sources.

Bun - What can I use it for?

Roger Oriol — Tue, 26 Jul 2022 00:00:00 GMT

In recent years, the web development landscape has discovered that, while super stylized websites and web applications with tons of very rich interactions made with Javascript might be appealing to users, the loading time of the page can be a hundred times more influential to the user's perspective of the experience. As a result, there has been a ton of effort to reduce the loading time metrics of websites. Many frameworks and patterns have (re-)emerged to give solutions to this problem, like Server-Side Rendering and Static Site Generation.

Enter Jarred Sumner. Jarred noticed that an action as simple as printing to the console was extremely slow in Node.js. Here's an example of the time it takes to execute a "Hello world" program in Node.js and Rust.

log.js

console.log('Hello world!')

$ time node log.js
Hello world!
node log.js  0,13s user 0,02s system 100% cpu 0,152 total

log.rs

fn main() {
  println!("Hello world!");
}

$ rustc log.rs
$ time ./log
Hello world!
./log  0,00s user 0,00s system 86% cpu 0,004 total

Jarred said he was very bothered by this fact, as well as other realities from the Javascript world, like having to choose and harmonize multiple tools - the runtime, the bundler, the transpiler... He was sure this didn't have to be the case and he set out on a one-man journey to build Bun.

What exactly is Bun?

Bun is first and foremost a Javascript runtime like Node.js and Deno. It is designed to be a blazingly fast Javascript runtime. But don’t worry, embracing this new runtime does not mean having to refactor all of your existing code. Bun implements most of the Node APIs and Web APIs like fs, fetch, Buffer, WebSocket, and much more. Furthermore, Bun is also compatible with most NPM packages, even though it uses its own package manager, which is, again, built for speed.

But Bun is not just a Javascript runtime, it comes with batteries included. It comes with all the tools that are usually needed in a Javascript project: a package manager, a bundler, and a transpiler that not only works for Javascript, it also works for Typescript and JSX out-of-the-box. Moreover, it also includes a native implementation for dotenv configuration loading and an SQLite3 client.

Here's an example of installing a regular-sized package.json using NPM v8.15 and the Bun package manager v0.1.5:

$ time npm install

added 563 packages, and audited 564 packages in 3m

npm install  166,14s user 16,06s system 92% cpu 3:16,08 total

$ bun install
bun install v0.1.5

 563 packages installed [9.70s]

But how is Bun so fast?

The speed of Bun can be explained by two main factors: The choice of Javascript engine and low-level optimization of native implementations.

Both Node.js and Deno use the V8 Javascript engine. While V8 is an awesome engine that powers the Chrome web browser, Bun chose to go for the option that would yield the best performance at any cost. Jarred chose to use the JavascriptCore engine for building Bun, which is the engine that powers Webkit and it seems that performs better in start time as well as in some specific cases.

Another decision that helped optimize Bun to the extreme is the choice of Zig as the language it is built in. Zig is a low-level programming language without a garbage collector, like C or Rust. While those two could have been great choices to build Bun in, Zig has some unique features that Jarred valued, which ultimately led him to choose Zig over other languages. First, Zig has no hidden control flow, which makes it easier to ensure that no secondary functions are unexpectedly called when running it. Also, Zig has a feature named "CompTime". With CompTime you can mark a piece of code to be executed at compile time instead of execution time, saving any precious time that can be scratched from the final execution time.

Using Bun

To use Bun, first install it with the following command:

curl https://bun.sh/install | bash

Yes, it is a shell command. No, (at the time of writing this article) it does not work for Windows. You can now start using Bun.

For this example, we will build a web server. Bun comes with an HTTP server out-of-the-box. If the file used to start bun has a default export with a fetch function, it will start the HTTP server, but you can also use Bun.serve() for clarity. Here's a basic example:

http.ts

export default {
  port: 3000,
  fetch(request: Request): Response {
    return new Response("Hello world!");
  }
};

Note that for this example no library is needed. Both Typescript and the HTTP server just work. Now, let's run the code and check that the server is up and running:

$ bun run http.ts

$ curl http://localhost:3000
Hello world!

Excellent! We can now introduce new features to the server. We could use the improved Node.js API for reading and returning the contents of a file. We can also use the built-in SQLite client to fetch records from a database. Using the tools that Bun provides, we can implement this easily:

import {
  readFileSync
} from 'fs';
import {
  Database
} from 'bun:sqlite';

export default {
  port: 3000,
  async fetch(request: Request): Promise<Response> {
    const urlParts = request.url.match(/https?:\/\/.+\/(.*)/);

    const [_, resource] = urlParts;

    if (resource === 'file') {
      const file: string = readFileSync('/home/roger/Projects/experiments/log/file.txt', { encoding: 'utf-8' });
      return new Response(file);
    }
    else if (resource === 'db') {
      const db = new Database('db.sqlite');
      const query = db.query('SELECT * FROM foo');
      const result = query.run();
      db.close();
      return new Response(result);
    }

    return new Response('Hello world!');
  }
};

What can I use Bun for?

Let's be clear: Bun is not ready for production applications. At the time of writing this article, the last version of Bun is the beta version 0.1.5. It still has a small ecosystem and community, many missing APIs, and some bugs and performance issues in edge cases. Nevertheless, this is a temporary problem; Jarred and the community are working tirelessly to build all the missing tools and APIs and fix all bugs. When first writing this article, I described some problems like performance spikes and illegal hardware instructions. But these problems are being solved so quickly that this article would be constantly out-of-date. Also, libraries specifically built for Bun are appearing constantly. Probably, in the not-so-distant future, Bun will be a great tool to be used in production environments, so it would be wise to keep an eye on it.

That doesn't mean that you shouldn't use Bun right now. If you need to build an application that is not production-sensitive it could be the way to go. If you want to build a quick proof-of-concept app and don't want to fiddle with Typescript and transpiler libraries it could be a great solution. Another possible use could be if you need to write a script that needs to perform at it's best and you cannot be bothered to write it in Rust or C.

In conclusion, keep an eye in Bun. What Jarred has built in just a year has already made a big impact in the Javascript community and it has the potential to completely revolutionize it.

Blog | Roger Oriol

CaixaBank importer for Beancount

Motivation

Features

Contributing

License

Gym sessions tracker

Motivation

Project Overview

Features

Tech Stack:

Contributing

License

[Link] GPT-5

[Quote] GPT-5 variants

[Link] GPT-OSS

[Quote] How we built our multi-agent research system

[Link] Artificial Intelligence 3E: Foundations of computational agents

[Link] AGI is not multimodal

[Quote] Hype Coding - Steve Krouse

[Link] OpenAI Codex CLI

[Link] GPT 4.1

[Link] The Agent2Agent Protocol

[Quote] LMArena on X

[Link] The Llama 4 herd

Llama 4 Scout

Llama 4 Maverick

Llama 4 Behemoth

Llama 4 Reasoning

[Link] Circuit Tracing: Revealing Computational Graphs in Language Models

How to Write a Good index.html File

The tags you must include

<!DOCTYPE html> and lang="en": Setting the document type and language

<title>: The page title

<meta name="viewport">: Configuring viewport for responsive design

<meta name="description">, <meta name="keywords">, <meta name="author">: Essential meta information

<link rel="canonical">: Avoiding duplicate indexation

CSS loading strategies: Critical inline CSS vs. external stylesheets

<script>: Script loading best practices

The tags you maybe should include

<meta charset="utf-8">: Character encoding

<base href="/">: Defining a base URL

<meta name="application-name">: Application details

<meta name="theme-color">: Browser UI theme color

<meta name="color-scheme">: Light and dark mode support

<meta property="og:">: Social media integration with Open Graph

<link rel="manifest"> and <link rel="icon">: PWA support and favicons

<link rel="alternate">: Alternate Content Types for your site

<link rel="preload">, <link rel="preconnect">, <link rel="prefetch">: Resource optimization

Conclusion

Resources

HTML Starter Template

Motivation

What is it?

Contributing

License

[Link] Claude Think Tool

[Quote] 🔭 The Einstein AI model

About the Dead Internet Theory and AI

What Is The "Dead Internet Theory"?

Is It Really That Bad?

My Final Thoughts + Possible solutions

The Rise Of Reasoner Models: Scaling Test-Time Compute

What is Test-time Compute?

How Does Test-time Compute Work?

Performance Improvements Through Test-time Compute

Limitations of Test-time Compute

Conclusion

References

AI in 2024: Year in Review and Predictions for 2025

Part I: 2024 Year in Review

OpenAI in the front

Agentic AI and AI Assistants

AI Regulation

Part II: Looking Ahead to 2025

Small, specialized models

Multimodal models and Test-time compute

The evolution of AI regulation

Conclusion

Is the EU Falling Behind in the AI Race?

`<!DOCTYPE html>` and `lang="en"`: Setting the document type and language

`<title>`: The page title

`<meta name="viewport">`: Configuring viewport for responsive design

`<meta name="description">`, `<meta name="keywords">`, `<meta name="author">`: Essential meta information

`<link rel="canonical">`: Avoiding duplicate indexation

`<script>`: Script loading best practices

`<meta charset="utf-8">`: Character encoding

`<base href="/">`: Defining a base URL

`<meta name="application-name">`: Application details

`<meta name="theme-color">`: Browser UI theme color

`<meta name="color-scheme">`: Light and dark mode support

`<meta property="og:">`: Social media integration with Open Graph

`<link rel="manifest">` and `<link rel="icon">`: PWA support and favicons

`<link rel="alternate">`: Alternate Content Types for your site

`<link rel="preload">`, `<link rel="preconnect">`, `<link rel="prefetch">`: Resource optimization