DuckDB

DuckDB.ExtensionKit: Building DuckDB Extensions in C#

2026-03-20T00:00:00+00:00

Introduction

DuckDB has a flexible extension mechanism that allows extensions to be loaded dynamically at runtime. This makes it easy to extend DuckDB’s main feature set without adding everything to the main binary. Extensions can add support for new file formats, introduce custom types, or provide new scalar and table functions. A significant part of DuckDB’s functionality is actually implemented using this extension mechanism in the form of core extensions, which are developed alongside the engine itself by the DuckDB team. For example, DuckDB can read and write JSON files via the json extension and integrate with PostgreSQL using the postgres extension.

DuckDB also has a thriving ecosystem of community extensions, i.e., third-party extensions, maintained by community members, covering a wide range of use cases and integrations. For example, you can expose additional cryptographic functionality through the crypto community extension.

How Extensions Are Built Today

Today, developers can use the same C++ API that the core extensions use for developing extensions. A template for creating extensions is available in the extension-template repository. While powerful, the C++ extension API is tightly coupled to DuckDB’s internal APIs, so it can (and often will) change between DuckDB versions. Additionally, using it requires building the whole DuckDB engine and its documentation is not as complete as that of the C API.

To solve these issues, DuckDB also provides an experimental template for C/C++ based extensions that link with the C Extension API of DuckDB. This API provides a stable, backwards-compatible interface for developing extensions and is designed to allow extensions to work across different DuckDB versions. Because it is a C-based API, it can also be used from other programming languages such as Rust.

Even with the C API, writing extensions still means working at a low level, performing manual memory management, and writing a lot of boilerplate code. While the C API solves stability and compatibility, it doesn’t solve developer experience for higher-level ecosystems. This is where DuckDB.ExtensionKit comes in, aiming to make extension development more accessible to developers working in the .NET ecosystem. By building on top of the DuckDB C Extension API and compiling extensions using the .NET Native AOT (Ahead-of-Time) compilation, DuckDB.ExtensionKit offers the best of both worlds: native DuckDB extensions that integrate like any other extension, combined with the productivity and rich library ecosystem of C# and .NET.

DuckDB.ExtensionKit

DuckDB.ExtensionKit provides a set of C# APIs and build tooling for implementing DuckDB extensions. It exposes the low-level DuckDB C Extension API as C# methods, and also provides type-safe, higher-level APIs for defining scalar and table functions, while still producing native DuckDB extensions. The toolkit also includes a source generator that automatically generates the required boilerplate code, including the native entry point and API initialization.

With DuckDB.ExtensionKit, building an extension closely resembles building a regular C# library. Extension authors create a C# project that references the ExtensionKit runtime and implements functions using the provided, type-safe APIs that expose DuckDB concepts.

At build time, the source generator emits the required boilerplate, including the native entry point and extension initialization. The project is then compiled using .NET Native AOT, producing a native DuckDB extension binary that can be loaded and used by DuckDB like any other extension, without requiring a .NET runtime.

To show a concrete example for this process, the following snippet shows a small DuckDB extension implemented using DuckDB.ExtensionKit that exposes both a scalar function and a table function for working with JWTs (JSON Web Token). At a high level, writing an extension with DuckDB.ExtensionKit involves defining a C# type that represents the extension and registering functions explicitly. In the example below, this is done by creating a partial class annotated with the [DuckDBExtension] attribute and implementing the RegisterFunctions method. The implementation makes use of the System.IdentityModel.Tokens.Jwt NuGet package, illustrating how extensions can easily take advantage of existing .NET libraries.

We'll add two functions, a scalar function for extracting a single claim from a JWT and a table function for extracting multiple claims.

public static partial class JwtExtension
{
  private static void RegisterFunctions(DuckDBConnection connection)
  {
    connection.RegisterScalarFunction<string, string, string?>("extract_claim_from_jwt", ExtractClaimFromJwt);

    connection.RegisterTableFunction("extract_claims_from_jwt", (string jwt) => ExtractClaimsFromJwt(jwt),
                                     c => new { claim_name = c.Key, claim_value = c.Value });
  }

  private static string? ExtractClaimFromJwt(string jwt, string claim)
  {
    var jwtHandler = new JwtSecurityTokenHandler();
    var token = jwtHandler.ReadJwtToken(jwt);
    return token.Claims.FirstOrDefault(c => c.Type == claim)?.Value;
  }

  private static Dictionary<string, string> ExtractClaimsFromJwt(string jwt)
  {
    var jwtHandler = new JwtSecurityTokenHandler();
    var token = jwtHandler.ReadJwtToken(jwt);
    return token.Claims.ToDictionary(c => c.Type, c => c.Value);
  }
}

In just 25 lines, we have built an extension that adds extract_claim_from_jwt and extract_claims_from_jwt functions to DuckDB. We can call these functions just like any other function. For example, to extract the name field from a claim, we can run:

SELECT extract_claim_from_jwt(
    'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6ImExZmIyY2NjN2FiMjBiMDYyNzJmNGUxMjIwZDEwZmZlIn0.eyJpc3MiOiJodHRwczovL2lkcC5sb2NhbCIsImF1ZCI6Im15X2NsaWVudF9hcHAiLCJuYW1lIjoiR2lvcmdpIERhbGFraXNodmlsaSIsInN1YiI6IjViZTg2MzU5MDczYzQzNGJhZDJkYTM5MzIyMjJkYWJlIiwiYWRtaW4iOnRydWUsImV4cCI6MTc2NjU5MTI2NywiaWF0IjoxNzY2NTkwOTY3fQ.N7h2xc4rgS4oPo8IO9wyG1lnr2wqTUC80YudWTXp7rXmU2JdsUiweKmuYVVbygdJAR4PJmbQtak4_VuZg2fZFILVpzDyLvGITfUW_18XuDQ_SIm3VlfAuHOVHfruuvvSAfjUkTW2Jlrv3ihFYgusV58vjhcVFHssOGMEbtMNo10Jf62dczVVGNZXh_OOLS0nTLffhY94sZddqQIE56W8xhLK5YMO4gO8voMzhUwDwucnVvyNfui38MPDNdTSKjn3Ab0hG8jzOVhbYSCHf0eQsbxPzGtXUCJobScWDb78IphFWec6W4ugIYp5CMh3C_noQi94NYjQg2P-AJ5FLCKzKA',
    'name'
);

This returns Giorgi Dalakishvili. Let's test the table function:

SELECT *
FROM extract_claims_from_jwt(
    'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6ImExZmIyY2NjN2FiMjBiMDYyNzJmNGUxMjIwZDEwZmZlIn0.eyJpc3MiOiJodHRwczovL2lkcC5sb2NhbCIsImF1ZCI6Im15X2NsaWVudF9hcHAiLCJuYW1lIjoiR2lvcmdpIERhbGFraXNodmlsaSIsInN1YiI6IjViZTg2MzU5MDczYzQzNGJhZDJkYTM5MzIyMjJkYWJlIiwiYWRtaW4iOnRydWUsImV4cCI6MTc2NjU5MTI2NywiaWF0IjoxNzY2NTkwOTY3fQ.N7h2xc4rgS4oPo8IO9wyG1lnr2wqTUC80YudWTXp7rXmU2JdsUiweKmuYVVbygdJAR4PJmbQtak4_VuZg2fZFILVpzDyLvGITfUW_18XuDQ_SIm3VlfAuHOVHfruuvvSAfjUkTW2Jlrv3ihFYgusV58vjhcVFHssOGMEbtMNo10Jf62dczVVGNZXh_OOLS0nTLffhY94sZddqQIE56W8xhLK5YMO4gO8voMzhUwDwucnVvyNfui38MPDNdTSKjn3Ab0hG8jzOVhbYSCHf0eQsbxPzGtXUCJobScWDb78IphFWec6W4ugIYp5CMh3C_noQi94NYjQg2P-AJ5FLCKzKA'
);

This returns:

claim_name	claim_value
iss	https://idp.local
aud	my_client_app
name	Giorgi Dalakishvili
sub	5be86359073c434bad2da3932222dabe
admin	true
exp	1766591267
iat	1766590967

How DuckDB.ExtensionKit Works

DuckDB.ExtensionKit relies on several modern C# language and runtime features to efficiently bridge DuckDB’s C extension API to managed code. These features make it possible to build native extensions in C# without introducing a managed runtime dependency at load time.

Function Pointers

DuckDB’s C extension API is exposed as a versioned function table: a large struct (duckdb_ext_api_v1) whose fields are C function pointers (e.g., duckdb_open, duckdb_register_scalar_function, duckdb_vector_get_data, and so on). DuckDB.ExtensionKit mirrors this mechanism in C#. It defines a C# representation of the struct (DuckDBExtApiV1), where each field is declared as a C# function pointer (delegate* unmanaged[Cdecl]<...>). This maps the C ABI directly: calling into DuckDB becomes a simple indirect call through a function pointer field, rather than a delegate invocation with runtime marshaling.

Entrypoint

A DuckDB extension needs to expose an entrypoint function following the C calling convention (the entrypoint that should be exported from the binary is the name of the extension plus _init_c_api). This way, DuckDB can locate it when the extension is loaded. In the C extension template, this is handled with macros that generate the exported function and the surrounding boilerplate.

DuckDB.ExtensionKit follows the same model, but generates the boilerplate from C# instead of C macros. The source generator emits a native-compatible entrypoint that retrieves the API table (via the access object) and performs the required initialization, just like the C template does. The generated method is annotated with [UnmanagedCallersOnly(EntryPoint = "...")], which instructs the .NET toolchain to export a real native symbol with that name and make it callable from C. With .NET Native AOT, this becomes an actual exported function in the produced binary – allowing DuckDB to load and call into the extension exactly as it would for a C implementation.

Native AOT

Finally, Native AOT is what makes this approach practical for DuckDB extensions. Once the extension code and generated sources are compiled, the project is published using .NET Native AOT. This step produces a native binary with no dependency on a managed runtime at load time. The resulting artifact is a native DuckDB extension that can be loaded and executed in the same way as extensions written in C or C++. From DuckDB’s perspective, there is no difference between an extension built with DuckDB.ExtensionKit and one implemented in a traditional native language.

Current Status and Limitations

DuckDB.ExtensionKit, just like the C extension template, is currently experimental. The APIs are still evolving, and not all extension features supported by DuckDB are exposed yet.

The toolkit relies on .NET Native AOT, which means extensions need to be built for specific target platforms (for example, linux-x64, osx-arm64, or win-x64). As with other native extensions, binaries are platform-specific and need to be built accordingly.

Build Your Own Extension in C#

DuckDB.ExtensionKit is available as an open-source project on GitHub under the MIT license. The project includes example extensions that demonstrate how to define and build DuckDB extensions in C#. The repository contains a JWT-based example extension that showcases both scalar functions and table functions, as well as the full build and publishing workflow using .NET Native AOT.

Feedback, bug reports, and contributions are welcome through GitHub issues.

Closing Thoughts

DuckDB’s extension mechanism has proven to be a flexible foundation for extending the system without complicating the core engine. DuckDB.ExtensionKit explores how this mechanism can be made accessible to a broader audience by leveraging the .NET ecosystem, while still producing native extensions that integrate directly with DuckDB.

Although C# is typically viewed as a high-level language, this project demonstrates that it can also be used to implement low-level, ABI-compatible components when needed. By combining modern C# features with DuckDB’s existing extension interface, it is possible to write extensions in a high-level language without giving up control over native boundaries.

Big Data on the Cheapest MacBook

2026-03-11T00:00:00+00:00

Apple released the MacBook Neo today and there is no shortage of tech reviews explaining whether it's the right device for you if you are a student, a photographer or a writer. What they don't tell you is whether it fits into our Big Data on Your Laptop ethos. We wanted to answer this using a data-driven approach, so we went to the nearest Apple Store, picked one up and took it for a spin.

What's in the Box?

Well, not much! If you buy this machine in the EU, there isn't even a charging brick included. All you get is the laptop and a braided USB-C cable. But you likely already have a few USB-C bricks lying around – let's move on to the laptop itself!

The only part of the hardware specification that you can select is the disk: you can pick either 256 or 512 GB. As our mission is to deal with alleged “Big Data”, we picked the larger option, which brings the price to $700 in the US or €800 in the EU. The amount of memory is fixed to 8 GB. And while there is only a single CPU option, it is quite an interesting one: this laptop is powered by the 6-core Apple A18 Pro, originally built for the iPhone 16 Pro.

It turns out that we have already tested this phone under some unusual circumstances. Back in 2024, with DuckDB v1.2-dev, we found that the iPhone 16 Pro could complete all TPC-H queries at scale factor 100 in about 10 minutes when air-cooled and in less than 8 minutes while lying in a box of dry ice. The MacBook Neo should definitely be able to handle this workload – but maybe it can even handle a bit more. Cue the inevitable benchmarks!

ClickBench

For our first experiment, we used ClickBench, an analytical database benchmark. ClickBench has 43 queries that focus on aggregation and filtering operations. The operations run on a single wide table with 100M rows, which uses about 14 GB when serialized to Parquet and 75 GB when stored in CSV format.

Benchmark Environment

We ported ClickBench's DuckDB implementation to macOS and ran it on the MacBook Neo using the freshly minted v1.5.0 release. We only applied a small tweak: as suggested in our performance guide, we slightly lowered the memory limit to 5 GB, to reduce relying on the OS' swapping and to let DuckDB handle memory management for larger-than-memory workloads. This is a common trick in memory-constrained environments where other processes are likely using more than 20% of the total system memory.

We also re-ran ClickBench with DuckDB v1.5.0 on two cloud instances, yielding the following lineup:

The star of our show, the MacBook Neo with 2 performance cores, 4 efficiency cores and 8 GB RAM
c6a.4xlarge with 16 AMD EPYC vCPU cores and 32 GB RAM. This instance is popular in ClickBench with about 80 individual results reported.
c8g.metal-48xl with a whopping 192 Graviton4 vCPU cores and 384 GB RAM. This instance is often at the top of the overall ClickBench leaderboard.

The benchmark script first loaded the Parquet file into the database. Then, as per ClickBench's rules, it ran each query three times to capture both cold runs (the first run when caches are cold) and hot runs (when the system has a chance to exploit e.g. file system caching).

Results and Analysis

Our experiments produced the following aggregate runtimes, in seconds:

Machine	Cold run (median)	Cold run (total)	Hot run (median)	Hot run (total)
MacBook Neo	0.57	59.73	0.41	54.27
c6a.4xlarge	1.34	145.08	0.50	47.86
c8g.metal-48xl	1.54	169.67	0.05	4.35

Cold run. The results start with a big surprise: in the cold run, the MacBook Neo is the clear winner with a sub-second median runtime, completing all queries in under a minute! Of course, if we dig deeper into the setups, there is an explanation for this. The cloud instances have network-attached disks, and accessing the database on these dominates the overall query runtimes. The MacBook Neo has a local NVMe SSD, which is far from best-in-class, but still provides relatively quick access on the first read.

Hot run. In the hot runs, the MacBook's total runtime only improves by approximately 10%, while the cloud machines come into their own, with the c8g.metal-48xl winning by an order of magnitude. However, it's worth noting that on median query runtimes the MacBook Neo can still beat the c6a.4xlarge, a mid-sized cloud instance. And the laptop's total runtime is only about 13% slower despite the cloud box having 10 more CPU threads and 4 times as much RAM.

TPC-DS

For our second experiment, we picked the queries of the TPC-DS benchmark. Compared to the ubiquitous TPC-H benchmark, which has 8 tables and 22 queries, TPC-DS has 24 tables and 99 queries, many of which are more complex and include features such as window functions. And while TPC-H has been optimized to death, there is still some semblance of value in TPC-DS results. Let's see whether the cheapest MacBook can handle these queries!

For this round, we used DuckDB's LTS version, v1.4.4. We generated the datasets using DuckDB's tpcds extension and set the memory limit to 6 GB.

At SF100, the laptop breezed through most queries with a median query runtime of 1.63 seconds and a total runtime of 15.5 minutes.

At SF300, the memory constraint started to show. While the median query runtime was still quite good at 6.90 seconds, DuckDB occasionally used up to 80 GB of space for spilling to disk and it was clear that some queries were going to take a long time. Most notably, query 67 took 51 minutes to complete. But hardware and software continued to work together tirelessly, and they ultimately passed the test, completing all queries in 79 minutes.

Should You Buy One?

Here's the thing: if you are running Big Data workloads on your laptop every day, you probably shouldn't get the MacBook Neo. Yes, DuckDB runs on it, and can handle a lot of data by leveraging out-of-core processing. But the MacBook Neo's disk I/O is lackluster compared to the Air and Pro models (about 1.5 GB/s compared to 3–6 GB/s), and the 8 GB memory will be limiting in the long run. If you need to process Big Data on the move and can pay up a bit, the other MacBook models will serve your needs better and there are also good options for Linux and Windows.

All that said, if you run DuckDB in the cloud and primarily use your laptop as a client, this is a great device. And you can rest assured that if you occasionally need to crunch some data locally, DuckDB on the MacBook Neo will be up to the challenge.

Announcing DuckDB 1.5.0

2026-03-09T00:00:00+00:00

We are proud to release DuckDB v1.5.0, codenamed “Variegata” after the Paradise shelduck (Tadorna variegata) endemic to New Zealand.

In this blog post, we cover the most important updates for this release around support, features and extensions. As always, there is more: for the complete release notes, see the release page on GitHub.

To install the new version, please visit the installation page. Note that it can take a few days to release some extensions (e.g., the UI) client libraries (e.g., Go, R, Java) due to the extra changes and review rounds required.

With this release, we will have two DuckDB releases available: v1.4 (LTS) and v1.5 (current). The next release – planned for September – will ship a major version, DuckDB 2.0.

New Features

Command Line Client

For users who use DuckDB through the terminal, the highlight of the new release is a rework of the CLI client with a new color scheme, dynamic prompts, a pager and many other convenience features.

Color Scheme

We shipped a new color palette and harmonized it with the documentation. The color palette is available in both dark mode and light mode. Both use two shades of gray, and five colors for keywords, strings, errors, functions and numbers. You can find the color palette in the Design Manual.

You can customize the color scheme using the .highlight_colors dot command:

.highlight_colors column_name darkgreen bold_underline
.highlight_colors numeric_value red bold
.highlight_colors string_value purple2
FROM ducks;

Dynamic Prompts in the CLI

DuckDB v1.5.0 introduces dynamic prompts for the CLI (PR #19579). By default, these show the database and schema that you are currently connected to:

duckdb

memory D ATTACH 'my_database.duckdb';
memory D USE my_database;
my_database D CREATE SCHEMA my_schema;
my_database D USE my_schema;
my_database.my_schema D ...

These prompts can be configured using bracket codes to have a maximum length, run a custom query, use different colors, etc. (#19579).

`.tables` and `DESCRIBE`

To show the columns of an individual table, use the DESCRIBE statement:

memory D ATTACH 'https://blobs.duckdb.org/data/animals.db' AS animals_db;
memory D USE animals_db;
animals_db D DESCRIBE ducks;

┌──────────────────────┐
│        ducks         │
│                      │
│ id           integer │
│ name         varchar │
│ extinct_year integer │
└──────────────────────┘

The .tables dot command lists the attached catalogs, the schemas and tables in them, and the columns in each table.

memory D ATTACH 'https://blobs.duckdb.org/data/animals.db' AS animals_db;
memory D ATTACH 'https://blobs.duckdb.org/data/numbers1.db';
memory D .tables

 ────────────── animals_db ───────────────
 ───────────────── main ──────────────────
┌─────────────────┐┌──────────────────────┐
│      swans      ││        ducks         │
│                 ││                      │
│ id      integer ││ id           integer │
│ name    varchar ││ name         varchar │
│ species varchar ││ extinct_year integer │
│ color   varchar ││                      │
│ habitat varchar ││        5 rows        │
│                 │└──────────────────────┘
│     3 rows      │
└─────────────────┘
  numbers1
 ── main ──
┌──────────┐
│   tbl    │
│          │
│ i bigint │
│          │
│  2 rows  │
└──────────┘

Accessing the Last Result Using `_`

You can access the last result of a query inline using the underscore character _. This is not only convenient but also makes it unnecessary to re-run potentially long-running queries:

memory D ATTACH 'https://blobs.duckdb.org/data/animals.db' AS animals_db;
memory D USE animals_db;
animals_db D FROM ducks WHERE extinct_year IS NOT NULL;
┌───────┬──────────────────┬──────────────┐
│  id   │       name       │ extinct_year │
│ int32 │     varchar      │    int32     │
├───────┼──────────────────┼──────────────┤
│     1 │ Labrador Duck    │         1878 │
│     3 │ Crested Shelduck │         1964 │
│     5 │ Pink-headed Duck │         1949 │
└───────┴──────────────────┴──────────────┘
animals_db D FROM _;
┌───────┬──────────────────┬──────────────┐
│  id   │       name       │ extinct_year │
│ int32 │     varchar      │    int32     │
├───────┼──────────────────┼──────────────┤
│     1 │ Labrador Duck    │         1878 │
│     3 │ Crested Shelduck │         1964 │
│     5 │ Pink-headed Duck │         1949 │
└───────┴──────────────────┴──────────────┘

Last but not least, the CLI now has a pager! It is triggered when there are more than 50 rows in the results.

memory D .maxrows 100
memory D FROM range(0, 100);

You can navigate on Linux and Windows using Page Up / Page Down. On macOS, use Fn + Up / Down. To exit the pager, press Q.

The initial implementation of the pager was provided by tobwen in #19004.

PEG Parser

DuckDB v1.5 ships an experimental parser based on PEG (Parser Expression Grammars). The new parser enables better suggestions, improved error messages, and allows extensions to extend the grammar. The PEG parser is currently disabled by default but you can opt-in using:

CALL enable_peg_parser();

The PEG parser is already used for generating suggestions. You can cycle through the options using TAB.

animals_db D FROM ducks WHERE habitat IS 
IS           ISNULL       ILIKE        IN           INTERSECT    LIKE

We are planning to make the switch to the new parser in the upcoming DuckDB release.

As a tradeoff, the parser has a slight performance overhead, however, this is in the range of milliseconds and is thus negligible for analytical queries. For more details on the rationale for using a PEG parser and benchmark results, please refer to the CIDR 2026 paper by Hannes and Mark, or their blog post summarizing the paper.

`VARIANT` Type

DuckDB now natively supports the VARIANT type, inspired by Snowflake's semi-structured VARIANT data type and available in Parquet since 2025. Unlike the JSON type, which is physically stored as text, VARIANT stores typed, binary data. Each row in a VARIANT column is self-contained with its own type information. This leads to better compression and query performance. Here are a few examples of using VARIANT.

Store different types in the same column:

CREATE TABLE events (id INTEGER, data VARIANT);
INSERT INTO events VALUES
    (1, 42::VARIANT),
    (2, 'hello world'::VARIANT),
    (3, [1, 2, 3]::VARIANT),
    (4, {'name': 'Alice', 'age': 30}::VARIANT);

SELECT * FROM events;

┌───────┬────────────────────────────┐
│  id   │            data            │
│ int32 │          variant           │
├───────┼────────────────────────────┤
│     1 │ 42                         │
│     2 │ hello world                │
│     3 │ [1, 2, 3]                  │
│     4 │ {'name': Alice, 'age': 30} │
└───────┴────────────────────────────┘

Check the underlying type of each row:

SELECT id, data, variant_typeof(data) AS vtype
FROM events;

┌───────┬────────────────────────────┬───────────────────┐
│  id   │            data            │       vtype       │
│ int32 │          variant           │      varchar      │
├───────┼────────────────────────────┼───────────────────┤
│     1 │ 42                         │ INT32             │
│     2 │ hello world                │ VARCHAR           │
│     3 │ [1, 2, 3]                  │ ARRAY(3)          │
│     4 │ {'name': Alice, 'age': 30} │ OBJECT(name, age) │
└───────┴────────────────────────────┴───────────────────┘

You can extract fields from nested variants using the dot notation or the variant_extract function:

SELECT data.name FROM events WHERE id = 4;
-- or 
SELECT variant_extract(data, 'name') AS name FROM events WHERE id = 4;

┌─────────┐
│  name   │
│ variant │
├─────────┤
│ Alice   │
└─────────┘

DuckDB also supports reading VARIANT types from Parquet files, including shredding (storing nested data as flat values).

`read_duckdb` Function

The read_duckdb table function can read DuckDB databases without first attaching them. This can make reading from DuckDB databases more ergonomic – for example, you can use globbing. You can read the example numbers databases as follows:

SELECT min(i), max(i)
FROM read_duckdb('numbers*.db');

┌────────┬────────┐
│ min(i) │ max(i) │
│ int64  │ int64  │
├────────┼────────┤
│      1 │      5 │
└────────┴────────┘

Azure Writes

You can now write to the Azure Blob or ADLSv2 storage using the COPY statement:

-- Write query results to a Parquet file on Blob Storage
COPY (SELECT * FROM my_table)
TO 'az://my_container/path/output.parquet';

-- Write a table to a CSV file on ADLSv2 Storage
COPY my_table
TO 'abfss://my_container/path/output.csv';

ODBC Scanner

We are now shipping an ODBC scanner extension. This allows you to query a remote endpoint as follows:

LOAD odbc_scanner;
SET VARIABLE conn = odbc_connect('Driver={Oracle Driver};DBQ=//127.0.0.1:1521/XE;UID=scott;PWD=tiger;');
SELECT * FROM odbc_query(getvariable('conn'), 'SELECT SYSTIMESTAMP FROM dual;');

In the coming weeks, we'll publish the documentation page and release a followup post on the ODBC scanner. In the meantime, please refer to the project's README.

Major Changes

Lakehouse Updates

All of DuckDB’s supported Lakehouse formats have received some updates for v1.5.

DuckLake

The main DuckLake change for DuckDB v1.5 is updating the DuckLake specification to v0.4. We are aiming for this to be the same specification that ships with DuckLake 1.0, which will be released in April. Its main highlights include:

Macro support.
Sorted tables.
Deletion inlining and addition of partial delete files.
Internal rework of DuckLake options.

We'll announce more details about these features in the blog post for DuckLake v1.

Delta Lake

For the Delta Lake extension, the team has focused on improving support for writes via Unity Catalog, Delta idempotent writes and table CHECKPOINTs.

Iceberg

For the Iceberg extension, the team is working on a larger release for v1.5.1. For v1.5.0, the main feature is the addition of table properties in the CREATE TABLE statement:

CREATE TABLE test_create_table (a INTEGER)
WITH (
    'format-version' = '2', -- format version will be elevated to format-version when creating a table
    'location' = 's3://path/to/data', -- location will be elevated to location when creating a table
    'property1' = 'value1',
    'property2' = 'value2'
);

Other minor additions have been made to enable passing EXTRA_HTTP_HEADERS when attaching to an Iceberg catalog, which has unlocked Google’s BigLake.

Both Delta and DuckLake have implemented the VARIANT type. Iceberg’s VARIANT type will ship in the v1.5.1 release with some other features that are specific to the Iceberg v3 specification.

Network Stack

The default backend for the httpfs extension has changed from httplib to curl. As one of the most popular and well-tested open-source projects, we expect curl to provide long-standing stability and security for DuckDB. Regardless of the http library used, openssl is still the backing SSL library and options such as http_timeout, http_retries, etc. are still the same.

Our community has been testing the new network stack for the last few weeks. Still, if you encounter any issues, please submit them to the duckdb-httpfs repository.

If you are interested in more details, click here.

Announcing DuckDB 1.4.4 LTS

2026-01-26T00:00:00+00:00

In this blog post, we highlight a few important fixes in DuckDB v1.4.4, the fourth patch release in DuckDB's 1.4 LTS line. The release ships bugfixes, performance improvements and security patches. You can find the complete release notes on GitHub.

To install the new version, please visit the installation page.

Fixes

This version ships a number of performance improvements and bugfixes.

#20233 Function chaining not allowed in QUALIFY #20233

Correctness

Crashes and Internal Errors

Performance

Miscellaneous

Conclusion

This post was a short summary of the changes in v1.4.4. As usual, you can find the full release notes on GitHub. We would like to thank our contributors for providing detailed issue reports and patches. In the coming month, we'll release DuckDB v1.5.0. We'll also keep v1.4 LTS updated until mid-September. We'll announce the release date of v1.4.5 in the release calendar in the coming months.

Earlier today, we pushed an incorrect tag that was visible for a few minutes. No binaries or extensions were available under this tag and we replaced it as soon as we noticed the issue. Our apologies for the erroneous release.

Announcing Vortex Support in DuckDB

2026-01-23T00:00:00+00:00

I think it is worth starting this intro by talking a little bit about the established format for columnar data. Parquet has done some amazing things for analytics. If you go back to the times where CSV was the better alternative, then you know how important Parquet is. However, even if the specification has evolved over time, Parquet has some design constraints. A particular limitation is that it is block-compressed and engines need to decompress pages in order to do further operations like filtering, decoding values, etc. For a while, researchers and private companies have been working on alternatives to Parquet that could improve on some of Parquet’s shortcomings. Vortex, from the SpiralDB team, is one of them.

What is Vortex?

Vortex is an extensible, open source format for columnar data. It was created to handle heterogeneous compute patterns and different data modalities. But, what does this mean?

The project was donated to the Linux Foundation by the SpiralDB team in August 2025.

Vortex provides different layouts and encodings for different data types. Some of the most notable are ALP for floating point encoding or FSST for string encoding. This lightweight compression strategy keeps data sizes down while allowing one of Vortex’s most important features: compute functions. By knowing the encoded layout of the data, Vortex is able to run arbitrary expressions on compressed data. This allows a Vortex reader to execute, for example, filter expressions within storage segments without decompressing data.

We mentioned heterogeneous compute to emphasize that Vortex was designed with the idea of having optimized layouts for different data types, including vectors, large text or even image or audio, but also to maximize CPU or GPU saturation. The idea is that decompression is deferred all the way to the GPU or CPU, enabling what Vortex calls “late materialization”. The FastLanes encoding, a project originating at CWI (like DuckDB), is one of the main drivers behind this feature.

Vortex also supports dynamically loaded libraries (similar to DuckDB extensions) to provide new encodings for specific types as well as specific compute functions, e.g. for geospatial data. Another very interesting feature is encoding WebAssembly into the file, which can allow the reader to benefit from specific compute kernels applied to the file.

Besides DuckDB, other engines such as DataFusion, Spark and Arrow already offer integration with Vortex.

For more information, check out the Vortex documentation.

The DuckDB Vortex Extension

DuckDB is a database as the name says, yes, but it is also widely used as an engine to query many different data sources. Through core or community extensions, DuckDB can integrate with:

Databases like Snowflake, BigQuery or PostgreSQL.
Lakehouse formats like Delta, Iceberg or DuckLake.
File formats, most notably JSON, CSV, Parquet and most recently Vortex.

The community has gotten very creative, though, so these days you can even read YAML and Markdown with DuckDB using community extensions.

All this is possible due to the DuckDB extension system, which makes it relatively easy to implement logic to interact with different file formats or external systems.

The SpiralDB team built a DuckDB extension. Together with the DuckDB Labs team, we have made the extension available as a core DuckDB extension, so that the community can enjoy Vortex as a first-class citizen in DuckDB.

Example Usage

Installing and using the Vortex extension is very simple:

INSTALL vortex;
LOAD vortex;

Then, you can easily use it to read and write, similar to other extensions such as Parquet.

SELECT * FROM read_vortex('my.vortex');

COPY (SELECT * FROM generate_series(0, 3) t(i))
TO 'my.vortex' (FORMAT vortex);

Why Vortex and DuckDB?

Vortex claims to do well primarily at three use cases:

Traditional SQL analytics: Through late decompression and compute expressions on compressed data, Vortex can filter down data within the storage segment, reducing IO and memory consumption.
Machine learning pre-processing pipelines: By supporting a wide variety of encodings for different data types, Vortex claims to be effective at reading and writing data, whether it is audio, text, images or vectors.
AI model training: Encodings such as FastLanes allow for a very efficient copy of data to the GPU. Vortex is aiming at being able to copy data directly from S3 object storage to the GPU.

The promise of more efficient IO and memory use through late decompression is a good reason to try DuckDB and Vortex for SQL analytics. On another note, if you are looking at running analytics on unified datasets that are used for multiple use cases, including pre-processing pipelines and AI training, then Vortex may be a good candidate since it is designed to fit all of these use cases well.

Performance Experiment

For those who are number hungry, we decided to run a TPC-H benchmark scale factor 100 with DuckDB to understand how Vortex can perform as a storage format compared to Parquet. We tried to make the benchmark as fair as possible. These are the parameters:

Run on Mac M1 with 10 cores & 32 GB of memory.
The benchmark runs each query 5 times and the average is used for the final report.
The DuckDB connection is closed after each query to try to make runs “colder” and avoid DuckDB's caching (particularly with Parquet) from influencing the results. OS page caching does have an influence in subsequent runs but we decided to acknowledge this factor and still keep the first run.
Each TPC-H table is a single file, which means that lineitem files for Parquet and Vortex are quite large (both around 20 GB). This allows us to ignore the effect of globbing and having many small files.
Data files used for the benchmark are generated with tpchgen-rs and are copied out using DuckDB’s Parquet and Vortex extensions.
We compared Vortex against Parquet v1 and v2. The v2 specification allows for considerably faster reading than the v1 specification but many writers do not support this, so we thought it was worth including both.

The results are very good. The TPC-H benchmark runs 18% faster with respect to Parquet V2 and 35% faster than Parquet V1 (using the geometric means, which is the recommended approach).

Another interesting result is the standard deviation across runs. There was a considerable difference between the first (and coldest) run of each query and subsequent runs in Parquet, while Vortex performed very well across all runs with a much smaller standard deviation.

Format	Geometric Mean (s)	Arithmetic Mean (s)	Avg Std Dev (s)	Total Time (s)
parquet_v1	2.324712	2.875722	0.145914	63.265881
parquet_v2	1.839171	2.288013	0.182962	50.336281
vortex	1.507675	1.991289	0.078893	43.808349

The times did vary across different runs of the same benchmark, and subsequent runs have yielded similar results but with slight variations. The differences between Parquet v2 and Vortex have always been around 12-18% in geometric means and around 8-14% in total times. Benchmarking is very hard!

Click here to see a more detailed breakdown of the benchmark results.

DuckDB on LoongArch

2026-01-06T00:00:00+00:00

It’s not every day that a new CPU architecture arrives on your desk. I grew up on the Intel 486 back in the early 90s. I also still remember AMD releasing its 64-bit x86 extension in 2000. Then not a lot happened until Apple released the ARM-based M1 architecture in 2020. But today is the day again (for me), with the long-awaited arrival of the “MOREFINE M700S” in our office.

The M700S contains a Loongson CPU. Also called “LoongArch” or “Godson” processors, this CPU was developed in China based on the (somewhat esoteric) MIPS architecture. This is part of a plan to become technologically self-sufficient as part of the government-funded Made in China 2025 plan.

It is probably safe to assume that – given the ongoing trade shenanigans – the Loongson will become much more popular in China as time goes on. DuckDB already sees quite a lot of usage from China, so naturally we want to make sure that DuckDB runs well on the Loongson. Thankfully, one of our community members has already opened a pull request with two minimal changes to allow DuckDB to compile. We became curious.

We purchased the M700S on (where else?) AliExpress for around 500 EUR. Besides the Loongson 8-core 3A6000 CPU it contains 16 GB of main memory and a 256 GB solid-state disk.

Once plugged in and booted up, things feel pretty normal besides the loud fan that seems to be always on. On the screen, a variant of Debian called Loongnix boots up. The GUI seems to be KDE-based and comes with a custom browser “LBrowser” which is a fork of Chromium. Just because it was not obvious we document it here: the default root password is M700S. There is also a user account m700s with the same password.

Overall, the software seems a little dated, even after running apt upgrade: the Linux kernel seems to be version 4.19, which was released back in 2018, and which has been EOL for a year now. The GCC version is 8.3, which similarly came out in 2019.

With the aforementioned patch, we managed to compile DuckDB 1.4.3 on Loongnix. There was one small issue where the CMake file append_metadata.cmake was not compatible with the older CMake version (3.13.4) available on Loongnix. But simply replacing that file with an empty one allowed us to complete the build. Of course we could also have updated CMake, but life is short. Once completed, we ran DuckDB’s extensive unit test suite (make allunit) to confirm that our build runs correctly on the Loongson CPU. Results looked good.

For performance comparison, we re-used the methodology from our previous blog post that ran DuckDB on a Raspberry Pi. In short, we run the 22 TPC-H benchmark queries on “Scale Factor” 100 and 300, which in DuckDB format is a 25 GB and 78 GB database file, respectively. We compare those numbers with the nearest computer, which is my day-to-day MacBook Pro with an M3 Max CPU. For fairness, we limit DuckDB to 14 GB of RAM on both platforms. The reported timings are “hot” runs, meaning we re-ran the query set and took the timings from the second run.

Here are the results, and they are not great. We start with aggregated timings:

SF	System	Geometric mean	Sum
SF100	MacBook	0.6	16.9
SF100	MOREFINE	6.1	192.8
SF300	MacBook	2.8	78.8
SF300	MOREFINE	27.3	791.6

We can see that the MacBook is around ten times faster than the MOREFINE on this benchmark, both in the geometric mean of runtimes as well as in the sum. If you are interested in the individual query runtimes, you can find them below.

Click here to see the individual query runtimes.

Iceberg in the Browser

2025-12-16T00:00:00+00:00

In this post, we describe the current patterns for interacting with Iceberg Catalogs, and pose the question: could it be done from a browser? After elaborating on the DuckDB ecosystem changes required to unlock this capability, we demonstrate our approach to interacting with an Iceberg REST Catalog. It's browser-only, no extra setup required.

Interaction Models for Iceberg Catalogs

Iceberg is an open table format, which allows you to capture a mutable database table as a set of static files on object storage (such as AWS S3). Iceberg catalogs allow you to track and organize Iceberg tables. For example, Iceberg REST Catalogs provide these functionalities through a REST API.

There are two common ways to interact with Iceberg catalogs:

The client–server model, where the compute part of the operation is delegated to a managed infrastructure (such as the cloud). Users can interact with the server by installing a local client or using a lightweight client such as a browser.
The client-is-the-server model, where the user first installs the relevant libraries, and then performs queries directly on their machine.

Iceberg engines follow these interaction models: they are either run natively in managed compute infrastructure or they are run locally by the user. Let's see how things look with DuckDB in the mix!

Iceberg with DuckDB

DuckDB supports both Iceberg interaction models. In the client–server model, DuckDB runs on the server to read the Iceberg datasets. From the user's point of view, the choice of engine is transparent, and DuckDB is just one of many engines that the server could use in the background. The client-is-the-server model is more interesting: here, users install a DuckDB client locally and use it through its SQL interface to query Iceberg catalogs. For example:

CREATE SECRET test_secret (
    TYPE S3, 
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
);

ATTACH 'warehouse' AS db (
    TYPE ICEBERG,
    ENDPOINT_URL 'https://your-iceberg-endpoint',
);

SELECT sum(value)
FROM db.table
WHERE other_column = 'some_value';

The client-is-the-server model unlocks empowered clients, which can operate directly on the data.

You can discover the full DuckDB-Iceberg extension feature set, including insert and update capabilities, in our earlier blog post.

Iceberg with DuckDB in the Browser

While setting up a local DuckDB installation is quite simple, opening a browser tab is even quicker. Therefore, we asked ourselves: could we support the client-is-the-server model directly from within a browser tab? This could provide a zero-setup, no-infrastructure, properly serverless option for interacting with Iceberg catalogs.

Luckily, DuckDB has a client that can run in any browser! DuckDB-Wasm is a WebAssembly port of DuckDB, which supports loading of extensions.

Interacting with an Iceberg REST Catalog requires a number of functionalities; the ability to talk to a REST API over HTTP(S), the ability to read and write avro and parquet files on object storage, and finally, the ability to negotiate authentication to access those resources on behalf of the user. All of these must be done from within a browser without calling any native components.

To support these functionalities, we implemented the following high-level changes:

In the core duckdb codebase, we redesigned HTTP interactions, so that extensions and clients have a uniform interface to the networking stack. (PR)
In duckdb-wasm, we implemented such an interface, which in this case is a wrapper around the available JavaScript network stack. (PR)
In duckdb-iceberg, we routed all networking through the common HTTP interface, so that native DuckDB and DuckDB-Wasm execute the same logic. (PR)

The result is that you can now query Iceberg with DuckDB running directly in a browser! Now you can access the same Iceberg catalog using client–server, client-is-the-server, or properly serverless from the isolation of a browser tab!

Welcome to Serverless Iceberg Analytics

Check out our demo of serverless Iceberg analytics using the DuckDB Table Visualizer

The current credentials in the demo are provided via a throwaway account with minimal permissions. If you enter your own credentials and share a link, you will be sharing your credentials.

Access Your Own Data

Substituting your own S3 Tables bucket ARN and credentials with policy AmazonS3TablesReadOnlyAccess, you can also access your catalog, metadata and data. Computations are fully local, and the credentials and warehouse ID are only sent to the catalog endpoint specified in your ATTACH command. Inputs are translated to SQL, and added to the hash segment of the URL.

This means that:

no sensitive data is handled or sent to duckdb.org
computations are local, fully in your browser
you can use the familiar SQL interface with the same code snippets that can run everywhere DuckDB runs
if you edit the credentials and share the resulting link, you will be sharing the new credentials

As of today, this works with Amazon S3 Tables. This has been implemented through a collaboration with the Amazon S3 Tables team. To learn more about S3 Tables, how to get started and their feature set, you can take a look at their product page or documentation. A demo of DuckDB querying S3 Tables from a browser was presented at AWS re:Invent 2025 – see the presentation.

Conclusion

The DuckDB-Iceberg extension is now supported in DuckDB-Wasm and it can read and edit Iceberg REST Catalogs. Users can now access Iceberg data from within a browser, without having to install or manage any compute nodes!

If you would like to provide feedback or file issues, please reach out to us on either the DuckDB-Wasm or DuckDB-Iceberg repository. If you are interested in using any part of this within your organization, feel free to reach out.

Announcing DuckDB 1.4.3 LTS

2025-12-09T00:00:00+00:00

In this blog post, we highlight a few important fixes in DuckDB v1.4.3, the third patch release in DuckDB's 1.4 LTS line. You can find the complete release notes on GitHub.

To install the new version, please visit the installation page.

Fixes

This version ships a number of performance improvements and bugfixes.

Correctness

Crashes and Internal Errors

Performance

#18997 Macro binding had slow performance for unbalanced trees
#19901 Memory management has been improved during WAL replay in the presence of indexes
The vortex extension ships significant performance improvements for writing Vortex files

Miscellaneous

Azure Blob Storage Writes

The azure extension can now write to the Azure Blob Storage. This unlocks several other Azure and Fabric features, including using OneLake instances.

Windows Arm64

With this release, we are introducing beta support for Windows Arm64 by distributing native DuckDB extensions and Python wheels.

Extension Distribution

On Windows Arm64, you can now natively install core extensions, including complex ones like spatial:

duckdb

PRAGMA platform;

┌───────────────┐
│   platform    │
│    varchar    │
├───────────────┤
│ windows_arm64 │
└───────────────┘

INSTALL spatial;
LOAD spatial;
SELECT ST_Area(ST_GeomFromText(
        'POLYGON((0 0, 4 0, 4 3, 0 3, 0 0))'
    )) AS area;

┌────────┐
│  area  │
│ double │
├────────┤
│  12.0  │
└────────┘

Python Wheel Distribution

We now distribute Python wheels for Windows Arm64 for Python 3.11+. This means that you take e.g. a Copilot+ PC, install the native Python interpreter and run:

pip install duckdb

This installs the duckdb package using the binary distributed through PyPI. Then, you can use it as follows:

python

Python 3.13.9
    (tags/v3.13.9:8183fa5, Oct 14 2025, 14:51:39)
    [MSC v.1944 64 bit (ARM64)] on win32

>>> import duckdb
>>> duckdb.__version__
'1.4.3'

Currently, many Python installations that you'll find on Windows Arm64 computers use the x86_64 (AMD64) Python distribution and run through Microsoft's Prism emulator. For example, if you install Python through the Windows Store, you will get the Python AMD64 installation. To understand which platform your Python installation is using, observe the Python CLI's first line (e.g., Python 3.13.9 ... (ARM64)).

ODBC Driver

We are now shipping a native ODBC driver for Windows Arm64. Head to the ODBC Windows installation page to try it out!

Conclusion

This post was a short summary of the changes in v1.4.3. As usual, you can find the full release notes on GitHub. We would like to thank our contributors for providing detailed issue reports and patches. Stay tuned for DuckDB v1.4.4 and v1.5.0, both released early next year!

Writes in DuckDB-Iceberg

2025-11-28T00:00:00+00:00

Over the past several months, the DuckDB Labs team has been hard at work on the DuckDB-Iceberg extension, with full read support and initial write support released in v1.4.0. Today, we are happy to announce delete and update support for Iceberg v2 tables is available in v1.4.2!

The Iceberg open table format has become extremely popular in the past two years, with many databases announcing support for the open table format originally developed at Netflix. This past year the DuckDB team has made Iceberg integration a priority and today we are happy to announce another step in that direction. In this blog post we will describe the current feature set of DuckDB-Iceberg in DuckDB v1.4.2.

Getting Started

To experiment with the new DuckDB-Iceberg features, you will need to connect to your favorite Iceberg REST Catalog. There are many ways to connect to an Iceberg REST Catalog: please have a look at the Connecting to REST Catalogs for connecting to catalogs like Apache Polaris or Lakekeeper and the Connecting to S3 Tables page if you would like to connect to Amazon S3 Tables.

ATTACH 'warehouse_name' AS iceberg_catalog (
    TYPE iceberg,
    other options
);

Inserts, Deletes and Updates

Support for creating tables and inserting to tables was already added in DuckDB v1.4.0: you can use standard DuckDB SQL syntax to insert data into your Iceberg table.

CREATE TABLE iceberg_catalog.default.simple_table (
    col1 INTEGER,
    col2 VARCHAR
);
INSERT INTO iceberg_catalog.default.simple_table
    VALUES (1, 'hello'), (2, 'world'), (3, 'duckdb is great');

You can also use any DuckDB table scan function to insert data into an Iceberg table:

INSERT INTO iceberg_catalog.default.more_data
    SELECT * FROM read_parquet('path/to/parquet');

Starting with v1.4.2, the standard SQL syntax also works for deletes and updates:

DELETE FROM iceberg_catalog.default.simple_table
WHERE col1 = 2;

UPDATE iceberg_catalog.default.simple_table
SET col1 = col1 + 5
WHERE col1 = 1;

SELECT *
FROM iceberg_catalog.default.simple_table;

┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     3 │ duckdb is great │
│     6 │ hello           │
└───────┴─────────────────┘

The Iceberg write support current has two limitations:

The update support is limited to tables that are not partitioned and not sorted. Attempting to perform update, insert or delete operations on partitioned or sorted tables using DuckDB-Iceberg will result in an error.

DuckDB-Iceberg only writes positional deletes for DELETE and UPDATE statements. Copy-on-write functionality is not yet supported.

Functions for Table Properties

Currently, DuckDB-Iceberg only supports merge-on-read semantics. Within Iceberg Table Metadata, table properties can be used to describe what form of deletes or updates are allowed. DuckDB-Iceberg will respect write.update.mode and write.delete.mode table properties for updates and deletes. If a table has these properties and they are not merge-on-read, DuckDB will throw an error and the UPDATE or DELETE will not be committed. Version v1.4.2 introduces three new functions to add, remove, and view table properties for an Iceberg table:

set_iceberg_table_properties
iceberg_table_properties
remove_iceberg_table_properties

You can use them as follows:

-- to set table properties
CALL set_iceberg_table_properties(iceberg_catalog.default.simple_table, {
    'write.update.mode': 'merge-on-read',
    'write.file.size': '100000kb'
});
-- to read table properties
SELECT * FROM iceberg_table_properties(iceberg_catalog.default.simple_table);

┌───────────────────┬───────────────┐
│        key        │     value     │
│      varchar      │    varchar    │
├───────────────────┼───────────────┤
│ write.update.mode │ merge-on-read │
│ write.file.size   │ 100000kb      │
└───────────────────┴───────────────┘

-- to remove table properties
CALL remove_iceberg_table_properties(
    iceberg_catalog.default.simple_table,
    ['some.other.property']
);

Iceberg Table Metadata

DuckDB-Iceberg also allows you to view the metadata of your Iceberg tables using the iceberg_metadata() and iceberg_snapshots() functions.

SELECT * FROM iceberg_metadata(iceberg_catalog.default.table_1);

┌──────────────────────┬──────────────────────┬──────────────────┬─────────┬──────────────────┬─────────────────────────────────────────────────────────────┬─────────────┬──────────────┐
│    manifest_path     │ manifest_sequence_…  │ manifest_content │ status  │     content      │                         file_path                           │ file_format │ record_count │
│       varchar        │        int64         │     varchar      │ varchar │     varchar      │                          varchar                            │   varchar   │    int64     │
├──────────────────────┼──────────────────────┼──────────────────┼─────────┼──────────────────┼─────────────────────────────────────────────────────────────┼─────────────┼──────────────┤
│ s3://warehouse/def…  │                    1 │ DATA             │ ADDED   │ EXISTING         │ s3:///simple_table/data/019a6ecc-9e9e-7…  │ parquet     │            3 │
│ s3://warehouse/def…  │                    2 │ DELETE           │ ADDED   │ POSITION_DELETES │ s3:///simple_table/data/d65b1db8-9fa8-4…  │ parquet     │            1 │
│ s3://warehouse/def…  │                    3 │ DELETE           │ ADDED   │ POSITION_DELETES │ s3:///simple_table/data/8d1b92dc-5f6e-4…  │ parquet     │            1 │
│ s3://warehouse/def…  │                    3 │ DATA             │ ADDED   │ EXISTING         │ s3:///simple_table/data/019a6ecf-5261-7…  │ parquet     │            1 │
└──────────────────────┴──────────────────────┴──────────────────┴─────────┴──────────────────┴─────────────────────────────────────────────────────────────┴─────────────┴──────────────┘

SELECT * FROM iceberg_snapshots(iceberg_catalog.default.simple_table);

┌─────────────────┬─────────────────────┬─────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ sequence_number │     snapshot_id     │      timestamp_ms       │                                                manifest_list                                                 │
│     uint64      │       uint64        │        timestamp        │                                                   varchar                                                    │
├─────────────────┼─────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│               1 │ 1790528822676766947 │ 2025-11-10 17:24:55.075 │ s3:///simple_table/data/snap-1790528822676766947-f09658c4-ca52-4305-943f-6a8073529fef.avro │
│               2 │ 6333537230056014119 │ 2025-11-10 17:27:35.602 │ s3:///simple_table/data/snap-6333537230056014119-316d09bc-549d-46bc-ae13-a9fab5cbf09b.avro │
│               3 │ 7452040077415501383 │ 2025-11-10 17:27:52.169 │ s3:///simple_table/data/snap-7452040077415501383-93dee94e-9ec1-45fa-aec2-13ef434e50eb.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Time Travel

Time travel is also possible via snapshot ids or timestamps using the AT (VERSION => ...) or AT (TIMESTAMP => ...) syntax.

-- via snapshot id
SELECT *
FROM iceberg_catalog.default.simple_table AT (
	VERSION => snapshot_id
);

┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     1 │ hello           │
│     3 │ duckdb is great │
└───────┴─────────────────┘

-- via timestamp
SELECT *
FROM iceberg_catalog.default.simple_table AT (
    TIMESTAMP => '2025-11-10 17:27:45.602'
);

┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     1 │ hello           │
│     3 │ duckdb is great │
└───────┴─────────────────┘

Viewing Requests to the Iceberg REST Catalog

You may also be curious as to what requests DuckDB is making to the Iceberg REST Catalog. To do so, enable HTTP logging, run your workload, then select from the HTTP logs.

CALL enable_logging('HTTP');
SELECT * FROM iceberg_catalog.default.simple_table;
SELECT request.type, request.url, response.status
FROM duckdb_logs_parsed('HTTP');

┌─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│  type   │                                                                             url                          │       status       │
│ varchar │                                                                           varchar                        │      varchar       │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default                     │ NULL               │
│ HEAD    │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table │ NULL               │
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table │ NULL               │
│ GET     │ https:///data/snap-5943683398986255948-c2217dde-6036-4e07-88f2-…                       │ OK_200             │
│ GET     │ https:///data/f8c95b93-7b6b-4a24-8557-b98b553723d4-m0.avro                             │ OK_200             │
│ GET     │ https:///data/214a7988-da39-4dac-aa3a-4a73d3ead405-m0.avro                             │ OK_200             │
│ GET     │ https:///data/019a7244-c6e8-7bc9-9dd4-7249fcb04959.parquet                             │ PartialContent_206 │
│ GET     │ https:///data/019a7244-fcb5-7308-96ec-1c9e32509eab.parquet                             │ PartialContent_206 │
│ GET     │ https:///data/7f14bb06-f57a-42b4-ba7f-053a65152759-m0.avro                             │ OK_200             │
│ GET     │ https:///data/71f8b43d-51e7-40e7-be88-e8d869836ecd-deletes.parq…                       │ PartialContent_206 │
│ GET     │ https:///data/64f6c6e2-2f54-470e-b990-b201bc615042-m0.avro                             │ OK_200             │
│ GET     │ https:///data/4e54afed-6dd8-4ba0-88fb-16f972ac1d91-deletes.parq…                       │ PartialContent_206 │
├─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┤
│ 12 rows                                                                                                                       3 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we can see calls to the Iceberg REST Catalog, followed by calls to the storage endpoint. The first three calls to the Iceberg REST Catalog are to verify the schema still exists and to get the latest metadata.json of the DuckDB-Iceberg table. Next, it queries the manifest list, manifest files, and eventually the files with data and deletes. The data and delete files are stored locally in a cache to speed up subsequent reads.

Transactions

DuckDB is an ACID-compliant database that supports transactions. Work on DuckDB-Iceberg has been made with this in mind. Within a transaction, the following conditions will hold for Iceberg tables.

The first time a table is read in a transaction, its snapshot information is stored in the transaction and will remain consistent within that transaction.
Updates, inserts and deletes will only be committed to an Iceberg Table when the transaction is committed (i.e., COMMIT);

Point #1 is important for read performance. If you wish to do analytics on an Iceberg table and you do not need to get the latest version of the table every time, running your analytics in a transaction will prevent fetching the latest version for every query.

-- truncate the logs
CALL truncate_duckdb_logs();
CALL enable_logging('HTTP')
BEGIN;
-- first read gets latest snapshot information
SELECT * FROM iceberg_catalog.default.simple_table;
-- subsequent read reads from local cached data
SELECT * FROM iceberg_catalog.default.simple_table;
-- get logs
SELECT request.type, request.url, response.status
FROM duckdb_logs_parsed('HTTP');

┌─────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│  type   │                                                  url                                                        │       status       │
│ varchar │                                                varchar                                                      │      varchar       │
├─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default                        │ NULL               │
│ HEAD    │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table    │ NULL               │
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table    │ NULL               │
│ GET     │ https:///data/snap-5943683398986255948-c2217dde-6036-4e07-88f2-1…                         │ OK_200             │
│ GET     │ https:///data/f8c95b93-7b6b-4a24-8557-b98b553723d4-m0.avro                                │ OK_200             │
│ GET     │ https:///data/214a7988-da39-4dac-aa3a-4a73d3ead405-m0.avro                                │ OK_200             │
│ GET     │ https:///data/019a7244-c6e8-7bc9-9dd4-7249fcb04959.parquet                                │ PartialContent_206 │
│ GET     │ https:///data/019a7244-fcb5-7308-96ec-1c9e32509eab.parquet                                │ PartialContent_206 │
│ GET     │ https:///data/7f14bb06-f57a-42b4-ba7f-053a65152759-m0.avro                                │ OK_200             │
│ GET     │ https:///data/71f8b43d-51e7-40e7-be88-e8d869836ecd-deletes.parquet                        │ PartialContent_206 │
│ GET     │ https:///data/64f6c6e2-2f54-470e-b990-b201bc615042-m0.avro                                │ OK_200             │
│ GET     │ https:///data/4e54afed-6dd8-4ba0-88fb-16f972ac1d91-deletes.parquet                        │ PartialContent_206 │
├─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┤
│ 12 rows                                                                                                                          3 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we see all the same requests we saw in the previous section. However, now we are in a transaction, which means the second time we read from iceberg_catalog.default.simple_table, we do not need to query the REST Catalog for table updates. This means DuckDB-Iceberg performs no extra requests when reading a table a second time, significantly improving performance.

Conclusion and Future Work

With these features, DuckDB-Iceberg now has a strong base support for the Iceberg tables, which enables users to unlock the analytical powers of DuckDB on their Iceberg tables. There is still more work to come and the Iceberg table specification has many more features the DuckDB team would like to support in DuckDB-Iceberg. If you feel any feature is a priority for your analytical workloads, please reach out to us in the DuckDB-Iceberg GitHub repository or get in touch with our engineers.

Below is a list of improvements planned for the near future (in no particular order):

Performance improvements
Updates / deletes / inserts to partitioned tables
Updates / deletes / inserts to sorted tables
Schema evolution
Support for Iceberg v3 tables, focusing on binary deletion vectors and row lineage tracking

Data-at-Rest Encryption in DuckDB

2025-11-19T00:00:00+00:00

If you would like to use encryption in DuckDB, we recommend using the latest stable version, v1.4.2. For more details, see the latest release blog post.

Many years ago, we read the excellent “Code Book” by Simon Singh. Did you know that Mary, Queen of Scots, used an encryption method harking back to Julius Caesar to encrypt her more saucy letters? But alas: the cipher was broken and the contents of the letters got her executed.

These days, strong encryption software and hardware is a commodity. Modern CPUs come with specialized cryptography instructions, and operating systems small and big contain mostly-robust cryptography software like OpenSSL.

Databases store arbitrary information, it is clear that many if not most datasets of any value should perhaps not be plainly available to everyone. Even if stored on tightly controlled hardware like a cloud virtual machine, there have been many cases of files being lost through various privilege escalations. Unsurprisingly, compliance frameworks like the common SOC 2 “highly recommend” encrypting data when stored on storage mediums like hard drives.

However, database systems and encryption have a somewhat problematic track record. Even PostgreSQL, the self-proclaimed “The World's Most Advanced Open Source Relational Database” has very limited options for data encryption. SQLite, the world’s “Most Widely Deployed and Used Database Engine” does not support data encryption out-of-the-box, its encryption extension is a $2000 add-on.

DuckDB has supported Parquet Modular Encryption for a while. This feature allows reading and writing Parquet files with encrypted columns. However, while Parquet files are great and reports of their impending death are greatly exaggerated, they cannot – for example – be updated in place, a pretty basic feature of a database management system.

Starting with DuckDB 1.4.0, DuckDB supports transparent data encryption of data-at-rest using industry-standard AES encryption.

DuckDB's encryption does not yet meet the official NIST requirements. Please follow issue #20162 “Store and verify tag for canary encryption” to track our progress towards NIST-compliance.

Some Basics of Encryption

There are many different ways to encrypt data, some more secure than others. In database systems and elsewhere, the standard is the Advanced Encryption Standard (AES), which is a block cipher algorithm standardized by US NIST. AES is a symmetric encryption algorithm, meaning that the same key is used for both encryption and decryption of data.

For this reason, most systems choose to only support randomized encryption, meaning that identical plaintexts will always yield different ciphertexts (if used correctly!). The most commonly used industry standard and recommended encryption algorithm is AES – Galois Counter Mode (AES-GCM). This is because on top of its ability to randomize encryption, it also authenticates data by calculating a tag to ensure data has not been tampered with.

DuckDB v1.4 supports encryption at rest using AES-GCM-256 and AES-CTR-256 (counter mode) ciphers. AES-CTR is a simpler and faster version of AES-GCM, but less secure, since it does not provide authentication by calculating a tag. The 256 refers to the size of the key in bits, meaning that DuckDB now only supports GCM with 32-byte keys.

GCM and CTR both require as input a (1) plaintext, (2) an initialization vector (IV) and (3) an encryption key. Plaintext is the text that a user wants to encrypt. An IV is a unique bytestream of usually 16 bytes, that ensures that identical plaintexts get encrypted into different ciphertexts. A number used once (nonce) is a bytestream of usually 12 bytes, that together with a 4-byte counter construct the IV. Note that the IV needs to be unique for every encrypted block, but it does not necessarily have to be random. Reuse of the same IV is problematic, since an attacker could XOR the two ciphertexts and extract both messages. The tag in AES-GCM is calculated after all blocks are encrypted, pretty much like a checksum, but it adds an integrity check that securely authenticates the entire ciphertext.

Implementation in DuckDB

Before diving deeper into how we actually implemented encryption in DuckDB, we’ll explain some things about the DuckDB file format.

DuckDB has one main database header which stores data that enables it to correctly load and verify a DuckDB database. At the start of each DuckDB main database header, the magic bytes (“DUCKDB”) are stored and read upon initialization to verify whether the file is a valid DuckDB database file. The magic bytes are followed by four 8-byte of flags that can be set for different purposes.

When a database is encrypted in DuckDB, the main database header remains plaintext at all times, since the main header contains no sensitive data about the contents of the database file. Upon initializing an encrypted database, DuckDB sets the first bit in the first flag to indicate that the database is encrypted. After setting this bit, additional metadata is stored that is necessary for encryption. This metadata entails the (1) database identifier, (2) 8 bytes of additional metadata for e.g. the encryption cipher used, and (3) the encrypted canary.

The database identifier is used as a “salt”, and consists of 16 randomly generated bytes created upon initialization of each database. The salt is often used to ensure uniqueness, i.e., it makes sure that identical input keys or passwords are transformed into different derived keys. The 8-bytes of metadata comprise the key derivation function (first byte), usage of additional authenticated data (second byte), the encryption cipher (third byte), and the key length (fifth byte). After the metadata, the main header uses the encrypted canary to check if the input key is correct.

Encryption Key Management

To encrypt data in DuckDB, you can use practically any plaintext or base64 encoded string, but we recommend using a secure 32-byte base64 key. The user itself is responsible for the key management and thus for using a secure key. Instead of directly using the plain key provided by the user, DuckDB always derives a more secure key by means of a key derivation function (kdf). The kdf is a function that reduces or extends the input key to a 32-byte secure key. If the correctness of the input key is checked by deriving the secure key and decrypting the canary, the derived key is managed in a secure encryption key cache. This cache manages encryption keys for the current DuckDB context and ensures that the derived encryption keys are never swapped to disk by locking its memory. To strengthen security even more, the original input keys are immediately wiped from memory when the input keys are transformed into secure derived keys.

DuckDB Block Structure

After the main database header, DuckDB stores two 4KB database headers that contain more information about e.g. the block (header) size and the storage version used. After keeping the main database header plaintext, all remaining headers and blocks are encrypted when encryption is used.

Blocks in DuckDB are by default 256KB, but their size is configurable. At the start of each plaintext block there is an 8-byte block header, which stores an 8-byte checksum. The checksum is a simple calculation that is often used in database systems to check for any corrupted data.

For encrypted blocks however, its block header consists of 40 bytes instead of 8 bytes for the checksum. The block header for encrypted blocks contains a 16-byte nonce/IV and, optionally, a 16-byte tag, depending on which encryption cipher is used. The nonce and tag are stored in plaintext, but the checksum is encrypted for better security. Note that the block header always needs to be 8-bytes aligned to calculate the checksum.

Write-Ahead-Log Encryption

The write ahead log (WAL) in database systems is a crash recovery mechanism to ensure durability. It is an append-only file that is used in scenarios where the database crashed or is abruptly closed, and when not all changes are written yet to the main database file. The WAL makes sure these changes can be replayed up to the last checkpoint; which is a consistent snapshot of the database at a certain point in time. This means, when a checkpoint is enforced, which happens in DuckDB by either (1) closing the database or (2) reaching a certain threshold for storage, the WAL gets written into the main database file.

In DuckDB, you can force the creation of a WAL by setting

PRAGMA disable_checkpoint_on_shutdown;
PRAGMA wal_autocheckpoint = '1TB';

This way you’ll disable a checkpointing on closing the database, meaning that the WAL does not get merged into the main database file. In addition, by setting wal_autocheckpoint to a high threshold, this will avoid intermediate checkpoints to happen and the WAL will persist. For example, we can create a persistent WAL file by first setting the above PRAGMAs, then attach an encrypted database, and then create a table where we insert 3 values.

ATTACH 'encrypted.db' AS enc (
    ENCRYPTION_KEY 'asdf',
    ENCRYPTION_CIPHER 'GCM'
);
CREATE TABLE enc.test (a INTEGER, b INTEGER);
INSERT INTO enc.test VALUES (11, 22), (13, 22), (12, 21)

If we now close the DuckDB process, we can see that there is a .wal file shown: encrypted.db.wal. But how is the WAL created internally?

Before writing new entries (inserts, updates, deletes) to the database, these entries are essentially logged and appended to the WAL. Only after logged entries are flushed to disk, a transaction is considered as committed. A plaintext WAL entry has the following structure:

Since the WAL is append-only, we encrypt a WAL entry per value. For AES-GCM this means that we append a nonce and a tag to each entry. The structure in which we do this is depicted in below. When we serialize an encrypted entry to the encrypted WAL, we first store the length in plaintext, because we need to know how many bytes we should decrypt. The length is followed by a nonce, which on its turn is followed by the encrypted checksum and the encrypted entry itself. After the entry, a 16-byte tag is stored for verification.

Encrypting the WAL is triggered by default when an encryption key is given for any (un)encrypted database.

Temporary File Encryption

Temporary files are used to store intermediate data that is often necessary for large, out-of-core operations such as sorting, large joins and window functions. This data could contain sensitive information and can, in case of a crash, remain on disk. To protect this leftover data, DuckDB automatically encrypts temporary files too.

The Structure of Temporary Files

There are three different types of temporary files in DuckDB: (1) temporary files that have the same layout as a regular 256KB block, (2) compressed temporary files and (3) temporary files that exceed the standard 256KB block size. The former two are suffixed with .tmp, while the latter is distinguished by a suffix with .block. To keep track of the size of .block temporary files, they are always prefixed with its length. As opposed to regular database blocks, temporary files do not contain a checksum to check for data corruption, since the calculation of a checksum is somewhat expensive.

Encrypting Temporary Files

Temporary files are encrypted (1) automatically when you attach an encrypted database or (2) when you use the setting SET temp_file_encryption = true. In the latter case, the main database file is plaintext, but the temporary files will be encrypted. For the encryption of temporary files DuckDB internally generates temporary keys. This means that when the database crashes, the temporary keys are also lost. Temporary files cannot be decrypted in this case and are then essentially garbage.

To force DuckDB to produce temporary files, you can use a simple trick by just setting the memory limit low. This will create temporary files once the memory limit is exceeded. For example, we can create a new encrypted database, load this database with TPC-H data (SF 1), and then set the memory limit to 1 GB. If we then perform a large join, we force DuckDB to spill intermediate data to disk. For example:

SET memory_limit = '1GB';
ATTACH 'tpch_encrypted.db' AS enc (
    ENCRYPTION_KEY 'asdf',
    ENCRYPTION_CIPHER 'cipher'
);
USE enc;
CALL dbgen(sf = 1);

ALTER TABLE lineitem
    RENAME TO lineitem1;
CREATE TABLE lineitem2 AS
    FROM lineitem1;
CREATE OR REPLACE TABLE ans AS
    SELECT l1.* , l2.*
    FROM lineitem1 l1
    JOIN lineitem2 l2 USING (l_orderkey , l_linenumber);

This sequence of commands will result in encrypted temporary files being written to disk. Once the query completes or when the DuckDB shell is exited, the temporary files are automatically cleaned up. In case of a crash however, it may happen that temporary files will be left on disk and need to be cleaned up manually.

How to Use Encryption in DuckDB

In DuckDB, you can (1) encrypt an existing database, (2) initialize a new, empty encrypted database or (3) reencrypt a database. For example, let's create a new database, load this database with TPC-H data of scale factor 1 and then encrypt this database.

INSTALL tpch;
LOAD tpch;
ATTACH 'encrypted.duckdb' AS encrypted (ENCRYPTION_KEY 'asdf');
ATTACH 'unencrypted.duckdb' AS unencrypted;
USE unencrypted;
CALL dbgen(sf = 1);
COPY FROM DATABASE unencrypted TO encrypted;

There is not a trivial way to prove that a database is encrypted, but correctly encrypted data should look like random noise and has a high entropy. So, to check whether a database is actually encrypted, we can use tools to calculate the entropy or visualize the binary, such as ent and binocle.

When we use ent after executing the above chunk of SQL, i.e., ent encrypted.duckdb, this will result in an entropy of 7.99999 bits per byte. If we do the same for the plaintext (unencrypted) database, this results in 7.65876 bits per byte. Note that the plaintext database also has a high entropy, but this is due to compression.

Let’s now visualize both the plaintext and encrypted data with binocle. For the visualization we created both a plaintext DuckDB database with scale factor of 0.001 of TPC-H data and an encrypted one:

Click here to see the entropy of a plaintext database

DuckDB

DuckDB.ExtensionKit: Building DuckDB Extensions in C#

Introduction

How Extensions Are Built Today

DuckDB.ExtensionKit

How DuckDB.ExtensionKit Works

Function Pointers

Entrypoint

Native AOT

Current Status and Limitations

Build Your Own Extension in C#

Closing Thoughts

Big Data on the Cheapest MacBook

What's in the Box?

ClickBench

Benchmark Environment

Results and Analysis

TPC-DS

Should You Buy One?

Announcing DuckDB 1.5.0

New Features

Command Line Client

Color Scheme

Dynamic Prompts in the CLI

.tables and DESCRIBE

Accessing the Last Result Using _

Pager

PEG Parser

VARIANT Type

read_duckdb Function

Azure Writes

ODBC Scanner

Major Changes

Lakehouse Updates

DuckLake

Delta Lake

Iceberg

Network Stack

Announcing DuckDB 1.4.4 LTS

Fixes

Correctness

Crashes and Internal Errors

Performance

Miscellaneous

Conclusion

Announcing Vortex Support in DuckDB

What is Vortex?

The DuckDB Vortex Extension

Example Usage

Why Vortex and DuckDB?

Performance Experiment

DuckDB on LoongArch

Iceberg in the Browser

Interaction Models for Iceberg Catalogs

Iceberg with DuckDB

Iceberg with DuckDB in the Browser

Welcome to Serverless Iceberg Analytics

Access Your Own Data

Conclusion

Announcing DuckDB 1.4.3 LTS

Fixes

Correctness

Crashes and Internal Errors

Performance

Miscellaneous

Azure Blob Storage Writes

Windows Arm64

Extension Distribution

Python Wheel Distribution

ODBC Driver

Conclusion

Writes in DuckDB-Iceberg

Getting Started

Inserts, Deletes and Updates

Functions for Table Properties

Iceberg Table Metadata

Time Travel

Viewing Requests to the Iceberg REST Catalog

Transactions

Conclusion and Future Work

`.tables` and `DESCRIBE`

Accessing the Last Result Using `_`

`VARIANT` Type

`read_duckdb` Function