Jekyll2026-03-20T14:17:39+00:00https://duckdb.org/feed.xmlDuckDBDuckDB is an in-process SQL database management system focused on analytical query processing. It is designed to be easy to install and easy to use. DuckDB has no external dependencies. DuckDB has bindings for C/C++, Python, R, Java, Node.js, Go and other languages.GitHub User[email protected]DuckDB.ExtensionKit: Building DuckDB Extensions in C#2026-03-20T00:00:00+00:002026-03-20T00:00:00+00:00https://duckdb.org/2026/03/20/duckdb-extensionkit-csharpIntroduction

DuckDB has a flexible extension mechanism that allows extensions to be loaded dynamically at runtime. This makes it easy to extend DuckDB’s main feature set without adding everything to the main binary. Extensions can add support for new file formats, introduce custom types, or provide new scalar and table functions. A significant part of DuckDB’s functionality is actually implemented using this extension mechanism in the form of core extensions, which are developed alongside the engine itself by the DuckDB team. For example, DuckDB can read and write JSON files via the json extension and integrate with PostgreSQL using the postgres extension.

DuckDB also has a thriving ecosystem of community extensions, i.e., third-party extensions, maintained by community members, covering a wide range of use cases and integrations. For example, you can expose additional cryptographic functionality through the crypto community extension.

How Extensions Are Built Today

Today, developers can use the same C++ API that the core extensions use for developing extensions. A template for creating extensions is available in the extension-template repository. While powerful, the C++ extension API is tightly coupled to DuckDB’s internal APIs, so it can (and often will) change between DuckDB versions. Additionally, using it requires building the whole DuckDB engine and its documentation is not as complete as that of the C API.

To solve these issues, DuckDB also provides an experimental template for C/C++ based extensions that link with the C Extension API of DuckDB. This API provides a stable, backwards-compatible interface for developing extensions and is designed to allow extensions to work across different DuckDB versions. Because it is a C-based API, it can also be used from other programming languages such as Rust.

Even with the C API, writing extensions still means working at a low level, performing manual memory management, and writing a lot of boilerplate code. While the C API solves stability and compatibility, it doesn’t solve developer experience for higher-level ecosystems. This is where DuckDB.ExtensionKit comes in, aiming to make extension development more accessible to developers working in the .NET ecosystem. By building on top of the DuckDB C Extension API and compiling extensions using the .NET Native AOT (Ahead-of-Time) compilation, DuckDB.ExtensionKit offers the best of both worlds: native DuckDB extensions that integrate like any other extension, combined with the productivity and rich library ecosystem of C# and .NET.

DuckDB.ExtensionKit

DuckDB.ExtensionKit provides a set of C# APIs and build tooling for implementing DuckDB extensions. It exposes the low-level DuckDB C Extension API as C# methods, and also provides type-safe, higher-level APIs for defining scalar and table functions, while still producing native DuckDB extensions. The toolkit also includes a source generator that automatically generates the required boilerplate code, including the native entry point and API initialization.

With DuckDB.ExtensionKit, building an extension closely resembles building a regular C# library. Extension authors create a C# project that references the ExtensionKit runtime and implements functions using the provided, type-safe APIs that expose DuckDB concepts.

At build time, the source generator emits the required boilerplate, including the native entry point and extension initialization. The project is then compiled using .NET Native AOT, producing a native DuckDB extension binary that can be loaded and used by DuckDB like any other extension, without requiring a .NET runtime.

To show a concrete example for this process, the following snippet shows a small DuckDB extension implemented using DuckDB.ExtensionKit that exposes both a scalar function and a table function for working with JWTs (JSON Web Token). At a high level, writing an extension with DuckDB.ExtensionKit involves defining a C# type that represents the extension and registering functions explicitly. In the example below, this is done by creating a partial class annotated with the [DuckDBExtension] attribute and implementing the RegisterFunctions method. The implementation makes use of the System.IdentityModel.Tokens.Jwt NuGet package, illustrating how extensions can easily take advantage of existing .NET libraries.

We'll add two functions, a scalar function for extracting a single claim from a JWT and a table function for extracting multiple claims.

public static partial class JwtExtension
{
  private static void RegisterFunctions(DuckDBConnection connection)
  {
    connection.RegisterScalarFunction<string, string, string?>("extract_claim_from_jwt", ExtractClaimFromJwt);

    connection.RegisterTableFunction("extract_claims_from_jwt", (string jwt) => ExtractClaimsFromJwt(jwt),
                                     c => new { claim_name = c.Key, claim_value = c.Value });
  }

  private static string? ExtractClaimFromJwt(string jwt, string claim)
  {
    var jwtHandler = new JwtSecurityTokenHandler();
    var token = jwtHandler.ReadJwtToken(jwt);
    return token.Claims.FirstOrDefault(c => c.Type == claim)?.Value;
  }

  private static Dictionary<string, string> ExtractClaimsFromJwt(string jwt)
  {
    var jwtHandler = new JwtSecurityTokenHandler();
    var token = jwtHandler.ReadJwtToken(jwt);
    return token.Claims.ToDictionary(c => c.Type, c => c.Value);
  }
}

In just 25 lines, we have built an extension that adds extract_claim_from_jwt and extract_claims_from_jwt functions to DuckDB. We can call these functions just like any other function. For example, to extract the name field from a claim, we can run:

SELECT extract_claim_from_jwt(
    'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6ImExZmIyY2NjN2FiMjBiMDYyNzJmNGUxMjIwZDEwZmZlIn0.eyJpc3MiOiJodHRwczovL2lkcC5sb2NhbCIsImF1ZCI6Im15X2NsaWVudF9hcHAiLCJuYW1lIjoiR2lvcmdpIERhbGFraXNodmlsaSIsInN1YiI6IjViZTg2MzU5MDczYzQzNGJhZDJkYTM5MzIyMjJkYWJlIiwiYWRtaW4iOnRydWUsImV4cCI6MTc2NjU5MTI2NywiaWF0IjoxNzY2NTkwOTY3fQ.N7h2xc4rgS4oPo8IO9wyG1lnr2wqTUC80YudWTXp7rXmU2JdsUiweKmuYVVbygdJAR4PJmbQtak4_VuZg2fZFILVpzDyLvGITfUW_18XuDQ_SIm3VlfAuHOVHfruuvvSAfjUkTW2Jlrv3ihFYgusV58vjhcVFHssOGMEbtMNo10Jf62dczVVGNZXh_OOLS0nTLffhY94sZddqQIE56W8xhLK5YMO4gO8voMzhUwDwucnVvyNfui38MPDNdTSKjn3Ab0hG8jzOVhbYSCHf0eQsbxPzGtXUCJobScWDb78IphFWec6W4ugIYp5CMh3C_noQi94NYjQg2P-AJ5FLCKzKA',
    'name'
);

This returns Giorgi Dalakishvili. Let's test the table function:

SELECT *
FROM extract_claims_from_jwt(
    'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6ImExZmIyY2NjN2FiMjBiMDYyNzJmNGUxMjIwZDEwZmZlIn0.eyJpc3MiOiJodHRwczovL2lkcC5sb2NhbCIsImF1ZCI6Im15X2NsaWVudF9hcHAiLCJuYW1lIjoiR2lvcmdpIERhbGFraXNodmlsaSIsInN1YiI6IjViZTg2MzU5MDczYzQzNGJhZDJkYTM5MzIyMjJkYWJlIiwiYWRtaW4iOnRydWUsImV4cCI6MTc2NjU5MTI2NywiaWF0IjoxNzY2NTkwOTY3fQ.N7h2xc4rgS4oPo8IO9wyG1lnr2wqTUC80YudWTXp7rXmU2JdsUiweKmuYVVbygdJAR4PJmbQtak4_VuZg2fZFILVpzDyLvGITfUW_18XuDQ_SIm3VlfAuHOVHfruuvvSAfjUkTW2Jlrv3ihFYgusV58vjhcVFHssOGMEbtMNo10Jf62dczVVGNZXh_OOLS0nTLffhY94sZddqQIE56W8xhLK5YMO4gO8voMzhUwDwucnVvyNfui38MPDNdTSKjn3Ab0hG8jzOVhbYSCHf0eQsbxPzGtXUCJobScWDb78IphFWec6W4ugIYp5CMh3C_noQi94NYjQg2P-AJ5FLCKzKA'
);

This returns:

claim_name claim_value
iss https://idp.local
aud my_client_app
name Giorgi Dalakishvili
sub 5be86359073c434bad2da3932222dabe
admin true
exp 1766591267
iat 1766590967

How DuckDB.ExtensionKit Works

DuckDB.ExtensionKit relies on several modern C# language and runtime features to efficiently bridge DuckDB’s C extension API to managed code. These features make it possible to build native extensions in C# without introducing a managed runtime dependency at load time.

Function Pointers

DuckDB’s C extension API is exposed as a versioned function table: a large struct (duckdb_ext_api_v1) whose fields are C function pointers (e.g., duckdb_open, duckdb_register_scalar_function, duckdb_vector_get_data, and so on). DuckDB.ExtensionKit mirrors this mechanism in C#. It defines a C# representation of the struct (DuckDBExtApiV1), where each field is declared as a C# function pointer (delegate* unmanaged[Cdecl]<...>). This maps the C ABI directly: calling into DuckDB becomes a simple indirect call through a function pointer field, rather than a delegate invocation with runtime marshaling.

Entrypoint

A DuckDB extension needs to expose an entrypoint function following the C calling convention (the entrypoint that should be exported from the binary is the name of the extension plus _init_c_api). This way, DuckDB can locate it when the extension is loaded. In the C extension template, this is handled with macros that generate the exported function and the surrounding boilerplate.

DuckDB.ExtensionKit follows the same model, but generates the boilerplate from C# instead of C macros. The source generator emits a native-compatible entrypoint that retrieves the API table (via the access object) and performs the required initialization, just like the C template does. The generated method is annotated with [UnmanagedCallersOnly(EntryPoint = "...")], which instructs the .NET toolchain to export a real native symbol with that name and make it callable from C. With .NET Native AOT, this becomes an actual exported function in the produced binary – allowing DuckDB to load and call into the extension exactly as it would for a C implementation.

Native AOT

Finally, Native AOT is what makes this approach practical for DuckDB extensions. Once the extension code and generated sources are compiled, the project is published using .NET Native AOT. This step produces a native binary with no dependency on a managed runtime at load time. The resulting artifact is a native DuckDB extension that can be loaded and executed in the same way as extensions written in C or C++. From DuckDB’s perspective, there is no difference between an extension built with DuckDB.ExtensionKit and one implemented in a traditional native language.

Current Status and Limitations

DuckDB.ExtensionKit, just like the C extension template, is currently experimental. The APIs are still evolving, and not all extension features supported by DuckDB are exposed yet.

The toolkit relies on .NET Native AOT, which means extensions need to be built for specific target platforms (for example, linux-x64, osx-arm64, or win-x64). As with other native extensions, binaries are platform-specific and need to be built accordingly.

Build Your Own Extension in C#

DuckDB.ExtensionKit is available as an open-source project on GitHub under the MIT license. The project includes example extensions that demonstrate how to define and build DuckDB extensions in C#. The repository contains a JWT-based example extension that showcases both scalar functions and table functions, as well as the full build and publishing workflow using .NET Native AOT.

Feedback, bug reports, and contributions are welcome through GitHub issues.

Closing Thoughts

DuckDB’s extension mechanism has proven to be a flexible foundation for extending the system without complicating the core engine. DuckDB.ExtensionKit explores how this mechanism can be made accessible to a broader audience by leveraging the .NET ecosystem, while still producing native extensions that integrate directly with DuckDB.

Although C# is typically viewed as a high-level language, this project demonstrates that it can also be used to implement low-level, ABI-compatible components when needed. By combining modern C# features with DuckDB’s existing extension interface, it is possible to write extensions in a high-level language without giving up control over native boundaries.

]]>
Giorgi Dalakishvili
Big Data on the Cheapest MacBook2026-03-11T00:00:00+00:002026-03-11T00:00:00+00:00https://duckdb.org/2026/03/11/big-data-on-the-cheapest-macbookApple released the MacBook Neo today and there is no shortage of tech reviews explaining whether it's the right device for you if you are a student, a photographer or a writer. What they don't tell you is whether it fits into our Big Data on Your Laptop ethos. We wanted to answer this using a data-driven approach, so we went to the nearest Apple Store, picked one up and took it for a spin.

What's in the Box?

Well, not much! If you buy this machine in the EU, there isn't even a charging brick included. All you get is the laptop and a braided USB-C cable. But you likely already have a few USB-C bricks lying around – let's move on to the laptop itself!

The only part of the hardware specification that you can select is the disk: you can pick either 256 or 512 GB. As our mission is to deal with alleged “Big Data”, we picked the larger option, which brings the price to $700 in the US or €800 in the EU. The amount of memory is fixed to 8 GB. And while there is only a single CPU option, it is quite an interesting one: this laptop is powered by the 6-core Apple A18 Pro, originally built for the iPhone 16 Pro.

It turns out that we have already tested this phone under some unusual circumstances. Back in 2024, with DuckDB v1.2-dev, we found that the iPhone 16 Pro could complete all TPC-H queries at scale factor 100 in about 10 minutes when air-cooled and in less than 8 minutes while lying in a box of dry ice. The MacBook Neo should definitely be able to handle this workload – but maybe it can even handle a bit more. Cue the inevitable benchmarks!

ClickBench

For our first experiment, we used ClickBench, an analytical database benchmark. ClickBench has 43 queries that focus on aggregation and filtering operations. The operations run on a single wide table with 100M rows, which uses about 14 GB when serialized to Parquet and 75 GB when stored in CSV format.

Benchmark Environment

We ported ClickBench's DuckDB implementation to macOS and ran it on the MacBook Neo using the freshly minted v1.5.0 release. We only applied a small tweak: as suggested in our performance guide, we slightly lowered the memory limit to 5 GB, to reduce relying on the OS' swapping and to let DuckDB handle memory management for larger-than-memory workloads. This is a common trick in memory-constrained environments where other processes are likely using more than 20% of the total system memory.

We also re-ran ClickBench with DuckDB v1.5.0 on two cloud instances, yielding the following lineup:

The benchmark script first loaded the Parquet file into the database. Then, as per ClickBench's rules, it ran each query three times to capture both cold runs (the first run when caches are cold) and hot runs (when the system has a chance to exploit e.g. file system caching).

Results and Analysis

Our experiments produced the following aggregate runtimes, in seconds:

Machine Cold run (median) Cold run (total) Hot run (median) Hot run (total)
MacBook Neo 0.57 59.73 0.41 54.27
c6a.4xlarge 1.34 145.08 0.50 47.86
c8g.metal-48xl 1.54 169.67 0.05 4.35

Cold run. The results start with a big surprise: in the cold run, the MacBook Neo is the clear winner with a sub-second median runtime, completing all queries in under a minute! Of course, if we dig deeper into the setups, there is an explanation for this. The cloud instances have network-attached disks, and accessing the database on these dominates the overall query runtimes. The MacBook Neo has a local NVMe SSD, which is far from best-in-class, but still provides relatively quick access on the first read.

Hot run. In the hot runs, the MacBook's total runtime only improves by approximately 10%, while the cloud machines come into their own, with the c8g.metal-48xl winning by an order of magnitude. However, it's worth noting that on median query runtimes the MacBook Neo can still beat the c6a.4xlarge, a mid-sized cloud instance. And the laptop's total runtime is only about 13% slower despite the cloud box having 10 more CPU threads and 4 times as much RAM.

TPC-DS

For our second experiment, we picked the queries of the TPC-DS benchmark. Compared to the ubiquitous TPC-H benchmark, which has 8 tables and 22 queries, TPC-DS has 24 tables and 99 queries, many of which are more complex and include features such as window functions. And while TPC-H has been optimized to death, there is still some semblance of value in TPC-DS results. Let's see whether the cheapest MacBook can handle these queries!

For this round, we used DuckDB's LTS version, v1.4.4. We generated the datasets using DuckDB's tpcds extension and set the memory limit to 6 GB.

At SF100, the laptop breezed through most queries with a median query runtime of 1.63 seconds and a total runtime of 15.5 minutes.

At SF300, the memory constraint started to show. While the median query runtime was still quite good at 6.90 seconds, DuckDB occasionally used up to 80 GB of space for spilling to disk and it was clear that some queries were going to take a long time. Most notably, query 67 took 51 minutes to complete. But hardware and software continued to work together tirelessly, and they ultimately passed the test, completing all queries in 79 minutes.

Should You Buy One?

Here's the thing: if you are running Big Data workloads on your laptop every day, you probably shouldn't get the MacBook Neo. Yes, DuckDB runs on it, and can handle a lot of data by leveraging out-of-core processing. But the MacBook Neo's disk I/O is lackluster compared to the Air and Pro models (about 1.5 GB/s compared to 3–6 GB/s), and the 8 GB memory will be limiting in the long run. If you need to process Big Data on the move and can pay up a bit, the other MacBook models will serve your needs better and there are also good options for Linux and Windows.

All that said, if you run DuckDB in the cloud and primarily use your laptop as a client, this is a great device. And you can rest assured that if you occasionally need to crunch some data locally, DuckDB on the MacBook Neo will be up to the challenge.

]]>
{"twitter" => "none", "picture" => "/images/blog/authors/gabor_szarnyas.png"}
Announcing DuckDB 1.5.02026-03-09T00:00:00+00:002026-03-09T00:00:00+00:00https://duckdb.org/2026/03/09/announcing-duckdb-150We are proud to release DuckDB v1.5.0, codenamed “Variegata” after the Paradise shelduck (Tadorna variegata) endemic to New Zealand.

In this blog post, we cover the most important updates for this release around support, features and extensions. As always, there is more: for the complete release notes, see the release page on GitHub.

To install the new version, please visit the installation page. Note that it can take a few days to release some extensions (e.g., the UI) client libraries (e.g., Go, R, Java) due to the extra changes and review rounds required.

With this release, we will have two DuckDB releases available: v1.4 (LTS) and v1.5 (current). The next release – planned for September – will ship a major version, DuckDB 2.0.

New Features

Command Line Client

For users who use DuckDB through the terminal, the highlight of the new release is a rework of the CLI client with a new color scheme, dynamic prompts, a pager and many other convenience features.

Color Scheme

We shipped a new color palette and harmonized it with the documentation. The color palette is available in both dark mode and light mode. Both use two shades of gray, and five colors for keywords, strings, errors, functions and numbers. You can find the color palette in the Design Manual.

You can customize the color scheme using the .highlight_colors dot command:

.highlight_colors column_name darkgreen bold_underline
.highlight_colors numeric_value red bold
.highlight_colors string_value purple2
FROM ducks;

DuckDB CLI light mode DuckDB CLI dark mode

Dynamic Prompts in the CLI

DuckDB v1.5.0 introduces dynamic prompts for the CLI (PR #19579). By default, these show the database and schema that you are currently connected to:

duckdb
memory D ATTACH 'my_database.duckdb';
memory D USE my_database;
my_database D CREATE SCHEMA my_schema;
my_database D USE my_schema;
my_database.my_schema D ...

These prompts can be configured using bracket codes to have a maximum length, run a custom query, use different colors, etc. (#19579).

.tables and DESCRIBE

To show the columns of an individual table, use the DESCRIBE statement:

memory D ATTACH 'https://blobs.duckdb.org/data/animals.db' AS animals_db;
memory D USE animals_db;
animals_db D DESCRIBE ducks;
┌──────────────────────┐
│        ducks         │
│                      │
│ id           integer │
│ name         varchar │
│ extinct_year integer │
└──────────────────────┘

The .tables dot command lists the attached catalogs, the schemas and tables in them, and the columns in each table.

memory D ATTACH 'https://blobs.duckdb.org/data/animals.db' AS animals_db;
memory D ATTACH 'https://blobs.duckdb.org/data/numbers1.db';
memory D .tables
 ────────────── animals_db ───────────────
 ───────────────── main ──────────────────
┌─────────────────┐┌──────────────────────┐
│      swans      ││        ducks         │
│                 ││                      │
│ id      integer ││ id           integer │
│ name    varchar ││ name         varchar │
│ species varchar ││ extinct_year integer │
│ color   varchar ││                      │
│ habitat varchar ││        5 rows        │
│                 │└──────────────────────┘
│     3 rows      │
└─────────────────┘
  numbers1
 ── main ──
┌──────────┐
│   tbl    │
│          │
│ i bigint │
│          │
│  2 rows  │
└──────────┘

Accessing the Last Result Using _

You can access the last result of a query inline using the underscore character _. This is not only convenient but also makes it unnecessary to re-run potentially long-running queries:

memory D ATTACH 'https://blobs.duckdb.org/data/animals.db' AS animals_db;
memory D USE animals_db;
animals_db D FROM ducks WHERE extinct_year IS NOT NULL;
┌───────┬──────────────────┬──────────────┐
  id          name        extinct_year 
 int32      varchar          int32     
├───────┼──────────────────┼──────────────┤
     1  Labrador Duck             1878 
     3  Crested Shelduck          1964 
     5  Pink-headed Duck          1949 
└───────┴──────────────────┴──────────────┘
animals_db D FROM _;
┌───────┬──────────────────┬──────────────┐
  id          name        extinct_year 
 int32      varchar          int32     
├───────┼──────────────────┼──────────────┤
     1  Labrador Duck             1878 
     3  Crested Shelduck          1964 
     5  Pink-headed Duck          1949 
└───────┴──────────────────┴──────────────┘

Pager

Last but not least, the CLI now has a pager! It is triggered when there are more than 50 rows in the results.

memory D .maxrows 100
memory D FROM range(0, 100);

You can navigate on Linux and Windows using Page Up / Page Down. On macOS, use Fn + Up / Down. To exit the pager, press Q.

The initial implementation of the pager was provided by tobwen in #19004.

PEG Parser

DuckDB v1.5 ships an experimental parser based on PEG (Parser Expression Grammars). The new parser enables better suggestions, improved error messages, and allows extensions to extend the grammar. The PEG parser is currently disabled by default but you can opt-in using:

CALL enable_peg_parser();

The PEG parser is already used for generating suggestions. You can cycle through the options using TAB.

animals_db D FROM ducks WHERE habitat IS 
IS           ISNULL       ILIKE        IN           INTERSECT    LIKE

We are planning to make the switch to the new parser in the upcoming DuckDB release.

As a tradeoff, the parser has a slight performance overhead, however, this is in the range of milliseconds and is thus negligible for analytical queries. For more details on the rationale for using a PEG parser and benchmark results, please refer to the CIDR 2026 paper by Hannes and Mark, or their blog post summarizing the paper.

VARIANT Type

DuckDB now natively supports the VARIANT type, inspired by Snowflake's semi-structured VARIANT data type and available in Parquet since 2025. Unlike the JSON type, which is physically stored as text, VARIANT stores typed, binary data. Each row in a VARIANT column is self-contained with its own type information. This leads to better compression and query performance. Here are a few examples of using VARIANT.

Store different types in the same column:

CREATE TABLE events (id INTEGER, data VARIANT);
INSERT INTO events VALUES
    (1, 42::VARIANT),
    (2, 'hello world'::VARIANT),
    (3, [1, 2, 3]::VARIANT),
    (4, {'name': 'Alice', 'age': 30}::VARIANT);

SELECT * FROM events;
┌───────┬────────────────────────────┐
│  id   │            data            │
│ int32 │          variant           │
├───────┼────────────────────────────┤
│     1 │ 42                         │
│     2 │ hello world                │
│     3 │ [1, 2, 3]                  │
│     4 │ {'name': Alice, 'age': 30} │
└───────┴────────────────────────────┘

Check the underlying type of each row:

SELECT id, data, variant_typeof(data) AS vtype
FROM events;
┌───────┬────────────────────────────┬───────────────────┐
│  id   │            data            │       vtype       │
│ int32 │          variant           │      varchar      │
├───────┼────────────────────────────┼───────────────────┤
│     1 │ 42                         │ INT32             │
│     2 │ hello world                │ VARCHAR           │
│     3 │ [1, 2, 3]                  │ ARRAY(3)          │
│     4 │ {'name': Alice, 'age': 30} │ OBJECT(name, age) │
└───────┴────────────────────────────┴───────────────────┘

You can extract fields from nested variants using the dot notation or the variant_extract function:

SELECT data.name FROM events WHERE id = 4;
-- or 
SELECT variant_extract(data, 'name') AS name FROM events WHERE id = 4;
┌─────────┐
│  name   │
│ variant │
├─────────┤
│ Alice   │
└─────────┘

DuckDB also supports reading VARIANT types from Parquet files, including shredding (storing nested data as flat values).

read_duckdb Function

The read_duckdb table function can read DuckDB databases without first attaching them. This can make reading from DuckDB databases more ergonomic – for example, you can use globbing. You can read the example numbers databases as follows:

SELECT min(i), max(i)
FROM read_duckdb('numbers*.db');
┌────────┬────────┐
│ min(i) │ max(i) │
│ int64  │ int64  │
├────────┼────────┤
│      1 │      5 │
└────────┴────────┘

Azure Writes

You can now write to the Azure Blob or ADLSv2 storage using the COPY statement:

-- Write query results to a Parquet file on Blob Storage
COPY (SELECT * FROM my_table)
TO 'az://my_container/path/output.parquet';

-- Write a table to a CSV file on ADLSv2 Storage
COPY my_table
TO 'abfss://my_container/path/output.csv';

ODBC Scanner

We are now shipping an ODBC scanner extension. This allows you to query a remote endpoint as follows:

LOAD odbc_scanner;
SET VARIABLE conn = odbc_connect('Driver={Oracle Driver};DBQ=//127.0.0.1:1521/XE;UID=scott;PWD=tiger;');
SELECT * FROM odbc_query(getvariable('conn'), 'SELECT SYSTIMESTAMP FROM dual;');

In the coming weeks, we'll publish the documentation page and release a followup post on the ODBC scanner. In the meantime, please refer to the project's README.

Major Changes

Lakehouse Updates

All of DuckDB’s supported Lakehouse formats have received some updates for v1.5.

DuckLake

The main DuckLake change for DuckDB v1.5 is updating the DuckLake specification to v0.4. We are aiming for this to be the same specification that ships with DuckLake 1.0, which will be released in April. Its main highlights include:

  • Macro support.
  • Sorted tables.
  • Deletion inlining and addition of partial delete files.
  • Internal rework of DuckLake options.

We'll announce more details about these features in the blog post for DuckLake v1.

Delta Lake

For the Delta Lake extension, the team has focused on improving support for writes via Unity Catalog, Delta idempotent writes and table CHECKPOINTs.

Iceberg

For the Iceberg extension, the team is working on a larger release for v1.5.1. For v1.5.0, the main feature is the addition of table properties in the CREATE TABLE statement:

CREATE TABLE test_create_table (a INTEGER)
WITH (
    'format-version' = '2', -- format version will be elevated to format-version when creating a table
    'location' = 's3://path/to/data', -- location will be elevated to location when creating a table
    'property1' = 'value1',
    'property2' = 'value2'
);

Other minor additions have been made to enable passing EXTRA_HTTP_HEADERS when attaching to an Iceberg catalog, which has unlocked Google’s BigLake.

Both Delta and DuckLake have implemented the VARIANT type. Iceberg’s VARIANT type will ship in the v1.5.1 release with some other features that are specific to the Iceberg v3 specification.

Network Stack

The default backend for the httpfs extension has changed from httplib to curl. As one of the most popular and well-tested open-source projects, we expect curl to provide long-standing stability and security for DuckDB. Regardless of the http library used, openssl is still the backing SSL library and options such as http_timeout, http_retries, etc. are still the same.

Our community has been testing the new network stack for the last few weeks. Still, if you encounter any issues, please submit them to the duckdb-httpfs repository.

If you are interested in more details, click here.

Due to technical reasons, httplib is still the library we use for downloading the httpfs extension. When httpfs is loaded with the (now default) curl backend, subsequent extension installations go through https://, with the default endpoint for core extensions pointing to https://extensions.duckdb.org.

All core and community extensions are cryptographically signed, so installing them through http:// does not pose a security risk. However, some users reported issues about http:// extension installs in environments with firewalls.

Lambda Syntax

Up to DuckDB v1.2, the syntax for defining lambda expressions used the arrow notation x -> x + 1. While this was a nice syntax, it clashed with the JSON extract operator (->) due to operator precedence and led to error messages that some users found difficult to troubleshoot. To work around this, we introduced a new, Python-style lambda syntax in v1.3, lambda x: x + 1.

While DuckDB v1.5 supports both styles of writing lambda expressions, using the deprecated arrow syntax will now throw a warning:

SELECT list_transform([1, 2, 3], x -> x + 1);
WARNING:
Deprecated lambda arrow (->) detected. Please transition to the new lambda syntax, i.e., lambda x, i: x + i, before DuckDB's next release.

You can use the lambda_syntax configuration option to change this behavior to suppress the warning or to behave more strictly:

-- Suppress the warning
SET lambda_syntax = 'ENABLE_SINGLE_ARROW';
-- Turn the deprecation warning into an error
SET lambda_syntax = 'DISABLE_SINGLE_ARROW';

DuckDB 2.0 will disable the single arrow syntax by default and it will only be available if you opt-in explicitly.

Spatial Extension

The spatial extension ships several important changes.

Breaking Change: Flipping of Axis Order

Most functions in spatial operate in Cartesian space and are unaffected by axis order, e.g., whether the X and Y axes represent “longitude” and “latitude” or the other way around. But there are some functions where this matters, and where the assumption, counterintuitively, is that all input geometries use (x = latitude, y = longitude). These are:

  • ST_Distance_Spheroid
  • ST_Perimeter_Spheroid
  • ST_Area_Spheroid
  • ST_Distance_Sphere
  • ST_DWithin_Spheroid

Additionally, ST_Transform also expects that the input geometries are in the same axis order as defined by the source coordinate reference system, which in the case of e.g., EPSG:4326 is also (x = latitude, y = longitude).

This has been a long-standing source of confusion and numerous issues, as other databases, formats and GIS systems tend to always treat X as “easting”, “left-right” or “longitude”, and Y as “northing”, “up-down” or “latitude”.

We are changing how this currently works in DuckDB to be consistent with how other systems operate, and hopefully cause less confusion for new users in the future. However, to avoid silently breaking existing workflows that have adapted to this quirk (e.g., by using ST_FlipCoordinates), we are rolling out this change gradually via a new geometry_always_xy setting:

  • In DuckDB v1.5, setting geometry_always_xy = true enables the new behavior (x = longitude, y = latitude). Without it, affected functions emit a warning.
  • In DuckDB v2.0, the warning will become an error. Set geometry_always_xy = false to preserve the old behavior.
  • In DuckDB v2.1, geometry_always_xy = true will become the default.

So to summarize, nothing is changing by default in this release, but to avoid being affected by this change in the future, set geometry_always_xy explicitly now. Set it to true to opt into the new behavior, or false to keep the existing one.

Geometry Rework

GEOMETRY Becomes a Built-In Type

The GEOMETRY type has been moved from the spatial extension into core DuckDB!

Geospatial data is no longer niche. The Parquet standard now treats GEOMETRY as a first-class column type, and open table formats like Apache Iceberg and DuckLake are moving in the same direction. Many widely used data formats and systems also have geospatial counterparts—GeoJSON, PostGIS, GeoPandas, GeoPackage/Spatialite, and more.

DuckDB already offers extensions that integrate with many of these formats and systems. But there’s a structural problem: as long as GEOMETRY lives inside the spatial extension, other extensions that want to read or write geospatial data must either depend on spatial, implement their own incompatible geometry representation, or force users to handle the conversions themselves.

By moving GEOMETRY into DuckDB’s core, extensions can now produce and consume geometry values natively, without depending on spatial. While the spatial extension still provides most of the functions for working with geometries, the type itself becomes a shared foundation that the entire ecosystem can build on. We’ve already added GEOMETRY support to the Postgres scanner and GeoArrow conversion for Arrow import and export. Geometry support in additional extensions is coming soon.

This change also enables deeper integration with DuckDB’s storage engine and query optimizer, unlocking new compression techniques, query optimizations, and CRS awareness capabilities that were not possible when GEOMETRY only existed as an extension type. This is all documented in the new geometry page in the documentation, but we will highlight some below.

Improved Storage: WKB and Shredding

Geometry values are now stored using the industry-standard little-endian Well-Known Binary (WKB) encoding, replacing the custom format used by the spatial extension. However, we are still experimenting with the in-memory representation we want to use in the execution engine so you should still use the conversion functions (e.g., ST_AsWKT, ST_AsWKB, ST_GeomFromText, ST_GeomFromWKB) when moving data in and out of DuckDB.

We’ve also implemented a new storage technique specialized for GEOMETRY. When a geometry column contains values that all share the same type and vertex dimensions, DuckDB can additionally apply "shredding": rather than storing opaque blobs, the column is decomposed into primitive STRUCT, LIST, and DOUBLE segments that compress far more efficiently. This can reduce on-disk size by roughly 3x for uniform geometry columns such as point clouds. Shredding is applied automatically for uniform row groups of a certain size, but can be configured via the geometry_minimum_shredding_size configuration option.

Geometry Statistics and Query Optimization

Geometry columns now track per-row-group statistics - including the bounding box and the set of geometry types and vertex dimensions present. The query optimizer can use these to skip row groups that cannot match a query's spatial predicates, similar to min/max pruning for numeric columns. The && (bounding box intersection) operator is the first to benefit; broader support across spatial functions is in progress.

Coordinate Reference System Support

The GEOMETRY type now accepts an optional CRS parameter (e.g., GEOMETRY('OGC:CRS84')), making CRS part of the type system rather than implicit metadata. Spatial functions enforce CRS consistency across their inputs, catching a common class of silent errors that arises when mixing geometries from different coordinate systems. Only a couple of CRSs are built in by default, but loading the spatial extension registers over 7,000 CRSs from the EPSG dataset. While CRS support is still a bit experimental, we are planning to develop it further to support e.g., custom CRS definitions.

Optimizations

Non-Blocking Checkpointing

During checkpointing, it's now possible to run concurrent reads (#19867), writes (#20052), insertions with indexes (#20160) and deletes (#20286). The rework of checkpointing benefits concurrent RW workloads and increases the TPC-H throughput score on SF100 from 246,115.60 to 287,122.97, a 17% improvement.

Aggregates

Aggregate functions received several optimizations. For example, the last aggregate function was optimized by community member xe-nvdk to iterate from the end of each vector batch instead of the beginning. In synthetic benchmarks, this results in a 40% speedup.

Distribution

Python Pip

You can install the DuckDB CLI on any platform where pip is available:

pip install duckdb-cli

You can then launch DuckDB in your virtual environment using:

duckdb

Both DuckDB v1.4 and v1.5 are supported. We are working on shipping extensions as extras using the duckdb[extension_name] syntax – stay tuned!

Windows Install Script (Beta)

On Windows, you can now use an install script:

powershell -NoExit iex (iwr "https://install.duckdb.org/install.ps1").Content

Please note that this is currently in the beta stage. If you have any feedback, please let us know.

CLI for Linux with musl libc

We are distributing CLI clients that work with musl libc (e.g., for Alpine Linux, commonly used in Docker images). The archives are available on GitHub.

Note that the musl libc CLI client requires the libstdc++. To install this package, run:

apk add libstdc++

Extension Sizes

We reworked our build system to make the extension binaries smaller! The DuckLake extension's size was reduced by ~30%, from 17 MB to 12 MB. For smaller extensions such as Excel, the reduction is more than 60%, from 9 MB to 3 MB.

Summary

These were a few highlights – but there are many more features and improvements in this release. There have been over 6500 commits by close to 100 contributors since v1.4. The full release notes can be found on GitHub. We would like to thank our community for providing detailed issue reports and feedback. And again, our special thanks go to external contributors!

PS: If you visited this blog post through a direct link – we also rolled out a new landing page!

Appendix: Example Dataset

See the code that creates the example databases.
ATTACH 'numbers1.db';
ATTACH 'numbers2.db';
ATTACH 'animals.db';

CREATE TABLE numbers1.tbl AS FROM range(1, 3) t(i);

CREATE TABLE numbers2.tbl AS FROM range(2, 6) t(i);

CREATE TABLE animals.ducks AS
FROM (VALUES
    (1, 'Labrador Duck', 1878),
    (2, 'Mallard', NULL),
    (3, 'Crested Shelduck', 1964),
    (4, 'Wood Duck', NULL),
    (5, 'Pink-headed Duck', 1949)
) t(id, name, extinct_year);

CREATE TABLE animals.swans AS
FROM (VALUES
    (1, 'Aurora', 'Mute Swan', 'White', 'European lakes and rivers'),
    (2, 'Midnight', 'Black Swan', 'Black', 'Australian wetlands'),
    (3, 'Tundra', 'Tundra Swan', 'White', 'Arctic and subarctic regions')
) t(id, name, species, color, habitat);

DETACH numbers1;
DETACH numbers2;
DETACH animals;
]]>
The DuckDB team
Announcing DuckDB 1.4.4 LTS2026-01-26T00:00:00+00:002026-01-26T00:00:00+00:00https://duckdb.org/2026/01/26/announcing-duckdb-144In this blog post, we highlight a few important fixes in DuckDB v1.4.4, the fourth patch release in DuckDB's 1.4 LTS line. The release ships bugfixes, performance improvements and security patches. You can find the complete release notes on GitHub.

To install the new version, please visit the installation page.

Fixes

This version ships a number of performance improvements and bugfixes.

Correctness

Crashes and Internal Errors

Performance

Miscellaneous

Conclusion

This post was a short summary of the changes in v1.4.4. As usual, you can find the full release notes on GitHub. We would like to thank our contributors for providing detailed issue reports and patches. In the coming month, we'll release DuckDB v1.5.0. We'll also keep v1.4 LTS updated until mid-September. We'll announce the release date of v1.4.5 in the release calendar in the coming months.

Earlier today, we pushed an incorrect tag that was visible for a few minutes. No binaries or extensions were available under this tag and we replaced it as soon as we noticed the issue. Our apologies for the erroneous release.

]]>
The DuckDB team
Announcing Vortex Support in DuckDB2026-01-23T00:00:00+00:002026-01-23T00:00:00+00:00https://duckdb.org/2026/01/23/duckdb-vortex-extensionI think it is worth starting this intro by talking a little bit about the established format for columnar data. Parquet has done some amazing things for analytics. If you go back to the times where CSV was the better alternative, then you know how important Parquet is. However, even if the specification has evolved over time, Parquet has some design constraints. A particular limitation is that it is block-compressed and engines need to decompress pages in order to do further operations like filtering, decoding values, etc. For a while, researchers and private companies have been working on alternatives to Parquet that could improve on some of Parquet’s shortcomings. Vortex, from the SpiralDB team, is one of them.

What is Vortex?

Vortex is an extensible, open source format for columnar data. It was created to handle heterogeneous compute patterns and different data modalities. But, what does this mean?

The project was donated to the Linux Foundation by the SpiralDB team in August 2025.

Vortex provides different layouts and encodings for different data types. Some of the most notable are ALP for floating point encoding or FSST for string encoding. This lightweight compression strategy keeps data sizes down while allowing one of Vortex’s most important features: compute functions. By knowing the encoded layout of the data, Vortex is able to run arbitrary expressions on compressed data. This allows a Vortex reader to execute, for example, filter expressions within storage segments without decompressing data.

We mentioned heterogeneous compute to emphasize that Vortex was designed with the idea of having optimized layouts for different data types, including vectors, large text or even image or audio, but also to maximize CPU or GPU saturation. The idea is that decompression is deferred all the way to the GPU or CPU, enabling what Vortex calls “late materialization”. The FastLanes encoding, a project originating at CWI (like DuckDB), is one of the main drivers behind this feature.

Vortex also supports dynamically loaded libraries (similar to DuckDB extensions) to provide new encodings for specific types as well as specific compute functions, e.g. for geospatial data. Another very interesting feature is encoding WebAssembly into the file, which can allow the reader to benefit from specific compute kernels applied to the file.

Besides DuckDB, other engines such as DataFusion, Spark and Arrow already offer integration with Vortex.

For more information, check out the Vortex documentation.

The DuckDB Vortex Extension

DuckDB is a database as the name says, yes, but it is also widely used as an engine to query many different data sources. Through core or community extensions, DuckDB can integrate with:

  • Databases like Snowflake, BigQuery or PostgreSQL.
  • Lakehouse formats like Delta, Iceberg or DuckLake.
  • File formats, most notably JSON, CSV, Parquet and most recently Vortex.

The community has gotten very creative, though, so these days you can even read YAML and Markdown with DuckDB using community extensions.

All this is possible due to the DuckDB extension system, which makes it relatively easy to implement logic to interact with different file formats or external systems.

The SpiralDB team built a DuckDB extension. Together with the DuckDB Labs team, we have made the extension available as a core DuckDB extension, so that the community can enjoy Vortex as a first-class citizen in DuckDB.

Example Usage

Installing and using the Vortex extension is very simple:

INSTALL vortex;
LOAD vortex;

Then, you can easily use it to read and write, similar to other extensions such as Parquet.

SELECT * FROM read_vortex('my.vortex');

COPY (SELECT * FROM generate_series(0, 3) t(i))
TO 'my.vortex' (FORMAT vortex);

Why Vortex and DuckDB?

Vortex claims to do well primarily at three use cases:

  • Traditional SQL analytics: Through late decompression and compute expressions on compressed data, Vortex can filter down data within the storage segment, reducing IO and memory consumption.
  • Machine learning pre-processing pipelines: By supporting a wide variety of encodings for different data types, Vortex claims to be effective at reading and writing data, whether it is audio, text, images or vectors.
  • AI model training: Encodings such as FastLanes allow for a very efficient copy of data to the GPU. Vortex is aiming at being able to copy data directly from S3 object storage to the GPU.

The promise of more efficient IO and memory use through late decompression is a good reason to try DuckDB and Vortex for SQL analytics. On another note, if you are looking at running analytics on unified datasets that are used for multiple use cases, including pre-processing pipelines and AI training, then Vortex may be a good candidate since it is designed to fit all of these use cases well.

Performance Experiment

For those who are number hungry, we decided to run a TPC-H benchmark scale factor 100 with DuckDB to understand how Vortex can perform as a storage format compared to Parquet. We tried to make the benchmark as fair as possible. These are the parameters:

  • Run on Mac M1 with 10 cores & 32 GB of memory.
  • The benchmark runs each query 5 times and the average is used for the final report.
  • The DuckDB connection is closed after each query to try to make runs “colder” and avoid DuckDB's caching (particularly with Parquet) from influencing the results. OS page caching does have an influence in subsequent runs but we decided to acknowledge this factor and still keep the first run.
  • Each TPC-H table is a single file, which means that lineitem files for Parquet and Vortex are quite large (both around 20 GB). This allows us to ignore the effect of globbing and having many small files.
  • Data files used for the benchmark are generated with tpchgen-rs and are copied out using DuckDB’s Parquet and Vortex extensions.
  • We compared Vortex against Parquet v1 and v2. The v2 specification allows for considerably faster reading than the v1 specification but many writers do not support this, so we thought it was worth including both.

The results are very good. The TPC-H benchmark runs 18% faster with respect to Parquet V2 and 35% faster than Parquet V1 (using the geometric means, which is the recommended approach).

Another interesting result is the standard deviation across runs. There was a considerable difference between the first (and coldest) run of each query and subsequent runs in Parquet, while Vortex performed very well across all runs with a much smaller standard deviation.

summary

Format Geometric Mean (s) Arithmetic Mean (s) Avg Std Dev (s) Total Time (s)
parquet_v1 2.324712 2.875722 0.145914 63.265881
parquet_v2 1.839171 2.288013 0.182962 50.336281
vortex 1.507675 1.991289 0.078893 43.808349

The times did vary across different runs of the same benchmark, and subsequent runs have yielded similar results but with slight variations. The differences between Parquet v2 and Vortex have always been around 12-18% in geometric means and around 8-14% in total times. Benchmarking is very hard!

Click here to see a more detailed breakdown of the benchmark results.

This figure shows the results per query, including the standard deviation error bar.
mean_per_query
The following is the summary of the sizes of the datasets. Note that both Parquet v1 and v2 are using the default compression used by the DuckDB Parquet writer, which is Snappy. In this case, Vortex is not using any general-purpose compression but still keeps the data sizes competitive.

Table parquet_v1 parquet_v2 vortex
customer 1.15 0.99 1.06
lineitem 21.15 16.02 18.14
nation 0.00 0.00 0.00
orders 6.02 4.54 5.03
part 0.59 0.47 0.54
partsupp 4.07 3.33 3.72
region 0.00 0.00 0.00
supplier 0.07 0.06 0.07
total 33.06 25.40 28.57

Conclusion

Vortex is a very interesting alternative to established columnar formats like Parquet. Its focus on lightweight compression encodings, late decompression and being able to run compute expressions on compressed data makes it very interesting for a wide range of use cases. With regard to DuckDB, we see that Vortex is already very performant for analytical queries, where it is on par or better than Parquet v2 on the TPC-H benchmark queries.

Vortex has been backwards compatible since version 0.36.0, which was released more than 6 months ago. Vortex is now at version 0.56.0.

]]>
Guillermo Sanchez, SpiralDB Team
DuckDB on LoongArch2026-01-06T00:00:00+00:002026-01-06T00:00:00+00:00https://duckdb.org/2026/01/06/duckdb-on-loongarch-morefineIt’s not every day that a new CPU architecture arrives on your desk. I grew up on the Intel 486 back in the early 90s. I also still remember AMD releasing its 64-bit x86 extension in 2000. Then not a lot happened until Apple released the ARM-based M1 architecture in 2020. But today is the day again (for me), with the long-awaited arrival of the “MOREFINE M700S” in our office.

The M700S contains a Loongson CPU. Also called “LoongArch” or “Godson” processors, this CPU was developed in China based on the (somewhat esoteric) MIPS architecture. This is part of a plan to become technologically self-sufficient as part of the government-funded Made in China 2025 plan.

It is probably safe to assume that – given the ongoing trade shenanigans – the Loongson will become much more popular in China as time goes on. DuckDB already sees quite a lot of usage from China, so naturally we want to make sure that DuckDB runs well on the Loongson. Thankfully, one of our community members has already opened a pull request with two minimal changes to allow DuckDB to compile. We became curious.

We purchased the M700S on (where else?) AliExpress for around 500 EUR. Besides the Loongson 8-core 3A6000 CPU it contains 16 GB of main memory and a 256 GB solid-state disk.

Once plugged in and booted up, things feel pretty normal besides the loud fan that seems to be always on. On the screen, a variant of Debian called Loongnix boots up. The GUI seems to be KDE-based and comes with a custom browser “LBrowser” which is a fork of Chromium. Just because it was not obvious we document it here: the default root password is M700S. There is also a user account m700s with the same password.

Overall, the software seems a little dated, even after running apt upgrade: the Linux kernel seems to be version 4.19, which was released back in 2018, and which has been EOL for a year now. The GCC version is 8.3, which similarly came out in 2019.

With the aforementioned patch, we managed to compile DuckDB 1.4.3 on Loongnix. There was one small issue where the CMake file append_metadata.cmake was not compatible with the older CMake version (3.13.4) available on Loongnix. But simply replacing that file with an empty one allowed us to complete the build. Of course we could also have updated CMake, but life is short. Once completed, we ran DuckDB’s extensive unit test suite (make allunit) to confirm that our build runs correctly on the Loongson CPU. Results looked good.

For performance comparison, we re-used the methodology from our previous blog post that ran DuckDB on a Raspberry Pi. In short, we run the 22 TPC-H benchmark queries on “Scale Factor” 100 and 300, which in DuckDB format is a 25 GB and 78 GB database file, respectively. We compare those numbers with the nearest computer, which is my day-to-day MacBook Pro with an M3 Max CPU. For fairness, we limit DuckDB to 14 GB of RAM on both platforms. The reported timings are “hot” runs, meaning we re-ran the query set and took the timings from the second run.

Here are the results, and they are not great. We start with aggregated timings:

SF System Geometric mean Sum
SF100 MacBook 0.6 16.9
SF100 MOREFINE 6.1 192.8
SF300 MacBook 2.8 78.8
SF300 MOREFINE 27.3 791.6

We can see that the MacBook is around ten times faster than the MOREFINE on this benchmark, both in the geometric mean of runtimes as well as in the sum. If you are interested in the individual query runtimes, you can find them below.

Click here to see the individual query runtimes.
Q SF100/MacBook SF100/MOREFINE SF300/MacBook SF300/MOREFINE
1 1.247 7.363 4.528 26.475
2 0.117 1.058 0.474 4.101
3 0.697 8.563 2.759 32.432
4 0.570 7.348 2.331 27.185
5 0.631 8.498 3.217 34.462
6 0.180 1.236 1.395 13.225
7 0.620 7.702 3.119 37.411
8 0.640 5.593 3.611 29.914
9 1.906 30.560 6.670 99.884
10 0.923 11.755 4.036 40.412
11 0.102 1.037 0.709 4.444
12 0.535 6.422 2.918 31.501
13 1.847 21.185 6.394 74.081
14 0.408 5.616 3.240 26.613
15 0.252 2.652 1.906 17.454
16 0.273 3.108 0.879 11.480
17 0.805 5.184 4.655 28.469
18 1.538 15.492 7.619 71.845
19 0.779 9.143 4.379 39.111
20 0.441 4.993 3.234 25.967
21 1.996 23.231 9.503 96.452
22 0.441 5.036 1.237 18.709

It is always exciting to get DuckDB running on a new platform. Of course, we have built DuckDB to be ulta-portable and agnostic to hardware environments while still delivering excellent performance. So it was not that surprising that it was not that difficult to get DuckDB running on the MOREFINE with its new-ish CPU. However, performance on the standard TPC-H benchmark was not that impressive, with the MacBook being around ten times faster than the MOREFINE.

Of course, there are many opportunities for improvement. For starters, the gcc toolchain on LoongArch is likely not as advanced by far compared with its x86/ARM counterpart, so advances there could make a big difference. The same applies of course to IO performance, which we have not measured separately. But hey, the “glass half full” department could also rightfully claim that the Loongson CPU can complete TPC-H SF300!

One could also argue that a MacBook Pro is much more expensive than 500 EUR MOREFINE. However, a recent M4 Mac Mini with the same memory and storage specs will cost around 700 EUR, not that much more all things considered. It will run circles around the MOREFINE. And it will not constantly annoy you with its fan.

]]>
{"twitter" => "hfmuehleisen", "picture" => "/images/blog/authors/hannes_muehleisen.jpg"}
Iceberg in the Browser2025-12-16T00:00:00+00:002025-12-16T00:00:00+00:00https://duckdb.org/2025/12/16/iceberg-in-the-browserIn this post, we describe the current patterns for interacting with Iceberg Catalogs, and pose the question: could it be done from a browser? After elaborating on the DuckDB ecosystem changes required to unlock this capability, we demonstrate our approach to interacting with an Iceberg REST Catalog. It's browser-only, no extra setup required.

Interaction Models for Iceberg Catalogs

Iceberg analytics today Iceberg analytics today

Iceberg is an open table format, which allows you to capture a mutable database table as a set of static files on object storage (such as AWS S3). Iceberg catalogs allow you to track and organize Iceberg tables. For example, Iceberg REST Catalogs provide these functionalities through a REST API.

There are two common ways to interact with Iceberg catalogs:

  • The client–server model, where the compute part of the operation is delegated to a managed infrastructure (such as the cloud). Users can interact with the server by installing a local client or using a lightweight client such as a browser.
  • The client-is-the-server model, where the user first installs the relevant libraries, and then performs queries directly on their machine.

Iceberg engines follow these interaction models: they are either run natively in managed compute infrastructure or they are run locally by the user. Let's see how things look with DuckDB in the mix!

Iceberg with DuckDB

Iceberg with DuckDB Iceberg with DuckDB

DuckDB supports both Iceberg interaction models. In the client–server model, DuckDB runs on the server to read the Iceberg datasets. From the user's point of view, the choice of engine is transparent, and DuckDB is just one of many engines that the server could use in the background. The client-is-the-server model is more interesting: here, users install a DuckDB client locally and use it through its SQL interface to query Iceberg catalogs. For example:

CREATE SECRET test_secret (
    TYPE S3, 
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
);

ATTACH 'warehouse' AS db (
    TYPE ICEBERG,
    ENDPOINT_URL 'https://your-iceberg-endpoint',
);

SELECT sum(value)
FROM db.table
WHERE other_column = 'some_value';

The client-is-the-server model unlocks empowered clients, which can operate directly on the data.

You can discover the full DuckDB-Iceberg extension feature set, including insert and update capabilities, in our earlier blog post.

Iceberg with DuckDB in the Browser

While setting up a local DuckDB installation is quite simple, opening a browser tab is even quicker. Therefore, we asked ourselves: could we support the client-is-the-server model directly from within a browser tab? This could provide a zero-setup, no-infrastructure, properly serverless option for interacting with Iceberg catalogs.

Iceberg with DuckDB-Wasm Iceberg with DuckDB-Wasm

Luckily, DuckDB has a client that can run in any browser! DuckDB-Wasm is a WebAssembly port of DuckDB, which supports loading of extensions.

Interacting with an Iceberg REST Catalog requires a number of functionalities; the ability to talk to a REST API over HTTP(S), the ability to read and write avro and parquet files on object storage, and finally, the ability to negotiate authentication to access those resources on behalf of the user. All of these must be done from within a browser without calling any native components.

To support these functionalities, we implemented the following high-level changes:

  • In the core duckdb codebase, we redesigned HTTP interactions, so that extensions and clients have a uniform interface to the networking stack. (PR)
  • In duckdb-wasm, we implemented such an interface, which in this case is a wrapper around the available JavaScript network stack. (PR)
  • In duckdb-iceberg, we routed all networking through the common HTTP interface, so that native DuckDB and DuckDB-Wasm execute the same logic. (PR)

The result is that you can now query Iceberg with DuckDB running directly in a browser! Now you can access the same Iceberg catalog using client–server, client-is-the-server, or properly serverless from the isolation of a browser tab!

Welcome to Serverless Iceberg Analytics

Check out our demo of serverless Iceberg analytics using the DuckDB Table Visualizer

The current credentials in the demo are provided via a throwaway account with minimal permissions. If you enter your own credentials and share a link, you will be sharing your credentials.

Access Your Own Data

Substituting your own S3 Tables bucket ARN and credentials with policy AmazonS3TablesReadOnlyAccess, you can also access your catalog, metadata and data. Computations are fully local, and the credentials and warehouse ID are only sent to the catalog endpoint specified in your ATTACH command. Inputs are translated to SQL, and added to the hash segment of the URL.

This means that:

  • no sensitive data is handled or sent to duckdb.org
  • computations are local, fully in your browser
  • you can use the familiar SQL interface with the same code snippets that can run everywhere DuckDB runs
  • if you edit the credentials and share the resulting link, you will be sharing the new credentials

As of today, this works with Amazon S3 Tables. This has been implemented through a collaboration with the Amazon S3 Tables team. To learn more about S3 Tables, how to get started and their feature set, you can take a look at their product page or documentation. A demo of DuckDB querying S3 Tables from a browser was presented at AWS re:Invent 2025 – see the presentation.

Conclusion

The DuckDB-Iceberg extension is now supported in DuckDB-Wasm and it can read and edit Iceberg REST Catalogs. Users can now access Iceberg data from within a browser, without having to install or manage any compute nodes!

If you would like to provide feedback or file issues, please reach out to us on either the DuckDB-Wasm or DuckDB-Iceberg repository. If you are interested in using any part of this within your organization, feel free to reach out.

]]>
Carlo Piovesan, Tom Ebergen, Gábor Szárnyas
Announcing DuckDB 1.4.3 LTS2025-12-09T00:00:00+00:002025-12-09T00:00:00+00:00https://duckdb.org/2025/12/09/announcing-duckdb-143In this blog post, we highlight a few important fixes in DuckDB v1.4.3, the third patch release in DuckDB's 1.4 LTS line. You can find the complete release notes on GitHub.

To install the new version, please visit the installation page.

Fixes

This version ships a number of performance improvements and bugfixes.

Correctness

Crashes and Internal Errors

Performance

Miscellaneous

Azure Blob Storage Writes

The azure extension can now write to the Azure Blob Storage. This unlocks several other Azure and Fabric features, including using OneLake instances.

Windows Arm64

With this release, we are introducing beta support for Windows Arm64 by distributing native DuckDB extensions and Python wheels.

Extension Distribution

On Windows Arm64, you can now natively install core extensions, including complex ones like spatial:

duckdb
PRAGMA platform;
┌───────────────┐
│   platform    │
│    varchar    │
├───────────────┤
│ windows_arm64 │
└───────────────┘
INSTALL spatial;
LOAD spatial;
SELECT ST_Area(ST_GeomFromText(
        'POLYGON((0 0, 4 0, 4 3, 0 3, 0 0))'
    )) AS area;
┌────────┐
│  area  │
│ double │
├────────┤
│  12.0  │
└────────┘

Python Wheel Distribution

We now distribute Python wheels for Windows Arm64 for Python 3.11+. This means that you take e.g. a Copilot+ PC, install the native Python interpreter and run:

pip install duckdb

This installs the duckdb package using the binary distributed through PyPI. Then, you can use it as follows:

python
Python 3.13.9
    (tags/v3.13.9:8183fa5, Oct 14 2025, 14:51:39)
    [MSC v.1944 64 bit (ARM64)] on win32

>>> import duckdb
>>> duckdb.__version__
'1.4.3'

Currently, many Python installations that you'll find on Windows Arm64 computers use the x86_64 (AMD64) Python distribution and run through Microsoft's Prism emulator. For example, if you install Python through the Windows Store, you will get the Python AMD64 installation. To understand which platform your Python installation is using, observe the Python CLI's first line (e.g., Python 3.13.9 ... (ARM64)).

ODBC Driver

We are now shipping a native ODBC driver for Windows Arm64. Head to the ODBC Windows installation page to try it out!

Conclusion

This post was a short summary of the changes in v1.4.3. As usual, you can find the full release notes on GitHub. We would like to thank our contributors for providing detailed issue reports and patches. Stay tuned for DuckDB v1.4.4 and v1.5.0, both released early next year!

]]>
The DuckDB team
Writes in DuckDB-Iceberg2025-11-28T00:00:00+00:002025-11-28T00:00:00+00:00https://duckdb.org/2025/11/28/iceberg-writes-in-duckdbOver the past several months, the DuckDB Labs team has been hard at work on the DuckDB-Iceberg extension, with full read support and initial write support released in v1.4.0. Today, we are happy to announce delete and update support for Iceberg v2 tables is available in v1.4.2!

The Iceberg open table format has become extremely popular in the past two years, with many databases announcing support for the open table format originally developed at Netflix. This past year the DuckDB team has made Iceberg integration a priority and today we are happy to announce another step in that direction. In this blog post we will describe the current feature set of DuckDB-Iceberg in DuckDB v1.4.2.

Getting Started

To experiment with the new DuckDB-Iceberg features, you will need to connect to your favorite Iceberg REST Catalog. There are many ways to connect to an Iceberg REST Catalog: please have a look at the Connecting to REST Catalogs for connecting to catalogs like Apache Polaris or Lakekeeper and the Connecting to S3 Tables page if you would like to connect to Amazon S3 Tables.

ATTACH 'warehouse_name' AS iceberg_catalog (
    TYPE iceberg,
    other options
);

Inserts, Deletes and Updates

Support for creating tables and inserting to tables was already added in DuckDB v1.4.0: you can use standard DuckDB SQL syntax to insert data into your Iceberg table.

CREATE TABLE iceberg_catalog.default.simple_table (
    col1 INTEGER,
    col2 VARCHAR
);
INSERT INTO iceberg_catalog.default.simple_table
    VALUES (1, 'hello'), (2, 'world'), (3, 'duckdb is great');

You can also use any DuckDB table scan function to insert data into an Iceberg table:

INSERT INTO iceberg_catalog.default.more_data
    SELECT * FROM read_parquet('path/to/parquet');

Starting with v1.4.2, the standard SQL syntax also works for deletes and updates:

DELETE FROM iceberg_catalog.default.simple_table
WHERE col1 = 2;

UPDATE iceberg_catalog.default.simple_table
SET col1 = col1 + 5
WHERE col1 = 1;

SELECT *
FROM iceberg_catalog.default.simple_table;
┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     3 │ duckdb is great │
│     6 │ hello           │
└───────┴─────────────────┘

The Iceberg write support current has two limitations:

The update support is limited to tables that are not partitioned and not sorted. Attempting to perform update, insert or delete operations on partitioned or sorted tables using DuckDB-Iceberg will result in an error.

DuckDB-Iceberg only writes positional deletes for DELETE and UPDATE statements. Copy-on-write functionality is not yet supported.

Functions for Table Properties

Currently, DuckDB-Iceberg only supports merge-on-read semantics. Within Iceberg Table Metadata, table properties can be used to describe what form of deletes or updates are allowed. DuckDB-Iceberg will respect write.update.mode and write.delete.mode table properties for updates and deletes. If a table has these properties and they are not merge-on-read, DuckDB will throw an error and the UPDATE or DELETE will not be committed. Version v1.4.2 introduces three new functions to add, remove, and view table properties for an Iceberg table:

  • set_iceberg_table_properties
  • iceberg_table_properties
  • remove_iceberg_table_properties

You can use them as follows:

-- to set table properties
CALL set_iceberg_table_properties(iceberg_catalog.default.simple_table, {
    'write.update.mode': 'merge-on-read',
    'write.file.size': '100000kb'
});
-- to read table properties
SELECT * FROM iceberg_table_properties(iceberg_catalog.default.simple_table);
┌───────────────────┬───────────────┐
│        key        │     value     │
│      varchar      │    varchar    │
├───────────────────┼───────────────┤
│ write.update.mode │ merge-on-read │
│ write.file.size   │ 100000kb      │
└───────────────────┴───────────────┘
-- to remove table properties
CALL remove_iceberg_table_properties(
    iceberg_catalog.default.simple_table,
    ['some.other.property']
);

Iceberg Table Metadata

DuckDB-Iceberg also allows you to view the metadata of your Iceberg tables using the iceberg_metadata() and iceberg_snapshots() functions.

SELECT * FROM iceberg_metadata(iceberg_catalog.default.table_1);
┌──────────────────────┬──────────────────────┬──────────────────┬─────────┬──────────────────┬─────────────────────────────────────────────────────────────┬─────────────┬──────────────┐
│    manifest_path     │ manifest_sequence_…  │ manifest_content │ status  │     content      │                         file_path                           │ file_format │ record_count │
│       varchar        │        int64         │     varchar      │ varchar │     varchar      │                          varchar                            │   varchar   │    int64     │
├──────────────────────┼──────────────────────┼──────────────────┼─────────┼──────────────────┼─────────────────────────────────────────────────────────────┼─────────────┼──────────────┤
│ s3://warehouse/def…  │                    1 │ DATA             │ ADDED   │ EXISTING         │ s3://<storage_location>/simple_table/data/019a6ecc-9e9e-7…  │ parquet     │            3 │
│ s3://warehouse/def…  │                    2 │ DELETE           │ ADDED   │ POSITION_DELETES │ s3://<storage_location>/simple_table/data/d65b1db8-9fa8-4…  │ parquet     │            1 │
│ s3://warehouse/def…  │                    3 │ DELETE           │ ADDED   │ POSITION_DELETES │ s3://<storage_location>/simple_table/data/8d1b92dc-5f6e-4…  │ parquet     │            1 │
│ s3://warehouse/def…  │                    3 │ DATA             │ ADDED   │ EXISTING         │ s3://<storage_location>/simple_table/data/019a6ecf-5261-7…  │ parquet     │            1 │
└──────────────────────┴──────────────────────┴──────────────────┴─────────┴──────────────────┴─────────────────────────────────────────────────────────────┴─────────────┴──────────────┘
SELECT * FROM iceberg_snapshots(iceberg_catalog.default.simple_table);
┌─────────────────┬─────────────────────┬─────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ sequence_number │     snapshot_id     │      timestamp_ms       │                                                manifest_list                                                 │
│     uint64      │       uint64        │        timestamp        │                                                   varchar                                                    │
├─────────────────┼─────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│               1 │ 1790528822676766947 │ 2025-11-10 17:24:55.075 │ s3://<storage_location>/simple_table/data/snap-1790528822676766947-f09658c4-ca52-4305-943f-6a8073529fef.avro │
│               2 │ 6333537230056014119 │ 2025-11-10 17:27:35.602 │ s3://<storage_location>/simple_table/data/snap-6333537230056014119-316d09bc-549d-46bc-ae13-a9fab5cbf09b.avro │
│               3 │ 7452040077415501383 │ 2025-11-10 17:27:52.169 │ s3://<storage_location>/simple_table/data/snap-7452040077415501383-93dee94e-9ec1-45fa-aec2-13ef434e50eb.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Time Travel

Time travel is also possible via snapshot ids or timestamps using the AT (VERSION => ...) or AT (TIMESTAMP => ...) syntax.

-- via snapshot id
SELECT *
FROM iceberg_catalog.default.simple_table AT (
	VERSION => snapshot_id
);
┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     1 │ hello           │
│     3 │ duckdb is great │
└───────┴─────────────────┘
-- via timestamp
SELECT *
FROM iceberg_catalog.default.simple_table AT (
    TIMESTAMP => '2025-11-10 17:27:45.602'
);
┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     1 │ hello           │
│     3 │ duckdb is great │
└───────┴─────────────────┘

Viewing Requests to the Iceberg REST Catalog

You may also be curious as to what requests DuckDB is making to the Iceberg REST Catalog. To do so, enable HTTP logging, run your workload, then select from the HTTP logs.

CALL enable_logging('HTTP');
SELECT * FROM iceberg_catalog.default.simple_table;
SELECT request.type, request.url, response.status
FROM duckdb_logs_parsed('HTTP');
┌─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│  type   │                                                                             url                          │       status       │
│ varchar │                                                                           varchar                        │      varchar       │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│ GET     │ https://<catalog_endpoint>/iceberg/v1/<warehouse>/iceberg-testing/namespaces/default                     │ NULL               │
│ HEAD    │ https://<catalog_endpoint>/iceberg/v1/<warehouse>/iceberg-testing/namespaces/default/tables/simple_table │ NULL               │
│ GET     │ https://<catalog_endpoint>/iceberg/v1/<warehouse>/iceberg-testing/namespaces/default/tables/simple_table │ NULL               │
│ GET     │ https://<storage_endpoint>/data/snap-5943683398986255948-c2217dde-6036-4e07-88f2-…                       │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/f8c95b93-7b6b-4a24-8557-b98b553723d4-m0.avro                             │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/214a7988-da39-4dac-aa3a-4a73d3ead405-m0.avro                             │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/019a7244-c6e8-7bc9-9dd4-7249fcb04959.parquet                             │ PartialContent_206 │
│ GET     │ https://<storage_endpoint>/data/019a7244-fcb5-7308-96ec-1c9e32509eab.parquet                             │ PartialContent_206 │
│ GET     │ https://<storage_endpoint>/data/7f14bb06-f57a-42b4-ba7f-053a65152759-m0.avro                             │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/71f8b43d-51e7-40e7-be88-e8d869836ecd-deletes.parq…                       │ PartialContent_206 │
│ GET     │ https://<storage_endpoint>/data/64f6c6e2-2f54-470e-b990-b201bc615042-m0.avro                             │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/4e54afed-6dd8-4ba0-88fb-16f972ac1d91-deletes.parq…                       │ PartialContent_206 │
├─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┤
│ 12 rows                                                                                                                       3 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we can see calls to the Iceberg REST Catalog, followed by calls to the storage endpoint. The first three calls to the Iceberg REST Catalog are to verify the schema still exists and to get the latest metadata.json of the DuckDB-Iceberg table. Next, it queries the manifest list, manifest files, and eventually the files with data and deletes. The data and delete files are stored locally in a cache to speed up subsequent reads.

Transactions

DuckDB is an ACID-compliant database that supports transactions. Work on DuckDB-Iceberg has been made with this in mind. Within a transaction, the following conditions will hold for Iceberg tables.

  1. The first time a table is read in a transaction, its snapshot information is stored in the transaction and will remain consistent within that transaction.
  2. Updates, inserts and deletes will only be committed to an Iceberg Table when the transaction is committed (i.e., COMMIT);

Point #1 is important for read performance. If you wish to do analytics on an Iceberg table and you do not need to get the latest version of the table every time, running your analytics in a transaction will prevent fetching the latest version for every query.

-- truncate the logs
CALL truncate_duckdb_logs();
CALL enable_logging('HTTP')
BEGIN;
-- first read gets latest snapshot information
SELECT * FROM iceberg_catalog.default.simple_table;
-- subsequent read reads from local cached data
SELECT * FROM iceberg_catalog.default.simple_table;
-- get logs
SELECT request.type, request.url, response.status
FROM duckdb_logs_parsed('HTTP');
┌─────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│  type   │                                                  url                                                        │       status       │
│ varchar │                                                varchar                                                      │      varchar       │
├─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│ GET     │ https://<catalog_endpoint>/iceberg/v1/<warehouse>/iceberg-testing/namespaces/default                        │ NULL               │
│ HEAD    │ https://<catalog_endpoint>/iceberg/v1/<warehouse>/iceberg-testing/namespaces/default/tables/simple_table    │ NULL               │
│ GET     │ https://<catalog_endpoint>/iceberg/v1/<warehouse>/iceberg-testing/namespaces/default/tables/simple_table    │ NULL               │
│ GET     │ https://<storage_endpoint>/data/snap-5943683398986255948-c2217dde-6036-4e07-88f2-1…                         │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/f8c95b93-7b6b-4a24-8557-b98b553723d4-m0.avro                                │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/214a7988-da39-4dac-aa3a-4a73d3ead405-m0.avro                                │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/019a7244-c6e8-7bc9-9dd4-7249fcb04959.parquet                                │ PartialContent_206 │
│ GET     │ https://<storage_endpoint>/data/019a7244-fcb5-7308-96ec-1c9e32509eab.parquet                                │ PartialContent_206 │
│ GET     │ https://<storage_endpoint>/data/7f14bb06-f57a-42b4-ba7f-053a65152759-m0.avro                                │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/71f8b43d-51e7-40e7-be88-e8d869836ecd-deletes.parquet                        │ PartialContent_206 │
│ GET     │ https://<storage_endpoint>/data/64f6c6e2-2f54-470e-b990-b201bc615042-m0.avro                                │ OK_200             │
│ GET     │ https://<storage_endpoint>/data/4e54afed-6dd8-4ba0-88fb-16f972ac1d91-deletes.parquet                        │ PartialContent_206 │
├─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┤
│ 12 rows                                                                                                                          3 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we see all the same requests we saw in the previous section. However, now we are in a transaction, which means the second time we read from iceberg_catalog.default.simple_table, we do not need to query the REST Catalog for table updates. This means DuckDB-Iceberg performs no extra requests when reading a table a second time, significantly improving performance.

Conclusion and Future Work

With these features, DuckDB-Iceberg now has a strong base support for the Iceberg tables, which enables users to unlock the analytical powers of DuckDB on their Iceberg tables. There is still more work to come and the Iceberg table specification has many more features the DuckDB team would like to support in DuckDB-Iceberg. If you feel any feature is a priority for your analytical workloads, please reach out to us in the DuckDB-Iceberg GitHub repository or get in touch with our engineers.

Below is a list of improvements planned for the near future (in no particular order):

  • Performance improvements
  • Updates / deletes / inserts to partitioned tables
  • Updates / deletes / inserts to sorted tables
  • Schema evolution
  • Support for Iceberg v3 tables, focusing on binary deletion vectors and row lineage tracking
]]>
{"twitter" => "the_Tmonster", "picture" => "/images/blog/authors/tom_ebergen.jpg"}
Data-at-Rest Encryption in DuckDB2025-11-19T00:00:00+00:002025-11-19T00:00:00+00:00https://duckdb.org/2025/11/19/encryption-in-duckdb

If you would like to use encryption in DuckDB, we recommend using the latest stable version, v1.4.2. For more details, see the latest release blog post.

Many years ago, we read the excellent “Code Book” by Simon Singh. Did you know that Mary, Queen of Scots, used an encryption method harking back to Julius Caesar to encrypt her more saucy letters? But alas: the cipher was broken and the contents of the letters got her executed.

These days, strong encryption software and hardware is a commodity. Modern CPUs come with specialized cryptography instructions, and operating systems small and big contain mostly-robust cryptography software like OpenSSL.

Databases store arbitrary information, it is clear that many if not most datasets of any value should perhaps not be plainly available to everyone. Even if stored on tightly controlled hardware like a cloud virtual machine, there have been many cases of files being lost through various privilege escalations. Unsurprisingly, compliance frameworks like the common SOC 2 “highly recommend” encrypting data when stored on storage mediums like hard drives.

However, database systems and encryption have a somewhat problematic track record. Even PostgreSQL, the self-proclaimed “The World's Most Advanced Open Source Relational Database” has very limited options for data encryption. SQLite, the world’s “Most Widely Deployed and Used Database Engine” does not support data encryption out-of-the-box, its encryption extension is a $2000 add-on.

DuckDB has supported Parquet Modular Encryption for a while. This feature allows reading and writing Parquet files with encrypted columns. However, while Parquet files are great and reports of their impending death are greatly exaggerated, they cannot – for example – be updated in place, a pretty basic feature of a database management system.

Starting with DuckDB 1.4.0, DuckDB supports transparent data encryption of data-at-rest using industry-standard AES encryption.

DuckDB's encryption does not yet meet the official NIST requirements. Please follow issue #20162 “Store and verify tag for canary encryption” to track our progress towards NIST-compliance.

Some Basics of Encryption

There are many different ways to encrypt data, some more secure than others. In database systems and elsewhere, the standard is the Advanced Encryption Standard (AES), which is a block cipher algorithm standardized by US NIST. AES is a symmetric encryption algorithm, meaning that the same key is used for both encryption and decryption of data.

For this reason, most systems choose to only support randomized encryption, meaning that identical plaintexts will always yield different ciphertexts (if used correctly!). The most commonly used industry standard and recommended encryption algorithm is AES – Galois Counter Mode (AES-GCM). This is because on top of its ability to randomize encryption, it also authenticates data by calculating a tag to ensure data has not been tampered with.

DuckDB v1.4 supports encryption at rest using AES-GCM-256 and AES-CTR-256 (counter mode) ciphers. AES-CTR is a simpler and faster version of AES-GCM, but less secure, since it does not provide authentication by calculating a tag. The 256 refers to the size of the key in bits, meaning that DuckDB now only supports GCM with 32-byte keys.

GCM and CTR both require as input a (1) plaintext, (2) an initialization vector (IV) and (3) an encryption key. Plaintext is the text that a user wants to encrypt. An IV is a unique bytestream of usually 16 bytes, that ensures that identical plaintexts get encrypted into different ciphertexts. A number used once (nonce) is a bytestream of usually 12 bytes, that together with a 4-byte counter construct the IV. Note that the IV needs to be unique for every encrypted block, but it does not necessarily have to be random. Reuse of the same IV is problematic, since an attacker could XOR the two ciphertexts and extract both messages. The tag in AES-GCM is calculated after all blocks are encrypted, pretty much like a checksum, but it adds an integrity check that securely authenticates the entire ciphertext.

Implementation in DuckDB

Before diving deeper into how we actually implemented encryption in DuckDB, we’ll explain some things about the DuckDB file format.

DuckDB has one main database header which stores data that enables it to correctly load and verify a DuckDB database. At the start of each DuckDB main database header, the magic bytes (“DUCKDB”) are stored and read upon initialization to verify whether the file is a valid DuckDB database file. The magic bytes are followed by four 8-byte of flags that can be set for different purposes.

When a database is encrypted in DuckDB, the main database header remains plaintext at all times, since the main header contains no sensitive data about the contents of the database file. Upon initializing an encrypted database, DuckDB sets the first bit in the first flag to indicate that the database is encrypted. After setting this bit, additional metadata is stored that is necessary for encryption. This metadata entails the (1) database identifier, (2) 8 bytes of additional metadata for e.g. the encryption cipher used, and (3) the encrypted canary.

The database identifier is used as a “salt”, and consists of 16 randomly generated bytes created upon initialization of each database. The salt is often used to ensure uniqueness, i.e., it makes sure that identical input keys or passwords are transformed into different derived keys. The 8-bytes of metadata comprise the key derivation function (first byte), usage of additional authenticated data (second byte), the encryption cipher (third byte), and the key length (fifth byte). After the metadata, the main header uses the encrypted canary to check if the input key is correct.

Encryption Key Management

To encrypt data in DuckDB, you can use practically any plaintext or base64 encoded string, but we recommend using a secure 32-byte base64 key. The user itself is responsible for the key management and thus for using a secure key. Instead of directly using the plain key provided by the user, DuckDB always derives a more secure key by means of a key derivation function (kdf). The kdf is a function that reduces or extends the input key to a 32-byte secure key. If the correctness of the input key is checked by deriving the secure key and decrypting the canary, the derived key is managed in a secure encryption key cache. This cache manages encryption keys for the current DuckDB context and ensures that the derived encryption keys are never swapped to disk by locking its memory. To strengthen security even more, the original input keys are immediately wiped from memory when the input keys are transformed into secure derived keys.

DuckDB Block Structure

After the main database header, DuckDB stores two 4KB database headers that contain more information about e.g. the block (header) size and the storage version used. After keeping the main database header plaintext, all remaining headers and blocks are encrypted when encryption is used.

Blocks in DuckDB are by default 256KB, but their size is configurable. At the start of each plaintext block there is an 8-byte block header, which stores an 8-byte checksum. The checksum is a simple calculation that is often used in database systems to check for any corrupted data.

Plaintext block Plaintext block

For encrypted blocks however, its block header consists of 40 bytes instead of 8 bytes for the checksum. The block header for encrypted blocks contains a 16-byte nonce/IV and, optionally, a 16-byte tag, depending on which encryption cipher is used. The nonce and tag are stored in plaintext, but the checksum is encrypted for better security. Note that the block header always needs to be 8-bytes aligned to calculate the checksum.

Encrypted block Encrypted block

Write-Ahead-Log Encryption

The write ahead log (WAL) in database systems is a crash recovery mechanism to ensure durability. It is an append-only file that is used in scenarios where the database crashed or is abruptly closed, and when not all changes are written yet to the main database file. The WAL makes sure these changes can be replayed up to the last checkpoint; which is a consistent snapshot of the database at a certain point in time. This means, when a checkpoint is enforced, which happens in DuckDB by either (1) closing the database or (2) reaching a certain threshold for storage, the WAL gets written into the main database file.

In DuckDB, you can force the creation of a WAL by setting

PRAGMA disable_checkpoint_on_shutdown;
PRAGMA wal_autocheckpoint = '1TB';

This way you’ll disable a checkpointing on closing the database, meaning that the WAL does not get merged into the main database file. In addition, by setting wal_autocheckpoint to a high threshold, this will avoid intermediate checkpoints to happen and the WAL will persist. For example, we can create a persistent WAL file by first setting the above PRAGMAs, then attach an encrypted database, and then create a table where we insert 3 values.

ATTACH 'encrypted.db' AS enc (
    ENCRYPTION_KEY 'asdf',
    ENCRYPTION_CIPHER 'GCM'
);
CREATE TABLE enc.test (a INTEGER, b INTEGER);
INSERT INTO enc.test VALUES (11, 22), (13, 22), (12, 21)

If we now close the DuckDB process, we can see that there is a .wal file shown: encrypted.db.wal. But how is the WAL created internally?

Before writing new entries (inserts, updates, deletes) to the database, these entries are essentially logged and appended to the WAL. Only after logged entries are flushed to disk, a transaction is considered as committed. A plaintext WAL entry has the following structure:

Plaintext block Plaintext block

Since the WAL is append-only, we encrypt a WAL entry per value. For AES-GCM this means that we append a nonce and a tag to each entry. The structure in which we do this is depicted in below. When we serialize an encrypted entry to the encrypted WAL, we first store the length in plaintext, because we need to know how many bytes we should decrypt. The length is followed by a nonce, which on its turn is followed by the encrypted checksum and the encrypted entry itself. After the entry, a 16-byte tag is stored for verification.

Plaintext block Plaintext block

Encrypting the WAL is triggered by default when an encryption key is given for any (un)encrypted database.

Temporary File Encryption

Temporary files are used to store intermediate data that is often necessary for large, out-of-core operations such as sorting, large joins and window functions. This data could contain sensitive information and can, in case of a crash, remain on disk. To protect this leftover data, DuckDB automatically encrypts temporary files too.

The Structure of Temporary Files

There are three different types of temporary files in DuckDB: (1) temporary files that have the same layout as a regular 256KB block, (2) compressed temporary files and (3) temporary files that exceed the standard 256KB block size. The former two are suffixed with .tmp, while the latter is distinguished by a suffix with .block. To keep track of the size of .block temporary files, they are always prefixed with its length. As opposed to regular database blocks, temporary files do not contain a checksum to check for data corruption, since the calculation of a checksum is somewhat expensive.

Encrypting Temporary Files

Temporary files are encrypted (1) automatically when you attach an encrypted database or (2) when you use the setting SET temp_file_encryption = true. In the latter case, the main database file is plaintext, but the temporary files will be encrypted. For the encryption of temporary files DuckDB internally generates temporary keys. This means that when the database crashes, the temporary keys are also lost. Temporary files cannot be decrypted in this case and are then essentially garbage.

To force DuckDB to produce temporary files, you can use a simple trick by just setting the memory limit low. This will create temporary files once the memory limit is exceeded. For example, we can create a new encrypted database, load this database with TPC-H data (SF 1), and then set the memory limit to 1 GB. If we then perform a large join, we force DuckDB to spill intermediate data to disk. For example:

SET memory_limit = '1GB';
ATTACH 'tpch_encrypted.db' AS enc (
    ENCRYPTION_KEY 'asdf',
    ENCRYPTION_CIPHER 'cipher'
);
USE enc;
CALL dbgen(sf = 1);

ALTER TABLE lineitem
    RENAME TO lineitem1;
CREATE TABLE lineitem2 AS
    FROM lineitem1;
CREATE OR REPLACE TABLE ans AS
    SELECT l1.* , l2.*
    FROM lineitem1 l1
    JOIN lineitem2 l2 USING (l_orderkey , l_linenumber);

This sequence of commands will result in encrypted temporary files being written to disk. Once the query completes or when the DuckDB shell is exited, the temporary files are automatically cleaned up. In case of a crash however, it may happen that temporary files will be left on disk and need to be cleaned up manually.

How to Use Encryption in DuckDB

In DuckDB, you can (1) encrypt an existing database, (2) initialize a new, empty encrypted database or (3) reencrypt a database. For example, let's create a new database, load this database with TPC-H data of scale factor 1 and then encrypt this database.

INSTALL tpch;
LOAD tpch;
ATTACH 'encrypted.duckdb' AS encrypted (ENCRYPTION_KEY 'asdf');
ATTACH 'unencrypted.duckdb' AS unencrypted;
USE unencrypted;
CALL dbgen(sf = 1);
COPY FROM DATABASE unencrypted TO encrypted;

There is not a trivial way to prove that a database is encrypted, but correctly encrypted data should look like random noise and has a high entropy. So, to check whether a database is actually encrypted, we can use tools to calculate the entropy or visualize the binary, such as ent and binocle.

When we use ent after executing the above chunk of SQL, i.e., ent encrypted.duckdb, this will result in an entropy of 7.99999 bits per byte. If we do the same for the plaintext (unencrypted) database, this results in 7.65876 bits per byte. Note that the plaintext database also has a high entropy, but this is due to compression.

Let’s now visualize both the plaintext and encrypted data with binocle. For the visualization we created both a plaintext DuckDB database with scale factor of 0.001 of TPC-H data and an encrypted one:

Click here to see the entropy of a plaintext database
Click here to see the entropy of an encrypted database

In these figures, we can clearly observe that the encrypted database file seems completely random, while the plaintext database file shows some clear structure in its binary data.

To decrypt an encrypted database, we can use the following SQL:

ATTACH 'encrypted.duckdb' AS encrypted (ENCRYPTION_KEY 'asdf');
ATTACH 'new_unencrypted.duckdb' AS unencrypted;
COPY FROM DATABASE encrypted TO unencrypted;

And to reencrypt an existing database, we can just simply copy the old encrypted database to a new one, like:

ATTACH 'encrypted.duckdb' AS encrypted (ENCRYPTION_KEY 'asdf');
ATTACH 'new_encrypted.duckdb' AS new_encrypted (ENCRYPTION_KEY 'xxxx');
COPY FROM DATABASE encrypted TO new_encrypted;

The default encryption algorithm is AES GCM. This is recommended since it also authenticates data by calculating a tag. Depending on the use case, you can also use AES CTR. This is faster than AES GCM since it skips calculating a tag after encrypting all data. You can specify the CTR cipher as follows:

ATTACH 'encrypted.duckdb' AS encrypted (
    ENCRYPTION_KEY 'asdf',
    ENCRYPTION_CIPHER 'CTR'
);

To keep track of which databases are encrypted, you can query this by running:

FROM duckdb_databases();

This will show which databases are encrypted, and which cipher is used:

database_name database_oid path encrypted cipher
encrypted 2103 encrypted.duckdb true GCM
unencrypted 2050 unencrypted.duckdb false NULL
memory 592 NULL false NULL
system 0 NULL false NULL
temp 1995 NULL false NULL

5 rows — 10 columns (5 shown)

Implementation and Performance

Here at DuckDB, we strive to achieve a good out-of-the-box experience with zero external dependencies and a small footprint. Encryption and decryption, however, are usually performed by pretty heavy external libraries such as OpenSSL. We would much prefer not to rely on external libraries or statically linking huge codebases just so that people can use encryption in DuckDB without additional steps. This is why we actually implemented encryption twice in DuckDB, once with the (excellent) Mbed TLS library and once with the ubiquitous OpenSSL library.

DuckDB already shipped parts of Mbed TLS because we use it to verify RSA extension signatures. However, for maximum compatibility we actually disabled the hardware acceleration of MbedTLS, which has a performance impact. Furthermore, Mbed TLS is not particularly hardened against things like nasty timing attacks. OpenSSL on the other hand contains heavily vetted and hardware-accelerated code to perform AES operations, which is why we can also use it for encryption.

In DuckDB Land, OpenSSL is part of the httpfs extension. Once you load that extension, encryption will automatically switch to using OpenSSL. After we shipped encryption in DuckDB 1.4.0, security experts actually found issues with the random number generator we used in Mbed TLS mode. Even though it would be difficult to actually exploit this, we disabled writing to databases in MbedTLS mode from DuckDB 1.4.1. Instead, DuckDB now (version 1.4.2+) tries to auto-install and auto-load the httpfs extension whenever a write is attempted. We might be able to revisit this in the future, but for now this seems the safest path forward that still allows high compatibility for reading. In OpenSSL mode, we always used a cryptographically-safe random number generation so that mode is unaffected.

Encrypting and decrypting database files is an additional step in writing tables to disk, so we would naturally assume that there is some performance impact. Let’s investigate the performance impact of DuckDB’s new encryption feature with a very basic experiment.

We first create two DuckDB database files, one encrypted and one unencrypted. We use the TPC-H benchmark generator again to create the table data, particularly the (somewhat tired) lineitem table.

INSTALL httpfs;
INSTALL tpch;
LOAD tpch;

ATTACH 'unencrypted.duckdb' AS unencrypted;
CALL dbgen(sf = 10, catalog = 'unencrypted');

ATTACH 'encrypted.duckdb' AS encrypted (ENCRYPTION_KEY 'asdf');
CREATE TABLE encrypted.lineitem AS FROM unencrypted.lineitem;

Now we use DuckDB’s neat SUMMARIZE command three times: once on the unencrypted database, and once on the encrypted database using MbedTLS and once on the encrypted database using OpenSSL. We set a very low memory limit to force more reading and writing from disk.

SET memory_limit = '200MB';
.timer on

SUMMARIZE unencrypted.lineitem;
SUMMARIZE encrypted.lineitem;

LOAD httpfs; -- use OpenSSL
SUMMARIZE encrypted.lineitem;

Here are the results on a fairly recent MacBook: SUMMARIZE on the unencrypted table took ca. 5.4 seconds. Using Mbed TLS, this went up to around 6.2 s. However, when enabling OpenSSL the end-to-end time went straight back to 5.4 s. How is this possible? Is decryption not expensive? Well, there is a lot more happening in query processing than reading blocks from storage. So the impact of decryption is not all that huge, even when using a slow implementation. Secondly, when using hardware acceleration in OpenSSL, the overall overhead of encryption and decryption becomes almost negligible.

But just running summarization is overly simplistic. Real™ database workloads include modifications to data, insertion of new rows, updates of rows, deletion of rows etc. Also, multiple clients will be updating and querying at the same time. So we re-surrected the full TPC-H “Power” test from our previous blog post “Changing Data with Confidence and ACID”. We slightly tweaked the benchmark script to enable the new database encryption. For this experiment, we used the OpenSSL encryption implementation due to the issues outlined above. We observe “Power@Size” and “Throughput@Size”. The former is raw sequential query performance, while the latter measures multiple parallel query streams in the presence of updates.

When running on the same MacBook with DuckDB 1.4.1 and a “scale factor” of 100, we get a Power@Size metric of 624,296 and a Throughput@Size metric of 450,409 without encryption.

When we enable encryption, the results are almost unchanged, confirming the observation of the small microbenchmark above. However, the relationship between available memory and the benchmark size means that we’re not stressing temporary file encryption. So we re-ran everything with an 8GB memory limit. We confirmed constant reading and writing to and from disk in this configuration by observing operating system statistics. For the unencrypted case, the Power@Size metric predictably went down to 591,841 and Throughput@Size went down to 153,690. And finally, we could observe a slight performance decrease with Power@Size of 571,985 and Throughput@Size of 145,353. However, that difference is not very great either and likely not relevant in real operational scenarios.

Conclusion

With the new encrypted database feature, we can now safely pass around DuckDB database files with all information inside them completely opaque to prying eyes. This allows for some interesting new deployment models for DuckDB, for example, we could now put an encrypted DuckDB database file on a Content Delivery Network (CDN). A fleet of DuckDB instances could attach to this file read-only using the decryption key. This elegantly allows efficient distribution of private background data in a similar way like encrypted Parquet files, but of course with many more features like multi-table storage. When using DuckDB with encrypted storage, we can also simplify threat modeling when – for example – using DuckDB on cloud providers. While in the past access to DuckDB storage would have been enough to leak data, we can now relax paranoia regarding storage a little, especially since temporary files and WAL are also encrypted. And the best part of all of this, there is almost no performance overhead to using encryption in DuckDB, especially with the OpenSSL implementation.

We are very much looking forward to what you are going to do with this feature, and please let us know if you run into any issues.

]]>
Lotte Felius, Hannes Mühleisen