Editor’s note: This post was originally written by Sylvain Lesage, one of the primary contributors to HighTable, whose work we’ve been sponsoring as part of our broader effort to plan for the data-scale problems that are emerging with LLMs. We’re republishing it here with his permission because it captures a core technical challenge we’ve been working through—how to scroll through billions of rows in the browser.
Stay tuned for a future post on what this innovation means for Hyperparam users.
TL;DR: In this post, I present five techniques related to vertical scrolling used in <HighTable>, a React component that can display billions of rows in a table while keeping good performance and accessibility.
It’s a long post, which reflects the complexity of rendering billions of rows in a table, and the amount of work we put into building the React component.
Table of contents:
Showing data in a table is one of the first exercises you’ll find in HTML 101 courses.
<table>
<thead>
<tr><th>Name</th><th>Age</th></tr>
</thead>
<tbody>
<tr><td>Alice</td><td>64</td></tr>
<tr><td>Bob</td><td>37</td></tr>
</tbody>
</table>
But, as often in data science, what works for simple cases breaks when the size increases.
In this post, I’ll showcase five techniques we use to solve challenges related to vertical scrolling in the <HighTable> React component to handle billions of rows.
The component also provides features for columns (sort, hide, resize), rows (select), cells (keyboard navigation, pointer interactions, custom rendering). Feel free to ask and look at the code if you’re interested in knowing more.
The <HighTable> component is developed at hyparam/hightable. It was created by Kenny Daniel for Hyperparam, and I’ve had the chance to contribute to its development for one year now.
This blog post was sponsored by Hyperparam. Thanks for the support and for challenging me to solve the fascinating problem of rendering billions of rows in the browser!
Try the hightable demo:
HighTable is also used in the Parquet viewer, on source.coop and in Hyperparam:
Before diving into the techniques, let’s describe how scrolling works using a standard HTML table.
The HTML structure is composed of a scrollable container, that we call the , and a table element inside it:
<div class="viewport" style="overflow-y: auto;">
<table class="table">
...
</table>
</div>
In this structure, the is a div with a fixed height and the CSS property overflow-y: auto enables a vertical scrollbar when the table is taller than the .
In the following widget, scroll the left box up and down to see how the right box mimics the scrolling effect.
If you use a keyboard, you can focus the left box with Tab, and scroll with the arrow keys ⏶ and ⏷. Otherwise, you can use mouse wheel, drag the scroll bar, or slide on a touch screen.
The component is delimited by its fixed-size (blue border). The table (golden border) is rendered inside the container. As its height is larger than the , only part of the table is visible, and a vertical scrollbar lets changing the visible part. The inner table element moves up and down within the , creating the scrolling effect.
On the right side, we mimic the scrolling effect, showing the position of the table relative to the .
Let’s settle some definitions and formulas that will be useful later:
in this post, we assume .clientHeight, the height of the visible area, is constant. In HighTable, we measure it and react to resizing.
.scrollHeight, the total height of the scrollable content, is equal to table.clientHeight. Both are equal to the number of rows in the table multiplied by the row height:
const rowHeight = 33 // in pixels
const numRows = data.numRows // total number of rows in the table
const height = numRows * rowHeight
In this post, we assume the row height and the number of rows are constant. In HighTable, we react to changes in data.numRows (the number of rows in the data frame, the data structure holding the table data), for example when filtering; but we assume the row height is fixed (see issue #395 to support variable row heights).
.scrollTop is the number of pixels between the top of the scrolled table and the top of the . The minimum value 0px shows the top of the table, while the bottom of the table is reached at the maximum value .scrollHeight - .clientHeight.
The visible pixels can be computed from the scroll top position:
const firstVisiblePixel = viewport.scrollTop
const lastVisiblePixel = viewport.scrollTop + viewport.clientHeight
// firstVisiblePixel is inclusive, lastVisiblePixel is exclusive
Now that we have the basics, let’s see how to handle large datasets.
The first challenge when working on a large dataset is that it will not fit in your browser memory. The good news: you’ll not want to look at every row either, and not at the same time. So, instead of loading the whole data file at start, we only load the visible cells.
Note that lazy loading the data does not change the HTML structure of the table.
The following widget shows how lazy loading works. Scroll the left box up and down to see how the cells are loaded on demand on the right side:
In the table, only the visible cells are loaded. When scrolling, newly visible cells are requested and loaded in the background, and rendered when available.
To do so, we compute the visible rows, and only load them:
const rowStart = Math.floor(firstVisiblePixel / rowHeight)
const rowEnd = Math.ceil(lastVisiblePixel / rowHeight)
// rowStart is inclusive, rowEnd is exclusive
In HighTable, the data loading logic is handled in a data frame, passed to the React component as the data prop:
<HighTable data={data} />
The data frame is an object that defines how to load (i.e. fetch and cache) the data on demand, and how to get the loaded data for rendering. See the DataFrame TypeScript definition in types.ts.
Here is a simplified DataFrame implementation that generates random data for one column, applying some delay to simulate fetching data over the network, and persists the values in memory:
const cache = new Map()
const eventTarget = new EventTarget()
const numRows = 1_000_000
const data = {
numRows,
eventTarget,
// Synchronously return the cached value (if any)
getCell({ row }) {
return cache.get(row);
},
// Load missing values for the given rows, and cache them
async fetch({ rowStart, rowEnd }) {
// Simulate network delay
await new Promise((resolve) => setTimeout(resolve, 100));
for (let row = rowStart; row < rowEnd; row++) {
// Skip already cached rows
if (cache.has(row)) continue;
// Generate a random value for the cell, and cache it
cache.set(row, {value: Math.random()});
}
// Emit an event to tell <HighTable> to re-render the visible cells
eventTarget.dispatchEvent(new Event('resolve'));
},
}
The data frame loads the data from the source using the asynchronous data.fetch() method. It must cache the results, and dispatch a resolve event when new data is available. The source can be anything. In our example, the data was randomly generated. It can also be obtained from a local file, an in-memory array, a remote file (using HTTP range requests), or a REST API, to name a few examples.
The data frame must also provide a synchronous data.getCell() method to get the cached data for a given cell, or undefined if the data is not loaded yet.
On every scroll move, the table is rendered, calling data.getCell() for the visible rows, as well as data.fetch() to load them in the background if necessary (it’s the responsibility of the data frame to return fast if the data is already cached). Every time new data is fetched and reported (on resolve events), the table will be re-rendered.
You can find a more complete example of a data frame that loads a remote Parquet file (using HTTP range requests) in the hyparquet demo.
The data frame structure is not oriented towards rows or columns, and allows loading and accessing the data by cell. Currently, in HighTable, we load full rows, but we could improve by computing the visible columns and loading them lazily as well. Join the pending discussion if you’re interested in this feature.
Impact of lazy loading
If we assume 10 billions of rows, and 100 bytes per row, the total data size is 1TB. Loading it all in memory is not possible, but with lazy loading, we only load 3KB for the visible part (about 30 rows at a time), and keep good performance.
Lazy loading the data is the first step, required to handle large datasets in the browser. The next step is to avoid rendering too many HTML elements at once.
In software engineering, when you try to optimize, the first step is to remove computing that does nothing. In our case, if the table has one million rows and we can see only 30 at a time, why render one million <tr> HTML elements? As a reference, Chrome recommends creating or updating less than 300 HTML elements for optimal responsiveness.
In the <HighTable> component, only the visible slice of the table is rendered. The other row elements simply don’t exist.
To achieve this, the HTML structure must be adapted, by adding an intermediate div element, that we call the canvas, between the and the table:
<div class="viewport" style="overflow-y: auto;">
<div class="canvas" style="position: relative; height: 30000px;">
<table class="table" style="position: absolute; top: 3000px;">
<!-- the table only renders the visible rows -->
...
</table>
</div>
</div>
The HTML structure will remain the same for the rest of the blog post, including techniques 3, 4 and 5.
The canvas div is not related at all with the
<canvas>HTML element. I’m open to suggestions for better naming if it’s confusing.
The canvas is sized so that it could contain all the rows:
canvas.style.height = `${data.numRows * rowHeight}px`
It sets the scrollbar to the expected size. As shown in the scrolling basics section, .scrollHeight is equal to canvas.clientHeight.
The canvas serves as a reference for absolutely positioning the table slice.
The following widget shows how table slicing works. Scroll the left box up and down to see how the right box mimics the scrolling effect, while rendering only the visible rows. Toggle the full table button to see how the rendered rows fit in the full table:
On the right side, you see that only the visible rows are rendered. The table slice contains 6 rows instead of 10 (or 7, depending on the scroll position).
The HTML structure inside the table slice is:
<table>
<tbody>
<!-- Rows 0 to 99 are not rendered -->
<!-- Visible rows -->
<tr>...row 100...</tr>
<tr>...row 101...</tr>
...
<tr>...row 119...</tr>
<!-- Rows 120 to 999 are not rendered -->
</tbody>
</table>
Let’s assume the data has 1,000 rows, each row in the table is 30px height, and the height is 600px (so that about 20 rows are visible at once). If the user has scrolled down 3,000px, <HighTable> only renders rows 100 to 119 in the actual <table> element.
The HTML above is a simplification. In hightable, we render a table header and add some padding rows before and after the visible rows to improve the scrolling experience.
The table top position is adjusted to fit in the full table (toggle the Show / Hide button to render the full table). It’s equals to the position of the first visible row inside the virtual full table. It’s nearly equal to .scrollTop, but differs by the amount of hidden pixels at the top of the first visible row. So:
table.style.top = `${
viewport.scrollTop - (viewport.scrollTop % rowHeight)
}px`;
These computations are done on every scroll event (and on every other change: when the height changes, or when the number of rows is updated). Once computed, the table slice is re-rendered with the new visible rows, the table position is updated with the new top value, and the data frame is queried to load the new visible cells if needed.
A detail worth mentioning is the sticky header. In
<HighTable>, the header with column names is rendered as part of the table element, in<thead>, not as a separate element. It helps with accessibility, as screen readers can easily identify the header cells associated with each data cell, and with columns resizing, as the header and data cells are aligned automatically by the browser. Thanks to the CSS propertyposition: sticky(see sticky on MDN), the header row remains visible at the top of the when scrolling. We take it into account to compute the first visible row.
Note that the table slicing technique is not specific to vertical scrolling. The same approach can be used for horizontal scrolling (rendering only the visible columns). It’s less critical, as tables generally have less columns than rows. Join the pending discussion on virtual columns if you’re interested in this feature.
Impact of table slicing
If we assume 10 billions of rows, and 30 rows are visible at a time, we only render 30 HTML elements instead of 10 billion. It allows to keep good performance with any number of rows, as the number of rendered elements is constant.
Until now, everything is pretty standard. The next techniques are more specific to hightable, and address challenges that arise when dealing with billions of rows.
Technique 2 works perfectly, until it breaks… As Eric Meyer explains in his blog post Infinite Pixels, HTML elements have a maximum height, and the exact value depends on the browser. The worst case is Firefox: about 17 million pixels. As the canvas height increases with the number of rows, if the row height is 33px (the default in HighTable), we cannot render more than 500K rows.
Our approach to this issue in HighTable is to set a maximum height for the canvas and downscale the scrollbar resolution above this limit. In HighTable, the threshold is set to 8 million pixels.
Concretely, above the threshold, one scrolled pixel corresponds to multiple pixels in the full table. The downscaling factor is the ratio between the theoretical height of the full table and the maximum height of the canvas. Thanks to that factor, if you scroll half the scrollbar, you reach the middle of the full table, no matter how big it is.
Below the threshold, the downscaling factor is 1, so everything works as before: one scrolled pixel corresponds to one pixel in the full table.
The downscale factor is computed as:
const fullTableHeight = data.numRows * rowHeight
const maxCanvasHeight = 8_000_000
if (fullTableHeight <= maxCanvasHeight) {
downscaleFactor = 1
} else {
downscaleFactor =
(fullTableHeight - viewport.clientHeight) /
(maxCanvasHeight - viewport.clientHeight)
}
Now, the first visible row is computed with:
firstVisibleRow = Math.floor(
(viewport.scrollTop * downscaleFactor) / rowHeight
)
and the table top position is set to align the first visible row with the top of the :
table.style.top = `${viewport.scrollTop}px`;
This lets the user navigate through the whole table, even with billions of rows.
The following widget shows how scrollbar downscaling works. Scroll the left box up and down to see how the right box mimics the scrolling effect, allowing to navigate through ten billion rows.
But there is a drawback. The native scroll bar precision is limited to 1 physical pixel. On “high-resolution” screens, the apparent precision is a fraction of a CSS pixel (1 / devicePixelRatio). But let’s keep one pixel for simplicity.
As an anecdote, setting the scroll value programmatically is hard to predict. It depends on the device pixel ratio, which itself depends on the zoom, and maybe other factors. For example,
element.scrollTo({top: 100})might result inscrollTop = 100,scrollTop = 100.23, orscrollTop = 99.89. You cannot know exactly, but within a margin of one pixel.The scrollTop value can even be outside of the expected range, for example negative or larger than the maximum value
scrollHeight - clientHeight. To prevent such browser-specific over-scroll effects, when reacting to a scroll event, hightable always clamps thescrollTopvalue within the expected range, and applies the CSS ruleoverflow-y: clip.clip, instead ofhidden, shows the sticky header, even if I’m not sure why to be honest.
So, when the downscale factor is big, like in the example above (2,189,781,021), the minimal scroll move (1px) corresponds to 2,189,781,021 pixels in the full table. With a row height of 30px, it means that the minimal scroll move corresponds to about 72,992,701 rows. It creates gaps in the reachable rows:
.scrollTop = 0, the visible rows are 0 to 5.scrollTop = 1, the visible rows are 72,992,700 to 72,992,705.scrollTop = 2, the visible rows are 145,985,401 to 145,985,406There is no way to navigate to the rows 6 to 10, for example. Setting .scrollTop = 0.00000000274 to reach rows 6 to 10 is impossible, because the browser rounds the scroll position to the nearest integer pixel.
Impact of infinite pixels
If we assume 10 billions of rows, the infinite pixels technique allows to navigate through the whole rows span. There is no limit to the number of rows, as we can always increase the downscale factor to fit in the maximum canvas height.
But due to the limited scrollbar precision, if the row height is 30px and the canvas is 8Mpx, each scrolled pixel moves the table by 1,250 rows. It means that only one row (and its neighbors) out of 1,250 is reachable.
The infinite pixels technique thus provides global navigation through billions of rows. But it does not allow fine scrolling, and some rows are unreachable. The technique 4 addresses this issue.
The previous technique allows to scroll globally through the file, but prevents users from scrolling locally because any scroll gesture will jump over gaps of unreachable rows.
To fix that, we implement two scrolling modes: local and global scrolling. Local scrolling means scrolling the table slice pixel by pixel (i.e. even more precisely than row by row), while global scrolling means jumping to the position given by the scrollbar.
The logic requires a state with three values: { scrollTop, globalAnchor, localOffset }
The first visible row is computed from the global anchor and the local offset:
const firstVisibleRow = Math.floor((
state.globalAnchor * downscaleFactor + state.localOffset
) / rowHeight)
The absolute positioning of the table is now:
table.style.top = `${viewport.scrollTop + state.localOffset}px`;
On every scroll event, we compute the magnitude of the scroll move (difference between the new ’s scroll top and the previous one, stored in the state) and decide to apply:
globalAnchor value unchanged (ie: not sync’ed anymore with the real scrollTop value) and adjust the localOffset so that the move appears local (for example, 3 rows downwards).Represented as code, the logic looks like this (simplified, pseudo-code):
const state = getState()
const delta = viewport.scrollTop - state.scrollTop
if (Math.abs(delta) > localThreshold) {
// global scroll
state.localOffset = 0
state.globalAnchor = viewport.scrollTop
} else {
// local scroll
state.localOffset += delta
}
setState(state)
Now, the user can navigate around the current row, but also jump to any part of the data.
The following widget shows the dual scrolling mode. Scroll the left box up and down to see how the right box mimics the scrolling effect, allowing to navigate both locally and globally through ten billion rows.
With this approach, small scroll moves appear local, while large scroll moves jump to the expected global position. The user can navigate through the whole table, and reach every row. The user can scroll as expected in the browser, with their mouse wheel, touchpad, keyboard (when the table is focused) or scrollbar.
Impact of pixel-precise scroll
If we assume 10 billions of rows, the dual scrolling mode allows to access any pixel of the full table using the native scrollbar. The user can scroll locally with the mouse wheel, and scroll globally by dragging the scrollbar.
This works if the full table height is less than the maximum canvas height (8Mpx in hightable) squared, which corresponds to about 64 trillion pixels. So, 1px fidelity is guaranteed up to 2 trillion rows with a row height of 30px.
Above that limit, the minimal step is greater than 1px, but every row is still reachable up to 64 trillion rows! Above, some rows become unreachable.
The last challenge is to move to any cell programmatically (i.e. random access to any part of the table), be it using the keyboard or through a “jump to row” input, without worrying about the local vs global scrolling mode. Random access requires decoupling vertical and horizontal scrolling. We explain it in the next section.
One of the HighTable requirements is to allow keyboard navigation (e.g. ↓ to go to the next row). Fortunately, the Web Accessibility Initiative (WAI) provides guidance through the Grid Pattern and the Data Grid Examples. We use tabindex roving to handle the focus, providing all the expected keyboard interactions.
The browser provides a useful default when calling
cell.focus(): it automatically scrolls to the cell and focus it. But in HighTable, we don’t use the default behavior. Indeed, it positions the cell at the center of the viewport, which does not feel natural.To get the expected behavior, we first scroll by the minimal amount to show the next row and column, by calling
cell.scrollIntoView({block: 'nearest', inline: 'nearest'}). Then we set the focus with no scroll action usingcell.focus({preventScroll: true}).
Unfortunately, the keyboard navigation techniques explained in the WAI resources are designed for full tables. But due to the techniques 2 (table slice), 3 (infinite pixels) and 4 (pixel-precise scroll), multiple steps are required. In particular, to let the user move the active cell with the keyboard, we separate the vertical scrolling logic from the horizontal one.
When the user moves the active cell, the final position can be anywhere in the table: ↓ moves to the next row, while Ctrl+↓ moves to the last row. If the move is big, we might have to scroll vertically to have the required cell in the DOM.
The same issue whenever we access a random row in the table, for example if an app embedding
<HighTable>provides a “jump to row” feature. The table should programmatically scroll to the expected row, and focus the cell in the expected column, without worrying about the local vs global scrolling mode, or the horizontal scroll position.
The process is as follows:
cell.scrollIntoView({inline: 'nearest'}),cell.focus({preventScroll: true}).Note that, for point 1. (computing the next state), we respect the block: nearest behavior by minimizing the scroll move. If the next row is below the current viewport, it will be the last visible row in the next viewport. If it is above, it will be the first visible row. If it is already visible, no vertical scroll is applied.
The pseudo-code for decoupling vertical and horizontal scrolling requires a flag to prevent horizontal scrolling and focus during the programmatic vertical scroll:
/* in the cell navigation code */
const shouldScroll = state.update()
renderTableSlice()
if (shouldScroll) {
// set a flag to prevent horizontal scrolling + focus
// during programmatic scroll
setFlag('programmaticScroll')
viewport.scrollTo({top: state.globalAnchor, behavior: 'instant'})
}
/* in the scroll event handler */
if (isFlagSet('programmaticScroll')) {
// allow horizontal scrolling + focus,
// once the programmatic scroll is done
clearFlag('programmaticScroll')
}
/* in the cell rendering code */
if (!isFlagSet('programmaticScroll')) {
// horizontal scrolling + focus allowed
cell.scrollIntoView({inline: 'nearest'})
cell.focus({preventScroll: true})
}
We set behavior: 'instant' when scrolling programmatically to ensure we only receive one scroll event. The alternative, behavior: 'smooth', would trigger multiple scroll events, clearing the flag too early, and generating conflicts with the internal state due to intermediate unexpected scrollTop positions (see the open issue).
Impact of two-step random access
With this technique, the user can access any random cell in the table with the keyboard, and the table will scroll to the expected position, even with billions of rows. The vertical and horizontal scrolling are decoupled, so that the user can move to the next column with → without triggering a vertical scroll, and vice versa with ↓.
No need for a fake scroll bar. No need to render the table in a canvas. We use the Web platform. Thanks to these five techniques that rely on native HTML elements, hightable lets you navigate seamlessly through billions of rows of a remote data file, in the browser.
Give a star ⭐ to the GitHub repo if you liked the article!
]]>Most dissatisfied users don’t complain. They churn.
“I asked your chatbot how I could talk with a live customer service agent and it gave me a nonsensical answer, so I never used it again.”
That complaint is unusually helpful. Most users don’t send feedback. They just bounce.
This post walks through a real-world workflow for inspecting LLM logs to debug chatbot failures, identify systemic issues, and validate fixes using real production data.
For Darryl, the engineer debugging the issue, the immediate question wasn’t why this one chat failed. It was:
If one person noticed and complained, how many others hit the same failure and churned?
The team had shipped the chatbot a month earlier. In that time, the LLM logs had already exploded to multiple gigabytes in Parquet and were still growing fast. Reading them manually wasn’t an option. Spot-checking wasn’t an option either: this was a trust failure in a support channel, and you can’t restore trust by guessing.
Darryl needed a workflow that could answer two things quickly:
The first step was to stop thinking in terms of individual conversations and start reasoning over the LLM logs as a dataset.
Darryl loaded the Parquet logs into Hyperparam and started by scanning raw rows to understand what was captured per turn (messages, tool calls, metadata).
The first goal was simple: locate conversations that matched the complaint, such as “trying to reach a human,” “live agent,” “customer service,” etc. That’s awkward in SQL because the query is semantic: the same intent shows up across many different phrasings.
Instead of writing brittle keyword filters, he used an AI agent to filter the dataset down to conversations that likely matched the reported intent. Then he pulled up the specific user’s interaction to review the full conversational context.
Darryl soon discovered that the issue wasn’t just that the chatbot wrote something wrong. In the failing chat, the model attempted a tool call to answer a factual question about support availability—but it called the wrong tool.
That’s an important distinction, because it changes the fix:
After fixing the single root cause, Darryl ran a broader review across the full dataset to look for other cases where users asked factual questions and received low-quality or nonsensical answers. Once he zoomed out to look across the full dataset, individual failures stopped being useful on their own. The questions he needed to ask were:
The agent surfaced multiple similar issues. What initially appeared to be a single complaint turned out to be a recurring failure that could be costing customers.
At this stage, Darryl needed to validate behavior by asking:
Given the exact same user inputs from the original version (V1), does the updated version (V2) reliably choose the right tool and produce a correct answer?
With Hyperparam, Darryl replayed historical conversations under different configurations (prompts, tooling, model), then compared outputs across variants using LLM-as-a-judge to score improvements at scale.
This made it possible to see whether fixes held up across the full replay dataset, not just a few handpicked samples.
After iterating, he exported a concrete set of changes for the next chatbot version: which tool call behavior to adjust, what prompt or tool constraints to add, and which configuration produced the best outcomes on the replay dataset.
“I found the right setup within a couple of hours without pulling in an entire team. Being able to compare V1 and V2 across real inputs made it obvious which changes actually worked.”
If you’re debugging chatbot failures using real LLM logs, this is the kind of workflow Hyperparam is designed to support.
]]>Now comes the challenge. Hyperparam’s users end up with massive datasets with large text-blobby columns such as chat logs and structured columns with labels, scores, or other, more classical, structured information. Users need the ability to query over this data in the browser in an AI native manner. The only AI-friendly language to do this with is SQL, but there’s no SQL engine built natively for the browser that’s fast enough, low memory enough, or async enough to meet Hyperparam’s standards for interactivity.
So I did what all engineers do: I built Squirreling, a ~9 KB (minified and gzipped) SQL engine with zero external dependencies. It achieves instant startup and constant memory usage for streaming queries.
I made my first commit on November 15th; open-sourced it on November 22nd, and had it live in Hyperparam on November 26th. This would never have been possible with only one person and in such a short timeframe without AI.
To understand why existing browser-based SQL engines struggle with interactive data exploration, it helps to examine how they’re built. Tools like DuckDB-Wasm compile a full analytical SQL engine to WebAssembly so it can run inside the browser. But database engines relying on WebAssembly to run in the browser face inherent limitations:
In practice, this shows up as noticeable startup times before queries can run, delayed time-to-first-result, and execution behavior that prioritizes throughput over interactivity. Queries run to completion before yielding results. And if you wanted to have derived columns or user-defined functions (UDFs), there’s no way to connect DuckDB-Wasm with async API calls such as LLMs. This makes existing SQL engines less than optimal for exploratory workflows that depend on fast, incremental feedback.
Squirreling emerged as a response to the simple question: what happens if you design a SQL engine for the browser first, instead of adapting a server-oriented database to run there?
Starting from that premise leads to a different set of design choices than those made by existing solutions:
These design choices are reflected throughout Squirreling’s architecture, shaping how queries execute, how data is retrieved, and how work is scheduled.
Let’s examine the key parts of Squirreling’s architecture that make these design choices concrete.
Squirreling delays computing column values until the query needs them. Expensive operations only run on cells that survive earlier stages such as filtering, sorting, and limiting.

By delaying materialization, Squirreling executes joins over minimal projections and effectively inherits the asymptotically worst-case-optimal behavior of modern join algorithms, only materializing payload columns for rows that survive the join. [1]
Squirreling distinguishes between streaming and buffered paths based on query characteristics:
Squirreling parses SQL into an AST and executes against the AST directly without a separate planning phase. This simple system avoids heavy planning overhead and allows execution to stay incremental and async. It fits the browser environment, where responsiveness matters more than deep cost-based optimization.
This AST-driven, async, late materialization model applies even to complex queries. Joins are executed directly from the AST, yielding an async and streaming result. They can stop early in the event LIMIT is applied and don’t force evaluation of columns. Columns are treated independently and evaluated individually and lazily, deferring expensive columns until required.
Squirreling is written in pure JavaScript with zero runtime dependencies. The complete library — consisting of the parser, executor, and all built-in functions — compiles to ~9kb (minified and gzipped). That’s 500x smaller than DuckDB-Wasm’s 4.5 MB binary.
This small footprint offers the following benefits:
These architectural choices follow logically from treating the browser as the primary execution environment for interactive exploration. The result is a browser-native SQL engine with a set of properties that shape how queries run:
Squirreling is available as an open-source library here: github.com/hyparam/squirreling
[1] Abadi, D. J., Myers, D. S., DeWitt, D. J., & Madden, S. R. (2007). Materialization Strategies in a Column-Oriented DBMS. In Proceedings of the 23rd International Conference on Data Engineering (ICDE) (pp. 466–475). IEEE.
]]>Debugging issues like sycophancy or tone shifts in large LLM chat logs usually starts the same way. Someone flags a problem, and suddenly you’re the engineer staring at hundreds of thousands of rows trying to figure out what went sideways. Your boss wants answers, and the dataset is huge. So you pull a small sample, send it through another LLM to score for sycophancy, and check to see whether the scoring prompt actually captures what you care about. That quick loop works for the iteration phase, but it never tells you how often the issue appears or what triggers it across the full dataset.
LLM chat logs become harder to reason about at scale because the issues you care about are distributed across tens or even hundreds of thousands of lines of text. Chatbot logs consist of multi-GB text files. In this deluge of unstructured data, what matters is finding the important 1% of failures that are relevant to the challenge you’re working on. Most teams start by sampling because it’s the fastest way to inspect a few examples and test their scoring logic. But sampling only shows fragments of the behavior, and there’s no guarantee those fragments reflect the full picture.
Traditional debugging workflows were built for structured tables, not multi-GB datasets of AI-scale text. They’re designed for predictable schemas, uniform rows, and fields you can sort, filter, or compute against. They usually rely on sampling or slice-based queries because they’re the fastest ways to inspect a few examples.
But the new world we live in consists of huge piles of unstructured text data: LLM chat logs that don’t behave that way. A single row can contain a full conversation, a long reasoning chain, or text that spans hundreds of tokens with no consistent structure. Engineers still catch individual failures like an instance of sycophancy or a strange tone shift, but locating those issues in massive logs often requires digging through isolated rows manually. And because conventional SQL or Python workflows weren’t built to analyze unstructured, conversational text across large datasets, they don’t help you map how often an issue occurs, what triggers it, or whether it’s part of a pattern that repeats across thousands of conversations.
So the question becomes: am I seeing the full picture here, or is my view skewed because traditional methods aren’t built to query massive unstructured datasets?
Understanding the issues in your AI systems almost always comes from actual use, whether doing so yourself (dogfooding) or by listening to reports from your users. In many cases, you might have a general sense of the issues that exist, like sycophancy, unexpected tone shifts, or two conversations that answer the same question differently. But with massive logs like this, the underlying behavior often only shows up when issues are viewed across the entire dataset.
Some issues only make sense when you see how often they appear or what triggers them:
These issues aren’t rare, but they’re distributed thinly across the dataset, and that distribution makes them nearly impossible to see without dataset-level querying. A sample shows you symptoms, but only inspecting the entire dataset reveals the scope, frequency, or context that define the real pattern. And you can’t query for this kind of thing with SQL. You need something that understands natural language.
When you can’t query across the full dataset, you lose the ability to judge the scope or conditions of an issue. This can result in:
These blind spots might start out small, but they can expand quickly and slow down debugging, pushing teams into reactive rather than proactive work.
As datasets grow, the limits of slice-based inspection become harder to ignore. Issues in LLM chat logs emerge from patterns that spread across thousands of conversations, across different regions and varying prompts. And with multi-GB datasets now being the norm, finding issues and understanding the patterns behind them requires reasoning across the full dataset, not just the fragments that appear inside a sample.
If you work with LLM chat logs, you can try the Hyperparam app for a faster way to explore and query large datasets. It’s free while in beta.
What makes LLM chat logs harder to analyze than other types of AI data?
LLM chat logs combine long-form text, multi-turn conversations, and inconsistent structures that don’t fit traditional structured data workflows. This makes it difficult to map issues and patterns across the full dataset using query-based or search-based methods.
Why do subtle LLM failures often appear only at a large scale?
No one knows how to make good evals. So issues tend to surface from actually deploying and using AI models. The issues that surface only become apparent over time, and are often subtle behavior issues like sycophancy. They don’t form clear patterns until you analyze the dataset broadly.
What happens when issues stay hidden in your LLM logs?
In my experience, inability to look deeply at LLM chat log data has resulted in:
These blind spots might start out small, but they can expand quickly and slow down debugging, pushing teams into reactive rather than proactive work.
How can teams reduce blind spots when working with massive unstructured logs?
Teams can reduce blind spots by shifting from spot-checking to dataset-level reasoning, using AI assistance that allows them to query, compare, and evaluate issues across the entire dataset instead of isolated rows.
]]>AI runs on data. Massive amounts of it. On one side you’re training models on large amounts of text, and once deployed, these models constantly produce mountains of AI text data. The entire lifecycle of AI is massive data in and even more data out. Between April 2024 and April 2025, Google’s AI products alone went from roughly 9.7 trillion tokens to more than 480 trillion tokens. That’s almost a 50x increase in just one year and rapidly approaching 1 quadrillion tokens per month.
However, none of the tools that currently exist are built to work with massive, planet-sized balls of unstructured text. Notebooks, SQL engines, and data visualizers all assume something smaller and more structured than what we actually deal with today.
If we want to keep advancing with AI, we need solutions that let us explore and understand AI data at the speed at which it’s produced. And that’s why Hyperparam exists. The Hyperparam AI tool, a browser-native application built specifically for this environment, lets you explore and transform massive datasets in real time so you can understand and improve your AI datasets.
Every company building with AI now sits on more text than any team can realistically examine. Chat logs, model outputs, product interactions, and support conversations all contain valuable intelligence on how a company’s AI is performing.
But AI data accumulates faster than humans can review or understand. In Q3 2025 alone, Azure’s AI services processed over 100 trillion tokens. Even small teams wind up with tens of thousands of rows overnight, and the rate of growth only accelerates as AI proliferates across more companies and industries.
Traditional tools to help businesses understand their data often rely heavily on the data being structured and accessed via SQL or other structured query languages. But the “signals” in AI data — e.g. did the model hallucinate, did the model ask for clarification, did the user get frustrated — exist fuzzily in text, not in an easy-to-access column. The information to learn from is in the data, but there is no way with traditional tools to access it for any kind of dataset analysis or debugging.
With the pace of AI, that gap only compounds. The more data you produce, the less equipped you are to do anything meaningful with it. The result is a backlog of unknowns that keeps growing while your ability to understand it stays flat.
Ironically, our hypothesis is that AI can help you understand your AI data, but only if the interface makes that possible. AI models can fuzzily extract information, transform, label, and filter for you, but none of that matters when the surrounding tools choke the moment you hit real-world dataset volumes. For example, ChatGPT can help you understand if your AI is hallucinating, but you can’t load more than a few dozen chat logs at a time. Traditional data viewers, even augmented with AI, can’t display more than a few thousand rows instantly. Custom notebooks could be built to use AI, but would require scalable infrastructure to run over the entire data.
The Hyperparam AI tool solves this problem. It’s the first tool that makes AI usable at dataset scale by pairing two things that have never existed side by side:
Because the interface is fast enough to keep up, the AI insights become actionable. You can generate columns, score for sentiment, and filter results in real time, all without waiting or guessing. In short, everything clicks into place: the combination of a high-speed UI and Hyperparam’s AI agents gives us the first tool designed to explore and understand AI-scale data and support real LLM dataset debugging.
Once the interface is fast enough to keep up with the data, the AI layer turns into a genuine workflow upgrade. The browser engine handles the scale, the model does the hard work of reading through the thousands of rows of text data, and you stay in charge of the decisions. The model scores every row, creates new columns, surfaces issues, and points out strange behavior you might not notice on your own. You explore and validate the results in real time because nothing stalls or blocks you.
Take something as simple as triaging chatbot sycophancy and releasing a new prompt to correct sycophantic behavior. In the Hyperparam chat, you can ask Hyperparam to score every conversation for sycophancy, sort the entire dataset, filter to the outliers, and transform sycophantic results into desired behaviors for evaluations. Then you can try out different prompts, check the responses, and iterate until you have a prompt performing well on your corrected evaluation. You can even export this evaluation to use it later. And you can do this all singlehandedly inside one browser tab.
Large language models can help score conversations or pinpoint odd behavior, but they can’t work through AI-scale datasets on their own. Hyperparam overcomes that limitation by pairing a high-speed browser engine with an army of AI agents that support the parts of the workflow where natural language actually adds value. You move through the data instantly, and the model helps you understand what you’re seeing without ever taking over the decisions.
This setup keeps the judgment where it belongs: with you, the human expert. We believe strongly that human-in-the-loop is the only way to work responsibly with AI. You decide how far to trust a score or when a prompt needs refinement. The UI makes the dataset feel lightweight and the AI does the heavy lifting, but every decision runs through your expert eye.
If you work with AI data, try the Hyperparam AI tool for a faster way to inspect, debug, and refine massive datasets. It’s free while it’s in beta.
]]>Hyperparam is an AI-powered data transformation tool that lets users and an army of AI agents look at, transform, score, and filter massive datasets instantly. As Kenny puts it, “It’s like a Swiss Army knife for your data.” It’s built on an ecosystem of open source data transformation libraries that power its paid app, which delivers the full Hyperparam experience.
Unlike most products that start with an enterprise focus and chase a single proof of concept, Kenny, as I’ve always known him to do, chose his own path. His thesis: Starting from open source is a better, faster way to build a product. In this interview, he shares his take on the open source community, product development in the new world of AI, and how Hyperparam took an intentional approach to open and closed source development.
At our previous company, Algorithmia, we didn’t go open source. What made you decide to do it differently this time?
I built Hyperparam as the data transformation tool I wished I had because there wasn’t one that met my criteria. The first version of Hyperparam was a simple browser-based data viewer for Parquet files with some simple data transformation tools. The majority of large datasets for AI training and monitoring are Parquet files, and I just wanted to look at the data and play around with it. But I didn’t think people would pay for a Parquet viewer, so I doubted I’d be shooting myself in the foot by giving it away. If anything, I was going to get all the benefits of usage in the community. So I just put it out there without promoting it.
One of the most compelling arguments for doing open source is because I think it’s fundamentally a better way to build a product and get feedback. If you start building a product straight for the enterprise, it’s a recipe for disaster. You start asking the wrong questions, like, “How do we fit into their workflows?’ rather than asking, “How do we build a product that would see organic adoption?” With open source, people use your software if it’s useful and if it’s not, they don’t. That’s an incredibly valuable signal.
Though you can use Hyperparam to view large datasets, you describe it as a data transformation tool. What makes Hyperparam different from data visualizers?
Hyperparam lets you instantly view, explore, and transform millions of rows of data, all through a user-friendly, chat-based UI built for scale and usability. So it’s much more than a data visualizer; it’s a data transformation tool.
When I went looking for a data tool, I just wanted to open one dataset. Jupyter was frustratingly slow. ChatGPT, VS Code, Copilot, and other assistants weren’t designed for interacting with massive datasets. And I quickly realized there wasn’t a single tool out there that let me look at any scaled dataset.
That led me on this path, and the question became: What does the interface look like for using AI across data? The answer is Hyperparam. It delivers the power of instant data transformation with the ease and nuance of natural language querying.
Hugging Face is just one of the organizations that started using your libraries HyLlama and Hyparquet. What was the significance of that moment?
Hugging Face’s adoption of my libraries validated hugely that there was something to my idea of moving more AI workflows to the browser. It was the strongest market signal I’d had up to that point, and it made me start thinking about what else we could build with these components.
For context, Hugging Face is the world’s repository for open models and open data. They use multi-gigabyte files in Llama CCP format, and they wanted to enable the user to simply get the metadata instead of downloading the entire file. HiLlama does exactly that: it pulls the metadata and provides the info the user needs, saving bandwidth, time, and disk space.
After Hugging Face integrated HyLlama into their website, they started looking into Hyparquet. When they realized it offered many of the same benefits for data, they started integrating it, as well. And it was a great honor that because they support OSS in general and have adopted our libraries, they gave us a substantial open-source grant.
You’re launching a paid version of Hyperparam soon. What’s open source and what’s part of the product?
I’ve already open sourced the parts of Hyperparam that have shared value to the community, and I’ve kept the full product experience (including the AI workflows) within the paid app.
Hyparquet, HighTable, HyLlama are some libraries we’ve released that are building blocks that help others explore data in the browser and also power what we’re building internally. My belief is that connectors, frontend components, writers, readers, and other “glue” components should be open source. They’re globally useful beyond what I’m building and should be shared by the community. But on their own, they’re not the Hyperparam product.
Beyond just thinking about what is useful for the community, there are a few other upsides of open sourcing components. For one, I get to control how the component is optimized and designed, and I can make sure it’s designed to work well with Hyperparam. Secondly, and probably most importantly, there’s the community of developers invested in these components. Approximately a dozen people contributed code to Hyparquet and HighTable, and even more filed bugs that I subsequently fixed. Giving away components doesn’t diminish value; it amplifies it through feedback, goodwill, and contributions.
Now, when thinking about the paid product, any AI component and the core user experience is proprietary. I care deeply how users flow through my product, and I need to own that experience because I don’t think anyone else can build it correctly. That’s a bit of a cocky statement, but my team is a select group of people obsessed with the overall data experience and how the AI should work.
What kind of AI workflows can you do with Hyperparam?
Hyperparam is a general purpose data transformation tool that enables you to do multiple things with your data, so it’s easiest to give an example.
Let’s say you’re a company that’s deploying a chatbot to your users, and a user files a ticket. Your support team needs to dive into the data to understand what happened, whether it’s sycophancy or some other issue. With Hyperparam, you can apply an LLM-generated score to every row in your dataset, look at the values, filter out the bad ones, transform them into something better, export the results, and just continue with your workflow.
That’s just one example of what Hyperparam can do. In addition to applying AI scores and sorting, filtering, and searching based on those scores, you can ask natural-language questions about your data, for example: “Rate every chatbot conversation for sycophancy,” “Did the user seem satisfied?” or “Was a conclusion reached?” You can categorize, tag, and explore your dataset in ways that were never possible before.
You can also run experiments: import historical data, tweak prompts, compare models, and see how the outputs change. It’s a deep research workflow that lets one person do what used to take an entire team.
What’s your core engineering philosophy and how did open source support that?
One of my fundamental engineering principles is to take no dependencies. I feel very strongly against building a huge stack of dependent software, which is something you see a lot of in JavaScript. Because Hyperparam started as a passion project and it’s open source, I could optimize purely for the function that I cared about. That’s not necessarily how that would have been at a company.
That simplification reminds me of SpaceX’s Raptor engine. Each iteration of the Raptor keeps getting simpler… yet more powerful. I wanted to do the same for software. It’s an aesthetic choice, but it also influences the architecture and engineering. With an obsession over engineering and product, you can build minimal software, and that’s what Hyperparam is. I built it from the ground up depending on nothing else. That’s why it’s as small, light, and fast as it is.
What’s your advice for developers starting out with open source?
Build the product you want to use yourself. If you have to solely rely on other people to tell you if what you are building is useful, your iteration cycles will be slow and painful. This advice might run contrary to conventional startup wisdom, which says you should assume you know nothing, talk to a hundred customers, and then build to solve their problem. That’s viable, but it’s not the only way to create something meaningful.
To build a certain kind of product, you need more vision and aesthetic opinion. In open source, you’ll see these shining monuments to technology, and why? Because someone cared enough to make them both functional and beautiful. Hyperparam is the data transformation tool I wished existed. I’m building for an audience of one: myself. When you build something you want to use yourself, you often end up building something others didn’t realize they needed.
]]>Hyperparam is building an AI-assisted tool for working with large text datasets. The product includes a viewer for parquet (and csv and jsonl) datasets, and a data assistant chat. Before launching the product we wanted to anticipate problems that may crop up.
The fastest way there was to generate simulated data with realistic, diverse, edge-case conversations that exposed how our data agent behaved across user types and intents. Once we could generate this data we had a quick way to interrogate the simulation data set in order to slice, flag and transform the conversations into a usable data set for follow on fine tuning (or continued analysis).
Plan:
Simulate 10 realistic personas and conversations using Snowglobe.
Explore and transform that dataset interactively with Hyperparam.
Isolate conversations where the agent recommended using Python versus general analytical queries.
Prepare the subset for evaluation or fine-tuning.
Snowglobe is a simulation engine for conversational AI teams. You define who your users are, what they want, and how they behave. Snowglobe auto-generates thousands of realistic interactions with your model or API endpoint. Think of it as a load test for reasoning or dialogue, not just latency. Snowglobe uses the information from an application description to create data that’s useful for your specific app. For this blog post, we created an application with the system prompt from Hyperparam. It’s long, but the short version looks like: “This chatbot allows users to chat with their data, pulling out insights and statistics. The data looks like this:
In our example, we’re simulating users of Hyperparam, a data exploration tool. We want personas that mirror the real user base data engineers, data analysts, and tinkerers with different levels of skill and temperament. To create personas like these, we can start with a “Simulation prompt”. For example, we can enter a simulation prompt like “Users are data engineers, scientists, and analysts ask questions about their data.”
This prompt results in personas like follows. These personas vary in objective, tone, and style.
personas:
- name: "Hands-On Data Explorer"
description: Loves examples, learns by doing.
- name: "Skeptical Analyst"
description: Double-checks every step, asks 'why'.
- name: "Product Engineer"
description: Wants quick, applied answers.
- name: "Aha Moment Seeker"
description: Prefers conceptual explanations.
...

This provides us with conversation templates tied to our product use cases:
scenarios:
- topic: "data analysis"
prompt: "How can I explore this dataset for outliers?"
- topic: "python vs sql"
prompt: "Should I use Python or write an analytic query?"

Snowglobe will orchestrate multi-turn dialogues between each persona and our model endpoint, generating text logs, metadata, and structured output (JSONL).
When the run completes, we have a dataset with 10 personas × 200 conversations each — 2000 total dialogues, complete with role labels, timestamps, and message-level metadata.
Hyperparam is an interactive, browser-based dataset explorer purpose-built for ML workflows. It opens local or remote Parquet, JSONL, CSV files instantly and lets you filter, transform, and visualize data directly: no heavy Jupyter notebooks required.
Drop file directly in the explorer
Hyperparam renders the dataset in an interactive table. You can scroll through conversations, inspect columns like persona, topic, or assistant_message, and even preview message trees. Conversation view allows for easy visual exploration.

One of the things we noticed right off the bat was that sometimes a user would ask a question and the model would suggest leaving the product and that they use a python script instead. This is not what we want. Sometimes though, the user asks a question that genuinely need to be done off-platform. What we’d really like to find is the conversations that could have been solved on-platform, but instead the model recommended python.
Now we want to detect when the assistant recommended Python code vs analytic query language in its replies.
In Hyperparam, you can do this with the hyperparam chat: a prompt-based operation that adds a new computed column.
Prompt: “Add columns: flag when a model suggests the user run python themselves, or makes a general analytic-style query instead of a transformation or filtering like we expect.”
The data agent runs across all sample rows, and decides to create two new boolean columns:
This single operation converts raw chat logs into labeled data.
After inspecting the two new columns, we wanted to extract the samples where it suggested using python but not as a general analytics query. This is our proxy for things our data-agent should be able to do but for some reason did not.
By visually inspecting the data we notice:
“Hands-On Data Explorer” and “Skeptical Analyst” personas trigger Python examples more frequently.
“Pragmatic Insight Seekers” get concise analytic answers.
Conversations recommending Python also tend to have longer message chains (higher cognitive load).
The data set of [suggested_user_python=true && is_analytics_query=false] is good for further fine-tuning our data agent as examples of where we should have a suggestions but did not.
The combination of Snowglobe + Hyperparam closes a crucial loop for conversational AI teams:
| Stage | Tool | Outcome |
|---|---|---|
| Simulation | Snowglobe | Synthetic but realistic data across personas |
| Exploration | Hyperparam | Fast, visual filtering and labeling |
| Transformation | Hyperparam | LLM-assisted column creation |
| Iteration | Both | Repeat, evaluate, fine-tune |
This pipeline lets teams:
Build evaluation datasets before collecting real user data.
Debug reasoning patterns in synthetic interactions.
Scale up diverse conversational contexts without manual labeling.
Quickly explore and interact with the data sets.
Simulation is the new data collection.
When you can generate, label, and filter conversation data quickly, interactively and with precision, you gain the power to test your agent’s reasoning loops and UX outcomes before they ever reach a customer.
Snowglobe gives you the synthetic user base. Hyperparam gives you the interactive microscope.
Next steps:
Try snowglobe.so to generate your own synthetic conversations.
Explore the data instantly with hyperparam.app.
Following the common adage that “data quality determines model quality”, I did what every AI engineer does and tried to look at some training data hosted on HuggingFace. I did not care how, I just needed to see some data and interact with it - search around rows, sort, and otherwise get a feel for the data quality.
This is where my goal started to go off the rails… Most modern AI datasets are 10GB or more and are in parquet format – we’ll talk more about this later – which means you need to parse and open the file. No simple less would work. The most common tools to read parquet for easy viewing are pandas/polars and DuckDB. With some ChatGPT help, I was running the command to load the first 5 rows of data. As shown below, I sat there waiting… and waiting.

Modern data viewer tools take anywhere from 5 sec (DuckDB) to 57 sec (Pandas) to load just 10 rows of data. The HCI community largely agrees that the ideal time-to-first-interactivity is 500 ms [Lui, Heer 2014]. Why should that not hold for data? Why is it acceptable for data to take 20x longer to load data than a webpage?
The rest of this blog describes my multi-month journey to hyper-optimize time-to-first-data for parquet files. I am still on this journey but, along the way, released Hyparquet, the most conformant browser-based parquet file reader in existence. It’s open source and, most importantly, can load my 10 rows of data in 150 ms.
Let’s take a step back and first understand where the runtime is going in existing data viewer tools. Let’s take an oversimplified version of a simple pandas backed data viewer and pretend it’s hosted in AWS and reading a parquet file from S3. Before the data even gets loaded, the user’s browser has to first hit cloud front, goes to ELB, then finally gets redirected to the Node JS frontend server of the data hosting service, goes through another ELB, hits the backend server hosting the data and then finally pings S3 and downloads the data. In total this takes about 40 sec of latency just to get the request and download the data. The data then gets parsed in the backend server (taking about 1 sec), and finally makes its way back to the user’s browser.
This diagram may feel complicated but it’s drastically oversimplified compared to most real-world architectures that have: auth, logging, message brokers, etc. Systems that each add additional latency.
When optimizing this pipeline, most engineers only have control of the backend and spend time optimizing parsing. This can speed up the time-to-first-data a lot, but it’s not enough for me. I wanted to completely remove the latency before parsing even started.
Fundamentally, whenever you have a backend, you need layers of tooling on top. And backend servers are generally good ideas - they manage application state, can handle compute heavy processes, and decouple the viewer from the data models. But I don’t care about any of that. A data viewer isn’t a feature in my application, it is the entire application. So what if I just remove everything backend related and point the browser directly at S3 *(Well, you still need cloudfront to optimize the SSL handshake)?
You would be left with just the browser talking straight to cloud storage:
With this architecture, you immediately save latency as you skip having to hit ELB and a backend. As an added bonus, it’s cheaper because you don’t need cloud costs to host the backend server and far simpler for developers to maintain.
This simplified architecture does leave two issues: (a) where does user state live so if they, for example, refresh a page, they don’t lose their location in the viewer and (b) you still need to parse a parquet file.
It turns out, if you use browser cookies and local storage, you can manage user state all in the browser. Sure, if the user clears their browsing history, they’re in trouble, but I’m okay with that. The parsing…well, I was just going to have to use a javascript parser instead. Or, as it turns out, build my own.
At the beginning 2024, when I started this quest, there were 3 libraries that could load parquet files from cloud storage directly into the browser: ParquetJS, ParquetWASM, and DuckDB WASM. And I had a goal to parse parquet files in under 500ms. As shown below, none of these were fast and ParquetJS wasn’t even supported anymore.

Looking at this waterfall chart we can see that all libraries take at least 600 ms to get a request and parse a parquet file. But, they also show multiple opportunities for optimization. Let’s summarize some of the inefficiencies of the duckdb-wasm library - we will go into more details below.
If I could optimize these pieces, I could achieve my 500ms time-to-first-data. And I could do it in Kenny style: 100% javascript and no dependencies (because who doesn’t want to rebuild everything from scratch). Time to introduce Hyparquet.
Re-writing a parquet parser from scratch, how hard can it be?? It took about a week to be able to parse my first parquet file, which I thought was pretty good. The problem is that I kept finding more parquet files that I couldn’t open. Parquet is a sprawling format, with many features:
It took 6 months to parse ALL the parquet files.

Javascript is not exactly known as a high performance language. I think this reputation is undeserved. I’m not saying it’s going to beat rust in a benchmark. But with careful engineering and tactical use of modern browser apis, we can make decoding parquet in the browser surprisingly performant.
Let’s dive deeper into some of the mistakes made by other parquet libraries, and how we can make it better in the browser:
Engine Size – DuckDB-WASM requires downloading and compiling several megabytes of WebAssembly, incurring seconds of startup delay before queries can run. That’s seconds where your user sees… nothing.
Could we get a performance advantage from starting with less? Every kilobyte of WASM adds startup latency. Hyparquet’s core engine is only 10KB (minified, gzipped), dramatically reducing startup latency, and is substantially easier to bundle. By narrowing the focus strictly to Parquet parsing with pushdown filters, we achieve near-instant initialization.
We also save an entire round-trip loading the wasm blob:

Smart Metadata Fetching – In parquet, the metadata is stored in the footer of the file. So in order to fetch the metadata, you might naively make at least three requests:
This is what parquet-wasm and parquetjs do:
HEAD request to get file size
Fetch the last 8 bytes to get the metadata_length field
Fetch the metadata
But with hyparquet we actually do a little better: we can skip the second step. Rather than make an 8 byte round-trip fetch request, we optimistically fetch 512kb of the footer of the file. 99% of the time that includes the entire metadata. In the rare cases where this initial request fails to include all the metadata, we use the metadata length in the footer and make another request for just the remainder of the missing metadata. On http over the internet, an 8 byte fetch takes almost the same amount of time as a 512 kb request.
Parallelization – Traditional databases fetch data sequentially. But browsers can handle 6+ concurrent HTTP connections. Hyparquet leverages parallel HTTP range requests, retrieving only needed portions of the Parquet file (specific columns or row groups) in parallel. This overlap of I/O helps reduce wall-clock latency for data access.
Duckdb uses a different (and much worse) algorithm: it does a sequence of exponentially increasing request sizes, all in series (not parallel). This is fine when you’re reading from local disk but is pathological when loading over the network:

Use the Metadata – Hyparquet employs predicate pushdown by analyzing Parquet metadata (schema, column statistics). This allows it to identify and skip irrelevant row groups entirely, reducing network load and improving speed. This isn’t new—every modern columnar database does this. But when network latency is your enemy, skipping even one unnecessary 25MB column chunk can save seconds.
It’s worth mentioning that by default parquetjs does NOT do this. In fact, neither does python! The default pyarrow and pandas parquet readers WILL READ THE ENTIRE FILE. I had to tweak parquetjs to make it load partial data at all. [2]
Async Everything – JavaScript might be the world’s most async-friendly language. We utilize this to return whatever data is ready first. Parquet is a column-oriented format, so if rows are being emitted from a cursor object, you’re making users wait for ALL the columns to load before returning any data to the user. Hyparquet can return data asynchronously whenever it’s ready (but provides helpers for row-oriented data if that’s needed).
Compression That Doesn’t Suck – Standard JavaScript Snappy decompression was too slow, so we implemented HySnappy, a WebAssembly-based decoder that’s 40% faster yet adds minimal size (<4KB). This ensures decompression never becomes the performance bottleneck.
The problem with WASM is that it normally adds an extra round-trip fetch request for the wasm file. We improved this using a little-known browser trick: you can synchronously load wasm if and only if it is less than 4kb! So we wrote our own snappy decompression library, with no dependencies, not even memcpy, and definitely no emscripten. This makes hysnappy super easy to bundle, deploy, and load.
This obsession with latency has real-world implications. Where DuckDB-Wasm might take several seconds just to initialize its query engine, Hyparquet can produce visible results on multi-gigabyte datasets in under a second.

Hyparquet demonstrates a shift toward treating the browser as a fully capable query processor operating directly on data stored in cloud storage, suggesting a new paradigm for database research:
This inverts traditional assumptions:
Hyparquet’s extreme minimalism sets a new benchmark for browser-native analytics. But what are the broader implications?
Hyparquet enables ML researchers and data analysts to interactively explore large datasets directly in the browser, eliminating the need for traditional backend setup or data infrastructure management.
Hyparquet also allows data analysis over a much-simplified infrastructure: By removing backend databases, there’s less infrastructure to maintain, simpler developer experience, and faster user experience.
If you remember where this started, I wanted users to see the first few rows of a large AI parquet dataset in under a second. But I started looking at data because I wanted to train a model. I obsessed over the first step of the AI data curation pipeline because it was painful. But I didn’t stop there. I founded Hyperparam, a company built on this paradigm of hyper optimization of browser native applications for data curation. Hyperparam’s goal is for users to build their training, evaluation, or RAG datasets with the seamless interactivity of the browser they are accustomed to for non data-intensive tasks. Our motto - “javascript can do it too”
Want to see Hyparquet in action? Head to https://hyperparam.app, drop any Parquet file or url, and watch your data appear instantly.

Everyone wants better AI models: smarter, cheaper, and with style. How does one achieve that? Whether you’re a mega-scale AI company, or a small enterprise team, the only real lever for making better models is to construct a better training set.
How do you build a better training set? This is a question that has always been one of the most challenging, and labor-intensive parts of the data science process.

Why is data cleaning and data understanding so time-consuming? Because current tools often miss three key capabilities: 1) should enable very fast free-form data exploration by the user, which is key to finding insights in your data, 2) use AI models to assist looking at huge volumes of data that would be impractical for a person, and 3) should be simple to run locally in the browser and not depend on complex services and data pipelines. Instead, most tools are built around Python, arguably the worst language for creating modern, compelling UIs and tools. This might seem controversial, but think about what is the most common interface for python? Jupyter Notebooks. Notebooks are great for iteration and experimentation, but they are extremely weak when it comes to interactive data exploration. If you’ve ever tried to open a parquet file (the most common format for modern ML datasets) in a notebook it looks like this:

This table is practically useless. You can’t paginate to the next set of rows. You can’t even see the entire data in a cell (which in this case is an entire github source file). So how are you supposed to get an intuitive sense of your data if you can’t even see it?
Can we do better? If you want to build a highly performant user interface, there is only one choice: JavaScript. The browser is the only place for building modern UIs.
The problem is that ML datasets are massive (often multiple gigabytes of compressed text data), so it’s not obvious if it’s even possible to work with large scale datasets in the browser. However, by using modern data formats like Apache Parquet, and clever frontend engineering, it is in fact possible to work with massive datasets directly in the browser.
Aside: Apache Parquet files are a column-oriented data structure that contains a built-in index. This allows tools like hadoop and duckdb to efficiently query parquet datasets without having to retrieve all the data. Furthermore it allows doing these queries without a server, simply by putting the parquet files in a storage service like S3. What if you could do this same trick in the browser, and pull in just the data needed to render the current view. Hello Hyparquet.
Hyparquet is a new JavaScript parquet parser which can efficiently query against parquet files stored in the cloud. This enables the creation of a new type of client-side only parquet data viewer which is significantly faster than anything that could be done with a server.
The goal here is to get data engineers to look at their data 👀 Anyone who has worked with data for a model before knows that looking at your data is the key to understanding the domain you’re trying to model, and it is virtually impossible to do good data science without looking at your data. Looking at your data is the easiest way to find data and model issues, and is a constant source of ideas of how to improve them.
This is one of the core workflows in data science: build a model, see what data was correctly or incorrectly modeled, fix the data and/or the model, and repeat. This is a repeatable, teachable process! And if it can be taught to a human data scientist, why can’t it be taught to a model to assist?
Can you use a model to assist with dataset curation? The challenges are two-fold: 1) How do you leverage human expertise to express what you want from the model? 2) These datasets are huge, so the cost of running a model across all the data is expensive.
You need the human in the loop to express their intent for the data. There is not just one definition of “good” versus “bad” data. What matters is the question “is this data useful for the model I’m trying to build?” This is where the UI comes in as a way to allow the user to look at the data, and use the data to express their intent.
As for the cost, we are entering a new era of LLMs where for the first time it is affordable to do dataset-scale inference in which you run an entire dataset through a model to help filter and label data. In 2023 it cost $5,000,000 USD to process 1 trillion input tokens with a sota model (gpt-4-turbo). In 2024 it cost $75,000 USD to process 1 trillion input tokens with a similar model (gpt-4o-mini). This trend will continue to make dataset-scale inference accessible to model builders. Model-based quality filtering has already been used by Meta to filter the training set for llama3 using labels generated by llama2 [1].
We’re entering a new era in which dataset-scale inference and interactive, browser-based data exploration will define how AI models are built and refined. By combining efficient data formats, high-performance JavaScript interfaces, and affordable AI-based annotations, teams can finally put data quality front and center without prohibitively high costs or clunky workflows.
The future belongs to those who seamlessly blend human expertise with AI-assisted insights—an approach that makes data cleaning faster, more intuitive, and ultimately, far more effective in powering the next generation of advanced AI models.
Ready to explore your machine learning data? Visit Hyperparam to start viewing and analyzing your datasets in seconds.
]]>“Model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. When you refer to “Lambda”, “ChatGPT”, “Bard”, or “Claude”, it’s not the model weights that you are referring to. It’s the dataset.” – jbetker @ openai
Machine learning models are only as good as the data they’re trained on. Our mission is simple: create the best training sets to build the world’s best models.
Everyone agrees that data quality is critical for building state-of-the-art models. But how do you build a great training dataset?
At Hyperparam we believe that it is impossible to do good data science without being intimately familiar with your training data. But where do you even start? Modern LLM depend on terabytes of unstructured text data. Most data tools cannot handle this scale of data interactively, or require sampling to show only a tiny slice of your data.
If you want to build a highly interactive tool for working with data, the browser is the only tool for building modern UIs. The question is: can the browser handle massive text datasets interactively? Yes. By leveraging modern web APIs, and with an obsessive focus on speed and architecture, we are building the world’s most scalable UI for data.
Building a UI for machine learning data is a necessary first step, but does not solve the problem of finding good vs bad quality data within massive datasets. To find the “needle in a haystack” we use machine learning models to reflect back on their own training set. Everyone evaluates models – we evaluate data.
Combine this new scalable UI with methods for evaluating ML data, and you have a powerful engine for iteratively developing the world’s best quality models.
Ready to explore your machine learning data? Visit Hyperparam to start viewing and analyzing your datasets in seconds.
]]>