Skip to content

vincentlaucsb/csv-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

371 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vince's CSV Parser

CMake on Windows Memory and Thread Sanitizers codecov

Motivation

I wanted a CSV library that was fast and reliable without forcing you into either:

  • A 1990s C-style API
  • A high-level wrapper that murders malloc() and your cache

This library tries to be fast for developers and fast for your computer.

Performance and Memory Requirements

This library combines SIMD-accelerated parsing, memory-mapped I/O, careful memory layout, minimal allocation, and background parsing to process large CSV files quickly, even when they exceed available RAM.

According to Visual Studio's profiler this CSV parser spends almost 90% of its CPU cycles actually reading your data as opposed to getting hung up in hard disk I/O or pushing around memory.

Show me the numbers

On my computer (12th Gen Intel(R) Core(TM) i5-12400 @ 2.50 GHz; Samsung 990 EVO), this parser can read

All benchmarks shown are warm cache runs to focus on parser/CPU performance rather than disk I/O variability.

Chunk Size Tuning

By default, the parser reads CSV data in 10MB chunks. 10MB was chosen after empirical testing to optimize throughput while minimizing memory and thread synchronization costs, but feel free to experiment with different numbers yourself.

A custom CSVFormat with chunk_size() can be passed to:

  • Shrink the chunk size (down to a minimum of 500KB)
  • Expand the chunk size (necessary if you encounter very large rows)
CSVFormat fmt;
fmt.chunk_size(100 * 1024 * 1024);  // 100MB chunks
CSVReader reader("massive_rows.csv", fmt);

Robust Yet Flexible

RFC 4180 and Beyond

This CSV parser is much more than a fancy string splitter, and parses all files following RFC 4180.

However, in reality we know that RFC 4180 is just a suggestion, so this library has:

  • Automatic delimiter guessing
  • Ability to ignore comments in leading rows and elsewhere
  • Ability to handle rows of different lengths
  • Ability to handle Windows, Unix, and old Mac newlines seamlessly

By default, rows of variable length are silently ignored, although you may elect to keep them or throw an error.

Encoding

This CSV parser is encoding-agnostic and will handle ANSI and UTF-8 encoded files. It does not try to decode UTF-8, except for detecting and stripping UTF-8 byte order marks (BOM).

Well Tested

This CSV parser has:

  • An extensive Catch2 test suite
  • Tests of various CMake and non-CMake builds across g++, clang, MSVC, and MinGW
  • Address, thread safety, and undefined behavior checks with ASan, TSan, and Valgrind (see GitHub Actions)

Bug Reports

I welcome genuine bug reports brought in good faith. This includes:

  • Crashes, memory leaks, data corruption, or race conditions
  • Incorrect parsing of valid CSV files
  • Performance regressions on real-world data
  • API issues that affect practical use cases

When reporting compiler or integration issues, please mention which form of the library you're using:

  • Single-header
  • Regular headers + your own build system
  • CMake

Note: Please keep reports focused on real-world problems. Questions about extremely edge-case behavior (e.g. "what should ,,, return?") do not belong in the issue tracker.

Documentation

In addition to the Features & Examples below, an extensive documentation site contains more examples, details, interesting features, and instructions for less common use cases.

Sponsors

If you use this library for work, please become a sponsor. Your donation will fund continued maintenance and development of the project.

Shameless plug: If you like this library, check out my side project experiencer — a WYSIWYG resume editor with clean HTML/CSS output.

Integration

This library was developed with Microsoft Visual Studio and is compatible with >g++ 7.5 and clang. All of the code required to build this library, aside from the C++ standard library, is contained under include/.

C++ Version

While C++20 is recommended, C++11 is the minimum version required. This library makes extensive use of string views, and uses Martin Moene's string view library if std::string_view is not available.

This library requires C++ exceptions to be enabled (for example, do not compile with -fno-exceptions).

SIMD acceleration is enabled by default when the build/compiler flags support it. If needed, you can force scalar-only parsing with CSV_NO_SIMD=ON in CMake or by defining CSV_NO_SIMD 1 before including the library headers.

Threading Modes

By default, csv-parser uses a background thread to parse file-based input. If CMake cannot find a thread library, threading is disabled automatically.

You can also disable it explicitly:

CMake

set(CSV_ENABLE_THREADS OFF)
add_subdirectory(csv-parser)

Non-CMake (define the macro before any csv-parser header)

#define CSV_ENABLE_THREADS 0
#include "csv.hpp"

Single-threaded mode is useful for embedded targets, environments where std::thread is unavailable, and WebAssembly builds without pthreads. The public API is unchanged; parsing simply runs synchronously on the caller's thread.

Emscripten / WebAssembly

On Emscripten, CSV_ENABLE_THREADS is forced off and memory-mapped parsing is replaced by the stream-based parser. The filename constructor (CSVReader("file.csv")) still works—it opens an std::ifstream internally instead of using mmap.

Emscripten builds must keep C++ exceptions enabled. In practice, compile/link with exception support (for example, -fexceptions) and do not disable exception catching.

Single Header

📥 Download csv.hpp — Available on GitHub Pages

Or copy the URL:

https://vincentlaucsb.github.io/csv-parser/csv.hpp

The file is automatically generated and deployed on every commit to master, ensuring you always have the latest version.

CMake Instructions

If you're including this in another CMake project, you can simply clone this repo into your project directory, and add the following to your CMakeLists.txt:

# Optional: Defaults to C++ 17
# set(CSV_CXX_STANDARD 11)

# Optional: disable background parsing threads
# set(CSV_ENABLE_THREADS OFF)

add_subdirectory(csv-parser)

# ...

add_executable(<your program> ...)
target_link_libraries(<your program> csv)

Avoid cloning with FetchContent

Don't want to clone? No problem. There's also a simple example documenting how to use CMake's FetchContent module to integrate this library.

Features & Examples

Reading an Arbitrarily Large File (with Iterators)

With this library, you can easily stream over a large file without reading its entirety into memory.

C++ Style

# include "csv.hpp"

using namespace csv;

...

CSVReader reader("very_big_file.csv");

for (CSVRow& row: reader) { // Input iterator
    for (CSVField& field: row) {
        // By default, get<>() produces a std::string.
        // A more efficient get<string_view>() is also available, where the resulting
        // string_view is valid as long as the parent CSVRow is alive
        std::cout << field.get<>() << ...
    }
}

...

Old-Fashioned C Style Loop

...

CSVReader reader("very_big_file.csv");
CSVRow row;
 
while (reader.read_row(row)) {
    // Do stuff with row here
}

...

⚠️ IMPORTANT - Iterator Type and Memory Safety:
CSVReader::iterator is an input iterator (std::input_iterator_tag), NOT a forward iterator. This design enables streaming large CSV files (50+ GB) without loading them entirely into memory, but may fail with some standard algorithms that require forward iterators.

If you need to get around this, I suggest either loading all rows into an STL container, e.g. std::vector<CSVRow>, or using the DataFrame class which supports row and column random access.

Memory-Mapped I/O and Streams

When passing in a file path to CSVReader, memory-mapped I/O is used as it is the most performant.

However, most finite steams implementing std::istream, such as std::stringstream and std::ifstream are supported as well as non-seekable streams. CSVReader is capable of taking a stream by reference, although it is recommended to pass in an owning std::unique_ptr<std::istream> for memory safety.

Both memory-mapped and std::istream paths benefit from having a background parsing thread, unless disabled.

CSVFormat format;
// custom formatting options go here

CSVReader mmap("some_file.csv", format);

std::ifstream infile("some_file.csv", std::ios::binary);
CSVReader ifstream_reader(infile, format);

std::stringstream my_csv;
CSVReader sstream_reader(my_csv, format);

Indexing by Column Names

Retrieving values using a column name string is a cheap, constant time operation with EXACT matching; with CASE_INSENSITIVE, the key is normalized before lookup.

# include "csv.hpp"

using namespace csv;

...

// Optional: pass in a format to customize lookup behavior
// Defaults to EXACT matching
CSVFormat format;
format.column_names_policy(ColumnNamePolicy::CASE_INSENSITIVE);

CSVReader reader("very_big_file.csv", format);
double sum = 0;

for (auto& row: reader) {
    // Note: Can also use index of column with [] operator
    sum += row["Total Salary"].get<double>();
}

...

Numeric Conversions

If your CSV has lots of numeric values, you can also have this parser (lazily) convert them to the proper data type.

  • try_get<T>() is a non-throwing version of get<T> which returns bool if the conversion was successful
  • Type checking is performed on conversions to prevent undefined behavior and integer overflow
    • Negative numbers cannot be blindly converted to unsigned integer types
  • get<float>(), get<double>(), and get<long double>() are capable of parsing numbers written in scientific notation.
  • Note: Conversions to floating point types are not currently checked for loss of precision.
# include "csv.hpp"

using namespace csv;

...

CSVReader reader("very_big_file.csv");

for (auto& row: reader) {
    int timestamp = 0;
    if (row["timestamp"].try_get(timestamp)) {
        // Non-throwing conversion
        std::cout << "Timestamp: " << timestamp << std::endl;
    }

    if (row["timestamp"].is_int()) {
        // Can use get<>() with any integer type, but negative
        // numbers cannot be converted to unsigned types
        row["timestamp"].get<int>();
        
        // You can also attempt to parse hex values
        long long value;
        if (row["hexValue"].try_parse_hex(value)) {
            std::cout << "Hex value is " << value << std::endl;
        }

        // Or specify a different integer type
        int smallValue;
        if (row["smallHex"].try_parse_hex<int>(smallValue)) {
            std::cout << "Small hex value is " << smallValue << std::endl;
        }

        // Non-imperial decimal numbers can be handled this way
        long double decimalValue;
        if (row["decimalNumber"].try_parse_decimal(decimalValue, ',')) {
            std::cout << "Decimal value is " << decimalValue << std::endl;
        }

        // ..
    }
}

Converting to JSON

You can serialize individual rows as JSON objects, where the keys are column names, or as JSON arrays (which don't contain column names). The outputted JSON contains properly escaped strings with minimal whitespace and no quoting for numeric values. How these JSON fragments are assembled into a larger JSON document is an exercise left for the user.

# include <sstream>
# include "csv.hpp"

using namespace csv;

...

CSVReader reader("very_big_file.csv");
std::stringstream my_json;

for (auto& row: reader) {
    my_json << row.to_json() << std::endl;
    my_json << row.to_json_array() << std::endl;

    // You can pass in a vector of column names to
    // slice or rearrange the outputted JSON
    my_json << row.to_json({ "A", "B", "C" }) << std::endl;
    my_json << row.to_json_array({ "C", "B", "A" }) << std::endl;
}

Specifying the CSV Format

Although the CSV parser has a decent guessing mechanism, in some cases it is preferrable to specify the exact parameters of a file.

# include "csv.hpp"
# include ...

using namespace csv;

CSVFormat format;
format.delimiter('\t')
      .quote('~')
      .header_row(2);   // Header is on 3rd row (zero-indexed)
      // .no_header();  // Parse CSVs without a header row
      // .quote(false); // Turn off quoting 

// Alternatively, we can use format.delimiter({ '\t', ',', ... })
// to tell the CSV guesser which delimiters to try out

CSVReader reader("weird_csv_dialect.csv", format);

for (auto& row: reader) {
    // Do stuff with rows here
}

Trimming Whitespace

This parser can efficiently trim off leading and trailing whitespace. Of course, make sure you don't include your intended delimiter or newlines in the list of characters to trim.

CSVFormat format;
format.trim({ ' ', '\t'  });

Handling Variable Numbers of Columns and Empty Rows

Sometimes, the rows in a CSV are not all of the same length. Whether this was intentional or not, this library is built to handle all use cases.

CSVFormat format;

// Default: Silently ignoring rows with zero, missing or extraneous columns
format.variable_columns(false); // Short-hand
format.variable_columns(VariableColumnPolicy::IGNORE_ROW);

// Case 2a: Keeping variable-length rows and empty rows
format.variable_columns(true); // Short-hand
format.variable_columns(VariableColumnPolicy::KEEP);

// Case 2b: Keeping variable-length rows but dropping empty rows
format.variable_columns(VariableColumnPolicy::KEEP_NON_EMPTY);

// Case 3: Throwing an error if variable-length rows are encountered
format.variable_columns(VariableColumnPolicy::THROW);

Setting Column Names

If a CSV file does not have column names, you can specify your own:

std::vector<std::string> col_names = { ... };
CSVFormat format;
format.column_names(col_names);

Parsing an In-Memory String

# include "csv.hpp"

using namespace csv;

...

// Method 1: Using parse()
std::string csv_string = "Actor,Character\r\n"
    "Will Ferrell,Ricky Bobby\r\n"
    "John C. Reilly,Cal Naughton Jr.\r\n"
    "Sacha Baron Cohen,Jean Giard\r\n";

auto rows = parse(csv_string);
for (auto& r: rows) {
    // Do stuff with row here
}
    
// Method 2: Using _csv operator
auto rows = "Actor,Character\r\n"
    "Will Ferrell,Ricky Bobby\r\n"
    "John C. Reilly,Cal Naughton Jr.\r\n"
    "Sacha Baron Cohen,Jean Giard\r\n"_csv;

for (auto& r: rows) {
    // Do stuff with row here
}

DataFrames for Random Access and Updates

For files that fit comfortably in memory, DataFrame provides fast and powerful keyed access, in-place updates, and grouping operations—all built on the same high-performance parser. It uses the same parsing pipeline as CSVReader but retains the results in memory for both row-wise and column-wise random access.

Creating a DataFrame with Keyed Access

# include "csv.hpp"

using namespace csv;

...

// Shortest form: pass a filename directly with DataFrameOptions
DataFrame<int> df("employees.csv",
    DataFrameOptions().set_key_column("employee_id"));

// Or construct from an existing CSVReader (e.g. when you need a custom format)
CSVReader reader("employees.csv");
DataFrame<int> df2(reader, "employee_id");

// O(1) lookups by key
auto salary = df[12345]["salary"].get<double>();

// Positional access: operator[](size_t) is disabled when KeyType is an integer
// type to prevent ambiguity with operator[](const KeyType&). Use iloc() instead.
auto first_row = df.iloc(0);
auto name = first_row["name"].get<std::string>();

// Check if a key exists
if (df.contains(99999)) {
    std::cout << "Employee exists" << std::endl;
}

Using DataFrameOptions for Fine-Grained Control

// Configure key column, duplicate-key policy, and missing-key behaviour
DataFrameOptions opts;
opts.set_key_column("employee_id")
    .set_duplicate_key_policy(
        DataFrameOptions::DuplicateKeyPolicy::KEEP_FIRST)  // or OVERWRITE / THROW
    .set_throw_on_missing_key(false);  // silently skip rows with no key value

DataFrame<int> df("employees.csv", opts);

Creating a DataFrame with a Custom Key Function

CSVReader reader("employees.csv");

// Build a composite key from two columns
auto make_key = [](const CSVRow& row) {
    return row["first_name"].get<std::string>() + "_" +
           row["last_name"].get<std::string>();
};

DataFrame<std::string> by_name(reader, make_key);

// Lookups by composite key
auto employee = by_name["Ada_Lovelace"]["department"].get<std::string>();

Updating Values

// Updates are stored in an efficient overlay without copying the entire dataset
df.set(12345, "salary", "95000");
df.set(67890, "department", "Engineering");

// Access methods return updated values transparently
std::cout << df[12345]["salary"].get<std::string>(); // "95000"

// Iterate with edits visible
for (auto& row : df) {
    std::cout << row["salary"].get<std::string>(); // Shows edited values
}

Grouping and Analysis

// Group by department
auto groups = df.group_by("department");
for (auto& [dept, row_indices] : groups) {
    double total_salary = 0;
    for (size_t i : row_indices) {
        total_salary += df[i]["salary"].get<double>();
    }
    std::cout << dept << " total: $" << total_salary << std::endl;
}

// Group using a custom function
auto by_salary_range = df.group_by([](const CSVRow& row) {
    double salary = row["salary"].get<double>();
    return salary < 50000 ? "junior" : salary < 100000 ? "mid" : "senior";
});

Writing Back to CSV

Each DataFrameRow has an implicit conversion to std::vector<std::string>, which is convenient when using CSVWriter.

// DataFrameRow has implicit conversion for CSVWriter compatibility
auto writer = make_csv_writer(std::cout);
for (auto& row : df) {
    writer << row;  // Outputs edited values
}

Writing CSV Files

Writing CSVs is powered by the generic DelimWriter, with helpful factory functions like make_csv_writer() and make_tsv_writer() that cut down on boilerplate.

# include "csv.hpp"
# include ...

using namespace csv;
using namespace std;

...

stringstream ss; // Can also use ofstream, etc.

auto writer = make_csv_writer(ss);
// auto writer = make_tsv_writer(ss);               // For tab-separated files
// DelimWriter<stringstream, '|', '"'> writer(ss);  // Your own custom format
// set_decimal_places(2);                           // How many places after the decimal will be written for floats

writer << vector<string>({ "A", "B", "C" })
    << deque<string>({ "I'm", "too", "tired" })
    << list<string>({ "to", "write", "documentation." });

writer << array<string, 3>({ "The quick brown", "fox", "jumps over the lazy dog" });
writer << make_tuple(1, 2.0, "Three");
...

You can pass in arbitrary types into DelimWriter by defining a conversion function for that type to std::string.

C++20 Ranges: Efficient writing for CSVRow, DataFrameRow, and STL containers

If compiling with C++20 or later, the DelimWriter uses efficient std::ranges over string views for zero-copy writing.

You can still serialize CSVRow or DataFrameRow in older versions, but you will have to use the std::vector<std::string>() conversion operator.