NoneBack

CPU Profiling: What, How, and When

Mon, 10 Mar 2025 14:46:54 +0800

What: What is CPU Profiling

A technique for analyzing program CPU performance. By collecting detailed data during program execution (such as function call frequency, time consumption, call stacks, etc.), it helps developers identify performance bottlenecks and optimize code efficiency. Typically used in performance analysis and root cause diagnosis scenarios.

How: How Profiling Data is Collected

Common tools like perf are used to collect process stack information. These tools use sampling statistics to capture stack samples executing on the CPU for performance analysis.

graph TD
    A[Sampling Trigger] -->|Interrupt| B[Sampling]
    B -->|perf_event/ebpf| C[Process Stack Addresses]
    C -->|Address Translation| D[ELF, OFFSET]
    D -->|Symbol Resolution| E[Call Stack]
    E -->|Formatting| F[pprof/perf script]
    F --> |Visualization| G[Flame Graph/Call Graph]

Trigger Mechanisms

Generally uses timer interrupts or event-counter-based strategies.

Timer Interrupts

Default fixed frequency (e.g., 99Hz) clock interrupts (SIGPROF). Shorter intervals increase precision but also overhead. Linux perf defaults to 99Hz frequency (≈10.1ms intervals).

Event-Counter Sampling

Triggers sampling when hardware performance counters (e.g., PERF_COUNT_HW_CPU_CYCLES) reach thresholds. Useful for analyzing hardware-related events like Cache Misses.

Sampling Methods

Typically, the OS kernel-provided interfaces like eBPF or perf_event are used for stack sampling.

eBPF Approach

Using eBPF programs (e.g., bpf_get_stackid), both user-space and kernel-space call stacks can be captured directly without additional stack unwinding. This method retrieves complete stack IP information.

perf_event Approach

The perf_event_open interface (e.g., perf record command) captures the instruction pointer (RIP). However, it only records the currently executing function address, not the full call stack. This means only the function name triggered by the sample can be resolved.

Example perf record output:

node 3236535 34397396.208842:     250000 cpu-clock:pppH:           110c800 v8::internal::Heap_CombinedGenerationalAndSharedBarrierSlow+0x0 (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)
node 3236535 34397396.354632:     250000 cpu-clock:pppH:      7f7d63e87ef4 Builtins_LoadIC+0x574 (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)

To obtain a full call stack, tools like libunwind perform stack unwinding. For example, perf record -g generates a full stack trace by unwinding the stack frames.

Example perf record -g output:

node 3236535 34397238.259753:     250000 cpu-clock:pppH: 
            7f7d44339100 [unknown] (/tmp/perf-3236535.map)
                 18ea0dc Builtins_JSEntryTrampoline+0x5c (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)
                 18e9e03 Builtins_JSEntry+0x83 (...)
...
                  c7d43f node::Start+0x58f (...)
            7f7d6ba14d90 __libc_start_call_main+0x80 (/usr/lib/x86_64-linux-gnu/libc.so.6)

Address Translation

The sampled address information corresponds to the process’s virtual addresses, such as:

7f7d44339100  
18ea0dc  
18e9e03  
106692b  
10679c4  
f2a090d  
c1c738  
...

To resolve these addresses into ELF + OFFSET for symbol translation, we use the memory mapping information from /proc/[pid]/maps. The key fields in the maps file include:

Example /proc/[pid]/maps entries:

00400000-00b81000 r--p 00000000 fc:03 550055  /root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node  
7f7d6bf3c000-7f7d6bf3d000 ---p 0021a000 fc:03 67  /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30  
7f7d6bf61000-7f7d6bf63000 r--p 00000000 fc:03 2928  /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2

Translation Process

Match the virtual address to the appropriate memory segment in /proc/[pid]/maps.
Calculate the offset within the ELF file using: offset = virtual_address - segment_start + file_offset

Symbol Resolution

After translating virtual addresses into ELF + OFFSET pairs, the next step is resolving these offsets into human-readable function symbols. This involves leveraging symbol tables or debugging information embedded in the ELF files.

Methods for Symbol Resolution

Using Symbol Tables Tools like nm can extract symbol information from the .dynsym (dynamic symbol table) or .symtab (static symbol table) sections of an ELF file.

Example:

# Extract malloc-related symbols from a Node.js binary
nm -D /path/to/node | grep malloc
# Output:
00000000055f9d18 D ares_malloc
0000000001f1a2a0 T ares_malloc_data
...
                 U malloc@GLIBC_2.2.5

Using DWARF Debugging Information DWARF debug data provides richer details, including source file locations and variable scopes. Tools like readelf or addr2line can parse this information.

Example:

# Extract function names and source locations from DWARF info
readelf --debug-dump=info /path/to/node | grep "DW_AT_name" -A3
# Output:
<1><1980>: DW_AT_name: uv__make_close_pending
    DW_AT_decl_file: 19
    DW_AT_decl_line: 247

Demangling C++ Symbols C++ symbols are often mangled (encoded) for uniqueness. Tools like c++filt restore human-readable names.

Example:

# Demangle a mangled symbol
echo "_ZN4node14ThreadPoolWork12ScheduleWorkEv" | c++filt
# Output:
node::ThreadPoolWork::ScheduleWork()

Stack Output Formatting

Resolved stack traces are formatted for analysis tools like pprof or perf script. Additional metadata (e.g., container ID, service type) may be included for aggregation.

Data Visualization

All those data above will eventually be rendered as flamegraph or call-chain graph.

When: When to Use CPU Profiling Tools

CPU profiling is most effective when analyzing CPU-bound performance issues. Below are common scenarios and their workflows:

graph TD
  A[Observe anomaly: Unavailability/Performance Jitter] --> B[Identify target process & timeframe]
  B --> C[Check core metrics: CPU, memory, disk, QPS]
  C --> D{Is CPU the bottleneck?}
  D -->|Yes| E[Profile CPU stacks]
  D -->|No| F[Use alternative tools e.g., memory profiler, I/O tracer]
  E --> G[Analyze flame graphs/call chains]
  G --> H[Root cause identified]

Scenario Category	Typical Symptoms	Tool Choices	Data Collection Strategy
Sudden CPU Spikes	Sawtooth-shaped CPU peaks in monitoring charts.	Continuous Profiling Systems	Capture 5-minute context before/after spikes + regular sampling.
Version Performance Regression	QPS/TPS drops post-deployment.	Differential FlameGraph	A/B version comparison sampling under identical loads.
High CpuSys	Elevated OS kernel CPU usage causing host instability.	FlameGraph/Call-Chain Graph	Regular sampling with kernel stack analysis.

When CPU Profiling Is NOT Suitable

For non-CPU-bound issues, profiling data may have limited value. Alternative tools are recommended:

graph TD
  A[CPU Profiling Limitations] --> B[Memory Bottlenecks]
  A --> C[I/O-Bound Workloads]
  A --> D[Lock Contention]
  A --> E[Short-lived Processes]

  B -->|Signs| B1(High page faults, GC pauses)
  B -->|Tools| B2{{Heap profiler: e.g., pprof, vmstat}}

  C -->|Signs| C1(High iowait, low CPU utilization)
  C -->|Tools| C2{{iostat, blktrace}}

  D -->|Signs| D1(High context switches, sys%)
  D -->|Tools| D2{{perf lock, lockstat}}

  E -->|Signs| E1(Process lifetime < sampling interval)
  E -->|Tools| E2{{execsnoop, dynamic tracing:e.g., bpftrace}}

References

code expaple：https://github.com/noneback/doctor
stack unwind: https://zhuanlan.zhihu.com/p/460686470
proc_pid_maps: https://man7.org/linux/man-pages/man5/proc_pid_maps.5.html
dwarf: https://www.hitzhangjie.pro/debugger101.io/8-dwarf/
demange & mangle: https://www.cnblogs.com/BloodAndBone/p/7912179.html

LevelDB MVCC

Sat, 08 Feb 2025 14:06:39 +0800

LevelDB implements concurrent sstable read/write operations and snapshot reads through MVCC. Let’s examine its implementation.

Sequence Number LevelDB uses Sequence Numbers as logical clocks to maintain a total order of KV write operations. The Sequence Number is encoded in the last few bytes of the InternalKey. This encoding ensures data ordering during memory writes.

Versioning Every change to the sstable file collection triggers a version upgrade in LevelDB. Each Version represents the database state at a specific moment, containing sstable metadata and compaction-related information. VersionEdit records version changes.

Version1 ---VersionEdit--> Version2

The VersionSet is represented as an ordered linked list of Versions, reflecting the database’s current and historical states. The LastSeq (last sequence number) and Version linked list are critical components.

The Version linked list tracks all stored Versions and their changes, with reference counting (RC) used for garbage collection. The Version Chain describes the total order of sstable write operations across different times.

WAL + Manifest ensures atomic LastSeq updates. The Manifest file acts as a WAL for version changes, working with Version Commit operations to guarantee atomic updates of the latest version in VersionSet.

During version transitions:

Write operations generate VersionEdit records in memory
Changes are written to the Manifest
VersionSet updates to the new Version

This process ensures version consistency: compaction-induced version changes won’t affect ongoing read operations, and readers never access intermediate sstable states.

class VersionEdit {
    /** Other code */
    typedef std::set<std::pair<int, uint64_t>> DeletedFileSet;

    std::string comparator_;
    uint64_t log_number_;
    uint64_t prev_log_number_;
    uint64_t next_file_number_;
    SequenceNumber last_sequence_;
    bool has_comparator_;
    bool has_log_number_;
    bool has_prev_log_number_;
    bool has_next_file_number_;
    bool has_last_sequence_;

    std::vector<std::pair<int, InternalKey>> compact_pointers_;
    DeletedFileSet deleted_files_;
    std::vector<std::pair<int, FileMetaData>> new_files_;
};

class Version {  
    VersionSet*vset_;  
    Version* next_;
    Version* prev_;
    int refs_;

    std::vector<FileMetaData*> files_[config::kNumLevels];

    FileMetaData* file_to_compact_;
    int file_to_compact_level_;

    double compaction_score_;  
    int compaction_level_;
};

class VersionSet {
    Env*const env_;
    const std::string dbname_;
    const Options* const options_;
    TableCache* const table_cache_;  
    const InternalKeyComparator icmp_;
    uint64_t next_file_number_;
    uint64_t manifest_file_number_;
    uint64_t last_sequence_;
    uint64_t log_number_;
    uint64_t prev_log_number_;

    WritableFile* descriptor_file_;
    log::Writer* descriptor_log_; 
    Version dummy_versions_;  
    Version* current_;        

    std::string compact_pointer_[config::kNumLevels];
}

MVCC & Snapshot Read

MVCC primarily resolves concurrent read and write conflicts on SSTables.

Memtable Operations: Reads and writes to the Memtable use a skip list, which inherently introduces conflicts. SSTable Operations: Reads and writes (compaction and read operations) do not interfere with each other. Each write operation is associated with a Sequence Number, and SSTables are only appended to. Compaction merely merges SSTables into new files.

Each read operation is associated with a Sequence Number and Version. The Sequence Number ensures that subsequent writes are invisible to the read operation, and the associated Version ensures that the SSTables used during the read are not garbage collected. This guarantees that the read operation does not encounter changes to the SSTables it uses.

If a read and write operate within the same Version, the compaction must complete first, and the Version does not change. This ensures that the compaction process does not interfere with the read. For other cases, reads always precede writes, making writes invisible to the read and eliminating conflicts.

Reference

https://leveldb-handbook.readthedocs.io/en/latest/index.html

https://noneback.github.io/en/blog/en/leveldb-write/

Prometheus--TSDB

Tue, 31 Dec 2024 01:10:28 +0800

Recently got promoted, I took a moment to summarize some of my previous work. A significant part of my job was building large-scale database observability systems, which are quite different from cloud-native monitoring solutions like Prometheus. Now, I’m diving into the standard open-source monitoring system.

This article mainly discusses the built-in single-node time series database (TSDB) of Prometheus, outlining its TSDB design without delving into source code analysis.

Analyzing the source code of such projects can often be of low value unless I specialize in TSDBs, as the analysis can be easily forgotten, and the code may not be exceptional.

Data + Query Model

A single monitoring metric is described as a structure of time-dependent data, a timeseries.

$$ {timeseries} = \quad\lbrace \quad metric(attached\ with\ a\ set\ of\ labels) \Rightarrow (t_0,\ v_0),\ (t_1,\ v_1),\ (t_2,\ v_2),\ \ldots,\ (t_n, v_n) \quad\rbrace $$
Queries utilize the ${identifier(metric\ +\ sets\ of\ selected\ labels\ value)}$ to retrieve the corresponding timeseries. series

series
  ^   
  │   . . . . . . . . . . . . . . . . .   . . . . .   {__name__="request_total", method="GET"}
  │     . . . . . . . . . . . . . . . . . . . . . .   {__name__="request_total", method="POST"}
  │         . . . . . . .
  │       . . .     . . . . . . . . . . . . . . . .                  ... 
  │     . . . . . . . . . . . . . . . . .   . . . .   
  │     . . . . . . . . . .   . . . . . . . . . . .   {__name__="errors_total", method="POST"}
  │           . . .   . . . . . . . . .   . . . . .   {__name__="errors_total", method="GET"}
  │         . . . . . . . . .       . . . . .
  │       . . .     . . . . . . . . . . . . . . . .                  ... 
  │     . . . . . . . . . . . . . . . .   . . . . 
  v
    <-------------------- time --------------------->

The use of the identifier is crucial. Poor labeling can lead to timeseries data growth, especially in scenarios where containers are rebuilt.

Data Organization

For cloud-native scenarios, what characteristics do monitoring data have?

Short data lifecycle. The lifespan of individual containers is brief (e.g., in scaling scenarios or during extensive temporary tasks), leading to rapid timeseries growth along certain time dimensions.
Vertical writing with horizontal querying.

With these issues in mind, let’s look at how the data files are organized to address or sidestep these problems.

First, examine the logical structure:

The entire database consists of blocks and a HEAD. Each block can further be broken down into chunks, while the HEAD serves as a read-write buffer area composed of in-memory data and write-ahead logs (WAL). Chunks contain multiple timeseries.

The disk directory structure for a single block is as follows:

├── 01BKGV7JC0RY8A6MACW02A2PJD  // block 的 ULID
│   ├── chunks
│   │   └── 000001
│   ├── tombstones
│   ├── index
│   └── meta.json
├── chunks_head
│   └── 000001
└── wal
    ├── 000000002
    └── checkpoint.00000001
        └── 00000000

Block: Contains all data for a given time period (default 2 hours) and is read-only, named using a ULID. Each block includes:
- Chunks: fixed-size (max 128MB) chunks file
- Index: index file mainly containing inverted index information
- meta.json: metadata including block’s minTime and maxTime for data skipping during queries.
Chunks Head: The chunks file corresponding to the block currently being written, read-only, with a maximum of 120 data points and a maximum time span of 2 hours.
WAL: Guarantees data integrity.

The diagram provides significant insights, such as how the WAL ensures data integrity; the Head acts similarly to a buffer pool in a TSDB, managing memory data for batch flushing to disk. When certain conditions are met (e.g., time threshold, data size threshold), the Head becomes immutable (block) and is flushed to disk.

Overall, many design concepts in data organization resemble the LSM storage structure, which indeed suits TSDB well.

Prometheus’s design approach can be summarized as follows:

Using time-based data partitioning to resolve the issue of short data lifecycles.
Using in-memory batching to handle scenarios where only the latest data is written.

Setting aside similar aspects with LevelDB, let’s outline the differences.

First, the underlying models are different. LevelDB is a key-value store, while TSDB focuses on timeseries with a strong temporal connection, where time is monotonically increasing. It rarely writes historical data. Additionally, the query models differ; TSDB provides diverse query options, such as filtering timeseries based on various label set operations, necessitating more metadata for efficient querying.

Due to these requirements, new structures and functions are introduced: inverted indexes, checkpoints, tombstones, retention policies, and a compaction design distinct from the LSM key-value model. These will be analyzed in relation to the corresponding file formats.

File Organization Format

Let’s examine the components; the specifics of the organizational method are not the focus of this article.

meta.json

This file contains information about the block, particularly valuable for compaction and the minT, maxT timestamps.

minT and maxT record the block’s time access period, which can skip data during queries.

Compaction records the block’s historical information, such as the number of compaction iterations (level) and its source blocks. The precise utility of this is uncertain but may help during compaction or retention tasks to manage potential duplicates.

{
    "ulid": "01EM6Q6A1YPX4G9TEB20J22B2R", 
    "minTime": 1602237600000,
    "maxTime": 1602244800000,
    "stats": {
        "numSamples": 553673232,
        "numSeries": 1346066,
        "numChunks": 4440437
    },
    "compaction": {
        "level": 1,
        "sources": [
            "01EM65SHSX4VARXBBHBF0M0FDS",
            "01EM6GAJSYWSQQRDY782EA5ZPN"
        ]
    },
    "version": 1
}

chunks

These are standard data files, with their indexes stored in the index file. Note that a chunk can only belong to one timeseries, and a timeseries consists of multiple chunks.

┌──────────────────────────────┐
│  magic(0x85BD40DD) <4 byte>  │
├──────────────────────────────┤
│    version(1) <1 byte>       │
├──────────────────────────────┤
│    padding(0) <3 byte>       │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │         Chunk 1          │ │
│ ├──────────────────────────┤ │
│ │          ...             │ │
│ ├──────────────────────────┤ │
│ │         Chunk N          │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘

Every Chunk:
┌───────────────┬───────────────────┬──────────────┬────────────────┐
│ len <uvarint> │ encoding <1 byte> │ data <bytes> │ CRC32 <4 byte> │
└───────────────┴───────────────────┴──────────────┴────────────────┘

tombstone

This marks deleted data. TSDB might have delete operations under scenarios like transient jobs or container destruction, where business logic may necessitate removal. Tombstones primarily enable appending writes instead of in-place modifications, and subsequently, blocks may be compacted to reclaim disk space.

Of course, not deleting data isn’t harmful; there will be TTL expiration that removes obsolete data.

┌────────────────────────────┬─────────────────────┐
│ magic(0x0130BA30) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                Tombstone 1                   │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                Tombstone N                   │ │
│ ├──────────────────────────────────────────────┤ │
│ │                  CRC<4b>                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

Every Tombstone:
┌────────────────────────┬─────────────────┬─────────────────┐
│ series ref <uvarint64> │ mint <varint64> │ maxt <varint64> │
└────────────────────────┴─────────────────┴─────────────────┘

index file

This file contains all information needed for reading, such as inverted indexes and the mapping of timeseries to chunks.

Notable structures include Series and Postings.

The Series section documents all series information corresponding to their chunks within the blocks.

The Posting Offset Table lists the locations of inverted indexes. The actual inverted index content is stored in the Postings section.

With inverted index collection operations, you can rapidly filter and retrieve timeseries that meet specified criteria.

┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                 Symbol Table                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                    Series                    │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index 1                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index N                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings 1                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings N                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │               Label Offset Table             │ │
│ ├──────────────────────────────────────────────┤ │
│ │             Postings Offset Table            │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      TOC                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

A Series:
┌──────────────────────────────────────────────────────┐
│ len <uvarint>                                        │
├──────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────┐ │
│ │            labels count <uvarint64>              │ │
│ ├──────────────────────────────────────────────────┤ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ ref(l_i.name) <uvarint32>                  │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(l_i.value) <uvarint32>                 │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │                       ...                        │ │
│ ├──────────────────────────────────────────────────┤ │
│ │            chunks count <uvarint64>              │ │
│ ├──────────────────────────────────────────────────┤ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ c_0.mint <varint64>                        │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ c_0.maxt - c_0.mint <uvarint64>            │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(c_0.data) <uvarint64>                  │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ c_i.mint - c_i-1.maxt <uvarint64>          │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ c_i.maxt - c_i.mint <uvarint64>            │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(c_i.data) - ref(c_i-1.data) <varint64> │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │                       ...                        │ │
│ └──────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────┤
│ CRC32 <4b>                                           │
└──────────────────────────────────────────────────────┘

A Postings:
┌────────────────────┬────────────────────┐
│ len <4b>           │ #entries <4b>      │
├────────────────────┴────────────────────┤
│ ┌─────────────────────────────────────┐ │
│ │ ref(series_1) <4b>                  │ │
│ ├─────────────────────────────────────┤ │
│ │ ...                                 │ │
│ ├─────────────────────────────────────┤ │
│ │ ref(series_n) <4b>                  │ │
│ └─────────────────────────────────────┘ │
├─────────────────────────────────────────┤
│ CRC32 <4b>                              │
└─────────────────────────────────────────┘

Accelerating Disk Queries

Let’s focus on how a query locates the relevant data:

First, it queries the Posting Offset Table to find the position of the corresponding label’s Postings.
Based on the information from the Postings, it identifies the chunk locations via the series reference.
Finally, it locates the corresponding chunks for the timeseries.

Compaction

Similar to LevelDB, Prometheus utilizes both major and minor compaction processes, termed Compaction and Head Compaction.

Head Compaction is akin to the process of persisting the Head portion into Chunks, during which tombstones are actually deleted from memory.

Compaction is the merging of blocks, accomplishing multiple aims:

Reclaiming disk resources used by marked deletions.
Consolidating duplicate information scattered across multiple blocks, such as shared chunks and inverted index records.
Enhancing query processing speed by addressing data overlapping across different blocks—handling this during compaction is more efficient than performing in-memory processing post-read.

When does compaction occur?

The official blog doesn’t clarify this well, merely mentioning it occurs when data overlaps. However, various triggers exist, including time-based triggers, checks at each minor compaction, tombstone size evaluations, and manual triggers, following strategies observed in LevelDB.

Retention

This is straightforward—based on time or size-based TTL. Integrating this into the compaction process could also be a viable approach.

Snapshot

This process involves dumping the in-memory data to disk, likely designed to balance extensive metric data disk writes with data integrity; Otherwise, its functionality would be dupicated with wal.

References

Borg: Large-scale Cluster Management at Google with Borg

Mon, 19 Feb 2024 11:12:16 +0800

Borg is a cluster management system, similar to the closed-source version of Kubernetes (k8s).

It achieves high utilization through admission control, efficient task packing, overcommitment, machine sharing, and process-level performance isolation.
It provides runtime features to reduce failure recovery time for high-availability applications and scheduling policies that reduce the probability of correlated failures.
It offers a declarative job description language, DNS integration, real-time job monitoring, and tools for analyzing and simulating system behavior, simplifying usage for end-users.

The paper aims to introduce the system design and share the experiences Google has gained behind it. This blog mainly focuses on system design, specifically the services Borg offers in terms of SLA, its abstraction of workloads, resources, and scheduling.

System Abstraction

Borg manages two primary workloads: long-running services and batch jobs, corresponding to two types of jobs (prod/non-prod). A job consists of several tasks, and different jobs have different priorities.

In terms of deployment architecture, a Borg cluster consists of several cells, each containing multiple machines.

For task scheduling, all physical or logical units on machines are treated as resources, including CPU, memory, IO, etc.

System Architecture

Borg uses a master-slave architecture, consisting of a BorgMaster and several Borglet nodes. The scheduler is an independent service.

BorgMaster is a logical node responsible for interacting with both external components and Borglets, as well as maintaining the internal state of the cluster. It uses Paxos to achieve multi-replication and high availability.

Borglet is the Borg proxy on each machine in the cell. It is responsible for starting/stopping tasks, managing node physical resources, and reporting status.

Scheduler is the service responsible for task scheduling. It uses the state recorded by the master to asynchronously handle task scheduling and informs the master for a secondary check.

Resource Scheduling

The scheduler is a key service in Borg. The quality of the scheduling algorithm directly affects resource utilization and is closely related to cost efficiency.

Basic Process

The scheduling algorithm has two parts:

Feasibility Check: Finds a set of machines capable of running the task.
Scoring: Selects the most suitable machine from that set.

During the feasibility check, the scheduler finds a set of machines that meet task constraints and have enough available resources. Available resources include those already allocated to lower-priority tasks that can be preempted.

During the scoring phase, the scheduler determines the suitability of each feasible machine. Scoring considers user-specific preferences but primarily depends on built-in criteria, such as minimizing the number and priority of preempted tasks, selecting machines that already have the task package, distributing tasks across different power and failure domains, and optimizing packing quality (mixing high- and low-priority tasks on a single machine to allow high-priority tasks to expand during load spikes).

The scheduler uses a cached copy of the cell state and performs the following steps repeatedly:

Retrieves state changes (including assigned and pending jobs) from the elected master and updates its local copy.
Runs a round of scheduling to assign tasks and sends assignment information to the master.
The master accepts and applies the assignments, but if they are unsuitable (e.g., based on outdated state), it waits for the scheduler’s next round.

Additional Aspects

The paper also discusses how to provide oversubscription and handle performance contention, though these are not the focus of this blog. Readers can refer to the original paper for more details.

References

https://www.cnblogs.com/hellojamest/p/16526159.html

Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications

Thu, 28 Sep 2023 10:43:23 +0800

It has been a while since I last studied, and I wanted to learn something interesting. This time, I’ll be covering Percolator, a distributed transaction system. I won’t translate the paper or delve into detailed algorithms; I’ll just document my understanding.

Percolator and 2PC

2PC

The Two-Phase Commit (2PC) protocol involves two types of roles: Coordinator and Participant. The coordinator manages the entire process to ensure multiple participants reach a unanimous decision. Participants respond to the coordinator’s requests, completing prepare operations and commit/abort operations based on those requests.

The 2PC protocol ensures the atomicity (ACD) of a transaction but does not implement isolation (I), relying instead on single-node transactions for ACD. The coordinator is clearly a critical point, which can become a bottleneck or cause blocking if it fails.

    Coordinator                                          Participant
                              QUERY TO COMMIT
                -------------------------------->
                              VOTE YES/NO           prepare*/abort*
                <-------------------------------
commit*/abort*                COMMIT/ROLLBACK
                -------------------------------->
                              ACKNOWLEDGMENT        commit*/abort*
                <--------------------------------
end

Percolator

Percolator can be seen as an optimized version of 2PC, with some improvements such as:

Optimizing the use of locks by introducing primary-secondary dual-level locks, which eliminates the reliance on a coordinator.
Providing full ACID semantics and supporting MVCC (Multi-Version Concurrency Control) through a timestamp service.

Percolator Protocol Details

The Percolator system consists of three main components:

Client: The client initiating a transaction. It acts as the control center for the entire protocol and is the coordinator of the two-phase commit process.
TO (Time Observer): Responsible for assigning timestamps, providing unique and incrementing timestamps to implement MVCC.
Bigtable: Provides single-row transactions, storing data as well as some attributes for transactional control.

lock + write + data: for transactions, where lock indicates that a cell is held by a transaction, and write represents the data visibility.

notify + ack: for watcher or notifier mechanisms.

Externally, Percolator is provided to businesses through an SDK, offering transactions and R/W operations. The model is similar to Begin Txn → Sets of RW Operations → Commit or Abort or Rollback. Bigtable acts as the persistent component, hiding details about Tablet Server data sharding. Each write operation (including read-then-write) in the transaction is treated as a participant in a distributed transaction and may be dispatched to multiple Tablet Server nodes.

Algorithm Workflow

All writes in a transaction are cached on the client before being written during the commit phase. The commit phase itself is a standard two-phase commit consisting of prewrite and commit stages.

Prewrite

Obtain a timestamp from TO as the start time of the transaction.
Lock the data, marking it as held by the current transaction. If locking fails, it means the data is held by another transaction, and the current transaction fails.

The locking process utilizes the primary-secondary mechanism, where one write is chosen as the primary and all others as secondary. The secondary locks point to the primary.

Clearly, data in the prewrite phase is invisible to other transactions.

Commit

Attempt to commit the data prewritten. The commit starts by committing the primary record, whose commit time will serve as the commit time for the entire transaction. First, the lock record is checked. If the lock does not exist, it indicates that the lock from the prewrite phase has been cleaned by another transaction, causing the current transaction to fail. If the lock exists, the write column is updated to indicate that the data is visible to the system.

In an asynchronous network, single-node failures and network delays are common. The algorithm must detect and clean up these locks to avoid deadlocks. Therefore, in the commit phase, if a lock is found to be missing, it means that an issue occurred with a participant, and the current transaction must be cleaned.

After successfully committing, clean up the lock record. Lock cleanup can be done asynchronously.

These designs eliminate the dependency on a centralized coordinator. Previously, a centralized service was required to maintain information about all transaction participants. In this algorithm, the primary-secondary lock and the write column achieve the same goal. The write column indicates the visibility and version chain of the data, while the lock column shows which transaction holds the data. The primary-secondary locks record the logical relationship among participants. Thus, committing the primary record becomes the commit point for the entire transaction. Once the primary is committed, all secondary records can be asynchronously committed by checking the corresponding primary record’s write column.

Snapshot Isolation

Two-phase commit ensures the atomicity of a transaction. On top of that, Percolator also provides snapshot isolation. In simple terms, snapshot isolation requires that committed transactions do not cause data conflicts and that read operations within a transaction satisfy snapshot reads. By leveraging the transaction start time and the primary commit time, a total ordering among transactions can be maintained, solving these issues naturally.

Deadlock Issues in Asynchronous Networks

As mentioned earlier, in an asynchronous network, single-node failures and network delays are common. The algorithm must clean up locks to prevent deadlocks when such failures are detected. The failure detection strategy can be as simple as a timeout, causing the current transaction to fail. When a node fails and then recovers, its previous transaction has already failed, and the relevant lock records must be cleaned up. Lock cleanup can be asynchronous; for example, during the prewrite phase, if a record’s lock column is found to be non-empty, its primary lock can be checked. If the primary lock is not empty, it means the transaction is incomplete, and the lock can be cleaned up; if empty, the transaction has committed, and the data should be committed and the lock cleaned (RollForward).

Notification Mechanism

A notification mechanism is crucial for state observation and linkage in asynchronous systems, but it is not the focus of this article.

Percolator in TiDB

Based on our analysis above, Percolator is an optimized 2PC distributed transaction implementation, relying on a storage engine that supports single-node transactions.

Let’s briefly look at how TiDB uses Percolator to implement distributed transactions.

The architecture of TiDB and TiKV is shown above. Data from relational tables in TiDB is ultimately mapped to KV pairs in TiKV. TiKV is a distributed KV store based on Raft and RocksDB. RocksDB supports transactional operations on KV pairs.

Thus, the transaction path in TiDB is as follows: a relational table transaction is converted into a set of KV transactions, which are executed based on Percolator to achieve relational table transaction operations.

Of course, it cannot provide the same transactional semantics and performance guarantees as a single-node TP database. However, a shared-nothing architecture has its own advantages, which may make this trade-off acceptable.

References

Engineering Practice of Two-Phase Commit

PolarDB Database Kernel Monthly Report

Percolator: Online Incremental Processing System (Chinese Translation)

Percolator: Online Incremental Processing System (Chinese Translation) | A Small Bird

Two-Phase Commit - Wikipedia

Percolator and TiDB Transaction Algorithm

Two-Phase Commit | OceanBase Learning Guide

TiDB Architecture Overview

Dynamo: Amazon’s Highly Available Key-value Store

Tue, 01 Aug 2023 16:15:29 +0800

An old paper by AWS, Dynamo has been in the market for a long time, and the architecture has likely evolved since the paper’s publication. Despite this, the paper was selected as one of the SIGMOD best papers of the year, and there are still many valuable lessons to learn.

Design

Dynamo is a NoSQL product that provides a key-value storage interface. It emphasizes high availability rather than consistency, which leads to differences in architectural design and technical choices compared to other systems.

Technical Details

Dynamo has many aspects that may be considered problematic from a technical perspective, such as the NWR (N-W-R) approach. However, given Dynamo’s long track record in production, these issues may have been resolved over time, though the paper is not explicit about this. For now, let’s discuss some of the aspects I found noteworthy:

Data Partitioning

Dynamo uses a consistent hashing algorithm. Traditional consistent hashing employs a hash ring to address the problem of extensive rehashing when nodes are added or removed, but it cannot avoid issues like data skew and performance imbalance caused by heterogeneous machines. In practice, Dynamo introduces virtual nodes into the hash ring, which elegantly solves these problems.

Data Write Challenges

Most storage systems ensure a certain level of consistency during writes, trading off lower write performance for reduced read complexity. However, Dynamo takes a different approach.

Dynamo’s design goal is to provide a highly available key-value store that ensures always writable operations while only guaranteeing eventual consistency. To achieve this, Dynamo pushes data conflict resolution to the read operation, ensuring that writes are never rejected.

There are two key issues to consider here:

Data Conflict Resolution: Concurrent reads and writes to the same key by multiple clients can easily lead to data conflicts. Since Dynamo only provides eventual consistency, data on different nodes in the Dynamo ring might be inconsistent.
- Dynamo uses vector clocks to keep track of data versions and merges them during reads to resolve conflicts.
Replica Data Gaps: Since Dynamo employs the NWR gossip protocol, it is theoretically possible that none of the nodes hold the complete data set, requiring synchronization between replicas.
- Dynamo uses an anti-entropy process to address this, employing Merkle Trees to efficiently detect inconsistencies between replicas and minimize the amount of data transferred.

The table in the paper clearly shows the aspects considered during Dynamo’s development and the corresponding technical choices. For more information, refer to the original paper.

References

Dynamo’s Implementation and Decentralization

Dynamo’s Flawed Architecture (Translation) by Tim Yang

Dynamo: A Flawed Architecture | Hacker News

MIT6.824 AuroraDB

Tue, 01 Aug 2023 16:11:54 +0800

This article introduces the design considerations of AWS’s database product, Aurora, including storage-compute separation, single-writer multi-reader architecture, and quorum-based NRW consistency protocol. The article also mentions how PolarDB was inspired by Aurora, with differences in addressing network bottlenecks and system call overhead.

Aurora is a database product provided by AWS, primarily aimed at OLTP business scenarios.

In terms of design, there are several aspects worth noting:

The design premise of Aurora is that with databases moving to the cloud, thanks to advancements in cloud infrastructure, the biggest bottleneck for databases has shifted from compute and storage to the network. This was an important premise for AWS when designing Aurora. Based on this premise, Aurora revisits the concept of “Log is Database”, pushing only the RedoLog down to the storage layer.
Storage-compute separation: The database storage layer interfaces with a distributed storage system, which provides reliability and security guarantees. The compute and storage layers can scale independently. The storage system provides a unified data view to the upper layers, significantly improving the efficiency of core functions and operations (such as backup, data recovery, and high availability).
Interesting reliability guarantees: For example, quorum-based NRW consistency protocol, where read and write operations on storage nodes require majority voting, ensures dual-AZ level fault tolerance. Sharding is used to reduce failure handling time, improving the SLA. Most reads occur during database recovery when the current state needs to be restored.
Single-writer multi-reader: Unlike NewSQL products with a shared-nothing architecture, Aurora provides only a single write node. This simplifies data consistency guarantees since the single write node can use the RedoLog LSN as a logical clock to maintain the partial order of data updates. By pushing the RedoLog to all nodes and applying these operations in order, consistency can be achieved.
Transaction implementation: Since the storage system provides a unified file view to the upper layer, Aurora’s transaction implementation is almost the same as that of a single-node transaction algorithm and can provide similar transaction semantics. NewSQL transactions are generally implemented via distributed transactions based on 2PC.
Background acceleration for foreground processing: Similar to the approach in LevelDB, storage nodes try to make some operations asynchronous (such as log apply) to improve user-perceived performance. These asynchronous operations maintain progress using various LSNs, such as VLSN, commit-LSN, etc.

Interestingly, although PolarDB’s design was inspired by Aurora, its architectural design considers that the network is not the bottleneck but rather that various system calls through the OS slow down overall speed. Given the instability of Alibaba Cloud’s storage system at that time, PolarStore was introduced, using hardware and FUSE-based storage technology to bypass or optimize system calls. Now that Pangu has improved significantly in terms of both stability and performance, it makes sense to weaken the role of PolarStore. I think this reasoning makes sense.

Additionally, why did they choose to use NRW instead of a consensus protocol like Raft? For now, it seems that NRW has one less round of network communication compared to Raft, which might be the reason.

References

MIT6.824 Chain Replication

Wed, 08 Feb 2023 23:05:57 +0800

This post provides a brief overview of the Chain Replication (CR) paper, which introduces a simple but effective algorithm for providing linearizable consistency in storage services. For those interested in the detailed design, it’s best to refer directly to the original paper.

Introduction

In short, the Chain Replication (CR) paper presents a replicated state machine algorithm designed for storage services that require linearizable consistency. It uses a chain replication method to improve throughput and relies on multiple replicas to ensure service availability.

The design of the algorithm is both simple and elegant. CR splits the replication workload across all nodes in the chain, with each node responsible for forwarding updates to its successor. Write requests are propagated from the head node to the tail, while read requests are served by the tail node.

To maintain relationships between nodes in the chain, Chain Replication introduces a Master service responsible for managing node configurations and handling node failures.

Failure Handling

Head Failure: If the head node fails, any pending or unprocessed requests are lost, but linearizable consistency remains unaffected. The second node in the chain is promoted to the new head.
Tail Failure: If the tail node fails, the second-to-last node becomes the new tail, and pending requests from the original tail are committed.
Middle Node Failure: When a middle node fails, the chain is reconnected in a manner similar to linked list operations. The previous node (Node_pre) is linked directly to the next node (Node_next). To ensure that no requests are lost during this failure, each CR node maintains a SendReqList that records all requests forwarded to its successor. Since requests are propagated from head to tail, Node_pre only needs to send Node_next any missing data. When the tail node receives a request, it marks it as committed, and an acknowledgment (Ack(req)) is sent back from the tail to the head, removing the request from each node’s SendReqList as the acknowledgment propagates.

Pros and Cons

The main advantages of Chain Replication include:

High Throughput: By distributing the workload across all nodes, CR effectively increases the throughput of a single node.
Balanced Load: Each node has a similar workload, resulting in balanced utilization.
Simplicity: The overall design is clean and straightforward, making it easier to implement.

However, there are some clear disadvantages:

Bottlenecks: If a node in the chain processes requests slowly, it will delay the entire chain’s processing.
Read Limitations: Only the head and tail nodes can serve requests efficiently. The data in the middle nodes is mostly there for replication purposes and not directly utilized for serving requests. However, the CRAQ (Chain Replication with Asynchronous Queries) variant allows middle nodes to serve read-only requests, similar to Raft’s Read Index, which can help alleviate this limitation.

References

MIT6.824-ZooKeeper

Tue, 03 Jan 2023 23:49:41 +0800

This article mainly discusses the design and practical considerations of the ZooKeeper system, such as wait-free and lock mechanisms, consistency choices, system-provided APIs, and specific semantic decisions. These trade-offs are the most insightful aspects of this article.

Positioning

ZooKeeper is a wait-free, high-performance coordination service for distributed applications. It supports the coordination needs of distributed applications by providing coordination primitives (specific APIs and data models).

Design

Keywords

There are two key phrases in ZooKeeper’s positioning: high performance and distributed application coordination service.

ZooKeeper’s high performance is achieved through wait-free design, local reads from multiple replicas, and the watch mechanism:

Wait-free requests are handled asynchronously, which may lead to request reordering, making the state machine different from the real-time sequence. ZooKeeper provides FIFO client order guarantees to manage this. Additionally, asynchronous handling is conducive to batch processing and pipelining, further improving performance.
The watch mechanism notifies clients of updates when a znode changes, reducing the overhead of clients querying local caches.
Local reads from multiple replicas: ZooKeeper uses the ZAB protocol to achieve data consensus, ensuring that write operations are linearizable. Read requests, however, are served locally from replicas without going through the ZAB consensus protocol, which provides serializability and might return stale data, improving performance.

The distributed application coordination service refers to the data model and API semantics provided by ZooKeeper, allowing distributed applications to freely use them to fulfill coordination needs such as group membership and distributed locking.

Data Model and API

ZooKeeper provides an abstraction of data nodes called znodes, which are organized through a hierarchical namespace. ZooKeeper offers two types of znodes: regular and ephemeral. Each znode stores data and is accessed using standard UNIX filesystem paths.

In practice, znodes are not designed for general data storage. Instead, znodes map to abstractions in client applications, often corresponding to metadata used for coordination.

In other words, when coordinating through ZooKeeper, utilize the metadata associated with znodes instead of treating them as mere data storage. For example, znodes associate metadata with timestamps and version counters, allowing clients to track changes to the znodes and perform conditional updates based on the znode version.

Essentially, this data model is a simplified file system API that supports full data reads and writes. Users implement distributed application coordination using the semantics provided by ZooKeeper.

The difference between regular and ephemeral znodes is that ephemeral nodes are automatically deleted when the session ends.

Clients interact with ZooKeeper through its API, and ZooKeeper manages client connections through sessions. In a session, clients can observe state changes that reflect their operations.

CAP Guarantees

ZooKeeper provides CP (Consistency and Partition Tolerance) guarantees. For instance, during leader election, ZooKeeper will stop serving requests until a new leader is elected, ensuring consistency.

Implementation

ZooKeeper uses multiple replicas to achieve high availability.

In simple terms, ZooKeeper’s upper layer uses the ZAB protocol to handle write requests, ensuring linearizability across replicas. Reads are processed locally, ensuring sequential consistency. The underlying data state machine is stored in the replicated database (in-memory) and Write-Ahead Log (WAL) on ZooKeeper cluster machines, with periodic snapshots to ensure durability. The entire in-memory database uses fuzzy snapshots and WAL replay to ensure crash safety and fast recovery after a crash.

The advantage of fuzzy snapshots is that they do not block online requests.

Interaction with Clients

Update operations will notify and clear the relevant znode’s watch.
Read requests are processed locally, and the partial order of write requests is defined by zxid. Sequential consistency is ensured, but reads may be stale. ZooKeeper provides the sync operation, which can mitigate this to some extent.
When a client connects to a new ZooKeeper server, the maximum zxid is compared. The outdated ZooKeeper server will not establish a session with the client.
Clients maintain sessions through heartbeats, and the server handles requests idempotently.

References

ZooKeeper Paper

MIT6.824-ZooKeeper FAQ

Flink-Iceberg-Connector Write Process

Mon, 10 Oct 2022 10:43:38 +0800

The Iceberg community provides an official Flink Connector, and this chapter’s source code analysis is based on that.

Overview of the Write Submission Process

Flink writes data through RowData -> distributeStream -> WriterStream -> CommitterStream. Before data is committed, it is stored as intermediate files, which become visible to the system after being committed (through writing manifest, snapshot, and metadata files).

private <T> DataStreamSink<T> chainIcebergOperators() {
    Preconditions.checkArgument(inputCreator != null,
        "Please use forRowData() or forMapperOutputType() to initialize the input DataStream.");
    Preconditions.checkNotNull(tableLoader, "Table loader shouldn't be null");

    DataStream<RowData> rowDataInput = inputCreator.apply(uidPrefix);

    if (table == null) {
        tableLoader.open();
        try (TableLoader loader = tableLoader) {
            this.table = loader.loadTable();
        } catch (IOException e) {
            throw new UncheckedIOException("Failed to load iceberg table from table loader: " + tableLoader, e);
        }
    }

    List<Integer> equalityFieldIds = checkAndGetEqualityFieldIds();

    RowType flinkRowType = toFlinkRowType(table.schema(), tableSchema);

    DataStream<RowData> distributeStream = distributeDataStream(
        rowDataInput, table.properties(), equalityFieldIds, table.spec(), table.schema(), flinkRowType);

    SingleOutputStreamOperator<WriteResult> writerStream = appendWriter(distributeStream, flinkRowType,
        equalityFieldIds);

    SingleOutputStreamOperator<Void> committerStream = appendCommitter(writerStream);

    return appendDummySink(committerStream);
}

Write Process Source Code Analysis

WriteStream

private SingleOutputStreamOperator<WriteResult> appendWriter(DataStream<RowData> input, RowType flinkRowType,
                                                             List<Integer> equalityFieldIds) {
    boolean upsertMode = upsert || PropertyUtil.propertyAsBoolean(table.properties(),
        UPSERT_ENABLED, UPSERT_ENABLED_DEFAULT);

    if (upsertMode) {
        Preconditions.checkState(!overwrite,
            "OVERWRITE mode shouldn't be enabled when configuring to use UPSERT data stream.");
        Preconditions.checkState(!equalityFieldIds.isEmpty(),
            "Equality field columns shouldn't be empty when configuring to use UPSERT data stream.");
        if (!table.spec().isUnpartitioned()) {
            for (PartitionField partitionField : table.spec().fields()) {
                Preconditions.checkState(equalityFieldIds.contains(partitionField.sourceId()),
                    "In UPSERT mode, partition field '%s' should be included in equality fields: '%s'",
                    partitionField, equalityFieldColumns);
            }
        }
    }

    IcebergStreamWriter<RowData> streamWriter = createStreamWriter(table, flinkRowType, equalityFieldIds, upsertMode);

    int parallelism = writeParallelism == null ? input.getParallelism() : writeParallelism;
    SingleOutputStreamOperator<WriteResult> writerStream = input
        .transform(operatorName(ICEBERG_STREAM_WRITER_NAME), TypeInformation.of(WriteResult.class), streamWriter)
        .setParallelism(parallelism);
    if (uidPrefix != null) {
        writerStream = writerStream.uid(uidPrefix + "-writer");
    }
    return writerStream;
}

The WriterStream operator is transformed from the distributeStream, with RowData as input and WriteResult as output. The transformation logic is encapsulated in the IcebergStreamWriter, which processes each element using processElement:

private transient TaskWriter<T> writer;
@Override
public void processElement(StreamRecord<T> element) throws Exception {
    writer.write(element.getValue());
}

IcebergStreamWriter delegates the writing to a TaskWriter created by TaskWriterFactory. The specific type could be PartitionedDeltaWriter or UnpartitionedWriter:

public TaskWriter<RowData> create() {
    Preconditions.checkNotNull(outputFileFactory,
        "The outputFileFactory shouldn't be null if we have invoked the initialize().");

    if (equalityFieldIds == null || equalityFieldIds.isEmpty()) {
        if (spec.isUnpartitioned()) {
            return new UnpartitionedWriter<>(spec, format, appenderFactory, outputFileFactory, io, targetFileSizeBytes);
        } else {
            return new RowDataPartitionedFanoutWriter(spec, format, appenderFactory, outputFileFactory,
                io, targetFileSizeBytes, schema, flinkSchema);
        }
    } else {
        if (spec.isUnpartitioned()) {
            return new UnpartitionedDeltaWriter(spec, format, appenderFactory, outputFileFactory, io,
                targetFileSizeBytes, schema, flinkSchema, equalityFieldIds, upsert);
        } else {
            return new PartitionedDeltaWriter(spec, format, appenderFactory, outputFileFactory, io,
                targetFileSizeBytes, schema, flinkSchema, equalityFieldIds, upsert);
        }
    }
}

CommitterStream

The CommitterStream receives WriteResult as input with no output. WriteResult contains the data files produced by WriteStream:

public class WriteResult implements Serializable {
  private DataFile[] dataFiles;
  private DeleteFile[] deleteFiles;
  private CharSequence[] referencedDataFiles;
  ...
}

The core logic for processing data file submissions is encapsulated in IcebergFilesCommitter. The IcebergFilesCommitter maintains a list of files that need to be committed for each checkpoint. Once a checkpoint completes, it tries to commit those files to Iceberg.

class IcebergFilesCommitter extends AbstractStreamOperator<Void>
    implements OneInputStreamOperator<WriteResult, Void>, BoundedOneInput {
    ...

    private final NavigableMap<Long, byte[]> dataFilesPerCheckpoint = Maps.newTreeMap();
    private final List<WriteResult> writeResultsOfCurrentCkpt = Lists.newArrayList();
    private transient ListState<SortedMap<Long, byte[]>> checkpointsState;
    ...
}

The processElement method stores WriteResult from upstream in writeResultsOfCurrentCkpt:

@Override
public void processElement(StreamRecord<WriteResult> element) {
    this.writeResultsOfCurrentCkpt.add(element.getValue());
}

During checkpointing (snapshotState), it saves the current checkpoint’s data in dataFilesPerCheckpoint. Later, once the checkpoint is completed (notifyCheckpointComplete), it commits the files:

public void snapshotState(StateSnapshotContext context) throws Exception {
    long checkpointId = context.getCheckpointId();
    LOG.info("Start to flush snapshot state to state backend, table: {}, checkpointId: {}", table, checkpointId);

    dataFilesPerCheckpoint.put(checkpointId, writeToManifest(checkpointId));
    checkpointsState.clear();
    checkpointsState.add(dataFilesPerCheckpoint);

    jobIdState.clear();
    jobIdState.add(flinkJobId);

    writeResultsOfCurrentCkpt.clear();
}

@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
    if (checkpointId > maxCommittedCheckpointId) {
        commitUpToCheckpoint(dataFilesPerCheckpoint, flinkJobId, checkpointId);
        this.maxCommittedCheckpointId = checkpointId;
    }
}

The commit logic is handled by commitUpToCheckpoint, which generates a new snapshot and adds it to Iceberg’s metadata:

private void commitUpToCheckpoint(NavigableMap<Long, byte[]> deltaManifestsMap,
                                  String newFlinkJobId,
                                  long checkpointId) throws IOException {
    NavigableMap<Long, byte[]> pendingMap = deltaManifestsMap.headMap(checkpointId, true);
    List<ManifestFile> manifests = Lists.newArrayList();
    NavigableMap<Long, WriteResult> pendingResults = Maps.newTreeMap();
    for (Map.Entry<Long, byte[]> e : pendingMap.entrySet()) {
        if (Arrays.equals(EMPTY_MANIFEST_DATA, e.getValue())) {
            continue;
        }

        DeltaManifests deltaManifests = SimpleVersionedSerialization
            .readVersionAndDeSerialize(DeltaManifestsSerializer.INSTANCE, e.getValue());
        pendingResults.put(e.getKey(), FlinkManifestUtil.readCompletedFiles(deltaManifests, table.io()));
        manifests.addAll(deltaManifests.manifests());
    }

    int totalFiles = pendingResults.values().stream()
        .mapToInt(r -> r.dataFiles().length + r.deleteFiles().length).sum();
    continuousEmptyCheckpoints = totalFiles == 0 ? continuousEmptyCheckpoints + 1 : 0;
    if (totalFiles != 0 || continuousEmptyCheckpoints % maxContinuousEmptyCommits == 0) {
        if (replacePartitions) {
            replacePartitions(pendingResults, newFlinkJobId, checkpointId);
        } else {
            commitDeltaTxn(pendingResults, newFlinkJobId, checkpointId);
        }
        continuousEmptyCheckpoints = 0;
    }
    pendingMap.clear();

    for (ManifestFile manifest : manifests) {
        try {
            table.io().deleteFile(manifest.path());
        } catch (Exception e) {
            LOG.warn("The iceberg transaction has been committed, but we failed to clean the temporary flink manifests: {}",
                manifest.path(), e);
        }
    }
}

public void commit(TableMetadata base, TableMetadata metadata) {
    if (base != current()) {
        if (base != null) {
            throw new CommitFailedException("Cannot commit: stale table metadata");
        } else {
            throw new AlreadyExistsException("Table already exists: %s", tableName());
        }
    }
    if (base == metadata) {
        LOG.info("Nothing to commit.");
        return;
    }

    long start = System.currentTimeMillis();
    doCommit(base, metadata);
    deleteRemovedMetadataFiles(base, metadata);
    requestRefresh();

    LOG.info("Successfully committed to table {} in {} ms",
        tableName(),
        System.currentTimeMillis() - start);
}

Write Issues

1. Lots of Small Files

For streaming writes, new files are generated each time, resulting in a lot of small files. While object storage supports small files well, it may increase Iceberg metadata overhead, as metadata files need to keep track of each data file. This can cause metadata files to become large and impact performance.

Solution:

Iceberg Rewrite Action: Iceberg supports rewriting data and metadata files via Flink or Spark actions, which need to be triggered separately.
Snapshot Expiry: Configure snapshot expiration to periodically delete old snapshots.

import org.apache.iceberg.flink.actions.Actions;

TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
Table table = tableLoader.loadTable();
RewriteDataFilesActionResult result = Actions.forTable(table)
    .rewriteDataFiles()
    .execute();

Iceberg Flink Documentation Iceberg Maintenance Documentation

2. Performance Issues with High Concurrency

Iceberg’s writing process creates a new snapshot for each commit and uses optimistic concurrency control to handle conflicts. In high-concurrency scenarios, this can lead to many commits being retried, impacting performance.

Solution:

Batch Commit: Introduce a caching layer or additional service to batch commits to the data lake, reducing the number of concurrent commit operations. This cache layer can also compact multiple data files before committing.

References: Optimizing Iceberg Writes for High Concurrency InfoQ Article on Iceberg Optimization

3. Flink Iceberg Connector Limitations

The Flink Iceberg Connector does not support hidden partitions or preprocessing of partition fields.

Apache-ORC Quick Investigation

Wed, 05 Oct 2022 19:56:01 +0800

Iceberg supports both ORC and Parquet columnar formats. Compared to Parquet, ORC offers advantages in query performance and ACID support. Considering the future data lakehouse requirements for query performance and ACID compliance, we are researching ORC to support a future demo involving Flink, Iceberg, and ORC.

Research Focus: ORC file encoding, file organization, and indexing support.

File Layout

An ORC file can be divided into three main sections:

Header: Identifies the file type.
Body: Contains row data and indexes, as shown below.
Tail: Contains top-level file information.

ORC Specification v1

File Tail

Since distributed storage generally supports only append-only semantics, the ORC file maintains a tail section for top-level file information.

The tail contains:

Postscript: Contains essential information for parsing the footer and metadata, such as the length of each section and compression method.
Footer: Stores schema information, row count, column statistics, and more.
Stripe Statistics and Metadata: Includes column-level statistics.

Postscript

The postscript is uncompressed and contains:

Footer length
Compression type
Metadata length
File identifier (“ORC”)

The footer includes the schema, row count, column-level statistics, and a list of stripes that make up the file body.

message Footer {
  optional uint64 headerLength = 1;
  optional uint64 contentLength = 2;
  repeated StripeInformation stripes = 3;
  repeated Type types = 4;
  repeated UserMetadataItem metadata = 5;
  optional uint64 numberOfRows = 6;
  repeated ColumnStatistics statistics = 7;
  optional uint32 rowIndexStride = 8;
  optional uint32 writer = 9;
  optional Encryption encryption = 10;
  optional uint64 stripeStatisticsLength = 11;
}

Stripe Information: Data in the body is organized into multiple stripes. Stripes contain row indexes, row data, and stripe footers, which are stored column-wise.

message StripeInformation {
  optional uint64 offset = 1;
  optional uint64 indexLength = 2;
  optional uint64 dataLength = 3;
  optional uint64 footerLength = 4;
  optional uint64 numberOfRows = 5;
}

Type Information: ORC uses a tree structure to represent nested data types, and the type schema must remain consistent.

create table Foobar (
  myInt int,
  myMap map<string, struct<myString : string, myDouble: double>>,
  myTime timestamp
);

Column Statistics: Simple statistics for each column are available to support coarse-grained filtering.

Stripes

The body of an ORC file is split into stripes, which are large chunks of data (typically ~200MB) that contain:

Index Data
Row Data
Stripe Footer

The Stripe Footer holds column encoding details and stream-related information, such as compression and encryption methods.

Index Support

ORC supports three levels of indexing:

Level	Location	Data Content
File Level	File Footer	Column-level statistics for the entire file
Stripe Level	File Footer	Column-level statistics for each stripe
Row Level	Beginning of Stripe	Statistics for each row group and their start position

Row Level Index

The row-level index contains Row Group Index and Bloom Filter Index.

Row Group Index

Indexes for primitive types are represented by ROW_INDEX streams, with each row group containing a RowIndexEntry.

Default row group size: 10,000 rows

message RowIndex {
  repeated RowIndexEntry entry = 1;
}

message RowIndexEntry {
  repeated uint64 positions = 1 [packed=true];
  optional ColumnStatistics statistics = 2;
}

Bloom Filter Index

Each column has a BLOOM_FILTER stream to help speed up searches.

message BloomFilter {
  optional uint32 numHashFunctions = 1;
  repeated fixed64 bitset = 2;
}

Data Access Path

Postscript -> Footer -> Retrieve Stripe Information -> Stripe Footer -> Stripe Index -> Row Group -> Column

References

Apache-Iceberg Quick Investigation

Wed, 05 Oct 2022 19:55:54 +0800

A table format for large-scale analysis of datasets.
A specification for organizing data files and metadata files.
A schema semantic abstraction between storage and computation.
Developed and open-sourced by Netflix to enhance scalability, reliability, and usability.

Background

Issues encountered when migrating HIVE to the cloud:

Dependency on List and Rename semantics makes it impossible to replace HDFS with cheaper OSS.
Scalability issues: Schema information in Hive is centrally stored in metastore, which can become a performance bottleneck.
Unsafe operations, CBO unfriendly, etc.

Features

Supports secure and efficient schema, partition changes, and evolution, self-defined schema, hidden partition.
- Abstracts its own schema, not tied to any computation engine schema; partition is maintained at the schema level. Partition and sort order provide transformer functions, such as date(timestamp).
Supports object storage with minimal dependency on FS semantics.
ACID semantics support, parallel reads, serialized write operations:
- Separation of read and write snapshots.
- Optimistic handling of write parallel conflicts, retry to ensure writes.
Snapshot support:
- Data rollback and time travel.
- Supports snapshot expiration (by default, data files are not deleted, but customizable deletion behavior is available) (related API doc).
- Incremental reading can be achieved by comparing snapshot differences.
Query optimization-friendly: predicate pushdown, data file statistics. Currently, compaction is not supported, but invalid files can be deleted during snapshot expiration (deleteWith).
High abstraction level, easy for modification, optimization, and extension. Catalog, read/write paths, file formats, storage dependencies are all pluggable. Iceberg’s design goal is to define a standard, open, and general data organization format while hiding differences in underlying data storage formats, providing a unified operational API for different engines to connect through its API.
Others: file-level encryption and decryption.

Ecosystem

Community support for OSS, Flink, Spark, and Presto:
- Flink (detail): Supports streaming reads and writes, incremental reads (based on snapshot), upsert write (0.13.0-release-notes).
- Presto: Iceberg connector.
- Aliyun OSS: # pr 3689.
Integration with other components:
- Integration with lower storage layers: Only relies on three semantics: In-place write, Seekable reads, Deletes, supports AliOSS (# pr 3689).
- Integration with other file formats: High abstraction level, currently supports Avro, Parquet, ORC.
- Catalog: Customizable (Doc: Custom Catalog Implementation), currently supports JDBC, Hive Metastore, Hadoop, etc.
- Integration with computation layer: Provides native JAVA & Python APIs, with a high level of abstraction, supporting most computation engines.
Open and neutral community, allowing contributions to improve influence.

Table Specification

Specification for organizing data files and metadata files.

Case: Spark + Iceberg + Local FS

Iceberg supports Parquet, Avro, ORC file formats.

# Storage organization
test2
├── data
│   ├── 00000-1-ccff6767-12cc-481c-93fc-db9f1a57438c-00001.parquet
│   └── 00001-2-6c1e5a0b-89fe-4e77-b90a-1773a7fbbcc8-00001.parquet
└── metadata
├── 2c1dc0e8-1843-4cb9-9c55-ae43f800bf3f-m0.avro // manifest file
├── snap-8512048775051875497-1-2c1dc0e8-1843-4cb9-9c55-ae43f800bf3f.avro // manifest list file
├── v1.metadata.json // metadata file
├── v2.metadata.json
    └── version-hint.text // catalog

DataFile

Data files in columnar format: Parquet, ORC.

There are three types of Data Files: data file, partition delete file, equality delete file.

Manifest File

Indexes data files, including statistics and partition information.

[
    {
        "status":1,
        "snapshot_id":{
            "long":1274364374047997583
        },
        "data_file":{
            "file_path":"/tmp/warehouse/db/test3/data/id=1/00000-31-401a9d2e-d501-434c-a38f-5df5f08ebbd7-00001.parquet",
            "file_format":"PARQUET",
            "partition":{
                "id":{
                    "long":1
                }
            },
            "record_count":1,
            "file_size_in_bytes":643,
            "block_size_in_bytes":67108864,
            "column_sizes":{
                "array":[
                    {
                        "key":1,
                        "value":46
                    },
                    {
                        "key":2,
                        "value":48
                    }
                ]
            },
            "value_counts":{
                "array":[
                    {
                        "key":1,
                        "value":1
                    },
                    {
                        "key":2,
                        "value":1
                    }
                ]
            },
            "null_value_counts":{
                "array":[
                    {
                        "key":1,
                        "value":0
                    },
                    {
                        "key":2,
                        "value":0
                    }
                ]
            },
            "nan_value_counts":{
                "array":[

                ]
            },
            "lower_bounds":{
                "array":[
                    {
                        "key":1,
                        "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                    },
                    {
                        "key":2,
                        "value":"a"
                    }
                ]
            },
            "upper_bounds":{
                "array":[
                    {
                        "key":1,
                        "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                    },
                    {
                        "key":2,
                        "value":"a"
                    }
                ]
            },
            "key_metadata":null,
            "split_offsets":{
                "array":[
                    4
                ]
            },
            "sort_order_id":{
                "int":0
            }
        }
    },
    {
       // another data file meta
    }
]

Snapshot

Represents the state of a Table at a specific point in time, saved via a Manifest List File.
A new Snapshot is generated every time a data change is made to the Table.

Manifest List File

Contains information about all Manifest files in a Snapshot, as well as partition stats and data file count.
One Snapshot corresponds to one Manifest List File, and each submission generates a manifest list file.
Optimistic concurrency: when concurrent Snapshot submissions conflict, the later submission retries to ensure submission.

Each manifest list stores metadata about manifests, including partition stats and data file counts.

[
    {
        "manifest_path":"/tmp/warehouse/db/test3/metadata/f22b748f-a7bc-4e4c-ad6c-3e335c1c0c2b-m0.avro",
        "manifest_length":6019,
        "partition_spec_id":0,
        "added_snapshot_id":{
            "long":1274364374047997583
        },
        "added_data_files_count":{
            "int":2
        },
        "existing_data_files_count":{
            "int":0
        },
        "deleted_data_files_count":{
            "int":0
        },
        "partitions":{
            "array":[
                {
                    "contains_null":false,
                    "contains_nan":{
                        "boolean":false
                    },
                    "lower_bound":{
                        "bytes":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                    },
                    "upper_bound":{
                        "bytes":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                    }
                }
            ]
        },
        "added_rows_count":{
            "long":2
        },
        "existing_rows_count":{
            "long":0
        },
        "deleted_rows_count":{
            "long":0
        }
    },
    {
      // another manifest file
    }
]

Metadata File

Tracks the state of the table. When the state changes, a new metadata file is generated and replaces the previous one, ensuring atomicity.

The table metadata file tracks the table schema, partitioning config, custom properties, and snapshots of the table contents.

{
    "format-version":1,
    "table-uuid":"175d0b61-8507-40b2-9c19-3338b05f3d48",
    "location":"/tmp/warehouse/db/test3",
    "last-updated-ms":1653387947819,
    "last-column-id":2,
    "schema":{
        "type":"struct",
        "schema-id":0,
        "fields":[
            {
                "id":1,
                "name":"id",
                "required":false,
                "type":"long"
            },
            Object{...}
        ]
    },
    "current-schema-id":0,
    "schemas":[
        {
            "type":"struct",
            "schema-id":0,
            "fields":[
                {
                    "id":1,
                    "name":"id",
                    "required":false,
                    "type":"long"
                },
                Object{...}
            ]
        }
    ],
    "partition-spec":[
        {
            "name":"id",
            "transform":"identity",
            "source-id":1,
            "field-id":1000
        }
    ],
    "default-spec-id":0,
    "partition-specs":[
        {
            "spec-id":0,
            "fields":[
                {
                    "name":"id",
                    "transform":"identity",
                    "source-id":1,
                    "field-id":1000
                }
            ]
        }
    ],
    "last-partition-id":1000,
    "default-sort-order-id":0,
    "sort-orders":[
        {
            "order-id":0,
            "fields":[

            ]
        }
    ],
    "properties":{
        "owner":"chenlan"
    },
    "current-snapshot-id":1274364374047997700,
    "snapshots":[
        {
            "snapshot-id":1274364374047997700,
            "timestamp-ms":1653387947819,
            "summary":{
                "operation":"append",
                "spark.app.id":"local-1653381214613",
                "added-data-files":"2",
                "added-records":"2",
                "added-files-size":"1286",
                "changed-partition-count":"2",
                "total-records":"2",
                "total-files-size":"1286",
                "total-data-files":"2",
                "total-delete-files":"0",
                "total-position-deletes":"0",
                "total-equality-deletes":"0"
            },
            "manifest-list":"/tmp/warehouse/db/test3/metadata/snap-1274364374047997583-1-f22b748f-a7bc-4e4c-ad6c-3e335c1c0c2b.avro",
            "schema-id":0
        }
    ],
    "snapshot-log":[
        {
            "timestamp-ms":1653387947819,
            "snapshot-id":1274364374047997700
        }

    ],
    "metadata-log":[
        {
            "timestamp-ms":1653387937345,
            "metadata-file":"/tmp/warehouse/db/test3/metadata/v1.metadata.json"
        }
    ]
}

Catalog

Records the latest metadata file path.

Features

ACID semantics guarantee: Atomic table state changes + snapshot-based reads and writes.
Flexible partition management: hidden partition, seamless partition changes.
Supports incremental reads: incremental read of each change using snapshots.
Multi-version data: beneficial for data rollback.
No side effects, safe schema, and partition changes.

Data Types

Data files in different formats define different types.

Nested Types:
- struct: A tuple of typed values.
- list: A collection of values with an element type.
- map: A collection of key-value pairs with a key type and a value type.
Primitive Types:

Primitive type	Description	Requirements
boolean	True or false
int	32-bit signed integers	Can promote to `long`
long	64-bit signed integers
float	32-bit IEEE 754 floating point	Can promote to double
double	64-bit IEEE 754 floating point
decimal(P,S)	Fixed-point decimal; precision P, scale S	Scale is fixed [1], precision must be 38 or less
date	Calendar date without timezone or time
time	Time of day without date, timezone	Microsecond precision [2]
timestamp	Timestamp without timezone	Microsecond precision [2]
timestamptz	Timestamp with timezone	Stored as UTC [2]
string	Arbitrary-length character sequences	Encoded with UTF-8 [3]
uuid	Universally unique identifiers	Should use 16-byte fixed
fixed(L)	Fixed-length byte array of length L
binary	Arbitrary-length byte array

Read & Write Paths

select: catalog -> manifest list file -> manifest file -> data file -> data group.

insert: reverse (catalog -> manifest list file -> manifest file -> data file -> data group).

update: delete & insert, data file + partition delete file + equality delete file.

Using Partition delete file transaction: issue of repeatedly inserting and deleting the same row within a transaction.

delete: row-level delete.

References

Iceberg Spec

Flink+Iceberg Data Lake Construction

Construction Practice of Real-time Data Warehouse with Flink + Iceberg (Chinese)

Iceberg Aliyun OSS

Iceberg Flink Support

Building Enterprise-grade Real-time Data Lake with Flink + Iceberg

How Flink Analyzes CDC Data in Iceberg Real-time Data Lake

Iceberg GitHub

Alluxio POSIX API

Comparison of Delta, Iceberg, and Hudi Open-source Data Lake Solutions

LevelDB Write

Tue, 10 May 2022 17:14:14 +0800

This is the second chapter of my notes on reading the LevelDB source code, focusing on the write flow of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts.

Main Process

The main write logic of LevelDB is relatively simple. First, the write operation is encapsulated into a WriteBatch, and then it is executed.

Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
  WriteBatch batch;
  batch.Put(key, value);
  return Write(opt, &batch);
}

WriteBatch

WriteBatch is an encapsulation of a group of update operations, which are applied atomically to the state machine. A block of memory is used to save the user’s update operations.

InMemory Format: | seq_num: 8 bytes | count: 4 bytes | list of records{ type + key + value}

class WriteBatch {
  ...
  // See comment in write_batch.cc for the format of rep_;
// WriteBatch::rep_ :=
//    sequence: fixed64
//    count: fixed32
//    data: record[count]
// record :=
//    kTypeValue varstring varstring         |
//    kTypeDeletion varstring
// varstring :=
//    len: varint32
//    data: uint8[len]
  std::string rep_;  
}
// some opt
void WriteBatch::Put(const Slice& key, const Slice& value) {
  WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);
  rep_.push_back(static_cast<char>(kTypeValue));
  PutLengthPrefixedSlice(&rep_, key);
  PutLengthPrefixedSlice(&rep_, value);
}

Write Flow

The write operation mainly consists of four steps:

Initializing Writer

Writer actually contains all the information needed for a write operation.

// Information kept for every waiting writer
struct DBImpl::Writer {
  explicit Writer(port::Mutex* mu)
      : batch(nullptr), sync(false), done(false), cv(mu) {}

  Status status;
  WriteBatch* batch;
  bool sync;
  bool done;
  port::CondVar cv;
};

Writer Scheduling

LevelDB’s write process is a multi-producer, single-consumer model, with multiple threads producing writers and a single thread consuming them. Each writer is produced by one thread but can be consumed by multiple threads.

Internally, LevelDB maintains a writer queue. Each thread’s write, delete, or update operation appends a writer to the end of the queue, with a lock ensuring data safety. Once a writer is added to the queue, the thread waits until it either reaches the head of the queue (is scheduled) or its operation is completed by another thread.

When consuming writers and executing the actual update operations, LevelDB optimizes the write task by merging writers of the same type (based on the sync flag).

Writing Writer Batches

First, MakeRoomForWrite ensures that the memtable has enough space and that the Write-Ahead Log (WAL) can guarantee a successful write. If the current memtable has enough space, it is reused. Otherwise, a new memtable and WAL are created, and the previous memtable is converted to an immutable memtable, awaiting compaction (minor compaction is serialized).
The function also checks the number of Level 0 files and decides whether to throttle the write rate based on configurations and triggers. There are two main configurations:
- Slowdown Trigger: This trigger causes write threads to sleep, slowing down writes so that compaction tasks can proceed. It also limits the number of Level 0 files to ensure read efficiency.
- StopWritesTrigger: Stops write threads.
BuildBatchGroup merges batches from writers of the same type starting from the head of the queue into a single batch.
The merged batch is then written first to the WAL and then to the memtable.

Write-Ahead Log (WAL)

The WAL file is split into blocks, and each block consists of records. The format is as follows:

| Header{ checksum(4 bytes) + len(2 bytes) + type(1 byte)} | Data |

The record types are Zero, Full, First, Middle, and Last. Zero is reserved for preallocated files. Since a key-value pair might be too large and needs to be recorded in several chunks, the other four types are used accordingly.

Write Flow

Memtable

The core design of memtable involves two parts: the skip list and the encoding of key-value pairs in the skip list. The skip list ensures the sorted nature of inserted data. For more details, you can refer to my other blog post.

The key encoded in memtable is called the Internal Key, which is encoded as follows:

| Original Key(varstring) + seq num(7 bytes) + type(1 byte) |

SeqNum is a monotonically increasing sequence number generated for each update operation, serving as a logical clock to indicate the recency of operations. Based on SeqNum, snapshot-based reads (versioned reads) can be implemented.

Reference

LevelDB handbook

MIT6.824-RaftKV

Fri, 15 Apr 2022 10:49:57 +0800

Earlier, I looked at the code of Casbin-Mesh because I wanted to try GSOC. Casbin-Mesh is a distributed Casbin application based on Raft. This RaftKV in MIT6.824 is quite similar, so I took the opportunity to write this blog.

Lab Overview

Lab 03 involves building a distributed KV service based on Raft. We need to implement the server and client for this service.

The structure of RaftKV and the interaction between its modules are shown below:

Compared to the previous lab, the difficulty is significantly lower. For implementation, you can refer to this excellent implementation, so I won’t elaborate too much.

Let’s talk about Raft and its interactions with clients.

Routing and Linearizability

To build a service that allows client access on top of Raft, the issues of routing and linearizability must first be addressed.

Routing

Raft is a Strong Leader consensus algorithm, and read and write requests usually need to be executed by the Leader. When a client queries the Raft cluster, it typically randomly selects a node. If that node is not the Leader, it returns the Leader information to the client, and the client redirects the request to the Leader.

Linearizability

Currently, Raft only supports At Least Once semantics. For a single client request, the Raft state machine may apply the command multiple times, which is particularly unsuitable for consensus-based systems.

To achieve linearizability, it is clear that requests need to be made idempotent.

A basic approach is for the client to assign a unique UID to each request, and the server maintains a session using this UID to cache the response of successful requests. When a duplicate request arrives at the server, it can respond directly using the cached response, thus achieving idempotency.

Of course, this introduces the issue of session management, but that is not the focus of this article.

Read-Only Optimization

After solving the above two problems, we have a usable Raft-based service.

However, we notice that whether it’s a read or write request, our application needs to go through a round of AppendEntries communication initiated by the Leader. It also requires successful quorum ACKs and additional disk write operations before the log is committed, after which the result can be returned to the client.

Write operations change the state machine, so these are necessary steps for write requests. However, read operations do not change the state machine, and we can optimize read requests to bypass the Raft log, reducing the overhead of synchronous write operations on disk IO.

The problem is that without additional measures, read-only query results that bypass the Raft log may become stale.

For example, if the old cluster Leader and a new Leader’s cluster are partitioned, queries made to the old Leader could be outdated.

The Raft paper mentions two methods to bypass the Raft log and optimize read-only requests: Read Index and Lease Read.

Read Index

The Read Index approach needs to address several issues:

Committed logs from the old term

For example, if the old Leader commits a log but crashes before sending heartbeats, other nodes will elect a new Leader. According to the Raft paper, the new Leader does not proactively commit logs from the old Leader.

To solve this, a no-op log is committed after a new Leader is elected to commit the old log.

Gap between commitIndex and appliedIndex

Introduce a readIndex variable, where the Leader saves the current commitIndex in a local variable called readIndex. This acts as a boundary for applying the log, and when a read-only request arrives, the log must be applied up to the position recorded by readIndex before the Leader can query the state machine to provide read services.

Ensure no Leader change when providing read-only services

To achieve this, after receiving a read request, the Leader first sends a heartbeat and needs to receive quorum ACKs to ensure there is no other Leader with a higher term, thus ensuring that readIndex is the highest committed index in the cluster.

For the specific process and optimizations like Batch and Follower Read, refer to the author’s PhD dissertation on Raft.

Lease Read

The Read Index approach only optimizes the overhead of disk IO, but still requires a round of network communication. However, this overhead can also be optimized, leading to the Lease Read approach.

The core idea of Lease Read is to use the fact that a Leader Election requires at least one ElectionTimeout time period. During this period, the system will not conduct a new election, thereby avoiding Leader changes when providing read-only services. We can use clocks to optimize network IO.

Implementation

To let the clock replace network communication, we need an additional lease mechanism. Once the Leader’s Heartbeat is approved by a quorum, the Leader can assume that no other node can become Leader during the ElectionTimeout period, and it can extend its lease accordingly. While holding the lease, the Leader can directly serve read-only queries without extra network communication.

However, there may be clock drift among servers, which means Followers cannot ensure that the Leader will not time out during the lease. This introduces the critical design for Lease Read: what strategy should be used to extend the lease?

The paper assumes that $ClockDrift$ is bounded, and when a heartbeat successfully updates the lease, the lease is extended to $start + rac{ElectionTimeout}{ClockDriftBound}$.

$ClockDriftBound$ represents the limit of clock drift in the cluster, but discovering and maintaining this limit is challenging due to many real-time factors that cause clock drift.

For instance, garbage collection (GC), virtual machine scheduling, cloud machine scaling, etc.

In practice, some safety is usually sacrificed for Lease Read performance. Generally, the lease is extended to $StartTime + ElectionTimeout - \Delta{t}$, where $\Delta{t}$ is a positive value. This reduces the lease extension time compared to ElectionTimeout, trading off between network IO overhead and safety.

Summary

When building a Raft-based service, it is crucial to design routing and idempotency mechanisms for accessing the service.

For read-only operations, there are two main optimization methods: Read Index and Lease Read. The former optimizes disk IO during read operations, while the latter uses clocks to optimize network IO.

References

Implementation Doc

Raft Paper

MIT6.824 Official

Consensus: Bridging Theory and Practice - zh

Tikv Lease-Read

LevelDB Startup

Sat, 09 Apr 2022 14:43:25 +0800

This is the first chapter of my notes on reading the LevelDB source code, focusing on the startup process of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts.

A code repository with annotations will be shared on GitHub later for those interested in studying it.

Prerequisites

Database Files

For now, I won’t delve into the encoding and naming details of these files (as I haven’t reached that part yet). I’ll focus on the meaning and role of each file.

├── 000005.ldb
├── 000008.ldb  // sst or ldb are both sst files
├── 000009.log  // WAL
├── CURRENT  // Records the name of the manifest file in use, also indicates the presence of the database
├── LOCK  // Empty file, ensures only one DB instance operates on the database
├── LOG  // Logs printed by LevelDB
├── LOG.old 
└── MANIFEST-000007   // descriptor_file, metadata file

Some questions worth exploring, which I may write about later:

How does LOCK ensure only one DB instance holds the database?

Essentially, it uses the fcntl system call to set a write lock on the LOCK file.

Encoding issues of various files

I’ll discuss LevelDB’s encoding design in a future blog.

DB State

LevelDB is an embedded database, often used as a component in other applications (e.g., for metadata nodes in distributed storage systems). These applications may crash or exit gracefully, leaving LevelDB data files behind. Thus, it’s necessary to restore the previous database state during startup to ensure data integrity.

So, what should the DB state include? LevelDB is an LSM-based storage engine, essentially an LSM Tree data structure + various read/write and storage optimizations. Based on this and LevelDB’s documentation, the DB state includes at least the following persistent information:

The SST files for each level and the key range covered by each SST file

The key range helps avoid unnecessary I/O.
Global logical clock, last_seq_number

Each data update has a seq_num that marks the recency of the update and is related to ordering.
Compaction-related parameters (file_to_compact, score, point)

Compaction parameters are used to trigger compaction after a crash.
Comparator name

Once the DB is initialized, the data sorting logic is fixed and cannot be changed. The comparator name serves as a credential.
log_number, next_file_number

WAL number and the next available file number.
deleted_files and add_files

SST files to be deleted or added due to compaction or reference count reaching zero.

In practice, each metadata change in LevelDB (usually caused by compaction) is recorded in a VersionEdit data structure. Thus, the DB state in LevelDB is essentially initial state + list of applied VersionEdits.

Version Control

Since we mentioned VersionEdit, let’s also discuss the version control in LevelDB’s startup process, which mainly involves three data structures: Version, VersionEdit, and VersionSet.

Why is version control needed? In short, LevelDB uses the Multi-Version Concurrency Control (MVCC) mechanism to avoid using a big lock and improve performance.

Snapshot reads at the command level are implemented via sequence_number. Each operation is assigned the current sequence_number, which is used to determine the data visible to that operation. Records with a sequence_number greater than that of the command are invisible to the operation.

MVCC at the SST file level is implemented using a version chain, primarily to avoid conflicts in the following scenario: when reading a file while a background major compaction tries to delete that file.

The main data structures related to SST-level MVCC are Version, VersionEdit, and VersionSet.

Version

Represents the latest data state after startup or compaction.

class Version {  
    VersionSet* vset_;  // VersionSet to which this Version belongs
      Version* next_;     // Next version in linked list
      Version* prev_;     // Previous version in linked list
      int refs_;          // Number of live refs to this version

      // List of files and metadata per level
      std::vector<FileMetaData*> files_[config::kNumLevels];

      // Next file to compact based on seek stats (compaction due to allowed_seek exhaustion)
      FileMetaData* file_to_compact_;
      int file_to_compact_level_;

      // Level that should be compacted next and its compaction score.
      // Score < 1 means compaction is not strictly needed. These fields
      // are initialized by Finalize().
      double compaction_score_;  // Score represents data imbalance; higher score indicates greater imbalance and compaction need.
      int compaction_level_;
}

VersionSet

Manages the current runtime state of the entire DB.

class VersionSet {
    Env* const env_;
    const std::string dbname_;
    const Options* const options_;
    TableCache* const table_cache_;  
    const InternalKeyComparator icmp_;
    uint64_t next_file_number_;
    uint64_t manifest_file_number_;
    uint64_t last_sequence_;
    uint64_t log_number_;
    uint64_t prev_log_number_;  // 0 or backing store for memtable being compacted

    // Opened lazily
    WritableFile* descriptor_file_; // descriptor_ is for manifest file
    log::Writer* descriptor_log_; // descriptor_ is for manifest file
    Version dummy_versions_;  // Head of circular doubly-linked list of versions.
    Version* current_;        // == dummy_versions_.prev_

    // Per-level key at which the next compaction at that level should start.
    // Either an empty string, or a valid InternalKey.
    std::string compact_pointer_[config::kNumLevels];
}

VersionEdit

Encapsulates metadata changes. This encapsulation reduces the window for version switching.

class VersionEdit {
  /** other code */
    typedef std::set<std::pair<int, uint64_t>> DeletedFileSet;

    std::string comparator_;
    uint64_t log_number_;  
    uint64_t prev_log_number_;
    uint64_t next_file_number_;
    SequenceNumber last_sequence_;
    bool has_comparator_;
    bool has_log_number_;
    bool has_prev_log_number_;
    bool has_next_file_number_;
    bool has_last_sequence_;

    std::vector<std::pair<int, InternalKey>> compact_pointers_;
    DeletedFileSet deleted_files_;
    std::vector<std::pair<int, FileMetaData>> new_files_;
};

Manifest Content

As mentioned earlier, the manifest is LevelDB’s metadata file that stores the persistent state of the database. During startup, LevelDB may need to restore the previous DB state using existing data files. Additionally, when a version changes, LevelDB generates a VersionEdit. The metadata changes recorded by VersionEdit need to be persisted to the manifest to ensure LevelDB’s MVCC multi-version state is crash-safe. Thus, the encoding layout inside the manifest is crucial.

Internally, metadata is encoded as SnapshotSessionRecord + list of SessionRecords, essentially initial state + list of applied VersionEdits.

A manifest contains several session records. The first session record stores the full version information of LevelDB at that time, while subsequent session records only record incremental changes.

A session record may contain the following fields:

Comparer name

Latest WAL file number

Next available file number

The largest sequence number among the data persisted by the DB

Information on new files

Information on deleted files

Compaction record information

Writing Version Changes to the Manifest

For LevelDB, adding or deleting some SSTable files needs to be an atomic operation to maintain database consistency before and after the state change.

Atomicity

Atomicity means that the operation is complete only when a session record is fully written to the manifest. If the process crashes before completion, the database can be restored to a correct state on restart, and those useless SSTable files will be deleted with compaction resumed.

Consistency

Consistency is ensured by marking state changes with version updates, which occur at the very end of the process. Thus, the database always transitions from one consistent state to another.

Restoring DB from the Manifest

As LevelDB runs, the number of session records in a manifest grows. Therefore, each time LevelDB restarts, a new manifest is created, and the first session record captures a snapshot of the current version state.

Outdated manifests are deleted during the recovery process at the next startup.

LevelDB uses this method to control the size of the manifest file. However, if the database is not restarted, the manifest will keep growing.

DB State Recovery Process

Check lock status and create data directory.
Check lockfile to determine if another DB instance exists.
Check if the CURRENT file exists.
Restore metadata from the manifest.
Recover last_seq_number and file_number from the WAL.

Main Open Process

Create default DB and VersionEdit instances.
Acquire lock.
Restore metadata from manifest and WAL.
If the new DB instance does not have a memtable, create one along with a WAL file.
Apply VersionEdit and persist it to the manifest.
Attempt to delete obsolete files.
Attempt to compact data.

References

MIT6.824-Raft

Mon, 21 Feb 2022 01:26:46 +0800

Finally, I managed to complete Lab 02 during this winter break, which had been on hold for quite some time. I was stuck on one of the cases in Test 2B for a while. During the winter break, I revisited the implementations from experts, and finally completed all the tasks, so I decided to document them briefly.

Algorithm Overview

The basis of consensus algorithms is the replicated state machine, which means that executing the same deterministic commands in the same order will eventually lead to a consistent state. Raft is a distributed consensus algorithm that serves as an alternative to Paxos, making it easier to learn and understand compared to Paxos.

The core content of the Raft algorithm can be divided into three parts: $Leader Election + Log Replication + Safety$.

Initially, all nodes in the cluster start as Followers. If a Follower does not receive a heartbeat from the Leader within a certain period, it becomes a Candidate and triggers an election, requesting votes from the other Followers. The Candidate that receives a majority of votes becomes the Leader.

Raft is a strong leader and strongly consistent distributed consensus algorithm. It uses Terms as a logical clock, and only one Leader can exist in each term. The Leader needs to send heartbeats periodically to maintain its status and to handle log replication.

When replicating logs, the Leader first replicates the log to other Followers. Once a majority of the Followers successfully replicate the log, the Leader commits the log.

Safety mainly consists of five parts, with two core elements relevant to the implementation. One is the leader’s append-only rule, which means it cannot modify committed logs. The other is election safety, preventing split-brain scenarios and ensuring that the new Leader has the most up-to-date log.

For more details, please refer to the original paper.

Implementation Ideas

The implementation largely follows an excellent blog post (see references), and many algorithm details are also provided in Figure 2 of the original paper, so I will only focus on aspects that need attention when implementing each function.

Leader Election

Triggering Election + Handling Election Results

The election is initiated by launching multiple goroutines to send RPC requests to other nodes in the background. Therefore, when handling RPC responses, it is necessary to confirm that the current node is a Candidate and that the request is not outdated, i.e., rf.state == Candidate && req.Term == rf.currentTerm. If the election is successful, the node should immediately send heartbeats to notify other nodes of the election result.

If a failed response is received with resp.Term > rf.currentTerm, the node should switch to the Follower state, update the term, and reset voting information.

In fact, whenever the term is updated, the voting information needs to be reset. If the votedFor information is not reset, some tests will fail.

Request Vote RPC

First, filter outdated requests with req.Term < rf.currentTerm and ignore duplicate voting requests for the current term. Then, follow the algorithm’s logic to process the request. Note that if the node successfully grants the vote, it should reset the election timer.

Resetting the election timeout only when granting a vote helps with liveness in leader elections under unstable network conditions.

State Transition

When switching roles, be mindful of handling the state of different timers (stop or reset). When switching to Leader, reset the values of matchIndex and nextIndex.

Log Replication

Log replication is the core of the Raft algorithm, and it requires careful attention.

My implementation uses multiple replicator and applier threads for asynchronous replication and application.

Log Replication RPC

First, filter outdated requests with req.Term < rf.currentTerm. Then, handle log inconsistencies, log truncation, and duplicate log entries before replicating logs and processing commitIndex.

Trigger Log Replication + Handle Request Results

Determine whether to replicate logs directly or send a snapshot before initiating replication.

The key point in handling request results is how to update matchIndex, nextIndex, and commitIndex.

matchIndex is used to record the latest log successfully replicated on other nodes, while nextIndex records the next log to be sent to other nodes. commitIndex is updated by sorting matchIndex and determining whether to trigger the applier to update appliedIndex.

If the request fails, nextIndex should be decremented, or the node should switch to the Follower state.

Asynchronous Apply

This is essentially a background goroutine controlled by condition variables and uses channels for communication. Each time it is triggered, it sends log[lastApplied:commitIndex] to the upper layer and updates appliedIndex.

Persistence

Persist the updated attributes that need to be saved to disk in a timely manner.

Install Snapshot

The main components are the Snapshot triggered by the Leader and the corresponding RPC. When applying a Snapshot, determine its freshness and update log[0], appliedIndex, and commitIndex.

Pitfalls

Defer

The first pitfall is related to the defer keyword in Go. I like to use the defer keyword at the beginning of an RPC to directly print some data from the node: defer Dprintf("%+v", raft.currentTerm). This way, the log can be printed at the end of the call. However, the content to be printed is fixed at the time the defer statement is executed. The correct usage should be defer func(ID int) { Dprintf("%+v", id) }().

Log Dummy Header

It is best to reserve a spot in the log for storing the index and term of the snapshot to avoid a painful refactor of the Snapshot section later.

Lock

Refer to the guidance on locking suggestions. Use a single coarse-grained lock rather than multiple locks. Correctness of the algorithm is more important than performance. Avoid holding locks while sending RPCs or using channels, as it may lead to timeouts.

References

Raft Wikipedia

Raft Official Website

Raft Paper

MIT6.824 Official

Potato’s Implementation Doc

Arch + DWM Setup Attempt

Sat, 15 Jan 2022 23:13:16 +0800

Originally, I wanted to replace Manjaro KDE with DWM, but I got stuck at the boot screen, and while trying to fix it, I ended up corrupting the bootloader. So, I decided to go all in, format the entire disk, and try setting up an Arch + DWM development environment. Here, I’m documenting the process to assist with future repairs and device migrations.

This is not a step-by-step guide, but rather a concise record of my journey.

Installing Arch Linux

Preparation

Environment for Installing Arch

To create the installation USB, you’ll need:

16GB+ USB drive
Rufus
Windows machine
Arch Linux ISO

After creating the bootable USB, boot from it to start Arch Linux.

Network and Mirrors

Connect to WiFi using iwctl, then update the system clock and modify the Pacman mirror list.

Installing Arch Linux

Disk Partitioning

The disk should be divided into three main parts: Boot, Swap, and Root partitions.

Mount Point	Partition	Partition Type	Suggested Size
`/mnt/boot` or `/mnt/efi`	`/dev/efi_system_partition`	EFI System Partition	At least 260 MiB
`[SWAP]`	`/dev/swap_partition`	Linux swap	More than 512 MiB
`/mnt`	`/dev/root_partition`	Linux x86-64 Root (/)	Remaining Space

fdisk -l # View disk information
cfdisk /dev/nvme # Partition the disk

Formatting Partitions

mkfs.ext4 ${root}
mkswap ${swap}
mkfs.fat -F 32 ${efi}

Configuring Partitions and Installing the System

Mount Root: mount /dev/${root_partition} /mnt
Mount EFI: mount /dev/${efi_partition} /mnt/boot/efi
Activate Swap: swapon /dev/${swap_partition}
Install Kernel and Essential Packages: pacstrap /mnt base linux linux-firmware
Generate fstab Config: genfstab -U /mnt >> /mnt/etc/fstab (check for correctness)

The system should now be installed, but there is no bootloader, so we need to install GRUB.

Other Configurations Before Booting

Change root to the new system:

mount /dev/${root_partition} /mnt
arch-chroot /mnt

Set timezone and sync time.
Configure language by editing locale.gen and locale.conf.
Network configuration: set hostname and hosts.
Set the root password.
Install the GRUB bootloader and EFI tools: grub-install --target=x86_64-efi --efi-directory=esp --bootloader-id=GRUB
Install and start iwd to connect to WiFi.
Boot into Arch Linux.

Post-Boot Configuration

Install Essential Software

Purpose	Software
Bluetooth	bluetoothctl
Network	iwd
Daily Use	nvim, ranger, zsh
Sound	alsamixer
Input Method	fcitx5-im, fcitx5-chinese-addons
Proxy	clash

Installing the Desktop Environment

Install Xorg

Xorg provides an open-source implementation of the X window system, which is the basis for graphical user interfaces.

Install: xorg-server, xorg-apps, xrandr, xinit.

Install Desktop Companion Software

I used the Suckless tiling window management suite: dwm, slock, st, dmenu, slim, slstatus.

Configure `.xinitc` and `.xprofile`

Add to .xinitc:

# .xinitc
fcitx5 &
xautolock -time 10 -locker slock &

autorandr -l home
picom -b
feh --bg-fill --randomize /home/noneback/Picture/wallpaper/*.jpg
exec slstatus &
exec dwm

# .xprofile
export INPUT_METHOD=fcitx5
export GTK_IM_MODULE=fcitx5
export QT_IM_MODULE=fcitx5
export XMODIFIERS=@im=fcitx5

Customization and Usability

Purpose	Software
Wallpaper	feh
Window Effects	picom
Screen Lock	xautolock
Multi-Screen	autorandr
Power Saving	tlp

Additional Notes

For more detailed instructions, please refer to the official installation documentation.

References

How to Implement SkipList

Sun, 21 Nov 2021 15:28:42 +0800

Some time ago, I decided to implement a simple LSM storage engine model. As part of that, I implemented a basic SkipList and BloomFilter with BitSet. However, due to work demands and after-hours laziness, the project was put on hold. Now that I’m thinking about it again, I realize I’ve forgotten some of the details, so I’m writing it down for future reference.

What is SkipList?

SkipList is an ordered data structure that can be seen as an alternative to balanced trees. It essentially uses sparse indexing to accelerate searches in a linked list structure. It combines both data and index into a single structure, allowing efficient insertions and deletions.

Balanced trees, such as BST and Red-Black Trees, solve the problem of tree imbalance but introduce rotation, coloring, and other complexities. In concurrent scenarios, these can lead to larger lock granularity and affect performance. Compared to balanced trees, SkipList avoids these problems.

Implementation

SkipLists are usually implemented on top of ordered linked lists. The key challenge with ordered linked lists is figuring out how to insert a new node while maintaining order.

For arrays, binary search can be used to quickly locate the insert position, and then elements are moved to make room. For linked lists, the overhead of moving elements does not exist, but they do not support jumping directly to a position, making it challenging to locate where to insert.

The essence of SkipLists is to maintain a linked list of nodes with multiple layers of sparse indices that can be used to efficiently locate nodes.

In the base level, all nodes are present. On the next level, approximately every other node is present, and so on. This approach reduces the average time complexity for search, insertion, and deletion.

Using Randomization

To efficiently maintain these index nodes, randomization is used to decide whether a newly added node should be part of the index.

For a SkipList of maximum level X, each time a new node is added, we use a random approach to determine whether the node should be indexed at each level, with a probability of 1/(2^n) for each level. This results in a roughly equal distribution similar to dividing nodes into even partitions.

Data Structures

type SkipListNode struct {
    data *codec.Entry
    nextPtrs []*SkipListNode
}

type SkipList struct {
    header, tail *SkipListNode
    level        int
    size         int
    rwmtx        *sync.RWMutex
    maxSize      int
}

Operations

The two most notable operations are search and insertion.

Search

The key step here is to use the sparse indices in the SkipList, moving from the top level downwards to efficiently locate the required position.

// findPreNode finds the node before the node with the given key
func (sl *SkipList) findPreNode(key []byte) (*SkipListNode, bool) {
    // Start from the highest level
    h := sl.header
    for i := sl.level - 1; i >= 0; i-- {
        for h.nextPtrs[i] != nil && bytes.Compare(h.nextPtrs[i].data.Key, key) != 1 {
            if bytes.Equal(h.nextPtrs[i].data.Key, key) {
                return h, true
            }
            h = h.nextPtrs[i]
        }
    }
    return nil, false
}

Insertion

First, locate the position to insert, then perform the insertion, and finally add indices as determined by randomization.

func (sl *SkipList) randomLevel() int {
    ans := 1
    rand.Seed(time.Now().Unix())
    for rand.Intn(2) == 0 && ans <= defaultMaxLevel {
        ans++
    }
    return ans
}

func (sl *SkipList) Insert(data *codec.Entry) *SkipListNode {
    sl.rwmtx.Lock()
    defer sl.rwmtx.Unlock()

    h := sl.header
    updateNode := make([]*SkipListNode, defaultMaxLevel) // stores the node before newNode
    // Search from the top level
    for i := sl.level - 1; i >= 0; i-- {
        // Loop while the current nextPtrs is not empty and data is smaller than the inserted one
        for h.nextPtrs[i] != nil && h.nextPtrs[i].data.Less(data) {
            h = h.nextPtrs[i]
        }
        updateNode[i] = h
    }
    // Choose the level to insert
    lvl := sl.randomLevel()
    if lvl > sl.level {
        // Insert into higher levels, we need to create header -> tail
        for i := sl.level; i < lvl; i++ {
            updateNode[i] = sl.header
        }
        sl.level = lvl
    }
    // Insert after the updated node
    n := NewSkipListNode(lvl, data)
    for i := 0; i < lvl; i++ {
        n.nextPtrs[i] = updateNode[i].nextPtrs[i]
        updateNode[i].nextPtrs[i] = n
    }
    sl.size++
    return n
}

References

Kylin Overview

Wed, 10 Nov 2021 23:45:27 +0800

Previously, I was hoping to work on an interesting thesis, but I couldn’t find a suitable advisor nearby. I initially found a good advisor before the college started the topic selection, but it turned out they couldn’t take me on. However, I wasn’t that interested in the advisor’s field, so I decided to look for something else. Recently, the college’s thesis selection process started, and I found an interesting topic in the list. I reached out to the professor and took on the project.

The topic I chose is “Design and Implementation of Database Query Algorithms Based on Differential Privacy”, focusing on Differential Privacy + OLAP. Specifically, it’s about adding Differential Privacy as a feature to Kylin.

That’s the overall gist; as for the details, I might write about them in future blog posts. This is the first in this series of blog posts.

Introduction

Kylin is a distributed OLAP data warehouse based on columnar storage systems like HBase and Parquet, and computational frameworks like Hadoop and Spark. It supports multidimensional analysis of massive datasets.

Kylin uses a cube pre-computation method, transforming real-time queries into queries against precomputed results, utilizing idle computation resources and storage space to optimize query times. This can significantly reduce query latency.

Background

Before Kylin, Hadoop was commonly used for large-scale data batch processing, with results stored in columnar storage systems like HBase. The related technologies for OLAP included big data parallel processing and columnar storage.

Massive Parallel Processing: It leverages multiple machines to process computational tasks in parallel, essentially using linear growth in computing resources to achieve a linear decrease in processing time.
Columnar Storage: Stores data in columns instead of rows. This approach is particularly effective for OLAP queries, which typically involve aggregations of specific columns. Columnar storage allows querying only the necessary columns and makes effective use of sequential I/O, thus improving performance.

These technologies enabled minute-level SQL query performance on platforms like Hadoop. However, even this is insufficient for interactive analysis, as the latency is still too high.

The core issue is that neither parallel computing nor columnar storage changes the fundamental time complexity of querying; they do not break the linear relationship between query time and data volume. Therefore, the only optimization comes from increasing computing resources and exploiting locality principles, both of which have scalability and theoretical bottlenecks as data grows.

To address this, Kylin introduced a pre-computation strategy, building multidimensional cubes for different dimensions and storing them as data tables. Future queries are made directly against these precomputed results. With pre-computation, the size of the materialized views is determined only by the cardinality of the dimensions and is no longer linearly proportional to the size of the dataset.

Essentially, this strategy uses idle computational resources and additional storage to improve response times during queries, breaking the linear relationship between query time and data size.

Core Concepts

The core working principle of Apache Kylin is MOLAP (Multidimensional Online Analytical Processing) Cube technology.

Dimensions and Measures

Dimensions refer to perspectives used for aggregating data, typically attributes of data records. Measures are numerical values calculated based on data. Using dimensions, you can aggregate measures, e.g., $$D_1,D_2,D_3,… \rightarrow S_1,S_2,…$$

Cube Theory

Data Cube involves building and querying precomputed, multidimensional data indices.

Cuboid: The data calculated for a particular combination of dimensions.
Cube Segment: The smallest building block of a cube. A cube can be split into multiple segments.
Incremental Cube Building: Typically triggered based on time attributes.
Cube Cardinality: The cardinality of all dimensions in a cube determines the cube’s complexity. Higher cardinality often leads to cube expansion (amplified I/O and storage).

Architecture Design

Kylin consists of two parts: online querying and offline building.

Offline Building: Involves three main components: the data source, the build engine, and the storage engine. Data is fetched from the data source, cubes are built, and they are stored in the columnar storage engine.
Online Querying: Consists of an interface layer and a query engine, abstracting away concepts like cubes from the user. External applications use the REST API to submit queries, which are processed by the query engine and returned.

Summary

As an OLAP engine, Kylin leverages parallel computing, columnar storage, and pre-computation techniques to improve both online query and offline build performance. This has the following notable pros and cons:

Advantages

Standard SQL Interface: Supports BI tools and makes integration easy.
High Query Speed: Queries against precomputed results are very fast.
Scalable Architecture: Easily scales to handle increasing data volumes.

Disadvantages

Complex Dependencies: Kylin relies on many external systems, which can make operations and maintenance challenging.
I/O and Storage Overhead: Pre-computation and cube building can lead to amplified I/O and storage needs.
Limited by Data Models: The complexity of data models and cube cardinality can impose limitations on scalability.

References

DFS-Haystack

Wed, 06 Oct 2021 22:44:01 +0800

The primary project in my group is a distributed file system (DFS) that provides POSIX file system semantics. The approach to handle “lots of small files” (LOSF) is inspired by Haystack, which is specifically designed for small files. I decided to read through the Haystack paper and take some notes as a learning exercise.

These notes are not an in-depth analysis of specific details but rather a record of my thoughts on the problem and design approach.

Introduction

Haystack is a storage system designed by Facebook for small files. In traditional DFS, file addressing typically involves using caches to store metadata, reducing disk interaction and improving lookup efficiency. For each file, a separate set of metadata must be maintained, with the volume of metadata depending on the number of files. In high-concurrency scenarios, metadata is cached in memory to reduce disk I/O.

With a large number of small files, the volume of metadata becomes significant. Considering the maintenance overhead of in-memory metadata, this approach becomes impractical. Therefore, Haystack was developed specifically for small files, with the core idea of aggregating multiple small files into a larger one to reduce metadata.

Background

The “small files” in the paper specifically refer to image data.

Facebook, as a social media company, deals heavily with image uploads and retrieval. As the business scaled, it became necessary to have a dedicated service to handle the massive, high-concurrency requests for image reads and writes.

In the social networking context, this type of data is characterized as written once, read often, never modified, and rarely deleted. Based on this, Facebook developed Haystack to support image sharing services.

Design

Traditional Design

The paper describes two historical designs: CDN-based and NAS-based solutions.

CDN-based Solution

The core of this solution is to use CDN (Content Delivery Network) to cache hot image data, reducing network transmission.

This approach optimizes access to hot images but also has some issues. Firstly, CDN is expensive and has limited capacity. Secondly, image sharing includes many less popular images, which leads to the long tail effect, slowing down access.

CDNs are generally used to serve static data and are often pre-warmed before an event, making them unsuitable as an image cache service. Many less popular images do not enter the CDN, leading to the long tail effect.

NAS-based Solution

This was Facebook’s initial design and is essentially a variation of the CDN-based solution.

They introduced NAS (Network Attached Storage) for horizontal storage expansion, incorporating file system semantics, but disk I/O remained an issue. Similar to local files, reading uncached data requires at least three disk I/O operations:

Read directory metadata into memory
Load the inode into memory
Read the content of the file

PhotoStore was used as a caching layer to store some metadata like file handles to speed up the addressing process.

The NAS-based design did not solve the fundamental issue of excessive metadata that could not be fully cached. When the number of files reaches a certain threshold, disk I/O becomes inevitable.

The fundamental issue is the one-to-one relationship between files and addressing metadata, causing the volume of metadata to change with the number of files.

Thus, the key to optimization is changing the one-to-one relationship between files and metadata, reducing the frequency of disk I/O during addressing.

Haystack-based Solution

The core idea of Haystack is to aggregate multiple small files into a larger one, maintaining a single piece of metadata for the large file. This changes the mapping between metadata and files, making it feasible to keep all metadata in memory.

Metadata is maintained only for the aggregated file, and the position of small files within the large file is maintained separately.

Implementation

Haystack mainly consists of three components: Haystack Directory, Haystack Cache, and Haystack Store.

File Mapping and Storage

File data is ultimately stored on logical volumes, each of which corresponds to multiple physical volumes across machines.

Users first access the Directory to obtain access paths and then use the URL generated by the Directory to access other components to retrieve the required data.

Components

Haystack Directory

This is Haystack’s access layer, responsible for file addressing and access control.

Read and write requests first go through the Directory. For read requests, the Directory generates an access URL containing the path: http://{cdn}/{cache}/{machine id}/{logicalvolume,Photo}. For write requests, it provides a volume to write into.

The Directory has four main functions:

Load balancing for read and write requests.
Determine request access paths (e.g., CDN or direct access) and generate access URLs.
Metadata and mapping management, e.g., logical attributes to volume mapping.
Logical volume read/write management, where volumes can be read-only or write-enabled.

This design is based on the data characteristics: “write once, read more.” This setup improves concurrency.

The Directory stores metadata such as file-to-volume mappings, logical-to-physical mappings, and volume attributes (size, owner, etc.). It relies on a distributed key-value store and a cache service to ensure low latency and high availability.

Proxy, Metadata Mapping, Access Control

Haystack Cache

The Cache layer optimizes addressing and image retrieval. The core design is the Cache Rule, which determines what data should be cached and how to handle cache misses.

Images are cached if they meet these criteria:

The request is directly from a user, not from a CDN.
The photo is retrieved from a write-enabled store machine.

If a cache miss occurs, the Cache fetches the image from the Store and pushes it to both the user and the CDN.

The caching policy is based on typical access patterns.

Haystack Store

The Store layer is responsible for data storage operations.

The addressing abstraction is: filename + offset => logical volume id + offset => data.

Multiple physical volumes constitute a logical volume. In the Store, small files are encapsulated as Needles managed by physical volumes.

Needles represent a way to encapsulate small files and manage volume blocks.

Store data is accessed at the Needle level. To speed up addressing, a memory map is used: key/alternate key => needle's flag/offset/other attributes.

These maps are persisted in Index Files on disk to provide a checkpoint for quick metadata recovery after a crash.

Each volume maintains its own in-memory mapping and index file.

When updating the in-memory mapping (e.g., adding or modifying a file), the index file is updated asynchronously. Deleted files are only marked as deleted, not removed from the index file.

The index serves as a lookup aid. Needles without an index can still be addressed, making the asynchronous update and index retention strategy feasible.

Workloads

Read

(Logical Volume ID, key, alternate key, cookies) => photo

For a read request, Store queries the in-memory mapping for the corresponding Needle. If found, it fetches the data from the volume and verifies the cookie and integrity; otherwise, it returns an error.

Cookies are randomly generated strings that prevent malicious attacks.

Write

(Logical Volume ID, key, alternate key, cookies, data) => result

Haystack only supports appending data rather than overwriting. When a write request is received, Store asynchronously appends data to a Needle and updates the in-memory mapping. If it’s an existing file, the Directory updates its metadata to point to the latest version.

Older volumes are frozen as read-only, and new writes are appended, so a larger offset indicates a newer version.

Delete

Deletion is handled using Mark Delete + Compact GC.

Fault Tolerance

Store ensures fault tolerance through monitoring + hot backup. Directory and Cache use Raft-like consistency algorithms for data replication and availability.

Optimization

The main optimizations include: Compaction, Batch Load, and In-Memory processing.

Summary

Key abstraction optimizations include asynchronous processing, batch operations, and caching.
Identifying the core issues, such as metadata management burden for a large number of small files, is crucial.

References

Finding a needle in Haystack: Facebook’s photo storage

MIT6.824 Bigtable

Thu, 16 Sep 2021 22:54:59 +0800

I recently found a translated version of the Bigtable paper online and saved it, but hadn’t gotten around to reading it. Lately, I’ve noticed that Bigtable shares many design similarities with a current project in our group, so I took some time over the weekend to read through it.

This is the last of Google’s three foundational distributed system papers, and although it wasn’t originally part of the MIT6.824 reading list, I’ve categorized it here for consistency.

As with previous notes, I won’t dive deep into the technical details but will instead focus on the design considerations and thoughts on the problem.

Introduction

Bigtable is a distributed structured data storage system built on top of GFS, designed to store large amounts of structured and semi-structured data. It is a NoSQL data store that emphasizes scalability and performance, as well as reliable fault tolerance through GFS.

Design Goal: Wide Applicability, Scalability, High Performance, High Availability

Data Model

Bigtable’s data model is No Schema and provides a simple model. It treats all data as strings, with encoding and decoding handled by the application layer.

Bigtable is essentially a sparse, distributed, persistent multidimensional sorted Map. The index of the Map is composed of Row Key, Column Key, and TimeStamp, and the value is an unstructured byte array.

// Mapping abstraction
(row:string, column:string, time:int64) -> string
// A Row Key is essentially a multi-dimensional structure composed of {Row, Column, Timestamp}.

The paper describes the data model as follows:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

Sparse means that columns in the same table can be null, which is quite common.

Row	Columns
Row1	{ID, Name, Phone}
Row2	{ID, Name, Phone, Address}
Row3	{ID, Name, Phone, Email}

Distributed refers to scalability and fault tolerance, i.e., Replication and Sharding. Bigtable leverages GFS replicas for fault tolerance and uses Tablet for partitioning data to achieve scalability.

Persistent Multidimensional Sorted indicates data is eventually persisted, and Bigtable optimizes write and read latency with WAL and LSM.

The open-source implementation of Bigtable is HBase, a row and column database.

Rows

Bigtable organizes data using lexicographic order of row keys. A Row Key can be any string, and read and write operations are atomic at the row level.

Lexicographic ordering helps aggregate related row records. MySQL achieves atomic row operations using an undo log.

Column Family

A set of column keys forms a Column Family, where the data often shares the same type.

A column key is composed of Column Family : Qualifier. The column family’s name must be a printable string, whereas the qualifier name can be any string.

The paper mentions:

Access control and both disk and memory accounting are performed at the column-family level.

This is because business users tend to retrieve data by columns, e.g., reading webpage content. In practice, column data is often compressed for storage. Thus, the Column Family level is a more suitable level for access control and resource accounting than rows.

TimeStamp

The timestamp is used to maintain different versions of the same data, serving as a logical clock. It is also used as an index to query data versions.

Typically, timestamps are sorted in reverse chronological order. When the number of versions is low, a pointer to the previous version is used to maintain data versioning; when the number of versions increases, an index structure is needed. TimeStamp indexing inherently requires range queries, so a sortable data structure is appropriate for indexing. Extra version management increases maintenance overhead, usually handled by limiting the number of data versions and garbage collecting outdated versions.

Tablet

Bigtable uses a range-based data sharding strategy, and Tablet is the basic unit for data sharding and load balancing.

A tablet is a collection of rows, managed by a Tablet Server. Rows in Bigtable are ultimately stored in a tablet, which is split or merged for load balancing among Tablet Servers.

Range-based sharding is beneficial for range queries, compared to hash-based sharding.

SSTable

SSTable is a persistent, sorted, immutable Map. Both keys and values are arbitrary byte arrays.

A tablet in Bigtable is stored in the form of SSTable files.

SSTable is organized into data blocks (typically 64KB each), with an index for fast data lookup. Data is read by first reading the index, searching the index, and then reading the data block.

API

The paper provides an API that highlights the differences from RDBMS.

// Writing to Bigtable
// Open the table 
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor 
RowMutation r1(T, "com.cnn.www"); 
r1.Set("anchor:www.c-span.org", "CNN"); 
r1.Delete("anchor:www.abc.com"); 
Operation op; 
Apply(&op, &r1);

// Reading from Bigtable
Scanner scanner(T); 
ScanStream *stream; 
stream = scanner.FetchColumnFamily("anchor"); 
stream->SetReturnAllVersions(); 
scanner.Lookup("com.cnn.www"); 
for (; !stream->Done(); stream->Next()) { 
  printf("%s %s %lld %s\n", 
         scanner.RowName(), 
         stream->ColumnName(), 
         stream->MicroTimestamp(), 
         stream->Value());
}

Architecture Design

External Components

Bigtable is built on top of other components in Google’s ecosystem, which significantly simplifies Bigtable’s design.

GFS

GFS is Bigtable’s underlying storage, providing replication and fault tolerance.

Refer to the previous notes for details.

Chubby

Chubby is a highly available distributed lock service that provides a namespace, where directories and files can serve as distributed locks.

High availability means maintaining multiple service replicas, with consistency ensured via Paxos. A lease mechanism prevents defunct Chubby clients from holding onto locks indefinitely.

Why Chubby? What is its role?

Stores Column Family information
Stores ACL (Access Control List)
Stores root metadata for the Root Tablet location, which is essential for Bigtable startup.

Bigtable uses a three-layer B+ tree-like structure for metadata. The Root Tablet location is in Chubby, which helps locate other metadata tablets, which in turn store user Tablet locations.
Tablet Server lifecycle monitoring

Each Tablet Server creates a unique file in a designated directory in Chubby and acquires an exclusive lock on it. The server is considered offline if it loses the lock.

In summary, Chubby’s functionality can be categorized into two parts. One is to store critical metadata as a highly available node, while the other is to manage the lifecycle of storage nodes (Tablet Servers) using distributed locking.

In GFS, these responsibilities are handled by the Master. By offloading them to Chubby, Bigtable simplifies the Master design and reduces its load.

Conceptually, Chubby can be seen as part of the Master node.

Internal Components

Master

Bigtable follows a Master-Slave architecture, similar to GFS and MapReduce. However, unlike GFS, Bigtable relies on Chubby and Tablet Servers to store metadata, with the Master only responsible for orchestrating the process and not storing tablet locations.

Responsibilities include Tablet allocation, garbage collection, monitoring Tablet Server health, load balancing, and metadata updates. The Master requires:

All Tablet information to determine allocation and distribution.

Tablet Server status information to decide on allocations.

Tablet Server

Tablet Servers manage tablets, handling reads and writes, splitting and merging tablets when necessary.

Metadata is not stored by the Master. Clients interact directly with Chubby and Tablet Servers for reading data. Tablets are split by Tablet Servers, and Master may not be notified instantly. WAL+retry mechanisms should be employed to ensure operations aren’t lost.

Client SDK

The client SDK is the entry point for businesses to access Bigtable. To minimize metadata lookup overhead, caching and prefetching are used to reduce the frequency of network interactions, making use of temporal and spatial locality.

Caching may introduce inconsistency issues, which require appropriate solutions, such as retries during inconsistent states.

Storage Design

Mapping and Addressing

Bigtable data is uniquely determined by a (Table, Row, Column) tuple, stored in tablets, which in turn are stored in SSTable format on GFS.

Tablets are logical representations of Bigtable’s on-disk entity, managed by Tablet Servers.

Bigtable uses Root Tablet + METADATA Table for addressing. The Root Tablet location is stored in Chubby, while the METADATA Table is maintained by Tablet Servers.

The Root Tablet stores the location of METADATA Tablets, and each METADATA Tablet contains the location of user tablets.

METADATA Table Row: (TableID, encoding of last row in Tablet) => Tablet Location

The system uses a B+ tree-like three-layer structure to maintain tablet location information.

Scheduling and Monitoring

Scheduling

Scheduling involves Tablet allocation and load balancing.

A Tablet can only be assigned to one Tablet Server at any given time. The Master maintains Tablet Server states and sends allocation requests as needed.

The Master does not maintain addressing information but holds Tablet Server states (including tablet count, status, and available resources) for scheduling.

Monitoring

Monitoring is carried out by Chubby and the Master.

Each Tablet Server creates a unique file in a Chubby directory and acquires an exclusive lock. When the Tablet Server disconnects and loses its lease, the lock is released.

The unique file determines whether a Tablet Server is active, and the Master may delete the file as needed. In cases of network disconnection, the Tablet Server will try to re-acquire the exclusive lock if the file still exists. If the file doesn’t exist, the disconnected Tablet Server should automatically leave the cluster.

The Master ensures its uniqueness by acquiring an exclusive lock on a unique file in Chubby, and monitors a specific directory for Tablet Server files.

Once it detects a failure, it deletes the Tablet Server’s Chubby file and reallocates its tablets to other Tablet Servers.

Compaction

Bigtable provides read and write services and uses an LSM-like structure to optimize write performance. For each write operation, the ACL information is first retrieved from Chubby to verify permissions. The write is then logged in WAL and stored in Memtable before eventually being persisted in SSTable.

When Memtable grows to a certain size, it triggers a Minor Compaction to convert Memtable to SSTable and write it to GFS.

Memtable is first converted into an immutable Memtable before becoming SSTable. This intermediate step ensures that Minor Compaction does not interfere with incoming writes.

Bigtable uses Compaction to accelerate writes, converting random writes into sequential writes and writing data in the background. Compaction occurs in three types:

Minor Compaction: Converts Memtable to SSTable, discarding deleted data and retaining only the latest version.
Merge Compaction: Combines Memtable and SSTable into a new SSTable.
Major Compaction: Combines multiple SSTables into one.

For reads, data aggregation is required across Memtable and multiple SSTables, as data may be distributed across these structures. Second-level caching and Bloom filters are used to speed up reads.

Tablet Servers have two levels of caching:

Scan Cache: Caches frequently read key-value pairs.
Block Cache: Caches SSTable blocks.

Bloom filters are also employed to reduce the number of SSTable lookups by indicating whether a key is not present.

Optimization

Locality

High-frequency columns can be grouped together into one SSTable, reducing the time to fetch related data.

Space is traded for time, leveraging locality principles.

Compression

SSTable blocks are compressed to reduce network bandwidth and latency during transfers.

Compression is performed in blocks to reduce encoding/decoding time and improve parallelism.

CommitLog Design

Tablet Servers maintain one Commit Log each, instead of one per Tablet, to minimize disk seeks and enable batch operations. During recovery, log entries must be sorted by (Table, Row, Log Seq Num) to facilitate recovery.

Summary

Keep it simple: Simple is better than complex.
Cluster monitoring is crucial for distributed services. Google’s three papers emphasize cluster monitoring and scheduling.
Do not make assumptions about other systems in your design. Issues may range from common network issues to unexpected operational problems.
Leverage background operations to accelerate user-facing actions, such as making writes fast and using background processes for cleanups.

References

MIT6.824 GFS

Thu, 09 Sep 2021 00:44:24 +0800

This article introduces the Google File System (GFS) paper published in 2003, which proposed a distributed file system designed to store large volumes of data reliably, meeting Google’s data storage needs. This write-up reflects on the design goals, trade-offs, and architectural choices of GFS.

Introduction

GFS is a distributed file system developed by Google to meet the needs of data-intensive applications, using commodity hardware to provide a scalable and fault-tolerant solution.

Background

Component Failures as the Norm: In GFS, component failures are treated as normal events rather than exceptions.

GFS uses inexpensive hardware to build a reliable service. Each machine has a certain probability of failure, resulting in a binomial distribution of overall system failures. The key challenge is to ensure the system remains available through redundancy and rapid failover.

Massive Files: Files in GFS can be extremely large, ranging from several hundred megabytes to tens of gigabytes.

GFS favors large files rather than many small files. Managing a large number of small files in a distributed system can lead to increased metadata overhead, inefficient caching, and greater inode usage.

Sequential Access: Most file modifications append data to the end of files rather than random modifications, and reads are generally sequential.

GFS is optimized for sequential writes, especially for appending data. Random writes are not well-supported and do not guarantee consistency.

Collaborative Design: The API and file system are designed collaboratively to improve efficiency and flexibility.

GFS provides an API similar to POSIX but includes additional optimizations to better match Google’s workload.

Design Goals

Storage Capacity

GFS is designed to manage millions of files, most of which are at least 100 MB in size. Files of several gigabytes are common, but GFS also supports smaller files without specific optimization.

Workload

Read Workload

Large-Scale Sequential Reads: Large-scale sequential data retrieval using disk I/O.
Small-Scale Random Reads: Small-scale random data retrieval, optimized through techniques such as request batching.

Write Workload

Primarily large-scale sequential writes, typically appending data to the end of files. GFS supports concurrent data appends from multiple clients, with atomic guarantees and synchronization.

Bandwidth vs. Latency

High sustained bandwidth is prioritized over low latency, given the typical workloads of GFS.

Fault Tolerance

GFS continuously monitors its state to detect and recover from component failures, which are treated as common occurrences.

Operations and Interfaces

GFS provides traditional file system operations such as file creation, deletion, and reading, along with features like snapshots and atomic record append.

Snapshots create file or directory copies, while atomic record append guarantees that data is appended atomically.

Architecture

The architecture of GFS follows a Master-Slave design, consisting of a single Master node and multiple Chunk Servers.

The Master and Chunk Servers are logical concepts and do not necessarily refer to specific physical machines.

GFS provides a client library (SDK) that allows clients to access the system, abstracting the underlying complexity. File data is divided into chunks and stored across multiple Chunk Servers, with replication for reliability. The Master manages metadata such as namespace, chunk locations, and more.

Component Overview

Client

Clients in GFS are application processes that use the GFS SDK for seamless integration. Key functionalities of the client include:

Caching: Cache metadata obtained from the Master to reduce communication overhead.
Encapsulation: Encapsulate retries, request splitting, and checksum validation.
Optimization: Perform request batching, load balancing, and caching to enhance efficiency.
Mapping: Map file operations to chunk-based ones, such as converting (filename, offset) into (chunk index, offset).

Master

The Master maintains all metadata, including the namespace, file-to-chunk mappings, and chunk versioning. Key functionalities include:

Monitoring: Track Chunk Server status and data locations using heartbeats.
Directory Tree Management: Manage the hierarchical file system structure with efficient locking mechanisms.
Mapping Management: Maintain mappings between files and chunks for fast lookups.
Fault Tolerance: Utilize checkpointing and Raft-style multi-replica backups to recover from Master failures.
System Scheduling: Manage chunk replication, garbage collection, lease distribution, and primary Chunk Server selection.

Metadata is stored in memory for performance reasons, resulting in a simplified design, but making checkpointing and logging crucial to ensure recovery.

Chunk Server

Chunk Servers are responsible for storing data, with each file chunk being saved as a Linux file. Chunk Servers also perform data integrity checks and report health information to the Master regularly.

Key Concepts and Mechanisms

Chunk Size

Chunks are the logical units for storing data in GFS, with each chunk typically sized at 64 MB. The chunk size balances metadata overhead, caching efficiency, data locality, and fault tolerance.

Small chunks increase metadata load on the Master, whereas larger chunks can create data hot spots and fragmentation.

Lease Mechanism

GFS uses a lease mechanism to ensure consistency between chunk replicas. When concurrent write requests occur, the Master selects a Chunk Server to be the primary. The primary node assigns an order to client operations, ensuring concurrent operations are executed consistently.

This mechanism reduces the coordination load on the Master and allows data to be appended atomically.

Chunk Versioning

The versioning system is used to ensure that only the latest chunk version is valid. The Master increments the version whenever a lease is granted, and a new version number is committed after acknowledgment from the primary.

Versioning helps determine the freshness of data during recoveries.

Control Flow vs. Data Flow

GFS separates control flow and data flow to optimize data transfers. Control commands are issued separately from data transfers, enabling efficient utilization of network topology.

Data is sent using a pipeline approach between Chunk Servers, which minimizes network overhead and uses cache effectively.

Data Integrity

Chunks are split into 64 KB blocks, each with a corresponding checksum for data integrity. These checksums are used to verify data during read operations.

Checksums are stored separately from the data, providing an additional layer of reliability.

Fault Tolerance and Replication

Chunks are stored in multiple replicas across different Chunk Servers for reliability. The Master detects Chunk Server failures via heartbeats and manages replication to meet desired redundancy levels.

Data integrity failures or Chunk Server disconnections trigger replication to maintain availability.

Consistency

GFS has a relaxed consistency model. It provides eventual consistency and does not guarantee strong consistency.

In practice, operations such as atomic record append ensure data integrity during appends but may not eliminate duplicate writes. Random writes are not consistently managed.

Summary

GFS demonstrates how practical design trade-offs, driven by specific business needs, can lead to an efficient and scalable distributed file system. It focuses on resilience, fault tolerance, and high throughput, making it ideal for Google’s data processing needs.

In distributed systems, scalability is often more important than single-node performance. GFS embraces this principle through large file management, redundancy, and workload distribution.

References

Epoll and IO Multiplexing

Sun, 15 Aug 2021 21:47:45 +0800

Let’s start with epoll.

epoll is an I/O event notification mechanism in the Linux kernel, designed to replace select and poll. It aims to efficiently handle large numbers of file descriptors and supports the system’s maximum file open limit, providing excellent performance.

Usage

API

epoll has three primary system calls:

/** epoll_create
 *  Creates an epoll instance and returns a file descriptor for it.
 *  Needs to be closed afterward, as epfd also consumes the system's fd resources.
 *  size: Indicates the number of file descriptors to be monitored by epfd.
 */
int epoll_create(int size);

/** epoll_ctl
 *  Adds or modifies a file descriptor to be monitored by epoll.
 *  epfd: The epoll file descriptor.
 *  op: Operation type.
 *       EPOLL_CTL_ADD: Add a new fd to epfd.
 *       EPOLL_CTL_MOD: Modify an already registered fd.
 *       EPOLL_CTL_DEL: Remove an fd from epfd.
 *  fd: The file descriptor to be monitored.
 *  event: Specifies the type of event to be monitored.
 *         EPOLLIN: Indicates the fd is ready for reading (including when the peer socket is closed).
 *         EPOLLOUT: Indicates the fd is ready for writing.
 *         EPOLLPRI: Indicates urgent data can be read.
 *         EPOLLERR: Indicates an error occurred on the fd.
 *         EPOLLHUP: Indicates the fd has been hung up.
 *         EPOLLET: Sets epoll to Edge Triggered (ET) mode, as opposed to Level Triggered (LT).
 *         EPOLLONESHOT: Only listen for the event once. If continued monitoring is required, the socket must be re-added to epfd.
 */
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

/** epoll_wait
 *  Collects events that have been triggered and returns the number of triggered events.
 *  epfd: The epoll file descriptor.
 *  events: Array of epoll events that will be populated with triggered events.
 *  maxevents: Indicates the size of the events array.
 *  timeout: Timeout duration; 0 returns immediately, -1 blocks indefinitely, >0 waits for the specified duration.
 */
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Processing Flow

epoll_create

When a process calls epoll_create, the Linux kernel creates an eventpoll structure:

struct eventpoll {
    spinlock_t lock;
    struct mutex mtx;
    wait_queue_head_t wq;
    wait_queue_head_t poll_wait;
    struct list_head rdllist;
    struct rb_root rbr;
    struct epitem *ovflist;
    struct user_struct *user;
    struct file *file;
    int visited;
    struct list_head visited_list_link;
};

During epoll_create, epoll registers a red-black tree with the kernel, which is used to store the monitored sockets. When epoll_create is called, a file node is created in this red-black tree, and the corresponding socket fd is registered.

Additionally, a doubly linked list is created to store events that are ready.

epoll_ctl

For each monitored event, an epitem structure is created:

struct epitem {
    struct rb_node rbn;
    struct list_head rdllink;
    struct epitem *next;
    struct epoll_filefd ffd;
    int nwait;
    struct list_head pwqlist;
    struct eventpoll *ep;
    struct list_head fllink;
    struct epoll_event event;
};

When epoll_ctl is called, the socket fd is registered to the eventpoll’s red-black tree, and a callback function is registered with the kernel interrupt handler. When an interrupt occurs for the fd, it is added to the ready list.

epoll_wait

When epoll_wait is called, it simply checks if there is data in the list of ready events (epitem). If there is data, it returns immediately; otherwise, it sleeps until either data arrives or the timeout expires.

Epoll Usage Model

for (;;) {  
    nfds = epoll_wait(epfd, events, 20, 500);  
    for (i = 0; i < nfds; ++i) {  
        if (events[i].data.fd == listenfd) {  
            connfd = accept(listenfd, (sockaddr *)&clientaddr, &clilen);  
            ev.data.fd = connfd;  
            ev.events = EPOLLIN | EPOLLET;  
            epoll_ctl(epfd, EPOLL_CTL_ADD, connfd, &ev);  
        }  
        else if (events[i].events & EPOLLIN) {  
            n = read(sockfd, line, MAXLINE);
            ev.data.ptr = md;  
            ev.events = EPOLLOUT | EPOLLET;  
            epoll_ctl(epfd, EPOLL_CTL_MOD, sockfd, &ev);  
        }  
        else if (events[i].events & EPOLLOUT) {  
            struct myepoll_data *md = (myepoll_data *)events[i].data.ptr;  
            sockfd = md->fd;  
            send(sockfd, md->ptr, strlen((char *)md->ptr), 0);  
            ev.data.fd = sockfd;  
            ev.events = EPOLLIN | EPOLLET;  
            epoll_ctl(epfd, EPOLL_CTL_MOD, sockfd, &ev);  
        }  
        else {  
            // Other processing  
        }  
    }  
}

Blocking IO, Non-blocking IO, and IO Multiplexing

Blocking IO

Blocking IO means that a thread waits for data to arrive, releasing the CPU until the data is available. When data arrives, the thread is rescheduled to run.

In scenarios with many read/write requests, frequent context switching and thread scheduling can lead to inefficiency.

Non-blocking IO

In non-blocking IO, a user thread makes an IO request, and if data is not yet available, it returns immediately. The thread must keep checking until the data is ready, at which point it can proceed.

Non-blocking IO has a significant drawback: the thread must constantly poll, which can lead to high CPU usage.

IO Multiplexing

Blocking IO occupies resources, and excessive context switching can be inefficient. Non-blocking IO can lead to high CPU utilization due to constant polling.

IO multiplexing manages multiple file descriptors in a single thread, reducing context switching and idle CPU usage. Mechanisms like select, poll, and epoll were developed to implement this concept, with epoll being the most scalable and efficient.

References

Linux IO Multiplexing and epoll Explained

Deep Dive into epoll

Linux Cgroups Overview

Tue, 08 Jun 2021 22:26:17 +0800

Linux Cgroups (Control Groups) provide the ability to limit, control, and monitor the resources used by a group of processes and their future child processes. These resources include CPU, memory, storage, and network. With Cgroups, it’s easy to limit a process’s resource usage and monitor its metrics in real time.

Three Components of Cgroups

cgroup

A mechanism for managing groups of processes. A cgroup contains a group of processes, and various Linux subsystem parameters can be configured on this cgroup, associating a group of processes with a group of system parameters from subsystems.
subsystem

A module that controls a set of resources.

Each subsystem is linked to a cgroup that defines the respective limits, and the subsystem imposes these limits on the processes in the cgroup.
hierarchy

A hierarchy is a tree structure that links multiple cgroups. With this tree structure, cgroups can inherit attributes from their parent cgroups.

Example Scenario: Suppose there is a group of periodic tasks limited by cgroup1 in terms of CPU usage. If one of these tasks is a logging process that also needs to be limited by disk I/O, a new cgroup2 can be created that inherits from cgroup1. cgroup2 will inherit the CPU limit from cgroup1 and add its own disk I/O limitation, without affecting other processes in cgroup1.

Relationships Between the Three

When a new hierarchy is created, all processes in the system join the root cgroup of that hierarchy by default. This root cgroup is created automatically with the hierarchy.
A subsystem can only be attached to one hierarchy.
A hierarchy can have multiple subsystems attached.
A process can belong to multiple cgroups in different hierarchies.
A child process is in the same cgroup as its parent process but can be moved to a different cgroup later.

Kernel Interface

Hierarchies in cgroups are organized in a tree structure. The kernel provides a virtual tree-like file system to configure cgroups, making it intuitive to work with them through a hierarchical directory structure.

Create a hierarchy and add sub-cgroups:

mkdir cgroup # Create mount point
sudo mount -t cgroup -o none,name=cgroup-test cgroup-test ./cgroup-test # Mount hierarchy
sudo mkdir cgroup-1
sudo mkdir cgroup-2
tree

.
├── cgroup-1
│   ├── cgroup.clone_children
│   ├── cgroup.procs
│   ├── notify_on_release
│   └── tasks
├── cgroup-2
│   ├── cgroup.clone_children
│   ├── cgroup.procs
│   ├── notify_on_release
│   └── tasks
├── cgroup.clone_children
├── cgroup.procs
├── cgroup.sane_behavior
├── notify_on_release
├── release_agent
└── tasks

Meaning of Different Files

Add and move processes to a cgroup (move process PID into the corresponding tasks file):

sudo sh -c "echo $$ >> ./cgroup-1/tasks" # Move terminal process to cgroup-1
cat /proc/$$/cgroup

>> 
13:name=cgroup-test:/cgroup-1
12:memory:/user.slice/user-1002.slice/session-12331.scope
11:perf_event:/
10:cpuset:/
9:freezer:/
8:blkio:/user.slice
7:rdma:/
6:hugetlb:/
5:pids:/user.slice/user-1002.slice/session-12331.scope
4:cpu,cpuacct:/user.slice
3:net_cls,net_prio:/
2:devices:/user.slice
1:name=systemd:/user.slice/user-1002.slice/session-12331.scope
0::/user.slice/user-1002.slice/session-12331.scope

Limit cgroup resource usage via subsystems:

First, link the hierarchy to a subsystem. By default, the system links the hierarchy to a memory subsystem.

# Start a memory-intensive stress process without any limitations
stress --vm-bytes 200m --vm-keep -m 1
sudo mkdir test-limit-memory && cd test-limit-memory # Create a cgroup
sudo sh -c "echo "100m" > memory.limit" # Set max memory usage to 100m
sudo sh -c "echo $$ > tasks" # Move current process to cgroup
stress --vm-bytes 200m --vm-keep -m 1

Observation

The memory usage of the process is limited by the specified setting.

Distributed Transactions

Thu, 20 May 2021 23:55:11 +0800

Transactions and Distributed Transactions

Transactions

A transaction is a logical unit of work in a database, composed of a finite sequence of database operations. The database must ensure the atomicity of transaction operations: when a transaction is successful, it means that all operations in the transaction have been fully executed; if the transaction fails, all executed SQL operations are rolled back.

A single-node database transaction has four main properties:

Atomicity: The transaction is executed as a whole. Either all operations within the transaction are executed, or none are executed.
Consistency: The transaction must ensure that the database moves from one consistent state to another. Consistent states mean that the data in the database must satisfy all integrity constraints.
Isolation: When multiple transactions are executed concurrently, the execution of one transaction should not affect the execution of others.
Durability: Changes made by a committed transaction should be permanently stored in the database.

Distributed Transactions

A distributed transaction is a transaction where the participants, transaction-supporting servers, resource servers, and transaction manager are located on different nodes of a distributed system.

With the adoption of microservice architectures, large business domains often involve multiple services, and a business process requires participation from multiple services. In specific business scenarios, data consistency among multiple services must be ensured.

For example, in a large e-commerce system, the order interface typically deducts inventory, reduces discounts, and generates an order ID. The order service, inventory, discount, and order ID are all separate services. The success of the order interface depends not only on local database operations but also on third-party system results. In this case, distributed transactions ensure that all these operations either succeed together or fail together.

In essence, distributed transactions are used to ensure data consistency across different databases.

Use Cases

Typical use cases in e-commerce systems include:

Order Inventory Deduction

When placing an order, operations include generating an order record and reducing product inventory. These are handled by separate microservices, so distributed transactions are required to ensure the atomicity of the order operation.
Third-Party Payments

In a microservice architecture, payment and orders are independent services. The order payment status depends on a notification from the financial service, which, in turn, depends on notifications from a third-party payment service.

A classic scenario is illustrated below:

From the diagram, there are two calls: the third-party payment service calling the payment service, and the payment service calling the order service. Both calls can encounter timeouts. Without distributed transactions, the actual payment status and the final payment status visible to the user may become inconsistent.

Implementation Approaches

Two-Phase Commit (2PC)

A transaction commit is divided into two phases:

Preparation Phase:
- The transaction manager (TM) initiates the transaction, logs the start of the transaction, and asks the participating resource managers (RMs) whether they can execute the commit operation, then waits for their responses.
- RMs execute local transactions, log redo/undo data, and return results to TM, but do not commit.
Commit/Rollback Phase:
- If all participating RMs execute successfully, the transaction proceeds to the commit phase:
  - TM logs the commit, sends a commit instruction to all RMs.
  - RMs commit the local transaction and respond to TM.
  - TM logs the end of the transaction.
- If any RM fails or times out during preparation or commit:
  - TM logs the rollback, sends rollback instructions to all RMs.
  - RMs rollback the local transaction and respond to TM.
  - TM logs the end of the transaction.

Characteristics

Atomicity: Supported
Consistency: Strong consistency
Isolation: Supported
Durability: Supported

Disadvantages

Synchronous Blocking: When participants occupy shared resources, others can only wait for resource release, leading to blocking.
Single Point of Failure: If the transaction manager fails, the entire system becomes unavailable.
Data Inconsistency: If the transaction manager only sends some commit messages, and a network issue occurs, only some participants receive the commit message, leading to inconsistency.
Uncertainty: If both the transaction manager and a participant fail after sending a commit message, it is uncertain whether the message was successfully committed.

Local Message Table

The transaction initiator maintains a local message table, and operations on the business table and the message table are within the same local transaction. Asynchronously, a scheduled task scans the message table and delivers the message downstream.

The broad concept of the local message table also allows downstream notification through methods other than message delivery, such as RPC calls.

The initiator executes a local transaction, operating both the business table and the local message table.
A scheduled task scans pending local messages (in the message table) and sends them to the message queue:
- If successful, mark the local message as sent.
- If failed, retry until successful.
The message queue delivers the message downstream.
The downstream transaction participant receives the message and executes a local transaction:
- If failed, no ACK is returned, and the message queue retries.
- If successful, an ACK is returned, marking the end of the global transaction.
- If the message or ACK is lost, the message queue retries.

Exceptional Scenarios

Message Loss: Handled by repeating the scheduled task.
Delivery Failure: Handled by retries, downstream must ensure idempotency.
ACK Loss: Handled by retries, downstream must ensure idempotency.

Advantages and Challenges

Advantages:

High system throughput, asynchronous downstream transactions via middleware decoupling.
Moderate business intrusion, requiring local message tables and scheduled tasks.

Challenges:

Incomplete transaction support, downstream transactions cannot be rolled back, only retried.

Characteristics

Atomicity: Supported
Consistency: Eventual consistency
Isolation: Not supported (committed branch transactions are visible to other transactions)
Durability: Supported

Best-Effort Notification

The best-effort notification is a simple approach to flexible transactions, suitable for business with low time sensitivity to eventual consistency, where the result of the passive party does not affect the initiator’s result.

This approach roughly works as follows:

System A completes its local transaction and sends a message to the MQ.
A service consumes the MQ and calls System B’s interface.
If System B succeeds, everything is fine; if it fails, the notification service periodically retries calling System B up to N times. If it still fails, it gives up.

Advantages and Challenges

Advantages:

Simple implementation.

Challenges:

No compensation mechanism, no guarantee of delivery.
Requires idempotency, with interfaces ensuring consistency and atomicity.

Characteristics

Atomicity: Not supported (requires additional interfaces)
Consistency: Not supported (requires additional interfaces)
Isolation: Not supported (committed branch transactions are visible to other transactions)
Durability: Supported

Classic Scenario

Payment Callback:

The payment service receives a successful payment notification from a third-party service, updates the payment status of the order, and synchronously notifies the order service. If this synchronous notification fails, an asynchronous script will keep retrying the order service interface.

References

Distributed Transactions: All You Need to Know

CPU False Sharing

Sun, 02 May 2021 13:47:30 +0800

The motivation for this post comes from an interview question I was asked: What is CPU false sharing?

CPU Cache

Let’s start by discussing CPU cache.

CPU cache is a type of storage medium introduced to bridge the speed gap between the CPU and main memory. In the pyramid-shaped storage hierarchy, it is located just below CPU registers. Its capacity is much smaller than that of main memory, but its speed can be close to the processor’s frequency.

The effectiveness of caching relies on the principle of temporal and spatial locality.

When the processor issues a memory access request, it first checks if the requested data is in the cache. If it is (a cache hit), it directly returns the data without accessing main memory. If it isn’t (a cache miss), it loads the data from main memory into the cache before returning it to the processor.

CPU Cache Architecture

There are usually three levels of cache between the CPU and main memory. The closer the cache is to the CPU, the faster it is but the smaller its capacity. When accessing data, the CPU first checks L1, then L2, and finally L3. If the data isn’t in any of these caches, it must be fetched from main memory.

L1 is close to the CPU core that uses it. L1 and L2 caches can only be used by a single CPU core.
L3 can be shared by all CPU cores in a socket.

CPU Cache Line

Caches operate on the basis of cache lines, which are the smallest unit of data transfer between the cache and main memory, typically 64 bytes. A cache line effectively references a block of memory (64 bytes).

Loading a cache line has the advantage that if the required data is located close to each other, it can be accessed without reloading the cache.

However, it can also lead to a problem known as CPU false sharing.

Consider this scenario:

We have a long variable a, which is not part of an array but is a standalone variable, and there’s another long variable b right next to it. When a is loaded, b is also loaded into the cache line for free.
Now, a thread on one CPU core modifies a, while another thread on a different CPU core reads b.
When a is modified, both a and b are loaded into the cache line of the modifying core, and after updating a, all other cache lines containing a become invalid, since they no longer hold the latest value of a.
When the other core reads b, it finds that the cache line is invalid and must reload it from main memory.

Because the cache operates at the level of cache lines, invalidating a’s cache line also invalidates b, and vice versa.

This causes a problem:

b and a are completely unrelated, but each time a is updated, b has to be reloaded from main memory due to a cache miss, slowing down the process.

CPU false sharing: When multiple threads modify independent variables that share the same cache line, they unintentionally affect each other’s performance. This is known as false sharing.

Ensure that memory for different variables is not placed adjacently.
Align variables during compilation to avoid false sharing. See data structure alignment.

References

MySQL Index Overview

Sun, 21 Mar 2021 20:41:33 +0800

Database indexes are sorted data structures in DBMS that help in quickly querying and updating data in a database. Generally, data structures used for building indexes include B-trees, B+ trees, hash tables, etc.

MySQL uses B+ trees to build indexes. The reason for this choice is that a B+ tree node can store more data, and in a B+ tree, only leaf nodes store data, while non-leaf nodes store only indexes. This allows the index to be stored in memory as much as possible, reducing tree height, minimizing disk I/O when querying indexes, and greatly improving query efficiency.

Indexes in InnoDB

Clustered Index and Non-Clustered Index

Indexes can be divided into clustered and non-clustered indexes based on the data stored in the leaf nodes.

Clustered Index: The leaf nodes store the data rows directly, allowing direct access to user data.
Non-Clustered Index: The leaf nodes store the primary key value, and data must be fetched by traversing back to the primary key index (a process known as index backtracking).

In the InnoDB engine, the table’s data is organized using the primary key index. Each table must have a primary key, which constructs the B+ tree, resulting in a primary key index. The primary key index is a clustered index, and all other secondary indexes are non-clustered indexes.

Composite Index

A composite index is an index composed of multiple fields.

create index index_name on table_name (col_1, col_2...)

Compared to a single-field index, the main difference is that it follows the leftmost prefix matching principle.

Leftmost Prefix Matching Principle: When using a composite index, the index values are sorted according to the fields in the index from left to right.

Using Indexes to Optimize Query Performance

Since indexes are ordered, they can significantly improve query efficiency. When using indexes for query optimization, some principles must be followed.

Leftmost Prefix Matching Principle

When using a composite index, the index values are sorted from left to right based on the indexed fields. We need to follow the leftmost prefix matching rule in our queries; otherwise, the data arrangement becomes unordered, resulting in a full table scan.

Suppose you create an index on col1, col2, col3. Following the leftmost prefix matching principle, the query conditions should be designed in the order col1 -> col2 -> col3.

Example:

select * from table_name where col1 = 1 and col2 = 2; This will use the index.

select * from table_name where col2 = 1 and col3 = 2; This will not use the index.

Note: MySQL will continue matching the columns until it encounters a range query (>, <, between, like), after which it stops matching.

Index Coverage Principle

Index coverage refers to querying values directly from the index without needing to traverse back to the table. Well-designed indexes can reduce the number of backtracking operations.

For a composite index (col1, col2, col3):

A query like select col1, col2, col3 from test where col1=1 and col2=2 can directly retrieve values for col1, col2, and col3 without needing to traverse back to the table, as their values are already stored in the secondary index.

HTTPS Introduction

Sun, 21 Feb 2021 16:48:55 +0800

HTTPS (HTTP over SSL) was introduced to address the security vulnerabilities of HTTP, such as eavesdropping and identity spoofing. It uses SSL or TLS to encrypt communication between the client and the server.

Problems with HTTP

Communication uses plain text, making it susceptible to eavesdropping.
Unable to verify the identity of the communication party, making it vulnerable to spoofing (e.g., Denial of Service attacks).
Cannot guarantee message integrity, making it possible for messages to be altered (e.g., Man-in-the-Middle attacks).

To address these issues, we need:

Encryption to prevent eavesdropping.
- Encrypting either the content or the communication channel can help secure the communication.
Authentication to prevent impersonation attacks.
- Certificates are commonly used for identity verification.
Integrity checks to prevent tampering.
- Hash functions like MD5 and SHA-1 are often used to ensure data integrity.

HTTPS

To solve the above problems comprehensively, we add encryption, authentication, and integrity protection to HTTP, resulting in HTTPS.

$HTTP + Encryption + Authentication + Integrity Protection = HTTPS$

HTTPS over SSL

HTTPS is not a new protocol; it simply adds SSL or TLS between HTTP and TCP. By doing so, HTTPS provides encryption, certificates, and integrity protection.

SSL is independent of HTTP, and it can be used with other protocols like SMTP and Telnet to provide encryption.

Encryption Mechanism

HTTPS uses both symmetric (shared key) and asymmetric (public key) encryption to achieve its goals effectively:

Public key encryption is used to encrypt the shared key (Pre-master secret), ensuring it cannot be intercepted.
Once the shared key is established, symmetric encryption is used for communication to ensure better performance.

Key differences:

Public key encryption: Asymmetric, secure but computationally expensive.

Shared key encryption: Symmetric, less secure for key exchange but more efficient for encryption.

Authentication Mechanism

Public key encryption requires proof that the public key itself is legitimate and not replaced. HTTPS uses certificates to achieve this authentication.

Certificates are issued by Certificate Authorities (CAs), who verify the identity of the party requesting the certificate and sign the public key.

The server sends the signed public key certificate to the client using public key encryption.
The client uses the CA’s public key to verify the signature. If verified, it proves:
- The CA is trustworthy.
- The server’s public key is legitimate.
Both parties then use the public key to establish a secure channel.

The CA’s public key is usually pre-installed in browsers.

Integrity Protection

HTTPS ensures message integrity by using message digest algorithms.

Hash Algorithms

These algorithms, also known as hash or digest functions, convert input data of any length into a fixed-length output string, such as MD5 and SHA-2.

A hash algorithm is not an encryption algorithm. It cannot be reversed to obtain the original data, so it can only be used for integrity checking.

Applications attach a Message Authentication Code (MAC) to messages. The MAC helps detect tampering, thus ensuring the integrity of the communication.

HTTPS Communication Flow

Other Considerations

Due to the overhead of encryption, decryption, and SSL handshake, HTTPS is generally slower and requires more CPU resources than HTTP.
- SSL accelerators (dedicated servers) are sometimes used to mitigate this issue.
When a client repeatedly accesses the same HTTPS server, it may not need to perform a complete TLS handshake each time.
- The server maintains a session ID for each client and uses it to resume secure sessions, avoiding a full handshake.

References

MIT6.824-MapReduce

Fri, 22 Jan 2021 17:02:44 +0800

The third year of university has been quite intense, leaving me with little time to continue my studies on 6.824, so my progress stalled at Lab 1. With a bit more free time during the winter break, I decided to continue. Each paper or experiment will be recorded in this article.

This is the first chapter of my Distributed System study notes.

About the Paper

The core content of the paper is the proposed MapReduce distributed computing model and the approach to implementing the Distributed MapReduce System, including the Master data structure, fault tolerance, and some refinements.

MapReduce Computing Model

The model takes a series of key-value pairs as input and outputs a series of key-value pairs as a result. Users can use the MapReduce System by designing Map and Reduce functions.

Map: Takes input data and generates a set of intermediate key-value pairs
Reduce: Takes intermediate key-value pairs as input, combines all data with the same key, and outputs the result.

map(String key, String value):
  // key: document name
  // value: document contents
  for each word w in value:
    EmitIntermediate(w, "1");


reduce(String key, Iterator values):
  // key: a word
  // values: a list of counts
  int result = 0;
  for each v in values:
    result += ParseInt(v);
  Emit(AsString(result));

MapReduce Execution Process

The Distributed MapReduce System adopts a master-slave design. During the MapReduce computation, there is generally one Master and several Workers.

Master: Responsible for creating, assigning, and scheduling Map and Reduce tasks
Worker: Responsible for executing Map and Reduce tasks

A more detailed description is as follows:

The entire MapReduce execution process includes M Map Tasks and R Reduce Tasks, divided into two phases: Map Phase and Reduce Phase.
The input file is split into M splits, and the computation enters the Map Phase. The Master assigns Map Tasks to idle Workers. The assigned Worker reads the corresponding split data and executes the Task. When all Map Tasks are completed, the Map Phase ends. The Partition function (generally hash(key) mod R) is used to generate R sets of intermediate key-value pairs, which are stored in files and reported to the Master for subsequent Reduce Task operations.
The computation enters the Reduce Phase. The Master assigns Reduce Tasks, and each Worker reads the corresponding intermediate key-value file and executes the Task. Once all Reduce tasks are completed, the computation is finished, and the results are stored in result files.

MapReduce Fault Tolerance Mechanism

Since Google MapReduce heavily relies on the distributed atomic file read/write operations provided by Google File System, the fault tolerance mechanism of the MapReduce cluster is much simpler and primarily focuses on recovering from unexpected task interruptions.

Worker Fault Tolerance

In the cluster, the Master periodically sends Ping signals to each Worker. If a Worker does not respond for a period of time, the Master considers the Worker unavailable.

Any Map task assigned to that Worker, whether running or completed, must be reassigned by the Master to another Worker, as the Worker being unavailable also means the intermediate results stored on that Worker’s local disk are no longer available. The Master will also notify all Reducers about the retry, and Reducers that fail to obtain complete intermediate results from the original Mapper will start fetching data from the new Mapper.

If a Reduce task is assigned to that Worker, the Master will select any unfinished Reduce tasks and reassign them to other Workers. Since the results of completed Reduce tasks are stored in Google File System, the availability of these results is ensured by Google File System, and the MapReduce Master only needs to handle unfinished Reduce tasks.

If there is a Worker in the cluster that takes an unusually long time to complete the last few Map or Reduce tasks, the entire MapReduce computation time will be prolonged, and such a Worker becomes a straggler.

Once the MapReduce computation reaches a certain completion level, any remaining tasks are backed up and assigned to other idle Workers, and the task is considered completed once one of the Workers finishes it.

Master Fault Tolerance

There is only one Master node in the entire MapReduce cluster, so Master failures are relatively rare.

During operation, the Master node periodically saves the current state of the cluster as a checkpoint to disk. After the Master process terminates, a restarted Master process can use the data stored on disk to recover to the state of the last checkpoint.

Partition Function

Used during the Map Phase to assign intermediate key-value pairs to R files according to certain rules.

Combiner

In some situations, the user-defined Map task may generate a large number of duplicate intermediate keys. The Combiner function performs a partial merge of the intermediate results to reduce the amount of data that needs to be transmitted between Mapper and Reducer.

Experiment

The experiment involves designing and implementing the Master and Worker to complete the main functionality of a Simple MapReduce System.

In the experiment, the single Master and multiple Worker model was implemented through RPC calls, and different applications were formed by running Map and Reduce functions via Go Plugins.

Master & Worker Functionality

Master

Task creation and scheduling
Worker registration and task assignment
Receiving the current state of the Worker
Monitoring task status

Worker

Registering with the Master
Getting tasks and processing them
Reporting status

Note: The Master provides corresponding functions to Workers via RPC calls

Main Data Structures

The design of data structures is the main task, and good design helps in implementing functionality. The relevant code is shown here; for the specific implementation, see GitHub.

Master

type Master struct {
 // Your definitions here.
 nReduce      int
 taskQueue    chan Task
 tasksContext []TaskContext
 lock         sync.Mutex
 files        []string
 phase        PhaseKind
 done         bool
 workerID     int
}

Worker

type worker struct {
 ID      int
 mapf    func(string, string) []KeyValue
 reducef func(string, []string) string
 nReduce int
 nMap    int
}

Task & TaskContext

type Task struct {
 ID       int
 Filename string
 Phase    PhaseKind
}

type TaskContext struct {
 t         *Task
 state     ContextState
 workerID  int
 startTime time.Time
}

Rpc Args & Reply

type RegTaskArgs struct {
 WorkerID int
}

type RegTaskReply struct {
 T    Task
 HasT bool
}

type ReportTaskArgs struct {
 WorkerID int
 TaskID   int
 State    ContextState
}
type ReportTaskReply struct {
}

type RegWorkerArgs struct {
}

type RegWorkerReply struct {
 ID      int
 NReduce int
 NMap    int
}

Constant & Type

const (
 RUNNING ContextState = iota
 FAILED
 READY
 IDEL
 COMPLETE
)

const (
 MAX_PROCESSING_TIME = time.Second * 5
 SCHEDULE_INTERVAL   = time.Second
)

const (
 MAP PhaseKind = iota
 REDUCE
)

type ContextState int

type PhaseKind int

Running and Testing

Running

# In main directory
cd ./src/main
# Master
go run ./mrmaster.go pg*.txt                                                
# Worker
go build -buildmode=plugin ../mrapps/wc.go && go run ./mrworker.go ./wc.so

Testing

cd ./src/main

sh  ./test-mr.sh

Optimization

These optimizations are some designs I thought of that could be improved when reviewing my code after completing the experiment.

Hotspot Issue

The hotspot issue here refers to a scenario where a particular data item appears frequently in the dataset. The intermediate key-value pairs generated during the Map phase can lead to a situation where one key appears frequently, resulting in excessive disk IO and network IO for a few machines during the shuffle step.

The essence of this issue is that the Shuffle step in MapReduce is highly dependent on the data.

The design purpose of Shuffle is to aggregate intermediate results to facilitate processing during the Reduce phase. Consequently, if the data is extremely unbalanced, hotspot issues will naturally arise.

In fact, the core problem is that a large number of keys are assigned to a single disk file after being hashed, serving as input for the subsequent Reduce phase.

The hash value for the same key should be identical, so the question becomes: How can we assign the same key’s hash value to different machines?

The solution I came up with is to add a random salt to the key in the Shuffle’s hash calculation so that the hash values are different, thereby reducing the probability of keys being assigned to the same machine and solving the hotspot issue.

Fault Tolerance

The paper already proposes some solutions for fault tolerance. The scenario in question is: a Worker node crashes unexpectedly and reconnects after a reboot. The Master observes the crash and reassigns its tasks to other nodes, but the reconnected Worker continues executing its original tasks, resulting in duplicate result files.

The potential issue here is that these two files may cause incorrect results. Furthermore, the reconnected Worker continuing to execute its original tasks wastes CPU and IO resources.

Based on this, we need to mark the newly generated result files, ensuring only the latest files are used as results, thus resolving the file conflict. Additionally, we should add an RPC interface for Worker nodes so that when they reconnect, the Master can call it to clear out any original tasks.

Straggler Issue

The straggler issue refers to a Task taking a long time to complete, delaying the overall MapReduce computation. Essentially, it is a hotspot issue and Worker crash handling problem, which can be addressed by referring to the above sections.

References

MIT6.824 Distributed System

Lab Official Site

MapReduce: Simplified Data Processing on Large Clusters

Detailed Explanation of Google MapReduce Paper

Chinese Spam Email Classification Based on Naive Bayes

Wed, 06 May 2020 00:00:00 +0000

Chinese Spam Email Classification Based on Naive Bayes

Training and Testing Data

This project primarily uses open-source data on GitHub.

Data Processing

First, we use regular expressions to filter the content of Chinese emails in the training set, removing all non-Chinese characters. The remaining content is then tokenized using jieba for word segmentation, and stopwords are filtered using a Chinese stopword list. The processed results for spam and normal emails are stored separately.

Two dictionaries, spam_voca and normal_voca, are used to store the word frequencies of different terms in different emails. The data processing is then complete.

Training and Prediction

The training and prediction process involves calculating the probability $P(Spam|word_1, word_2, \dots, word_n)$. When this probability exceeds a certain threshold, the email is classified as spam.

Based on the conditional independence assumption of Naive Bayes, and assuming the prior probability $P(s) = P(s’) = 0.5$, we have:

$P(s|w_1, w_2, \dots, w_n) = \frac{P(s, w_1, w_2, \dots, w_n)}{P(w_1, w_2, \dots, w_n)}$

$= \frac{P(w_1, w_2, \dots, w_n | s) P(s)}{P(w_1, w_2, \dots, w_n)} = \frac{P(w_1, w_2, \dots, w_n | s) P(s)}{P(w_1, w_2, \dots, w_n | s) \cdot p(s) + P(w_1, w_2, \dots, w_n | s’) \cdot p(s’)} $

Since $P(spam) = P(not\ spam)$, we have

$\frac{\prod\limits_{j=1}^n P(w_j | s)}{\prod\limits_{j=1}^n P(w_j | s) + \prod\limits_{j=1}^n P(w_j | s’)}$

Further, using Bayes’ theorem $P(w_j | s) = \frac{P(s | w_j) \cdot P(w_j)}{P(s)}$, the expression becomes

$\frac{\prod\limits_{j=1}^n P(s | w_j)}{\prod\limits_{j=1}^n P(s | w_j) + \prod\limits_{j=1}^n P(s’ | w_j)}$

Process details:

For each email in the test set, perform the same processing, and calculate the top $n$ words with the highest $P(s|w)$. During calculation, if a word appears only in the spam dictionary, set $P(w | s’) = 0.01$; similarly, if a word appears only in the normal dictionary, set $P(w | s) = 0.01$. If the word appears in neither, set $P(s|w) = 0.4$. These assumptions are based on prior research.
Use the 15 most important words for each email and calculate the probability using the above formulas. If the probability is greater than the threshold $\alpha$ (typically set to 0.9), classify it as spam; otherwise, classify it as a normal email.

You can refer to the code for further details.

Results

By adjusting the number of words used for prediction, the best result for this dataset is:

Selected 29 words: 0.9642857142857143

Project Structure

data
- 中文停用词表.txt (Chinese stopword list)
- normal (folder for normal emails)
- spam (folder for spam emails)
- test (folder for test emails)
main.py (main script)
normal_voca.json (JSON file for normal email vocabulary)
pycache (cache folder)
- utils.cpython-36.pyc
spam_voca.json (JSON file for spam email vocabulary)
utils.py (utility functions)

Code

# utils.py
import jieba
import numpy
import re
import os
import json
from collections import defaultdict

spam_file_num = 7775
normal_file_num = 7063

# Load stopword list
def get_stopwords():
    return [i.strip() for i in open('./data/中文停用词表.txt', encoding='gbk')]

# Read raw email content and process it
def get_raw_str_list(path):
    stop_list = get_stopwords()
    with open(path, encoding='gbk') as f:
        raw_str = f.read()
    pattern = '[^\u4E00-\u9FA5]'  # Chinese unicode range
    regex = re.compile(pattern)
    handled_str = re.sub(pattern, '', raw_str)
    str_list = [word for word in jieba.cut(handled_str) if word not in stop_list]
    return str_list

# Build vocabulary
def get_voca(path, is_file_path=False):
    if is_file_path:
        return read_voca_from_file(path)

    voca = defaultdict(int)
    file_list = [file for file in os.listdir(path)]
    for file in file_list:
        raw_str_list = get_raw_str_list(path + str(file))
        for raw_str in raw_str_list:
            voca[raw_str] = voca[raw_str] + 1

    return voca

# Save vocabulary to JSON file
def save_voca2json(voca, path, sort_by_value=False, indent_=4):
    if sort_by_value:
        sorted_by_value(voca)
    with open(path, 'w+') as f:
        f.write(json.dumps(voca, ensure_ascii=False, indent=indent_))

# Read vocabulary from JSON file
def read_voca_from_file(path):
    with open(path) as f:
        voca = json.load(f)
    return voca

# Sort dictionary by value
def sorted_by_value(_dict):
    _dict = dict(sorted(spam_voca.items(), key=lambda x: x[1], reverse=True))

# Calculate P(Spam|word)
def get_top_words_prob(path, spam_voca, normal_voca, words_size=30):
    critical_words = []
    for word in get_raw_str_list(path):
        if word in spam_voca.keys() and word in normal_voca.keys():
            p_w_s = spam_voca[word] / spam_file_num
            p_w_n = normal_voca[word] / normal_file_num
            p_s_w = p_w_s / (p_w_n + p_w_s)
        elif word in spam_voca.keys() and word not in normal_voca.keys():
            p_w_s = spam_voca[word] / spam_file_num
            p_w_n = 0.01
            p_s_w = p_w_s / (p_w_n + p_w_s)
        elif word not in spam_voca.keys() and word in normal_voca.keys():
            p_w_s = 0.01
            p_w_n = normal_voca[word] / normal_file_num
            p_s_w = p_w_s / (p_w_n + p_w_s)
        else:
            p_s_w = 0.4
        critical_words.append([word, p_s_w])
    return dict(sorted(critical_words[:words_size], key=lambda x: x[1], reverse=True))

# Calculate Bayesian probability
def caculate_bayes(words_prob, spam_voca, normal_voca):
    p_s_w = 1
    p_s_nw = 1
    for word, prob in words_prob.items():
        p_s_w *= prob
        p_s_nw *= (1 - prob)
    return p_s_w / (p_s_w + p_s_nw)

def predict(bayes, threshold=0.9):
    return bayes >= threshold

# Get files and labels
def get_files_labels(dir_path, is_spam=True):
    raw_files_list = os.listdir(dir_path)
    files_list = [dir_path + file for file in raw_files_list]
    labels = [is_spam for _ in range(len(files_list))]
    return files_list, labels

# Predict and print results
def predict_result(file_list, y, spam_voca, normal_voca, word_size=30):
    ret = []
    right = 0
    for file in file_list:
        words_prob = get_top_words_prob(file, spam_voca, normal_voca, words_size=word_size)
        bayes = caculate_bayes(words_prob, spam_voca, normal_voca)
        ret.append(predict(bayes))
    for i in range(len(ret)):
        if ret[i] == y[i]:
            right += 1
    print(right / len(y))

# main.py
from utils import *

if __name__ == '__main__':
    # Get vocabulary and save for future use
    spam_voca = get_voca('./spam_voca.json', is_file_path=True)
    normal_voca = get_voca('./normal_voca.json', is_file_path=True)
    save

Java Multithreading Programming

Fri, 01 Nov 2019 00:00:00 +0000

Yesterday evening, while revisiting the book “Advanced Java: Multithreading and Parallel Programming” by Liang Yung, I thought it would be a good idea to take the opportunity to document my understanding.

Java Multithreading Programming

Java provides built-in support for multithreading.

A thread is a single sequential flow of control within a process, and multiple threads can run concurrently within a process, each performing different tasks.
Multithreading is a specialized form of multitasking that consumes fewer resources.
A process contains the memory space allocated by the operating system and includes one or more threads. Threads cannot exist independently but must be part of a process. A process continues running until all non-daemon threads complete execution.
Multithreading allows developers to write efficient programs that fully utilize CPU resources.

Thread States

A thread is a dynamic execution entity that has different states throughout its lifecycle.

New:
- A thread is in a new state when it is created using the new keyword with the Thread class or its subclass. It remains in this state until the program starts the thread using the start() method.
Runnable:
- After invoking the start() method, the thread enters the runnable state and waits in the ready queue to be allocated CPU resources by the JVM thread scheduler.
Running:
- Once the thread gets CPU resources, it enters the running state and executes the run() method. In the running state, a thread can transition to blocked, runnable, or terminated states.
Blocked:
- When a thread executes methods like sleep() or suspend() and loses resources, it transitions to the blocked state. After resuming resources or the sleep time is over, it can reenter the runnable state.
Waiting Blocked:
- A running thread calling the wait() method enters the waiting blocked state.
Synchronized Blocked:
- A thread trying to acquire a synchronized lock but failing due to another thread owning the lock transitions to the synchronized blocked state.
Other Blocked:
- Through methods like sleep(), join(), or I/O requests, a thread can enter the other blocked state. Once these operations are complete, it can reenter the runnable state.
Terminated:
- A thread enters the terminated state once it has completed its execution or met some terminating conditions.

Creating Task Classes and Threads

A task in Java is an object that implements the Runnable interface (containing the run() method). You need to override the run() method to define the task’s behavior.
Threads are created through the Thread class, which also contains methods for controlling the thread.

Creating a thread is always based on a task:

Thread thread = new Thread(new TaskClass());
// Calling thread.start() will invoke TaskClass's run() method immediately.

Other Methods in the Thread Class

yield(): Temporarily releases the CPU to let other threads execute.
sleep(): Makes the thread sleep for a specified period to allow other threads to run.

Note: sleep() may throw an InterruptedException, which is a checked exception, meaning Java requires you to catch it in a try block.

Thread Priorities

Threads have priorities. The Java Virtual Machine always gives preference to higher-priority threads. If all threads have the same priority, they follow round-robin scheduling.

Use Thread.setPriority() to set a thread’s priority.

Thread Pool

If you need to create a thread for each of many tasks, starting new threads for each task can limit throughput and degrade performance. Using a thread pool is an ideal solution for managing the concurrent execution of tasks.

Java provides the Executor interface to execute tasks in a thread pool, and the ExecutorService interface is used to manage and control those tasks. Executors are created through static methods like newFixedThreadPool(int) (to create a pool with a fixed number of threads) or newCachedThreadPool() (to create a pool with a dynamically managed number of threads).

Example:

import java.util.concurrent.*;

public class ExecutorDemo {
  public static void main(String[] args) {
    // Create a fixed thread pool with a maximum of three threads
    ExecutorService executor = Executors.newFixedThreadPool(3);
    // Submit runnable tasks to the executor
    executor.execute(new PrintChar('a', 100));
    executor.execute(new PrintChar('b', 100));
    executor.execute(new PrintNum(100));
    // Shut down the executor
    executor.shutdown();
  }
}

Thread pools provide a better way to manage threads. They primarily address issues related to the overhead of thread lifecycle and resource limitations:

Thread pools reduce the time and system resources spent on creating and destroying threads. By reusing threads across multiple tasks, the cost of creating threads is amortized. Since threads already exist when new requests come in, they eliminate the latency caused by thread creation, allowing the application to respond faster.
Thread pools allow easy thread management, e.g., using a ScheduledThreadPool to execute tasks after a delay or on a repeating schedule.
They control concurrency levels, preventing resource contention when many threads compete for CPU resources.

Thread Synchronization

If multiple threads simultaneously access the same resource, it may lead to data corruption. If two tasks interact with a shared resource in a conflicting manner, they are said to be in a race condition. Without race conditions, a program is considered thread-safe.

To prevent race conditions, threads must be synchronized to prevent multiple threads from accessing a particular section of the program simultaneously.

Methods for Synchronizing Threads

Before executing a synchronized method, a lock must be obtained. Locks provide exclusive access to a shared resource. For instance methods, the object is locked; for static methods, the class is locked.

synchronized keyword:

You can apply this keyword to methods or blocks of code.

synchronized (expr) {
  // do something
}

public synchronized void func() {}

Lock-based synchronization: Locks and conditions can be used explicitly for thread synchronization.

A lock is an instance of the Lock interface, which provides methods to acquire and release locks. ReentrantLock is an implementation of the lock mechanism for mutual exclusion.

Example:

public void deposit(int amount) {
  lock.lock(); // Acquire the lock
  try {
    int newBalance = balance + amount;
    // This delay is deliberately added to magnify the
    // data corruption problem and make it easy to see.
    Thread.sleep(5);
    balance = newBalance;
  } catch (InterruptedException ex) {
    // Handle the exception
  } finally {
    lock.unlock(); // Release the lock
  }
}

Avoiding Deadlocks

A deadlock may occur when multiple threads need to acquire locks on several shared objects simultaneously. Deadlocks can be avoided by ordering resource acquisition.

Thread Collaboration

Threads can communicate by using conditions to specify what actions they should take under certain circumstances.

A condition is an object created through the Lock object’s newCondition() method. Threads can use await(), signal(), or signalAll() to communicate.

await(): Causes the current thread to wait until the condition is signaled.
signal()/signalAll(): Wakes one or all threads waiting on the condition.

Conditions must be used with locks; invoking their methods without a lock will result in an IllegalMonitorStateException.

Blocking Queues

Java provides blocking queues for multithreading, which allow synchronization without needing locks or conditions explicitly. They provide two additional operations:

When the queue is empty, a retrieval operation will block the thread until elements become available.
When the queue is full, an insert operation will block the thread until space becomes available.

Blocking queues are commonly used in producer-consumer scenarios. Producer threads place results in the queue, while consumer threads retrieve and process those results. Blocking queues automatically balance the workload between producers and consumers.

Core Methods of BlockingQueue

Adding Data:
- put(E e): Inserts an element at the end of the queue, waiting if the queue is full.
- offer(E e, long timeout, TimeUnit unit): Attempts to add an element, waiting up to the specified time if the queue is full. If successful, returns true; otherwise, returns false.
Retrieving Data:
- take(): Retrieves and removes the head of the queue, waiting if necessary until an element becomes available.
- drainTo(): Retrieves and removes all available elements from the queue, improving efficiency by reducing the number of lock/unlock operations.
- poll(long timeout, TimeUnit unit): Retrieves and removes the head of the queue, waiting up to the specified time if the queue is empty. If no element is found within the time limit, returns null.

Parallel Programming

Java uses the Fork/Join framework to implement parallel programming. In this framework, a Fork can be considered a separate task executed by a thread.

Decompose a problem into multiple non-overlapping subproblems that can be solved independently, then combine their solutions to get the overall answer.

Tasks are defined using the ForkJoinTask class and executed in a ForkJoinPool instance.

ForkJoinTask is the base class for tasks. It’s a lightweight entity, meaning many tasks can be executed by a small number of threads in the ForkJoinPool.

Example:

import java.util.concurrent.RecursiveAction;
import java.util.concurrent.ForkJoinPool;

public class ParallelMergeSort {
  public static void main(String[] args) {
    final int SIZE = 7000000;
    int[] list1 = new int[SIZE];
    int[] list2 = new int[SIZE];
    for (int i = 0; i < list1.length; i++)
      list1[i] = list2[i] = (int)(Math.random() * 10000000);
    long startTime = System.currentTimeMillis();
    parallelMergeSort(list1); // Invoke parallel merge sort
    long endTime = System.currentTimeMillis();
    System.out.println("\nParallel time with " +
      Runtime.getRuntime().availableProcessors() +
      " processors is " + (endTime - startTime) + " milliseconds");
    startTime = System.currentTimeMillis();
    MergeSort.mergeSort(list2); // MergeSort is in Listing 23.5
    endTime = System.currentTimeMillis();
    System.out.println("\nSequential time is " +
      (endTime - startTime) + " milliseconds");
  }

  public static void parallelMergeSort(int[] list) {
    RecursiveAction mainTask = new SortTask(list);
    ForkJoinPool pool = new ForkJoinPool();
    pool.invoke(mainTask);
  }

  private static class SortTask extends RecursiveAction {
    private final int THRESHOLD = 500;
    private int[] list;

    SortTask(int[] list) {
      this.list = list;
    }

    @Override
    protected void compute() {
      if (list.length < THRESHOLD)
        java.util.Arrays.sort(list);
      else {
        // Obtain the first half
        int[] firstHalf = new int[list.length / 2];
        System.arraycopy(list, 0, firstHalf, 0, list.length / 2);

        // Obtain the second half
        int secondHalfLength = list.length - list.length / 2;
        int[] secondHalf = new int[secondHalfLength];
        System.arraycopy(list, list.length / 2, secondHalf, 0, secondHalfLength);

        // Recursively sort the two halves
        invokeAll(new SortTask(firstHalf), new SortTask(secondHalf));

        // Merge firstHalf with secondHalf into list
        MergeSort.merge(firstHalf, secondHalf, list);
      }
    }
  }
}

Mon, 01 Jan 0001 00:00:00 +0000

An alternative tuple-storage engine for Casbin Mesh / Casbin — GSOC 2022 Proposal

About me

Basic Infomation

First / Last Name: Xie Kai
Email: [email protected]">[email protected]
QQ : 1633849228
School/University: Beijing University of Posts and Telecommunications
Graduation Date: July, 2022
Major/Focus: Software Engineering
Location: Beijing, China
Timezone: China Standard Time (CST), UTC +8
Github Profile: https://github.com/noneback
Personal Blog: http://noneback.github.io

Open Source Experience

I used to make contribution on those open source projects:

MaxtrixOne : Hyperconverged cloud-edge native database
flame-go: cache : a middleware that provides the cache management for Flamego
flame-go: session : a middleware that provides the session management for Flamego
casnode : An open-source forum (BBS) software developed by Go and React
Toys : Toys written by myself.

Other Information

Currently, I am learning mit6.824 and cmu15-445 courses and have finished MapReduce and Raft Lab. I have basic concepts of page layout, indexing (hash index, B+ tree index), and multi-version concurrency control.
Used to work in the Business Department and the Distributed Storage Department in Bytebance as an internship.

Problem Description

Currently, Casbin uses golang built-in map structure to maintain policies in the main memory and persist the policies via adapter abstraction.

If policies data grows, however, the growing cost of main memory resources and bad performance make the memory management strategy not tolerable anymore. We need to find a better way to manage the casbin in-memory data when data grows.

Implementation Plan

Breif Design

From my point of view, our main goal is to reduce the cost of memory as well as keep good performance handling policies read and write requests.

In order to achieve those key goals, we can introduce an experimental tuple storage to get charge of storing those policies, turning the policies management strategy from memory-oriented to disk-oriented. We can even make a better abstraction of the storage layer so that we can use different engines (row, column) for the different workloads.

In general, we can take the following parts into consideration:

API for upper layer

Design API for the upper level.

The API is the key design if we want to make the storage engine a plugin.

Derived from adapter will be fine.
workload optimizer
Try to optimize those workloads to improve performance.

Key design:
- Estimation of requests cost
- Strategy to reconstruct data access path.
Buffer Pool management

In-memory data structure management.

Key design: replace strategy
Indexing

Index to accelerate read and write requests. Key design is what index we should use, and how to build indexes for policies to improve performance.

Considering the casbin’s common workload, we can provide b+tree and hash structure for indexing.
Data Storage Structures
Key design:
- File organization. B+tree sequence file organization.
- Page and tuple layout. Our policies data is actually varchar so our tuples are variable-length records. And we can use slotted-page structure to organize records.
- encoding. Row-based or Column-based.
Transaction if necessary

we can use mvcc to improve performance.

Reference & Resource

Codebase

Bustdb codebase
MIT-6.830 Simpledb Codebase
Risinglight DB
Badger DB

Paper

An Empirical Evaluation of In-Memory Multi-Version Concurrency Control
Column-Stores vs. Row-Stores: How Different Are They Really?

Other

Database System Concepts
CMU-15445 DB Course
CMU-15721 DB Course

Timeline

Before the official coding time

May 1 - May 23

Learn more about Casbin source code and Casbin Community and try to solve some basic issues on the codebase.
Have a discussion with the mentor to determine what feature we need to add and make a basic design overview of the project.
Do research about how to implement our project best and write a detailed design document about it.

May 24 - June 14

Carefully design and write the basic framework of our whole project.
Write UT for framework code.

Official coding period starts

June 15 - June 28

Write code about data encoding and page tuple layout (disk manager)
Write UT for data storage layer

June 29 - July 13

Implement a module about buffer pool and Index management.
Write UT for the buffer pool and indexing.

July 14 - July 28

Implement the API and workload optimizer for the upper layer.
Optimized for workload from the upper level.

July 29 - August 5

Implement Transaction Module
Write UT for transaction part

August 6 - August 13

Polish our project
CI integration and document work. Polish the document finished

Extra Time

A buffer kept for any unpredictable delay

Deliverables

A Casbin built-in embedded disk-oriented tuple storage engine.

The Engine should contain ：

Carefully designed API for upper Casbin internal module.
Storage management. Include file organization and page layout.
Buffer pool management.
Index management.
An workload optimizer for the upper layer.
Transaction part if necessary.