NoneBack
https://noneback.github.io/
Recent content on NoneBack created by Hugo -- gohugo.ioen@NoneBack All rights reservedMon, 10 Mar 2025 14:46:54 +0800CPU Profiling: What, How, and When
https://noneback.github.io/blog/cpu-profiling-what-how-and-when/
Mon, 10 Mar 2025 14:46:54 +0800https://noneback.github.io/blog/cpu-profiling-what-how-and-when/<h2 id="what-what-is-cpu-profiling">What: What is CPU Profiling</h2>
<p>A technique for analyzing program CPU performance. By collecting detailed data during program execution (such as function call frequency, time consumption, call stacks, etc.), it helps developers identify performance bottlenecks and optimize code efficiency. Typically used in performance analysis and root cause diagnosis scenarios.</p>
<h2 id="how-how-profiling-data-is-collected">How: How Profiling Data is Collected</h2>
<p>Common tools like <code>perf</code> are used to collect process stack information. These tools use sampling statistics to capture stack samples executing on the CPU for performance analysis.</p>
<pre tabindex="0"><code class="language-mermaid" data-lang="mermaid">graph TD
A[Sampling Trigger] -->|Interrupt| B[Sampling]
B -->|perf_event/ebpf| C[Process Stack Addresses]
C -->|Address Translation| D[ELF, OFFSET]
D -->|Symbol Resolution| E[Call Stack]
E -->|Formatting| F[pprof/perf script]
F --> |Visualization| G[Flame Graph/Call Graph]
</code></pre><h3 id="trigger-mechanisms">Trigger Mechanisms</h3>
<p>Generally uses timer interrupts or event-counter-based strategies.</p>
<h4 id="timer-interrupts">Timer Interrupts</h4>
<p>Default fixed frequency (e.g., 99Hz) clock interrupts (SIGPROF). Shorter intervals increase precision but also overhead. Linux perf defaults to 99Hz frequency (≈10.1ms intervals).</p>
<h4 id="event-counter-sampling">Event-Counter Sampling</h4>
<p>Triggers sampling when hardware performance counters (e.g., <code>PERF_COUNT_HW_CPU_CYCLES</code>) reach thresholds. Useful for analyzing hardware-related events like Cache Misses.</p>
<h3 id="sampling-methods">Sampling Methods</h3>
<p>Typically, the OS kernel-provided interfaces like eBPF or perf_event are used for stack sampling.</p>
<h4 id="ebpf-approach">eBPF Approach</h4>
<p>Using eBPF programs (e.g., bpf_get_stackid), both user-space and kernel-space call stacks can be captured directly without additional stack unwinding. This method retrieves complete stack IP information.</p>
<h4 id="perf_event-approach">perf_event Approach</h4>
<p>The perf_event_open interface (e.g., perf record command) captures the instruction pointer (RIP). However, it only records the currently executing function address, not the full call stack. This means only the function name triggered by the sample can be resolved.</p>
<p>Example perf record output:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>node <span style="color:#ae81ff">3236535</span> 34397396.208842: <span style="color:#ae81ff">250000</span> cpu-clock:pppH: 110c800 v8::internal::Heap_CombinedGenerationalAndSharedBarrierSlow+0x0 <span style="color:#f92672">(</span>/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span>node <span style="color:#ae81ff">3236535</span> 34397396.354632: <span style="color:#ae81ff">250000</span> cpu-clock:pppH: 7f7d63e87ef4 Builtins_LoadIC+0x574 <span style="color:#f92672">(</span>/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node<span style="color:#f92672">)</span>
</span></span></code></pre></div><p>To obtain a full call stack, tools like libunwind perform stack unwinding. For example, <code>perf record -g</code> generates a full stack trace by unwinding the stack frames.</p>
<p>Example perf record -g output:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>node <span style="color:#ae81ff">3236535</span> 34397238.259753: <span style="color:#ae81ff">250000</span> cpu-clock:pppH:
</span></span><span style="display:flex;"><span> 7f7d44339100 <span style="color:#f92672">[</span>unknown<span style="color:#f92672">]</span> <span style="color:#f92672">(</span>/tmp/perf-3236535.map<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span> 18ea0dc Builtins_JSEntryTrampoline+0x5c <span style="color:#f92672">(</span>/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span> 18e9e03 Builtins_JSEntry+0x83 <span style="color:#f92672">(</span>...<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> c7d43f node::Start+0x58f <span style="color:#f92672">(</span>...<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span> 7f7d6ba14d90 __libc_start_call_main+0x80 <span style="color:#f92672">(</span>/usr/lib/x86_64-linux-gnu/libc.so.6<span style="color:#f92672">)</span>
</span></span></code></pre></div><h3 id="address-translation">Address Translation</h3>
<p>The sampled address information corresponds to the process’s virtual addresses, such as:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>7f7d44339100
</span></span><span style="display:flex;"><span>18ea0dc
</span></span><span style="display:flex;"><span>18e9e03
</span></span><span style="display:flex;"><span>106692b
</span></span><span style="display:flex;"><span>10679c4
</span></span><span style="display:flex;"><span>f2a090d
</span></span><span style="display:flex;"><span>c1c738
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><p>To resolve these addresses into ELF + OFFSET for symbol translation, we use the memory mapping information from <code>/proc/[pid]/maps</code>. The key fields in the maps file include:</p>
<!-- raw HTML omitted -->
<p>Example /proc/[pid]/maps entries:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>00400000-00b81000 r--p <span style="color:#ae81ff">00000000</span> fc:03 <span style="color:#ae81ff">550055</span> /root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node
</span></span><span style="display:flex;"><span>7f7d6bf3c000-7f7d6bf3d000 ---p 0021a000 fc:03 <span style="color:#ae81ff">67</span> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
</span></span><span style="display:flex;"><span>7f7d6bf61000-7f7d6bf63000 r--p <span style="color:#ae81ff">00000000</span> fc:03 <span style="color:#ae81ff">2928</span> /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
</span></span></code></pre></div><h4 id="translation-process">Translation Process</h4>
<ol>
<li>Match the virtual address to the appropriate memory segment in <code>/proc/[pid]/maps</code>.</li>
<li>Calculate the offset within the ELF file using:
<code>offset = virtual_address - segment_start + file_offset</code></li>
</ol>
<h3 id="symbol-resolution">Symbol Resolution</h3>
<p>After translating virtual addresses into <code>ELF + OFFSET</code> pairs, the next step is resolving these offsets into human-readable function symbols. This involves leveraging symbol tables or debugging information embedded in the ELF files.</p>
<h4 id="methods-for-symbol-resolution">Methods for Symbol Resolution</h4>
<ol>
<li>Using Symbol Tables
Tools like nm can extract symbol information from the .dynsym (dynamic symbol table) or .symtab (static symbol table) sections of an ELF file.</li>
</ol>
<p>Example:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Extract malloc-related symbols from a Node.js binary</span>
</span></span><span style="display:flex;"><span>nm -D /path/to/node | grep malloc
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Output:</span>
</span></span><span style="display:flex;"><span>00000000055f9d18 D ares_malloc
</span></span><span style="display:flex;"><span>0000000001f1a2a0 T ares_malloc_data
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> U malloc@GLIBC_2.2.5
</span></span></code></pre></div><ol start="2">
<li>Using DWARF Debugging Information
DWARF debug data provides richer details, including source file locations and variable scopes. Tools like readelf or addr2line can parse this information.</li>
</ol>
<p>Example:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Extract function names and source locations from DWARF info</span>
</span></span><span style="display:flex;"><span>readelf --debug-dump<span style="color:#f92672">=</span>info /path/to/node | grep <span style="color:#e6db74">"DW_AT_name"</span> -A3
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Output:</span>
</span></span><span style="display:flex;"><span><1><1980>: DW_AT_name: uv__make_close_pending
</span></span><span style="display:flex;"><span> DW_AT_decl_file: <span style="color:#ae81ff">19</span>
</span></span><span style="display:flex;"><span> DW_AT_decl_line: <span style="color:#ae81ff">247</span>
</span></span></code></pre></div><ol start="3">
<li>Demangling C++ Symbols
C++ symbols are often mangled (encoded) for uniqueness. Tools like c++filt restore human-readable names.</li>
</ol>
<p>Example:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Demangle a mangled symbol</span>
</span></span><span style="display:flex;"><span>echo <span style="color:#e6db74">"_ZN4node14ThreadPoolWork12ScheduleWorkEv"</span> | c++filt
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Output:</span>
</span></span><span style="display:flex;"><span>node::ThreadPoolWork::ScheduleWork<span style="color:#f92672">()</span>
</span></span></code></pre></div><h3 id="stack-output-formatting">Stack Output Formatting</h3>
<p>Resolved stack traces are formatted for analysis tools like pprof or perf script. Additional metadata (e.g., container ID, service type) may be included for aggregation.</p>
<h3 id="data-visualization">Data Visualization</h3>
<p>All those data above will eventually be rendered as flamegraph or call-chain graph.</p>
<h2 id="when-when-to-use-cpu-profiling-tools">When: When to Use CPU Profiling Tools</h2>
<p>CPU profiling is most effective when analyzing CPU-bound performance issues. Below are common scenarios and their workflows:</p>
<pre tabindex="0"><code class="language-mermaid" data-lang="mermaid">graph TD
A[Observe anomaly: Unavailability/Performance Jitter] --> B[Identify target process & timeframe]
B --> C[Check core metrics: CPU, memory, disk, QPS]
C --> D{Is CPU the bottleneck?}
D -->|Yes| E[Profile CPU stacks]
D -->|No| F[Use alternative tools e.g., memory profiler, I/O tracer]
E --> G[Analyze flame graphs/call chains]
G --> H[Root cause identified]
</code></pre><table>
<thead>
<tr>
<th>Scenario Category</th>
<th>Typical Symptoms</th>
<th>Tool Choices</th>
<th>Data Collection Strategy</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Sudden CPU Spikes</strong></td>
<td>Sawtooth-shaped CPU peaks in monitoring charts.</td>
<td>Continuous Profiling Systems</td>
<td>Capture 5-minute context before/after spikes + regular sampling.</td>
</tr>
<tr>
<td><strong>Version Performance Regression</strong></td>
<td>QPS/TPS drops post-deployment.</td>
<td>Differential FlameGraph</td>
<td>A/B version comparison sampling under identical loads.</td>
</tr>
<tr>
<td><strong>High CpuSys</strong></td>
<td>Elevated OS kernel CPU usage causing host instability.</td>
<td>FlameGraph/Call-Chain Graph</td>
<td>Regular sampling with kernel stack analysis.</td>
</tr>
</tbody>
</table>
<h3 id="when-cpu-profiling-is-not-suitable">When CPU Profiling Is NOT Suitable</h3>
<p>For non-CPU-bound issues, profiling data may have limited value. Alternative tools are recommended:</p>
<pre tabindex="0"><code class="language-mermaid" data-lang="mermaid">graph TD
A[CPU Profiling Limitations] --> B[Memory Bottlenecks]
A --> C[I/O-Bound Workloads]
A --> D[Lock Contention]
A --> E[Short-lived Processes]
B -->|Signs| B1(High page faults, GC pauses)
B -->|Tools| B2{{Heap profiler: e.g., pprof, vmstat}}
C -->|Signs| C1(High iowait, low CPU utilization)
C -->|Tools| C2{{iostat, blktrace}}
D -->|Signs| D1(High context switches, sys%)
D -->|Tools| D2{{perf lock, lockstat}}
E -->|Signs| E1(Process lifetime < sampling interval)
E -->|Tools| E2{{execsnoop, dynamic tracing:e.g., bpftrace}}
</code></pre><h2 id="references">References</h2>
<ul>
<li>
<p>code expaple:<a href="https://github.com/noneback/doctor">https://github.com/noneback/doctor</a></p>
</li>
<li>
<p>stack unwind: <a href="https://zhuanlan.zhihu.com/p/460686470">https://zhuanlan.zhihu.com/p/460686470</a></p>
</li>
<li>
<p>proc_pid_maps: <a href="https://man7.org/linux/man-pages/man5/proc_pid_maps.5.html">https://man7.org/linux/man-pages/man5/proc_pid_maps.5.html</a></p>
</li>
<li>
<p>dwarf: <a href="https://www.hitzhangjie.pro/debugger101.io/8-dwarf/">https://www.hitzhangjie.pro/debugger101.io/8-dwarf/</a></p>
</li>
<li>
<p>demange & mangle: <a href="https://www.cnblogs.com/BloodAndBone/p/7912179.html">https://www.cnblogs.com/BloodAndBone/p/7912179.html</a></p>
</li>
</ul>
LevelDB MVCC
https://noneback.github.io/blog/leveldb-mvcc/
Sat, 08 Feb 2025 14:06:39 +0800https://noneback.github.io/blog/leveldb-mvcc/<p>LevelDB implements concurrent sstable read/write operations and snapshot reads through MVCC. Let’s examine its implementation.</p>
<p>Sequence Number
LevelDB uses Sequence Numbers as logical clocks to maintain a total order of KV write operations. The Sequence Number is encoded in the last few bytes of the InternalKey. This encoding ensures data ordering during memory writes.</p>
<p><img src="https://leveldb-handbook.readthedocs.io/zh/latest/_images/internalkey.jpeg"></p>
<p>Versioning
Every change to the sstable file collection triggers a version upgrade in LevelDB. Each Version represents the database state at a specific moment, containing sstable metadata and compaction-related information. VersionEdit records version changes.</p>
<pre tabindex="0"><code>Version1 ---VersionEdit--> Version2
</code></pre><p>The VersionSet is represented as an ordered linked list of Versions, reflecting the database’s current and historical states. The LastSeq (last sequence number) and Version linked list are critical components.</p>
<p><img src="https://raw.githubusercontent.com/noneback/images/picgo/202502121042060.png"></p>
<p>The Version linked list tracks all stored Versions and their changes, with reference counting (RC) used for garbage collection. The Version Chain describes the total order of sstable write operations across different times.</p>
<p>WAL + Manifest ensures atomic LastSeq updates. The Manifest file acts as a WAL for version changes, working with Version Commit operations to guarantee atomic updates of the latest version in VersionSet.</p>
<p>During version transitions:</p>
<ol>
<li>Write operations generate VersionEdit records in memory</li>
<li>Changes are written to the Manifest</li>
<li>VersionSet updates to the new Version</li>
</ol>
<p>This process ensures version consistency: compaction-induced version changes won’t affect ongoing read operations, and readers never access intermediate sstable states.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">VersionEdit</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">/** Other code */</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">typedef</span> std<span style="color:#f92672">::</span>set<span style="color:#f92672"><</span>std<span style="color:#f92672">::</span>pair<span style="color:#f92672"><</span><span style="color:#66d9ef">int</span>, <span style="color:#66d9ef">uint64_t</span><span style="color:#f92672">>></span> DeletedFileSet;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>string comparator_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> prev_log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> next_file_number_;
</span></span><span style="display:flex;"><span> SequenceNumber last_sequence_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_comparator_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_prev_log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_next_file_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_last_sequence_;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>vector<span style="color:#f92672"><</span>std<span style="color:#f92672">::</span>pair<span style="color:#f92672"><</span><span style="color:#66d9ef">int</span>, InternalKey<span style="color:#f92672">>></span> compact_pointers_;
</span></span><span style="display:flex;"><span> DeletedFileSet deleted_files_;
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>vector<span style="color:#f92672"><</span>std<span style="color:#f92672">::</span>pair<span style="color:#f92672"><</span><span style="color:#66d9ef">int</span>, FileMetaData<span style="color:#f92672">>></span> new_files_;
</span></span><span style="display:flex;"><span>};
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Version</span> {
</span></span><span style="display:flex;"><span> VersionSet<span style="color:#f92672">*</span>vset_;
</span></span><span style="display:flex;"><span> Version<span style="color:#f92672">*</span> next_;
</span></span><span style="display:flex;"><span> Version<span style="color:#f92672">*</span> prev_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> refs_;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>vector<span style="color:#f92672"><</span>FileMetaData<span style="color:#f92672">*></span> files_[config<span style="color:#f92672">::</span>kNumLevels];
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> FileMetaData<span style="color:#f92672">*</span> file_to_compact_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> file_to_compact_level_;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">double</span> compaction_score_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> compaction_level_;
</span></span><span style="display:flex;"><span>};
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">VersionSet</span> {
</span></span><span style="display:flex;"><span> Env<span style="color:#f92672">*</span><span style="color:#66d9ef">const</span> env_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">const</span> std<span style="color:#f92672">::</span>string dbname_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">const</span> Options<span style="color:#f92672">*</span> <span style="color:#66d9ef">const</span> options_;
</span></span><span style="display:flex;"><span> TableCache<span style="color:#f92672">*</span> <span style="color:#66d9ef">const</span> table_cache_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">const</span> InternalKeyComparator icmp_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> next_file_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> manifest_file_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> last_sequence_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> prev_log_number_;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> WritableFile<span style="color:#f92672">*</span> descriptor_file_;
</span></span><span style="display:flex;"><span> log<span style="color:#f92672">::</span>Writer<span style="color:#f92672">*</span> descriptor_log_;
</span></span><span style="display:flex;"><span> Version dummy_versions_;
</span></span><span style="display:flex;"><span> Version<span style="color:#f92672">*</span> current_;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>string compact_pointer_[config<span style="color:#f92672">::</span>kNumLevels];
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="mvcc--snapshot-read">MVCC & Snapshot Read</h2>
<p>MVCC primarily resolves concurrent read and write conflicts on SSTables.</p>
<p>Memtable Operations: Reads and writes to the Memtable use a skip list, which inherently introduces conflicts.
SSTable Operations: Reads and writes (compaction and read operations) do not interfere with each other.
Each write operation is associated with a Sequence Number, and SSTables are only appended to. Compaction merely merges SSTables into new files.</p>
<p>Each read operation is associated with a Sequence Number and Version. The Sequence Number ensures that subsequent writes are invisible to the read operation, and the associated Version ensures that the SSTables used during the read are not garbage collected. This guarantees that the read operation does not encounter changes to the SSTables it uses.</p>
<p>If a read and write operate within the same Version, the compaction must complete first, and the Version does not change. This ensures that the compaction process does not interfere with the read. For other cases, reads always precede writes, making writes invisible to the read and eliminating conflicts.</p>
<h2 id="reference">Reference</h2>
<p><a href="https://leveldb-handbook.readthedocs.io/en/latest/index.html">https://leveldb-handbook.readthedocs.io/en/latest/index.html</a></p>
<p><a href="https://noneback.github.io/en/blog/en/leveldb-write/">https://noneback.github.io/en/blog/en/leveldb-write/</a></p>
Prometheus--TSDB
https://noneback.github.io/blog/prometheus-tsdb/
Tue, 31 Dec 2024 01:10:28 +0800https://noneback.github.io/blog/prometheus-tsdb/<p>Recently got promoted, I took a moment to summarize some of my previous work. A significant part of my job was building large-scale database observability systems, which are quite different from cloud-native monitoring solutions like Prometheus. Now, I’m diving into the standard open-source monitoring system.</p>
<p>This article mainly discusses the built-in single-node time series database (TSDB) of Prometheus, outlining its TSDB design without delving into source code analysis.</p>
<p>Analyzing the source code of such projects can often be of low value unless I specialize in TSDBs, as the analysis can be easily forgotten, and the code may not be exceptional.</p>
<h2 id="data--query-model">Data + Query Model</h2>
<ol>
<li>
<p>A <strong>single</strong> monitoring metric is described as a structure of time-dependent data, a timeseries.</p>
<p>$$
{timeseries} = \quad\lbrace \quad metric(attached\ with\ a\ set\ of\ labels) \Rightarrow (t_0,\ v_0),\ (t_1,\ v_1),\ (t_2,\ v_2),\ \ldots,\ (t_n, v_n) \quad\rbrace
$$</p>
</li>
<li>
<p>Queries utilize the ${identifier(metric\ +\ sets\ of\ selected\ labels\ value)}$ to retrieve the corresponding timeseries.
series</p>
</li>
</ol>
<pre tabindex="0"><code>series
^
│ . . . . . . . . . . . . . . . . . . . . . . {__name__="request_total", method="GET"}
│ . . . . . . . . . . . . . . . . . . . . . . {__name__="request_total", method="POST"}
│ . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . . . {__name__="errors_total", method="POST"}
│ . . . . . . . . . . . . . . . . . {__name__="errors_total", method="GET"}
│ . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . .
v
<-------------------- time --------------------->
</code></pre><blockquote>
<p>The use of the identifier is crucial. Poor labeling can lead to timeseries data growth, especially in scenarios where containers are rebuilt.</p>
</blockquote>
<h2 id="data-organization">Data Organization</h2>
<p>For cloud-native scenarios, what characteristics do monitoring data have?</p>
<ol>
<li>
<p>Short data lifecycle. The lifespan of individual containers is brief (e.g., in scaling scenarios or during extensive temporary tasks), leading to rapid timeseries growth along certain time dimensions.</p>
</li>
<li>
<p>Vertical writing with horizontal querying.</p>
<p><img src="https://raw.githubusercontent.com/noneback/images/picgo/20241231010014.png"></p>
</li>
</ol>
<p>With these issues in mind, let’s look at how the data files are organized to address or sidestep these problems.</p>
<p>First, examine the logical structure:</p>
<p><img src="https://raw.githubusercontent.com/noneback/images/picgo/20241231010031.png"></p>
<p>The entire database consists of blocks and a HEAD. Each block can further be broken down into chunks, while the HEAD serves as a read-write buffer area composed of in-memory data and write-ahead logs (WAL). Chunks contain multiple timeseries.</p>
<p>The disk directory structure for a single block is as follows:</p>
<pre tabindex="0"><code>├── 01BKGV7JC0RY8A6MACW02A2PJD // block 的 ULID
│ ├── chunks
│ │ └── 000001
│ ├── tombstones
│ ├── index
│ └── meta.json
├── chunks_head
│ └── 000001
└── wal
├── 000000002
└── checkpoint.00000001
└── 00000000
</code></pre><ul>
<li><strong>Block</strong>: Contains all data for a given time period (default 2 hours) and is read-only, named using a <a href="https://github.com/oklog/ulid">ULID</a>. Each block includes:
<ul>
<li>Chunks: fixed-size (max 128MB) chunks file</li>
<li>Index: index file mainly containing inverted index information</li>
<li>meta.json: metadata including block’s minTime and maxTime for data skipping during queries.</li>
</ul>
</li>
<li><strong>Chunks Head</strong>: The chunks file corresponding to the block currently being written, read-only, with a maximum of 120 data points and a maximum time span of 2 hours.</li>
<li><strong>WAL</strong>: Guarantees data integrity.</li>
</ul>
<blockquote>
<p>The diagram provides significant insights, such as how the WAL ensures data integrity; the Head acts similarly to a buffer pool in a TSDB, managing memory data for batch flushing to disk. When certain conditions are met (e.g., time threshold, data size threshold), the Head becomes immutable (block) and is flushed to disk.</p>
</blockquote>
<p>Overall, many design concepts in data organization resemble the LSM storage structure, which indeed suits TSDB well.</p>
<p>Prometheus’s design approach can be summarized as follows:</p>
<ul>
<li>Using time-based data partitioning to resolve the issue of short data lifecycles.</li>
<li>Using in-memory batching to handle scenarios where only the latest data is written.</li>
</ul>
<p>Setting aside similar aspects with LevelDB, let’s outline the differences.</p>
<p>First, the underlying models are different. LevelDB is a key-value store, while TSDB focuses on timeseries with a strong temporal connection, where time is monotonically increasing. It rarely writes historical data. Additionally, the query models differ; TSDB provides diverse query options, such as filtering timeseries based on various label set operations, necessitating more metadata for efficient querying.</p>
<p>Due to these requirements, new structures and functions are introduced: inverted indexes, checkpoints, tombstones, retention policies, and a compaction design distinct from the LSM key-value model. These will be analyzed in relation to the corresponding file formats.</p>
<h3 id="file-organization-format">File Organization Format</h3>
<p>Let’s examine the components; the specifics of the organizational method are not the focus of this article.</p>
<h4 id="metajson">meta.json</h4>
<p>This file contains information about the block, particularly valuable for compaction and the minT, maxT timestamps.</p>
<p><code>minT</code> and <code>maxT</code> record the block’s time access period, which can skip data during queries.</p>
<p>Compaction records the block’s historical information, such as the number of compaction iterations (level) and its source blocks. The precise utility of this is uncertain but may help during compaction or retention tasks to manage potential duplicates.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"ulid"</span>: <span style="color:#e6db74">"01EM6Q6A1YPX4G9TEB20J22B2R"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"minTime"</span>: <span style="color:#ae81ff">1602237600000</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"maxTime"</span>: <span style="color:#ae81ff">1602244800000</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"stats"</span>: {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"numSamples"</span>: <span style="color:#ae81ff">553673232</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"numSeries"</span>: <span style="color:#ae81ff">1346066</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"numChunks"</span>: <span style="color:#ae81ff">4440437</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"compaction"</span>: {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"level"</span>: <span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"sources"</span>: [
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"01EM65SHSX4VARXBBHBF0M0FDS"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"01EM6GAJSYWSQQRDY782EA5ZPN"</span>
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"version"</span>: <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="chunks">chunks</h4>
<p>These are standard data files, with their indexes stored in the index file. Note that a chunk can only belong to one timeseries, and a timeseries consists of multiple chunks.</p>
<pre tabindex="0"><code>┌──────────────────────────────┐
│ magic(0x85BD40DD) <4 byte> │
├──────────────────────────────┤
│ version(1) <1 byte> │
├──────────────────────────────┤
│ padding(0) <3 byte> │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │ Chunk 1 │ │
│ ├──────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────┤ │
│ │ Chunk N │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘
Every Chunk:
┌───────────────┬───────────────────┬──────────────┬────────────────┐
│ len <uvarint> │ encoding <1 byte> │ data <bytes> │ CRC32 <4 byte> │
└───────────────┴───────────────────┴──────────────┴────────────────┘
</code></pre><h4 id="tombstone">tombstone</h4>
<p>This marks deleted data. TSDB might have delete operations under scenarios like transient jobs or container destruction, where business logic may necessitate removal. Tombstones primarily enable appending writes instead of in-place modifications, and subsequently, blocks may be compacted to reclaim disk space.</p>
<p>Of course, not deleting data isn’t harmful; there will be TTL expiration that removes obsolete data.</p>
<pre tabindex="0"><code>┌────────────────────────────┬─────────────────────┐
│ magic(0x0130BA30) <4b> │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │ Tombstone 1 │ │
│ ├──────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Tombstone N │ │
│ ├──────────────────────────────────────────────┤ │
│ │ CRC<4b> │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Every Tombstone:
┌────────────────────────┬─────────────────┬─────────────────┐
│ series ref <uvarint64> │ mint <varint64> │ maxt <varint64> │
└────────────────────────┴─────────────────┴─────────────────┘
</code></pre><h4 id="index-file">index file</h4>
<p>This file contains all information needed for reading, such as inverted indexes and the mapping of timeseries to chunks.</p>
<p>Notable structures include Series and Postings.</p>
<p>The Series section documents all series information corresponding to their chunks within the blocks.</p>
<p>The Posting Offset Table lists the locations of inverted indexes. The actual inverted index content is stored in the Postings section.</p>
<blockquote>
<p>With inverted index collection operations, you can rapidly filter and retrieve timeseries that meet specified criteria.</p>
</blockquote>
<pre tabindex="0"><code>┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b> │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │ Symbol Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Series │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Index 1 │ │
│ ├──────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Index N │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings 1 │ │
│ ├──────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings N │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Offset Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings Offset Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ TOC │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
A Series:
┌──────────────────────────────────────────────────────┐
│ len <uvarint> │
├──────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────┐ │
│ │ labels count <uvarint64> │ │
│ ├──────────────────────────────────────────────────┤ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ ref(l_i.name) <uvarint32> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ ref(l_i.value) <uvarint32> │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ ... │ │
│ ├──────────────────────────────────────────────────┤ │
│ │ chunks count <uvarint64> │ │
│ ├──────────────────────────────────────────────────┤ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ c_0.mint <varint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ c_0.maxt - c_0.mint <uvarint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ ref(c_0.data) <uvarint64> │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ c_i.mint - c_i-1.maxt <uvarint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ c_i.maxt - c_i.mint <uvarint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ ref(c_i.data) - ref(c_i-1.data) <varint64> │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────┤
│ CRC32 <4b> │
└──────────────────────────────────────────────────────┘
A Postings:
┌────────────────────┬────────────────────┐
│ len <4b> │ #entries <4b> │
├────────────────────┴────────────────────┤
│ ┌─────────────────────────────────────┐ │
│ │ ref(series_1) <4b> │ │
│ ├─────────────────────────────────────┤ │
│ │ ... │ │
│ ├─────────────────────────────────────┤ │
│ │ ref(series_n) <4b> │ │
│ └─────────────────────────────────────┘ │
├─────────────────────────────────────────┤
│ CRC32 <4b> │
└─────────────────────────────────────────┘
</code></pre><h2 id="accelerating-disk-queries">Accelerating Disk Queries</h2>
<p>Let’s focus on how a query locates the relevant data:</p>
<ul>
<li>First, it queries the Posting Offset Table to find the position of the corresponding label’s Postings.</li>
<li>Based on the information from the Postings, it identifies the chunk locations via the series reference.</li>
<li>Finally, it locates the corresponding chunks for the timeseries.</li>
</ul>
<p><img src="https://raw.githubusercontent.com/noneback/images/picgo/20241231005912.png">
<img alt="!https://img.alicdn.com/imgextra/i4/581166664/O1CN01HsQNy31z6A3HT9l5B_!!581166664.jpg" src="https://github.com/noneback/images/blob/picgo/image.png?raw=true"></p>
<h2 id="compaction">Compaction</h2>
<p>Similar to LevelDB, Prometheus utilizes both major and minor compaction processes, termed Compaction and Head Compaction.</p>
<p>Head Compaction is akin to the process of persisting the Head portion into Chunks, during which tombstones are actually deleted from memory.</p>
<p>Compaction is the merging of blocks, accomplishing multiple aims:</p>
<ol>
<li>Reclaiming disk resources used by marked deletions.</li>
<li>Consolidating duplicate information scattered across multiple blocks, such as shared chunks and inverted index records.</li>
<li>Enhancing query processing speed by addressing data overlapping across different blocks—handling this during compaction is more efficient than performing in-memory processing post-read.</li>
</ol>
<h3 id="when-does-compaction-occur">When does compaction occur?</h3>
<p>The official blog doesn’t clarify this well, merely mentioning it occurs when data overlaps. However, various triggers exist, including time-based triggers, checks at each minor compaction, tombstone size evaluations, and manual triggers, following strategies observed in LevelDB.</p>
<h2 id="retention">Retention</h2>
<p>This is straightforward—based on time or size-based TTL. Integrating this into the compaction process could also be a viable approach.</p>
<h2 id="snapshot">Snapshot</h2>
<p>This process involves dumping the in-memory data to disk, likely designed to balance extensive metric data disk writes with data integrity; Otherwise, its functionality would be dupicated with wal.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://web.archive.org/web/20210803115658/https://fabxc.org/tsdb/">https://web.archive.org/web/20210803115658/https://fabxc.org/tsdb/</a></li>
<li><a href="https://liujiacai.net/blog/2021/04/11/prometheus-storage-engine/">https://liujiacai.net/blog/2021/04/11/prometheus-storage-engine/</a></li>
<li><a href="https://tech.qimao.com/prometheus-tsdb-de-she-ji-yu-shi-xian-2/">https://tech.qimao.com/prometheus-tsdb-de-she-ji-yu-shi-xian-2/</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/#compaction">https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/#compaction</a></li>
</ul>
Borg: Large-scale Cluster Management at Google with Borg
https://noneback.github.io/blog/borg/
Mon, 19 Feb 2024 11:12:16 +0800https://noneback.github.io/blog/borg/<p>Borg is a cluster management system, similar to the closed-source version of Kubernetes (k8s).</p>
<ul>
<li>It achieves high utilization through admission control, efficient task packing, overcommitment, machine sharing, and process-level performance isolation.</li>
<li>It provides runtime features to reduce failure recovery time for high-availability applications and scheduling policies that reduce the probability of correlated failures.</li>
<li>It offers a declarative job description language, DNS integration, real-time job monitoring, and tools for analyzing and simulating system behavior, simplifying usage for end-users.</li>
</ul>
<p>The paper aims to introduce the system design and share the experiences Google has gained behind it. This blog mainly focuses on system design, specifically the services Borg offers in terms of SLA, its abstraction of workloads, resources, and scheduling.</p>
<h2 id="system-abstraction">System Abstraction</h2>
<p>Borg manages two primary workloads: long-running services and batch jobs, corresponding to two types of jobs (prod/non-prod). A job consists of several tasks, and different jobs have different priorities.</p>
<p>In terms of deployment architecture, a Borg cluster consists of several cells, each containing multiple machines.</p>
<p>For task scheduling, all physical or logical units on machines are treated as resources, including CPU, memory, IO, etc.</p>
<h2 id="system-architecture">System Architecture</h2>
<p><img alt="System Architecture" src="https://raw.githubusercontent.com/noneback/images/picgo/202401291404127.png"></p>
<p>Borg uses a master-slave architecture, consisting of a BorgMaster and several Borglet nodes. The scheduler is an independent service.</p>
<p><strong>BorgMaster</strong> is a logical node responsible for interacting with both external components and Borglets, as well as maintaining the internal state of the cluster. It uses Paxos to achieve multi-replication and high availability.</p>
<p><strong>Borglet</strong> is the Borg proxy on each machine in the cell. It is responsible for starting/stopping tasks, managing node physical resources, and reporting status.</p>
<p><strong>Scheduler</strong> is the service responsible for task scheduling. It uses the state recorded by the master to asynchronously handle task scheduling and informs the master for a secondary check.</p>
<h2 id="resource-scheduling">Resource Scheduling</h2>
<p>The scheduler is a key service in Borg. The quality of the scheduling algorithm directly affects resource utilization and is closely related to cost efficiency.</p>
<h3 id="basic-process">Basic Process</h3>
<p>The scheduling algorithm has two parts:</p>
<ul>
<li><strong>Feasibility Check</strong>: Finds a set of machines capable of running the task.</li>
<li><strong>Scoring</strong>: Selects the most suitable machine from that set.</li>
</ul>
<p>During the feasibility check, the scheduler finds a set of machines that meet task constraints and have enough available resources. Available resources include those already allocated to lower-priority tasks that can be preempted.</p>
<p>During the scoring phase, the scheduler determines the suitability of each feasible machine. Scoring considers user-specific preferences but primarily depends on built-in criteria, such as minimizing the number and priority of preempted tasks, selecting machines that already have the task package, distributing tasks across different power and failure domains, and optimizing packing quality (mixing high- and low-priority tasks on a single machine to allow high-priority tasks to expand during load spikes).</p>
<p>The scheduler uses a cached copy of the cell state and performs the following steps repeatedly:</p>
<ol>
<li>Retrieves state changes (including assigned and pending jobs) from the elected master and updates its local copy.</li>
<li>Runs a round of scheduling to assign tasks and sends assignment information to the master.</li>
<li>The master accepts and applies the assignments, but if they are unsuitable (e.g., based on outdated state), it waits for the scheduler’s next round.</li>
</ol>
<h3 id="additional-aspects">Additional Aspects</h3>
<p>The paper also discusses how to provide oversubscription and handle performance contention, though these are not the focus of this blog. Readers can refer to the original paper for more details.</p>
<h2 id="references">References</h2>
<p><a href="https://www.cnblogs.com/hellojamest/p/16526159.html">https://www.cnblogs.com/hellojamest/p/16526159.html</a></p>
Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications
https://noneback.github.io/blog/percolator/
Thu, 28 Sep 2023 10:43:23 +0800https://noneback.github.io/blog/percolator/<p>It has been a while since I last studied, and I wanted to learn something interesting. This time, I’ll be covering Percolator, a distributed transaction system. I won’t translate the paper or delve into detailed algorithms; I’ll just document my understanding.</p>
<h2 id="percolator-and-2pc">Percolator and 2PC</h2>
<h3 id="2pc">2PC</h3>
<p>The Two-Phase Commit (2PC) protocol involves two types of roles: <strong>Coordinator and Participant</strong>. The coordinator manages the entire process to ensure multiple participants reach a unanimous decision. Participants respond to the coordinator’s requests, completing prepare operations and commit/abort operations based on those requests.</p>
<blockquote>
<p><strong>The 2PC protocol ensures the atomicity (ACD) of a transaction</strong> but does not implement <strong>isolation (I)</strong>, relying instead on single-node transactions for ACD. The coordinator is clearly a critical point, which can become a bottleneck or cause blocking if it fails.</p>
</blockquote>
<pre tabindex="0"><code> Coordinator Participant
QUERY TO COMMIT
-------------------------------->
VOTE YES/NO prepare*/abort*
<-------------------------------
commit*/abort* COMMIT/ROLLBACK
-------------------------------->
ACKNOWLEDGMENT commit*/abort*
<--------------------------------
end
</code></pre><h3 id="percolator">Percolator</h3>
<p>Percolator can be seen as an optimized version of 2PC, with some improvements such as:</p>
<ul>
<li>Optimizing the use of locks by introducing primary-secondary dual-level locks, which eliminates the reliance on a <strong>coordinator</strong>.</li>
<li>Providing full ACID semantics and supporting MVCC (Multi-Version Concurrency Control) through a timestamp service.</li>
</ul>
<h2 id="percolator-protocol-details">Percolator Protocol Details</h2>
<p>The Percolator system consists of three main components:</p>
<ul>
<li>
<p><strong>Client</strong>: The client initiating a transaction. It acts as the control center for the entire protocol and is the coordinator of the two-phase commit process.</p>
</li>
<li>
<p><strong>TO (Time Observer)</strong>: Responsible for assigning timestamps, providing unique and incrementing timestamps to implement MVCC.</p>
</li>
<li>
<p><strong>Bigtable</strong>: Provides single-row transactions, storing data as well as some attributes for transactional control.</p>
<blockquote>
<p><code>lock + write + data</code>: for transactions, where <code>lock</code> indicates that a cell is held by a transaction, and <code>write</code> represents the data visibility.</p>
<p><code>notify + ack</code>: for watcher or notifier mechanisms.</p>
<p><img alt="https://raw.githubusercontent.com/noneback/images/picgo/20230927163910.png" src="https://raw.githubusercontent.com/noneback/images/picgo/20230927163910.png"></p>
</blockquote>
</li>
</ul>
<p>Externally, Percolator is provided to businesses through an SDK, offering transactions and R/W operations. The model is similar to <code>Begin Txn → Sets of RW Operations → Commit or Abort or Rollback</code>. Bigtable acts as the persistent component, hiding details about Tablet Server data sharding. Each write operation (including read-then-write) in the transaction is treated as a participant in a distributed transaction and may be dispatched to multiple Tablet Server nodes.</p>
<h3 id="algorithm-workflow">Algorithm Workflow</h3>
<p>All writes in a transaction are cached on the client before being written during the commit phase. The commit phase itself is a standard two-phase commit consisting of prewrite and commit stages.</p>
<h4 id="prewrite">Prewrite</h4>
<ol>
<li>Obtain a timestamp from TO as the start time of the transaction.</li>
<li>Lock the data, marking it as held by the current transaction. If locking fails, it means the data is held by another transaction, and the current transaction fails.</li>
</ol>
<blockquote>
<p>The locking process utilizes the primary-secondary mechanism, where one write is chosen as the <strong>primary</strong> and all others as <strong>secondary</strong>. The secondary locks point to the primary.</p>
<p><img alt="https://raw.githubusercontent.com/noneback/images/picgo/202309271613141.png" src="https://raw.githubusercontent.com/noneback/images/picgo/202309271613141.png"></p>
</blockquote>
<p>Clearly, data in the prewrite phase is invisible to other transactions.</p>
<h4 id="commit">Commit</h4>
<ol>
<li>Attempt to commit the data prewritten. The commit starts by committing the primary record, whose commit time will serve as the commit time for the entire transaction. First, the lock record is checked. If the lock does not exist, it indicates that the lock from the prewrite phase has been cleaned by another transaction, causing the current transaction to fail. If the lock exists, the <code>write</code> column is updated to indicate that the data is visible to the system.</li>
</ol>
<blockquote>
<p>In an asynchronous network, single-node failures and network delays are common. The algorithm must detect and clean up these locks to avoid deadlocks. Therefore, in the commit phase, if a lock is found to be missing, it means that an issue occurred with a participant, and the current transaction must be cleaned.</p>
</blockquote>
<ol start="2">
<li>After successfully committing, clean up the lock record. Lock cleanup can be done asynchronously.</li>
</ol>
<p>These designs eliminate the dependency on a centralized <strong>coordinator</strong>. Previously, a centralized service was required to maintain information about all transaction participants. In this algorithm, the primary-secondary lock and the <code>write</code> column achieve the same goal. The <code>write</code> column indicates the visibility and version chain of the data, while the <code>lock</code> column shows which transaction holds the data. The primary-secondary locks record the logical relationship among participants. Thus, committing the primary record becomes the commit point for the entire transaction. Once the primary is committed, all secondary records can be asynchronously committed by checking the corresponding primary record’s <code>write</code> column.</p>
<h3 id="snapshot-isolation">Snapshot Isolation</h3>
<p>Two-phase commit ensures the atomicity of a transaction. On top of that, Percolator also provides <strong>snapshot isolation</strong>. In simple terms, snapshot isolation requires that committed transactions do not cause data conflicts and that read operations within a transaction satisfy snapshot reads. By leveraging the transaction start time and the primary commit time, a total ordering among transactions can be maintained, solving these issues naturally.</p>
<h3 id="deadlock-issues-in-asynchronous-networks">Deadlock Issues in Asynchronous Networks</h3>
<p>As mentioned earlier, in an asynchronous network, single-node failures and network delays are common. The algorithm must clean up locks to prevent deadlocks when such failures are detected. The failure detection strategy can be as simple as a timeout, causing the current transaction to fail. When a node fails and then recovers, its previous transaction has already failed, and the relevant lock records must be cleaned up. Lock cleanup can be asynchronous; for example, during the prewrite phase, if a record’s lock column is found to be non-empty, its primary lock can be checked. If the primary lock is not empty, it means the transaction is incomplete, and the lock can be cleaned up; if empty, the transaction has committed, and the data should be committed and the lock cleaned (RollForward).</p>
<h3 id="notification-mechanism">Notification Mechanism</h3>
<p>A notification mechanism is crucial for state observation and linkage in asynchronous systems, but it is not the focus of this article.</p>
<h2 id="percolator-in-tidb">Percolator in TiDB</h2>
<p>Based on our analysis above, Percolator is an optimized 2PC distributed transaction implementation, relying on a storage engine that supports single-node transactions.</p>
<p>Let’s briefly look at how TiDB uses Percolator to implement distributed transactions.</p>
<p><img alt="https://download.pingcap.com/images/docs-cn/tidb-architecture-v6.png" src="https://download.pingcap.com/images/docs-cn/tidb-architecture-v6.png"></p>
<p>The architecture of TiDB and TiKV is shown above. Data from relational tables in TiDB is ultimately mapped to KV pairs in TiKV. TiKV is a distributed KV store based on Raft and RocksDB. RocksDB supports transactional operations on KV pairs.</p>
<p><img alt="https://download.pingcap.com/images/docs/tikv-rocksdb.png" src="https://download.pingcap.com/images/docs/tikv-rocksdb.png"></p>
<p>Thus, the transaction path in TiDB is as follows: a relational table transaction is converted into a set of KV transactions, which are executed based on Percolator to achieve relational table transaction operations.</p>
<blockquote>
<p>Of course, it cannot provide the same transactional semantics and performance guarantees as a single-node TP database. However, a shared-nothing architecture has its own advantages, which may make this trade-off acceptable.</p>
</blockquote>
<h2 id="references">References</h2>
<p><a href="https://zhuanlan.zhihu.com/p/22594180">Engineering Practice of Two-Phase Commit</a></p>
<p><a href="http://mysql.taobao.org/monthly/2018/11/02/">PolarDB Database Kernel Monthly Report</a></p>
<p><a href="https://karellincoln.github.io/2018/04/05/percolator-translate/">Percolator: Online Incremental Processing System (Chinese Translation)</a></p>
<p><a href="https://www.notion.so/percolator-879c8f72f80b4966a2ec1e41edc74560?pvs=21">Percolator: Online Incremental Processing System (Chinese Translation) | A Small Bird</a></p>
<p><a href="https://zh.wikipedia.org/zh-hans/%E4%BA%8C%E9%98%B6%E6%AE%B5%E6%8F%90%E4%BA%A4">Two-Phase Commit - Wikipedia</a></p>
<p><a href="https://cn.pingcap.com/blog/percolator-and-txn">Percolator and TiDB Transaction Algorithm</a></p>
<p><a href="http://www.oceanbase.wiki/concept/transaction-management/transactions/distributed-transactions/two-phase-commit">Two-Phase Commit | OceanBase Learning Guide</a></p>
<p><a href="https://docs.pingcap.com/zh/tidb/stable/tidb-architecture">TiDB Architecture Overview</a></p>
Dynamo: Amazon’s Highly Available Key-value Store
https://noneback.github.io/blog/dynamo/
Tue, 01 Aug 2023 16:15:29 +0800https://noneback.github.io/blog/dynamo/<p>An old paper by AWS, Dynamo has been in the market for a long time, and the architecture has likely evolved since the paper’s publication. Despite this, the paper was selected as one of the SIGMOD best papers of the year, and there are still many valuable lessons to learn.</p>
<h2 id="design">Design</h2>
<p>Dynamo is a NoSQL product that provides a key-value storage interface. It emphasizes high availability rather than consistency, which leads to differences in architectural design and technical choices compared to other systems.</p>
<h2 id="technical-details">Technical Details</h2>
<p>Dynamo has many aspects that may be considered problematic from a technical perspective, such as the NWR (N-W-R) approach. However, given Dynamo’s long track record in production, these issues may have been resolved over time, though the paper is not explicit about this. For now, let’s discuss some of the aspects I found noteworthy:</p>
<h3 id="data-partitioning">Data Partitioning</h3>
<p>Dynamo uses a <strong>consistent hashing algorithm</strong>. Traditional consistent hashing employs a hash ring to address the problem of extensive rehashing when nodes are added or removed, but it cannot avoid issues like data skew and performance imbalance caused by heterogeneous machines. In practice, Dynamo introduces <strong>virtual nodes</strong> into the hash ring, which elegantly solves these problems.</p>
<h3 id="data-write-challenges">Data Write Challenges</h3>
<p>Most storage systems ensure a certain level of consistency during writes, trading off lower write performance for reduced read complexity. However, Dynamo takes a different approach.</p>
<p>Dynamo’s design goal is to provide a highly available key-value store that ensures <strong>always writable</strong> operations while only guaranteeing <strong>eventual consistency</strong>. To achieve this, Dynamo pushes data conflict resolution to the read operation, <strong>ensuring that writes are never rejected</strong>.</p>
<p>There are two key issues to consider here:</p>
<ol>
<li>
<p><strong>Data Conflict Resolution</strong>: Concurrent reads and writes to the same key by multiple clients can easily lead to data conflicts. Since Dynamo only provides eventual consistency, data on different nodes in the Dynamo ring might be inconsistent.</p>
<ul>
<li>Dynamo uses <strong>vector clocks</strong> to keep track of data versions and merges them during reads to resolve conflicts.</li>
</ul>
</li>
<li>
<p><strong>Replica Data Gaps</strong>: Since Dynamo employs the NWR gossip protocol, it is theoretically possible that none of the nodes hold the complete data set, requiring synchronization between replicas.</p>
<ul>
<li>Dynamo uses an <strong>anti-entropy process</strong> to address this, employing <strong>Merkle Trees</strong> to efficiently detect inconsistencies between replicas and minimize the amount of data transferred.</li>
</ul>
</li>
</ol>
<p><img alt="Dynamo Design Considerations" src="https://raw.githubusercontent.com/noneback/images/picgo/20230801162353.png"></p>
<p>The table in the paper clearly shows the aspects considered during Dynamo’s development and the corresponding technical choices. For more information, refer to the original paper.</p>
<h2 id="references">References</h2>
<p><a href="https://www.raychase.net/2396">Dynamo’s Implementation and Decentralization</a></p>
<p><a href="https://timyang.net/data/dynamo-flawed-architecture-chinese/">Dynamo’s Flawed Architecture (Translation) by Tim Yang</a></p>
<p><a href="https://news.ycombinator.com/item?id=915212">Dynamo: A Flawed Architecture | Hacker News</a></p>
MIT6.824 AuroraDB
https://noneback.github.io/blog/mit6.824-auroradb/
Tue, 01 Aug 2023 16:11:54 +0800https://noneback.github.io/blog/mit6.824-auroradb/<p>This article introduces the design considerations of AWS’s database product, Aurora, including storage-compute separation, single-writer multi-reader architecture, and quorum-based NRW consistency protocol. The article also mentions how PolarDB was inspired by Aurora, with differences in addressing network bottlenecks and system call overhead.</p>
<hr>
<p>Aurora is a database product provided by AWS, primarily aimed at OLTP business scenarios.</p>
<p>In terms of design, there are several aspects worth noting:</p>
<ul>
<li>The design premise of Aurora is that with databases moving to the cloud, thanks to advancements in cloud infrastructure, the biggest bottleneck for databases has shifted from compute and storage to the network. This was an important premise for AWS when designing Aurora. Based on this premise, Aurora revisits the concept of “Log is Database”, pushing only the RedoLog down to the storage layer.</li>
<li><strong>Storage-compute separation</strong>: The database storage layer interfaces with a distributed storage system, which provides reliability and security guarantees. The compute and storage layers can scale independently. The storage system provides a unified data view to the upper layers, significantly improving the efficiency of core functions and operations (such as backup, data recovery, and high availability).</li>
<li><strong>Interesting reliability guarantees</strong>: For example, quorum-based NRW consistency protocol, where read and write operations on storage nodes require majority voting, ensures dual-AZ level fault tolerance. Sharding is used to reduce failure handling time, improving the SLA. Most reads occur during database recovery when the current state needs to be restored.</li>
<li><strong>Single-writer multi-reader</strong>: Unlike NewSQL products with a shared-nothing architecture, Aurora provides only a single write node. This simplifies data consistency guarantees since the single write node can use the RedoLog LSN as a logical clock to maintain the partial order of data updates. By pushing the RedoLog to all nodes and applying these operations in order, consistency can be achieved.</li>
<li><strong>Transaction implementation</strong>: Since the storage system provides a unified file view to the upper layer, Aurora’s transaction implementation is almost the same as that of a single-node transaction algorithm and can provide similar transaction semantics. NewSQL transactions are generally implemented via distributed transactions based on 2PC.</li>
<li><strong>Background acceleration for foreground processing</strong>: Similar to the approach in LevelDB, storage nodes try to make some operations asynchronous (such as log apply) to improve user-perceived performance. These asynchronous operations maintain progress using various LSNs, such as VLSN, commit-LSN, etc.</li>
</ul>
<p><img alt="Aurora Architecture Overview" src="https://raw.githubusercontent.com/noneback/images/picgo/20230412094745.png"></p>
<p><img alt="Aurora Write Path" src="https://raw.githubusercontent.com/noneback/images/picgo/20230412094928.png"></p>
<p><img alt="Aurora Read Path" src="https://raw.githubusercontent.com/noneback/images/picgo/20230412094941.png"></p>
<p>Interestingly, although PolarDB’s design was inspired by Aurora, its architectural design considers that the network is not the bottleneck but rather that various system calls through the OS slow down overall speed. Given the instability of Alibaba Cloud’s storage system at that time, PolarStore was introduced, using hardware and FUSE-based storage technology to bypass or optimize system calls. Now that Pangu has improved significantly in terms of both stability and performance, it makes sense to weaken the role of PolarStore. I think this reasoning makes sense.</p>
<p>Additionally, why did they choose to use NRW instead of a consensus protocol like Raft? For now, it seems that NRW has one less round of network communication compared to Raft, which might be the reason.</p>
<p><img alt="Aurora Storage-Compute Separation" src="https://raw.githubusercontent.com/noneback/images/picgo/20230412094918.png"></p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://zhuanlan.zhihu.com/p/319806107">https://zhuanlan.zhihu.com/p/319806107</a></li>
<li><a href="http://nil.csail.mit.edu/6.824/2020/notes/l-aurora.txt">http://nil.csail.mit.edu/6.824/2020/notes/l-aurora.txt</a></li>
<li><a href="https://keys961.github.io/2020/05/05/%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB-Aurora/">Paper Reading - Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Database - keys961 | keys961 Blog</a></li>
</ul>
MIT6.824 Chain Replication
https://noneback.github.io/blog/mit6.824-chainreplication/
Wed, 08 Feb 2023 23:05:57 +0800https://noneback.github.io/blog/mit6.824-chainreplication/<p>This post provides a brief overview of the Chain Replication (CR) paper, which introduces a simple but effective algorithm for providing linearizable consistency in storage services. For those interested in the detailed design, it’s best to refer directly to the original paper.</p>
<h2 id="introduction">Introduction</h2>
<p>In short, the Chain Replication (CR) paper presents a replicated state machine algorithm designed for storage services that require linearizable consistency. It uses a chain replication method to improve throughput and relies on multiple replicas to ensure service availability.</p>
<p><img alt="Chain Replication" src="https://raw.githubusercontent.com/noneback/images/picgo/20230215135829.png"></p>
<p>The design of the algorithm is both simple and elegant. CR splits the replication workload across all nodes in the chain, with each node responsible for forwarding updates to its successor. Write requests are propagated from the head node to the tail, while read requests are served by the tail node.</p>
<p>To maintain relationships between nodes in the chain, Chain Replication introduces a Master service responsible for managing node configurations and handling node failures.</p>
<h2 id="failure-handling">Failure Handling</h2>
<ol>
<li>
<p><strong>Head Failure</strong>: If the head node fails, any pending or unprocessed requests are lost, but linearizable consistency remains unaffected. The second node in the chain is promoted to the new head.</p>
</li>
<li>
<p><strong>Tail Failure</strong>: If the tail node fails, the second-to-last node becomes the new tail, and pending requests from the original tail are committed.</p>
</li>
<li>
<p><strong>Middle Node Failure</strong>: When a middle node fails, the chain is reconnected in a manner similar to linked list operations. The previous node (<code>Node_pre</code>) is linked directly to the next node (<code>Node_next</code>). To ensure that no requests are lost during this failure, each CR node maintains a <code>SendReqList</code> that records all requests forwarded to its successor. Since requests are propagated from head to tail, <code>Node_pre</code> only needs to send <code>Node_next</code> any missing data. When the tail node receives a request, it marks it as committed, and an acknowledgment (<code>Ack(req)</code>) is sent back from the tail to the head, removing the request from each node’s <code>SendReqList</code> as the acknowledgment propagates.</p>
</li>
</ol>
<h2 id="pros-and-cons">Pros and Cons</h2>
<p>The main advantages of Chain Replication include:</p>
<ul>
<li><strong>High Throughput</strong>: By distributing the workload across all nodes, CR effectively increases the throughput of a single node.</li>
<li><strong>Balanced Load</strong>: Each node has a similar workload, resulting in balanced utilization.</li>
<li><strong>Simplicity</strong>: The overall design is clean and straightforward, making it easier to implement.</li>
</ul>
<p>However, there are some clear disadvantages:</p>
<ol>
<li><strong>Bottlenecks</strong>: If a node in the chain processes requests slowly, it will delay the entire chain’s processing.</li>
<li><strong>Read Limitations</strong>: Only the head and tail nodes can serve requests efficiently. The data in the middle nodes is mostly there for replication purposes and not directly utilized for serving requests. However, the CRAQ (Chain Replication with Asynchronous Queries) variant allows middle nodes to serve read-only requests, similar to Raft’s Read Index, which can help alleviate this limitation.</li>
</ol>
<h2 id="references">References</h2>
<ul>
<li><a href="https://tanxinyu.work/chain-replication-thesis/">Chain Replication Paper Summary</a></li>
<li><a href="http://nil.csail.mit.edu/6.824/2021/papers/cr-osdi04.pdf">Original CR Paper</a></li>
</ul>
MIT6.824-ZooKeeper
https://noneback.github.io/blog/mit6.824-zookeeper/
Tue, 03 Jan 2023 23:49:41 +0800https://noneback.github.io/blog/mit6.824-zookeeper/<p>This article mainly discusses the design and practical considerations of the ZooKeeper system, such as wait-free and lock mechanisms, consistency choices, system-provided APIs, and specific semantic decisions. These trade-offs are the most insightful aspects of this article.</p>
<h2 id="positioning">Positioning</h2>
<p>ZooKeeper is a wait-free, high-performance coordination service for distributed applications. It supports the coordination needs of distributed applications by providing coordination primitives (specific APIs and data models).</p>
<h2 id="design">Design</h2>
<h3 id="keywords">Keywords</h3>
<p>There are two key phrases in ZooKeeper’s positioning: <strong>high performance</strong> and <strong>distributed application coordination service</strong>.</p>
<p>ZooKeeper’s high performance is achieved through wait-free design, local reads from multiple replicas, and the watch mechanism:</p>
<ul>
<li>Wait-free requests are handled asynchronously, which may lead to request reordering, making the state machine different from the real-time sequence. ZooKeeper provides FIFO client order guarantees to manage this. Additionally, asynchronous handling is conducive to batch processing and pipelining, further improving performance.</li>
<li>The watch mechanism notifies clients of updates when a znode changes, reducing the overhead of clients querying local caches.</li>
<li>Local reads from multiple replicas: ZooKeeper uses the ZAB protocol to achieve data consensus, ensuring that write operations are linearizable. Read requests, however, are served locally from replicas without going through the ZAB consensus protocol, which provides serializability and might return stale data, improving performance.</li>
</ul>
<p>The distributed application coordination service refers to the data model and API semantics provided by ZooKeeper, allowing distributed applications to freely use them to fulfill coordination needs such as group membership and distributed locking.</p>
<h3 id="data-model-and-api">Data Model and API</h3>
<p>ZooKeeper provides an abstraction of data nodes called znodes, which are organized through a hierarchical namespace. ZooKeeper offers two types of znodes: regular and ephemeral. Each znode stores data and is accessed using standard UNIX filesystem paths.</p>
<p>In practice, znodes are not designed for general data storage. Instead, znodes map to abstractions in client applications, often corresponding to <strong>metadata</strong> used for coordination.</p>
<blockquote>
<p>In other words, when coordinating through ZooKeeper, utilize the metadata associated with znodes instead of treating them as mere data storage. For example, znodes associate metadata with timestamps and version counters, allowing clients to track changes to the znodes and perform conditional updates based on the znode version.</p>
</blockquote>
<p>Essentially, this data model is a simplified file system API that supports full data reads and writes. Users implement distributed application coordination using the semantics provided by ZooKeeper.</p>
<blockquote>
<p>The difference between regular and ephemeral znodes is that ephemeral nodes are automatically deleted when the session ends.</p>
</blockquote>
<p><img alt="img" src="https://s3.us-west-2.amazonaws.com/secure.notion-static.com/c9c4c039-a334-4c00-946c-743e6ab984d9/Untitled.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20230103%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20230103T155342Z&X-Amz-Expires=86400&X-Amz-Signature=7b1041157b56fe404023a2303762de9bb599c57d116bc10b9f46e1733f67bbc2&X-Amz-SignedHeaders=host&response-content-disposition=filename%3D\"Untitled.png\"&x-id=GetObject"></p>
<p>Clients interact with ZooKeeper through its API, and ZooKeeper manages client connections through sessions. In a session, clients can observe state changes that reflect their operations.</p>
<h2 id="cap-guarantees">CAP Guarantees</h2>
<p>ZooKeeper provides CP (Consistency and Partition Tolerance) guarantees. For instance, during leader election, ZooKeeper will stop serving requests until a new leader is elected, ensuring consistency.</p>
<h2 id="implementation">Implementation</h2>
<p><img alt="img" src="https://s3.us-west-2.amazonaws.com/secure.notion-static.com/cb5e3866-1ce2-4897-aa47-c486c10aba12/Untitled.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20230103%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20230103T155414Z&X-Amz-Expires=86400&X-Amz-Signature=35715be3617f7544fc7fcc05705f99a32d46e0ca9c31af2d51f383148f316f32&X-Amz-SignedHeaders=host&response-content-disposition=filename%3D\"Untitled.png\"&x-id=GetObject"></p>
<p>ZooKeeper uses multiple replicas to achieve high availability.</p>
<p>In simple terms, ZooKeeper’s upper layer uses the ZAB protocol to handle write requests, ensuring linearizability across replicas. Reads are processed locally, ensuring sequential consistency. The underlying data state machine is stored in the replicated database (in-memory) and Write-Ahead Log (WAL) on ZooKeeper cluster machines, with periodic snapshots to ensure durability. The entire in-memory database uses fuzzy snapshots and WAL replay to ensure crash safety and fast recovery after a crash.</p>
<blockquote>
<p>The advantage of fuzzy snapshots is that they do not block online requests.</p>
</blockquote>
<h3 id="interaction-with-clients">Interaction with Clients</h3>
<ul>
<li>Update operations will notify and clear the relevant znode’s watch.</li>
<li>Read requests are processed locally, and the partial order of write requests is defined by <code>zxid</code>. Sequential consistency is ensured, but reads may be stale. ZooKeeper provides the <code>sync</code> operation, which can mitigate this to some extent.</li>
<li>When a client connects to a new ZooKeeper server, the maximum <code>zxid</code> is compared. The outdated ZooKeeper server will not establish a session with the client.</li>
<li>Clients maintain sessions through heartbeats, and the server handles requests idempotently.</li>
</ul>
<h2 id="references">References</h2>
<p><a href="https://pdos.csail.mit.edu/6.824/papers/zookeeper.pdf">ZooKeeper Paper</a></p>
<p><a href="https://pdos.csail.mit.edu/6.824/papers/zookeeper-faq.txt">MIT6.824-ZooKeeper FAQ</a></p>
Flink-Iceberg-Connector Write Process
https://noneback.github.io/blog/flinkicebergconnector%E5%86%99%E5%85%A5%E6%B5%81%E7%A8%8B/
Mon, 10 Oct 2022 10:43:38 +0800https://noneback.github.io/blog/flinkicebergconnector%E5%86%99%E5%85%A5%E6%B5%81%E7%A8%8B/<p>The Iceberg community provides an official Flink Connector, and this chapter’s source code analysis is based on that.</p>
<h2 id="overview-of-the-write-submission-process">Overview of the Write Submission Process</h2>
<p>Flink writes data through <code>RowData -> distributeStream -> WriterStream -> CommitterStream</code>. Before data is committed, it is stored as intermediate files, which become visible to the system after being committed (through writing manifest, snapshot, and metadata files).</p>
<p><img alt="Flink-Iceberg Write Flow" src="https://intranetproxy.alipay.com/skylark/lark/0/2022/png/59256351/1655962006990-826460c7-b6fc-4efe-a8e0-65cc080ffea9.png"></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">private</span> <span style="color:#f92672"><</span>T<span style="color:#f92672">></span> DataStreamSink<span style="color:#f92672"><</span>T<span style="color:#f92672">></span> <span style="color:#a6e22e">chainIcebergOperators</span>() {
</span></span><span style="display:flex;"><span> Preconditions.<span style="color:#a6e22e">checkArgument</span>(inputCreator <span style="color:#f92672">!=</span> <span style="color:#66d9ef">null</span>,
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"Please use forRowData() or forMapperOutputType() to initialize the input DataStream."</span>);
</span></span><span style="display:flex;"><span> Preconditions.<span style="color:#a6e22e">checkNotNull</span>(tableLoader, <span style="color:#e6db74">"Table loader shouldn't be null"</span>);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> DataStream<span style="color:#f92672"><</span>RowData<span style="color:#f92672">></span> rowDataInput <span style="color:#f92672">=</span> inputCreator.<span style="color:#a6e22e">apply</span>(uidPrefix);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (table <span style="color:#f92672">==</span> <span style="color:#66d9ef">null</span>) {
</span></span><span style="display:flex;"><span> tableLoader.<span style="color:#a6e22e">open</span>();
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">try</span> (TableLoader loader <span style="color:#f92672">=</span> tableLoader) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">this</span>.<span style="color:#a6e22e">table</span> <span style="color:#f92672">=</span> loader.<span style="color:#a6e22e">loadTable</span>();
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">catch</span> (IOException e) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">throw</span> <span style="color:#66d9ef">new</span> UncheckedIOException(<span style="color:#e6db74">"Failed to load iceberg table from table loader: "</span> <span style="color:#f92672">+</span> tableLoader, e);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> List<span style="color:#f92672"><</span>Integer<span style="color:#f92672">></span> equalityFieldIds <span style="color:#f92672">=</span> checkAndGetEqualityFieldIds();
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> RowType flinkRowType <span style="color:#f92672">=</span> toFlinkRowType(table.<span style="color:#a6e22e">schema</span>(), tableSchema);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> DataStream<span style="color:#f92672"><</span>RowData<span style="color:#f92672">></span> distributeStream <span style="color:#f92672">=</span> distributeDataStream(
</span></span><span style="display:flex;"><span> rowDataInput, table.<span style="color:#a6e22e">properties</span>(), equalityFieldIds, table.<span style="color:#a6e22e">spec</span>(), table.<span style="color:#a6e22e">schema</span>(), flinkRowType);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> SingleOutputStreamOperator<span style="color:#f92672"><</span>WriteResult<span style="color:#f92672">></span> writerStream <span style="color:#f92672">=</span> appendWriter(distributeStream, flinkRowType,
</span></span><span style="display:flex;"><span> equalityFieldIds);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> SingleOutputStreamOperator<span style="color:#f92672"><</span>Void<span style="color:#f92672">></span> committerStream <span style="color:#f92672">=</span> appendCommitter(writerStream);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> appendDummySink(committerStream);
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="write-process-source-code-analysis">Write Process Source Code Analysis</h2>
<h3 id="writestream">WriteStream</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">private</span> SingleOutputStreamOperator<span style="color:#f92672"><</span>WriteResult<span style="color:#f92672">></span> <span style="color:#a6e22e">appendWriter</span>(DataStream<span style="color:#f92672"><</span>RowData<span style="color:#f92672">></span> input, RowType flinkRowType,
</span></span><span style="display:flex;"><span> List<span style="color:#f92672"><</span>Integer<span style="color:#f92672">></span> equalityFieldIds) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">boolean</span> upsertMode <span style="color:#f92672">=</span> upsert <span style="color:#f92672">||</span> PropertyUtil.<span style="color:#a6e22e">propertyAsBoolean</span>(table.<span style="color:#a6e22e">properties</span>(),
</span></span><span style="display:flex;"><span> UPSERT_ENABLED, UPSERT_ENABLED_DEFAULT);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (upsertMode) {
</span></span><span style="display:flex;"><span> Preconditions.<span style="color:#a6e22e">checkState</span>(<span style="color:#f92672">!</span>overwrite,
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"OVERWRITE mode shouldn't be enabled when configuring to use UPSERT data stream."</span>);
</span></span><span style="display:flex;"><span> Preconditions.<span style="color:#a6e22e">checkState</span>(<span style="color:#f92672">!</span>equalityFieldIds.<span style="color:#a6e22e">isEmpty</span>(),
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"Equality field columns shouldn't be empty when configuring to use UPSERT data stream."</span>);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (<span style="color:#f92672">!</span>table.<span style="color:#a6e22e">spec</span>().<span style="color:#a6e22e">isUnpartitioned</span>()) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> (PartitionField partitionField : table.<span style="color:#a6e22e">spec</span>().<span style="color:#a6e22e">fields</span>()) {
</span></span><span style="display:flex;"><span> Preconditions.<span style="color:#a6e22e">checkState</span>(equalityFieldIds.<span style="color:#a6e22e">contains</span>(partitionField.<span style="color:#a6e22e">sourceId</span>()),
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"In UPSERT mode, partition field '%s' should be included in equality fields: '%s'"</span>,
</span></span><span style="display:flex;"><span> partitionField, equalityFieldColumns);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> IcebergStreamWriter<span style="color:#f92672"><</span>RowData<span style="color:#f92672">></span> streamWriter <span style="color:#f92672">=</span> createStreamWriter(table, flinkRowType, equalityFieldIds, upsertMode);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> parallelism <span style="color:#f92672">=</span> writeParallelism <span style="color:#f92672">==</span> <span style="color:#66d9ef">null</span> <span style="color:#f92672">?</span> input.<span style="color:#a6e22e">getParallelism</span>() : writeParallelism;
</span></span><span style="display:flex;"><span> SingleOutputStreamOperator<span style="color:#f92672"><</span>WriteResult<span style="color:#f92672">></span> writerStream <span style="color:#f92672">=</span> input
</span></span><span style="display:flex;"><span> .<span style="color:#a6e22e">transform</span>(operatorName(ICEBERG_STREAM_WRITER_NAME), TypeInformation.<span style="color:#a6e22e">of</span>(WriteResult.<span style="color:#a6e22e">class</span>), streamWriter)
</span></span><span style="display:flex;"><span> .<span style="color:#a6e22e">setParallelism</span>(parallelism);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (uidPrefix <span style="color:#f92672">!=</span> <span style="color:#66d9ef">null</span>) {
</span></span><span style="display:flex;"><span> writerStream <span style="color:#f92672">=</span> writerStream.<span style="color:#a6e22e">uid</span>(uidPrefix <span style="color:#f92672">+</span> <span style="color:#e6db74">"-writer"</span>);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> writerStream;
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The <code>WriterStream</code> operator is transformed from the <code>distributeStream</code>, with <code>RowData</code> as input and <code>WriteResult</code> as output. The transformation logic is encapsulated in the <code>IcebergStreamWriter</code>, which processes each element using <code>processElement</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">private</span> <span style="color:#66d9ef">transient</span> TaskWriter<span style="color:#f92672"><</span>T<span style="color:#f92672">></span> writer;
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@Override</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">processElement</span>(StreamRecord<span style="color:#f92672"><</span>T<span style="color:#f92672">></span> element) <span style="color:#66d9ef">throws</span> Exception {
</span></span><span style="display:flex;"><span> writer.<span style="color:#a6e22e">write</span>(element.<span style="color:#a6e22e">getValue</span>());
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><code>IcebergStreamWriter</code> delegates the writing to a <code>TaskWriter</code> created by <code>TaskWriterFactory</code>. The specific type could be <code>PartitionedDeltaWriter</code> or <code>UnpartitionedWriter</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">public</span> TaskWriter<span style="color:#f92672"><</span>RowData<span style="color:#f92672">></span> <span style="color:#a6e22e">create</span>() {
</span></span><span style="display:flex;"><span> Preconditions.<span style="color:#a6e22e">checkNotNull</span>(outputFileFactory,
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"The outputFileFactory shouldn't be null if we have invoked the initialize()."</span>);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (equalityFieldIds <span style="color:#f92672">==</span> <span style="color:#66d9ef">null</span> <span style="color:#f92672">||</span> equalityFieldIds.<span style="color:#a6e22e">isEmpty</span>()) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (spec.<span style="color:#a6e22e">isUnpartitioned</span>()) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">new</span> UnpartitionedWriter<span style="color:#f92672"><></span>(spec, format, appenderFactory, outputFileFactory, io, targetFileSizeBytes);
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">new</span> RowDataPartitionedFanoutWriter(spec, format, appenderFactory, outputFileFactory,
</span></span><span style="display:flex;"><span> io, targetFileSizeBytes, schema, flinkSchema);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (spec.<span style="color:#a6e22e">isUnpartitioned</span>()) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">new</span> UnpartitionedDeltaWriter(spec, format, appenderFactory, outputFileFactory, io,
</span></span><span style="display:flex;"><span> targetFileSizeBytes, schema, flinkSchema, equalityFieldIds, upsert);
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">new</span> PartitionedDeltaWriter(spec, format, appenderFactory, outputFileFactory, io,
</span></span><span style="display:flex;"><span> targetFileSizeBytes, schema, flinkSchema, equalityFieldIds, upsert);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h3 id="committerstream">CommitterStream</h3>
<p>The <code>CommitterStream</code> receives <code>WriteResult</code> as input with no output. <code>WriteResult</code> contains the data files produced by <code>WriteStream</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">class</span> <span style="color:#a6e22e">WriteResult</span> <span style="color:#66d9ef">implements</span> Serializable {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> DataFile<span style="color:#f92672">[]</span> dataFiles;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> DeleteFile<span style="color:#f92672">[]</span> deleteFiles;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> CharSequence<span style="color:#f92672">[]</span> referencedDataFiles;
</span></span><span style="display:flex;"><span> ...
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The core logic for processing data file submissions is encapsulated in <code>IcebergFilesCommitter</code>. The <code>IcebergFilesCommitter</code> maintains a list of files that need to be committed for each checkpoint. Once a checkpoint completes, it tries to commit those files to Iceberg.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">IcebergFilesCommitter</span> <span style="color:#66d9ef">extends</span> AbstractStreamOperator<span style="color:#f92672"><</span>Void<span style="color:#f92672">></span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">implements</span> OneInputStreamOperator<span style="color:#f92672"><</span>WriteResult, Void<span style="color:#f92672">></span>, BoundedOneInput {
</span></span><span style="display:flex;"><span> ...
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> <span style="color:#66d9ef">final</span> NavigableMap<span style="color:#f92672"><</span>Long, <span style="color:#66d9ef">byte</span><span style="color:#f92672">[]></span> dataFilesPerCheckpoint <span style="color:#f92672">=</span> Maps.<span style="color:#a6e22e">newTreeMap</span>();
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> <span style="color:#66d9ef">final</span> List<span style="color:#f92672"><</span>WriteResult<span style="color:#f92672">></span> writeResultsOfCurrentCkpt <span style="color:#f92672">=</span> Lists.<span style="color:#a6e22e">newArrayList</span>();
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> <span style="color:#66d9ef">transient</span> ListState<span style="color:#f92672"><</span>SortedMap<span style="color:#f92672"><</span>Long, <span style="color:#66d9ef">byte</span><span style="color:#f92672">[]>></span> checkpointsState;
</span></span><span style="display:flex;"><span> ...
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The <code>processElement</code> method stores <code>WriteResult</code> from upstream in <code>writeResultsOfCurrentCkpt</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#a6e22e">@Override</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">processElement</span>(StreamRecord<span style="color:#f92672"><</span>WriteResult<span style="color:#f92672">></span> element) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">this</span>.<span style="color:#a6e22e">writeResultsOfCurrentCkpt</span>.<span style="color:#a6e22e">add</span>(element.<span style="color:#a6e22e">getValue</span>());
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>During checkpointing (<code>snapshotState</code>), it saves the current checkpoint’s data in <code>dataFilesPerCheckpoint</code>. Later, once the checkpoint is completed (<code>notifyCheckpointComplete</code>), it commits the files:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">snapshotState</span>(StateSnapshotContext context) <span style="color:#66d9ef">throws</span> Exception {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">long</span> checkpointId <span style="color:#f92672">=</span> context.<span style="color:#a6e22e">getCheckpointId</span>();
</span></span><span style="display:flex;"><span> LOG.<span style="color:#a6e22e">info</span>(<span style="color:#e6db74">"Start to flush snapshot state to state backend, table: {}, checkpointId: {}"</span>, table, checkpointId);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> dataFilesPerCheckpoint.<span style="color:#a6e22e">put</span>(checkpointId, writeToManifest(checkpointId));
</span></span><span style="display:flex;"><span> checkpointsState.<span style="color:#a6e22e">clear</span>();
</span></span><span style="display:flex;"><span> checkpointsState.<span style="color:#a6e22e">add</span>(dataFilesPerCheckpoint);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> jobIdState.<span style="color:#a6e22e">clear</span>();
</span></span><span style="display:flex;"><span> jobIdState.<span style="color:#a6e22e">add</span>(flinkJobId);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> writeResultsOfCurrentCkpt.<span style="color:#a6e22e">clear</span>();
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@Override</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">notifyCheckpointComplete</span>(<span style="color:#66d9ef">long</span> checkpointId) <span style="color:#66d9ef">throws</span> Exception {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (checkpointId <span style="color:#f92672">></span> maxCommittedCheckpointId) {
</span></span><span style="display:flex;"><span> commitUpToCheckpoint(dataFilesPerCheckpoint, flinkJobId, checkpointId);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">this</span>.<span style="color:#a6e22e">maxCommittedCheckpointId</span> <span style="color:#f92672">=</span> checkpointId;
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The commit logic is handled by <code>commitUpToCheckpoint</code>, which generates a new snapshot and adds it to Iceberg’s metadata:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">private</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">commitUpToCheckpoint</span>(NavigableMap<span style="color:#f92672"><</span>Long, <span style="color:#66d9ef">byte</span><span style="color:#f92672">[]></span> deltaManifestsMap,
</span></span><span style="display:flex;"><span> String newFlinkJobId,
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">long</span> checkpointId) <span style="color:#66d9ef">throws</span> IOException {
</span></span><span style="display:flex;"><span> NavigableMap<span style="color:#f92672"><</span>Long, <span style="color:#66d9ef">byte</span><span style="color:#f92672">[]></span> pendingMap <span style="color:#f92672">=</span> deltaManifestsMap.<span style="color:#a6e22e">headMap</span>(checkpointId, <span style="color:#66d9ef">true</span>);
</span></span><span style="display:flex;"><span> List<span style="color:#f92672"><</span>ManifestFile<span style="color:#f92672">></span> manifests <span style="color:#f92672">=</span> Lists.<span style="color:#a6e22e">newArrayList</span>();
</span></span><span style="display:flex;"><span> NavigableMap<span style="color:#f92672"><</span>Long, WriteResult<span style="color:#f92672">></span> pendingResults <span style="color:#f92672">=</span> Maps.<span style="color:#a6e22e">newTreeMap</span>();
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> (Map.<span style="color:#a6e22e">Entry</span><span style="color:#f92672"><</span>Long, <span style="color:#66d9ef">byte</span><span style="color:#f92672">[]></span> e : pendingMap.<span style="color:#a6e22e">entrySet</span>()) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (Arrays.<span style="color:#a6e22e">equals</span>(EMPTY_MANIFEST_DATA, e.<span style="color:#a6e22e">getValue</span>())) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">continue</span>;
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> DeltaManifests deltaManifests <span style="color:#f92672">=</span> SimpleVersionedSerialization
</span></span><span style="display:flex;"><span> .<span style="color:#a6e22e">readVersionAndDeSerialize</span>(DeltaManifestsSerializer.<span style="color:#a6e22e">INSTANCE</span>, e.<span style="color:#a6e22e">getValue</span>());
</span></span><span style="display:flex;"><span> pendingResults.<span style="color:#a6e22e">put</span>(e.<span style="color:#a6e22e">getKey</span>(), FlinkManifestUtil.<span style="color:#a6e22e">readCompletedFiles</span>(deltaManifests, table.<span style="color:#a6e22e">io</span>()));
</span></span><span style="display:flex;"><span> manifests.<span style="color:#a6e22e">addAll</span>(deltaManifests.<span style="color:#a6e22e">manifests</span>());
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> totalFiles <span style="color:#f92672">=</span> pendingResults.<span style="color:#a6e22e">values</span>().<span style="color:#a6e22e">stream</span>()
</span></span><span style="display:flex;"><span> .<span style="color:#a6e22e">mapToInt</span>(r <span style="color:#f92672">-></span> r.<span style="color:#a6e22e">dataFiles</span>().<span style="color:#a6e22e">length</span> <span style="color:#f92672">+</span> r.<span style="color:#a6e22e">deleteFiles</span>().<span style="color:#a6e22e">length</span>).<span style="color:#a6e22e">sum</span>();
</span></span><span style="display:flex;"><span> continuousEmptyCheckpoints <span style="color:#f92672">=</span> totalFiles <span style="color:#f92672">==</span> 0 <span style="color:#f92672">?</span> continuousEmptyCheckpoints <span style="color:#f92672">+</span> 1 : 0;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (totalFiles <span style="color:#f92672">!=</span> 0 <span style="color:#f92672">||</span> continuousEmptyCheckpoints <span style="color:#f92672">%</span> maxContinuousEmptyCommits <span style="color:#f92672">==</span> 0) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (replacePartitions) {
</span></span><span style="display:flex;"><span> replacePartitions(pendingResults, newFlinkJobId, checkpointId);
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span> commitDeltaTxn(pendingResults, newFlinkJobId, checkpointId);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> continuousEmptyCheckpoints <span style="color:#f92672">=</span> 0;
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> pendingMap.<span style="color:#a6e22e">clear</span>();
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> (ManifestFile manifest : manifests) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">try</span> {
</span></span><span style="display:flex;"><span> table.<span style="color:#a6e22e">io</span>().<span style="color:#a6e22e">deleteFile</span>(manifest.<span style="color:#a6e22e">path</span>());
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">catch</span> (Exception e) {
</span></span><span style="display:flex;"><span> LOG.<span style="color:#a6e22e">warn</span>(<span style="color:#e6db74">"The iceberg transaction has been committed, but we failed to clean the temporary flink manifests: {}"</span>,
</span></span><span style="display:flex;"><span> manifest.<span style="color:#a6e22e">path</span>(), e);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">commit</span>(TableMetadata base, TableMetadata metadata) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (base <span style="color:#f92672">!=</span> current()) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (base <span style="color:#f92672">!=</span> <span style="color:#66d9ef">null</span>) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">throw</span> <span style="color:#66d9ef">new</span> CommitFailedException(<span style="color:#e6db74">"Cannot commit: stale table metadata"</span>);
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">throw</span> <span style="color:#66d9ef">new</span> AlreadyExistsException(<span style="color:#e6db74">"Table already exists: %s"</span>, tableName());
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (base <span style="color:#f92672">==</span> metadata) {
</span></span><span style="display:flex;"><span> LOG.<span style="color:#a6e22e">info</span>(<span style="color:#e6db74">"Nothing to commit."</span>);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span>;
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">long</span> start <span style="color:#f92672">=</span> System.<span style="color:#a6e22e">currentTimeMillis</span>();
</span></span><span style="display:flex;"><span> doCommit(base, metadata);
</span></span><span style="display:flex;"><span> deleteRemovedMetadataFiles(base, metadata);
</span></span><span style="display:flex;"><span> requestRefresh();
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> LOG.<span style="color:#a6e22e">info</span>(<span style="color:#e6db74">"Successfully committed to table {} in {} ms"</span>,
</span></span><span style="display:flex;"><span> tableName(),
</span></span><span style="display:flex;"><span> System.<span style="color:#a6e22e">currentTimeMillis</span>() <span style="color:#f92672">-</span> start);
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="write-issues">Write Issues</h2>
<h3 id="1-lots-of-small-files">1. Lots of Small Files</h3>
<p>For streaming writes, new files are generated each time, resulting in a lot of small files. While object storage supports small files well, it may increase Iceberg metadata overhead, as metadata files need to keep track of each data file. This can cause metadata files to become large and impact performance.</p>
<p><strong>Solution:</strong></p>
<ul>
<li><strong>Iceberg Rewrite Action</strong>: Iceberg supports rewriting data and metadata files via Flink or Spark actions, which need to be triggered separately.</li>
<li><strong>Snapshot Expiry</strong>: Configure snapshot expiration to periodically delete old snapshots.</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#f92672">import</span> org.apache.iceberg.flink.actions.Actions;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>TableLoader tableLoader <span style="color:#f92672">=</span> TableLoader.<span style="color:#a6e22e">fromHadoopTable</span>(<span style="color:#e6db74">"hdfs://nn:8020/warehouse/path"</span>);
</span></span><span style="display:flex;"><span>Table table <span style="color:#f92672">=</span> tableLoader.<span style="color:#a6e22e">loadTable</span>();
</span></span><span style="display:flex;"><span>RewriteDataFilesActionResult result <span style="color:#f92672">=</span> Actions.<span style="color:#a6e22e">forTable</span>(table)
</span></span><span style="display:flex;"><span> .<span style="color:#a6e22e">rewriteDataFiles</span>()
</span></span><span style="display:flex;"><span> .<span style="color:#a6e22e">execute</span>();
</span></span></code></pre></div><p><a href="https://iceberg.apache.org/docs/latest/flink/">Iceberg Flink Documentation</a>
<a href="https://iceberg.apache.org/docs/latest/maintenance/">Iceberg Maintenance Documentation</a></p>
<h3 id="2-performance-issues-with-high-concurrency">2. Performance Issues with High Concurrency</h3>
<p>Iceberg’s writing process creates a new snapshot for each commit and uses optimistic concurrency control to handle conflicts. In high-concurrency scenarios, this can lead to many commits being retried, impacting performance.</p>
<p><strong>Solution:</strong></p>
<ul>
<li><strong>Batch Commit</strong>: Introduce a caching layer or additional service to batch commits to the data lake, reducing the number of concurrent commit operations. This cache layer can also compact multiple data files before committing.</li>
</ul>
<blockquote>
<p>References:
<a href="https://zhuanlan.zhihu.com/p/472617094">Optimizing Iceberg Writes for High Concurrency</a>
<a href="https://www.infoq.cn/article/hfft7c7ahoomgayjsouz">InfoQ Article on Iceberg Optimization</a></p>
</blockquote>
<h3 id="3-flink-iceberg-connector-limitations">3. Flink Iceberg Connector Limitations</h3>
<p>The Flink Iceberg Connector does not support hidden partitions or preprocessing of partition fields.</p>
Apache-ORC Quick Investigation
https://noneback.github.io/blog/apacheorc%E8%B0%83%E7%A0%94/
Wed, 05 Oct 2022 19:56:01 +0800https://noneback.github.io/blog/apacheorc%E8%B0%83%E7%A0%94/<p>Iceberg supports both ORC and Parquet columnar formats. Compared to Parquet, ORC offers advantages in query performance and ACID support. Considering the future data lakehouse requirements for query performance and ACID compliance, we are researching ORC to support a future demo involving Flink, Iceberg, and ORC.</p>
<p>Research Focus: ORC file encoding, file organization, and indexing support.</p>
<h2 id="file-layout">File Layout</h2>
<p>An ORC file can be divided into three main sections:</p>
<ul>
<li><strong>Header</strong>: Identifies the file type.</li>
<li><strong>Body</strong>: Contains row data and indexes, as shown below.</li>
<li><strong>Tail</strong>: Contains top-level file information.</li>
</ul>
<blockquote>
<p>ORC Specification v1</p>
</blockquote>
<p><img alt="File Layout" src="https://intranetproxy.alipay.com/skylark/lark/0/2022/png/59256351/1654164675197-b3513a38-dee1-4fea-a582-1e800542dc06.png#clientId=ubff13205-800f-4&crop=0&crop=0&crop=1&crop=1&from=paste&id=udfb74091&margin=%5Bobject%20Object%5D&name=image.png&originHeight=567&originWidth=580&originalType=binary&ratio=1&rotation=0&showTitle=false&size=134693&status=done&style=none&taskId=u3792c89f-94b1-497c-81db-a5f9ae97297&title="></p>
<h3 id="file-tail">File Tail</h3>
<p>Since distributed storage generally supports only append-only semantics, the ORC file maintains a tail section for top-level file information.</p>
<p>The tail contains:</p>
<ul>
<li><strong>Postscript</strong>: Contains essential information for parsing the footer and metadata, such as the length of each section and compression method.</li>
<li><strong>Footer</strong>: Stores schema information, row count, column statistics, and more.</li>
<li><strong>Stripe Statistics and Metadata</strong>: Includes column-level statistics.</li>
</ul>
<h4 id="postscript">Postscript</h4>
<p>The postscript is uncompressed and contains:</p>
<ul>
<li>Footer length</li>
<li>Compression type</li>
<li>Metadata length</li>
<li>File identifier (“ORC”)</li>
</ul>
<h4 id="footer">Footer</h4>
<p>The footer includes the schema, row count, column-level statistics, and a list of stripes that make up the file body.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-protobuf" data-lang="protobuf"><span style="display:flex;"><span><span style="color:#66d9ef">message</span> <span style="color:#a6e22e">Footer</span> {<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> headerLength <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> contentLength <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">repeated</span> StripeInformation stripes <span style="color:#f92672">=</span> <span style="color:#ae81ff">3</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">repeated</span> Type types <span style="color:#f92672">=</span> <span style="color:#ae81ff">4</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">repeated</span> UserMetadataItem metadata <span style="color:#f92672">=</span> <span style="color:#ae81ff">5</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> numberOfRows <span style="color:#f92672">=</span> <span style="color:#ae81ff">6</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">repeated</span> ColumnStatistics statistics <span style="color:#f92672">=</span> <span style="color:#ae81ff">7</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint32</span> rowIndexStride <span style="color:#f92672">=</span> <span style="color:#ae81ff">8</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint32</span> writer <span style="color:#f92672">=</span> <span style="color:#ae81ff">9</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> Encryption encryption <span style="color:#f92672">=</span> <span style="color:#ae81ff">10</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> stripeStatisticsLength <span style="color:#f92672">=</span> <span style="color:#ae81ff">11</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>}<span style="color:#960050;background-color:#1e0010">
</span></span></span></code></pre></div><ul>
<li><strong>Stripe Information</strong>: Data in the body is organized into multiple <strong>stripes</strong>. Stripes contain row indexes, row data, and stripe footers, which are stored column-wise.</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-protobuf" data-lang="protobuf"><span style="display:flex;"><span><span style="color:#66d9ef">message</span> <span style="color:#a6e22e">StripeInformation</span> {<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> offset <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> indexLength <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> dataLength <span style="color:#f92672">=</span> <span style="color:#ae81ff">3</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> footerLength <span style="color:#f92672">=</span> <span style="color:#ae81ff">4</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint64</span> numberOfRows <span style="color:#f92672">=</span> <span style="color:#ae81ff">5</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>}<span style="color:#960050;background-color:#1e0010">
</span></span></span></code></pre></div><ul>
<li><strong>Type Information</strong>: ORC uses a tree structure to represent nested data types, and the type schema must remain consistent.</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">create</span> <span style="color:#66d9ef">table</span> Foobar (
</span></span><span style="display:flex;"><span> myInt int,
</span></span><span style="display:flex;"><span> myMap <span style="color:#66d9ef">map</span><span style="color:#f92672"><</span>string, struct<span style="color:#f92672"><</span>myString : string, myDouble: double<span style="color:#f92672">>></span>,
</span></span><span style="display:flex;"><span> myTime <span style="color:#66d9ef">timestamp</span>
</span></span><span style="display:flex;"><span>);
</span></span></code></pre></div><ul>
<li><strong>Column Statistics</strong>: Simple statistics for each column are available to support coarse-grained filtering.</li>
</ul>
<h2 id="stripes">Stripes</h2>
<p>The body of an ORC file is split into <strong>stripes</strong>, which are large chunks of data (typically ~200MB) that contain:</p>
<ul>
<li><strong>Index Data</strong></li>
<li><strong>Row Data</strong></li>
<li><strong>Stripe Footer</strong></li>
</ul>
<p>The <strong>Stripe Footer</strong> holds column encoding details and stream-related information, such as compression and encryption methods.</p>
<h2 id="index-support">Index Support</h2>
<p>ORC supports three levels of indexing:</p>
<table>
<thead>
<tr>
<th>Level</th>
<th>Location</th>
<th>Data Content</th>
</tr>
</thead>
<tbody>
<tr>
<td>File Level</td>
<td>File Footer</td>
<td>Column-level statistics for the entire file</td>
</tr>
<tr>
<td>Stripe Level</td>
<td>File Footer</td>
<td>Column-level statistics for each stripe</td>
</tr>
<tr>
<td>Row Level</td>
<td>Beginning of Stripe</td>
<td>Statistics for each row group and their start position</td>
</tr>
</tbody>
</table>
<h3 id="row-level-index">Row Level Index</h3>
<p>The row-level index contains <strong>Row Group Index</strong> and <strong>Bloom Filter Index</strong>.</p>
<h4 id="row-group-index">Row Group Index</h4>
<p>Indexes for primitive types are represented by <strong>ROW_INDEX</strong> streams, with each row group containing a <strong>RowIndexEntry</strong>.</p>
<blockquote>
<p>Default row group size: 10,000 rows</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-protobuf" data-lang="protobuf"><span style="display:flex;"><span><span style="color:#66d9ef">message</span> <span style="color:#a6e22e">RowIndex</span> {<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">repeated</span> RowIndexEntry entry <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>}<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span><span style="color:#66d9ef">message</span> <span style="color:#a6e22e">RowIndexEntry</span> {<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">repeated</span> <span style="color:#66d9ef">uint64</span> positions <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span> [<span style="color:#66d9ef">packed</span><span style="color:#f92672">=</span><span style="color:#66d9ef">true</span>];<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> ColumnStatistics statistics <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>}<span style="color:#960050;background-color:#1e0010">
</span></span></span></code></pre></div><h4 id="bloom-filter-index">Bloom Filter Index</h4>
<p>Each column has a <strong>BLOOM_FILTER</strong> stream to help speed up searches.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-protobuf" data-lang="protobuf"><span style="display:flex;"><span><span style="color:#66d9ef">message</span> <span style="color:#a6e22e">BloomFilter</span> {<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">optional</span> <span style="color:#66d9ef">uint32</span> numHashFunctions <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> <span style="color:#66d9ef">repeated</span> <span style="color:#66d9ef">fixed64</span> bitset <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span>;<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>}<span style="color:#960050;background-color:#1e0010">
</span></span></span></code></pre></div><h2 id="data-access-path">Data Access Path</h2>
<ul>
<li><strong>Postscript</strong> -> <strong>Footer</strong> -> Retrieve Stripe Information -> <strong>Stripe Footer</strong> -> <strong>Stripe Index</strong> -> <strong>Row Group</strong> -> <strong>Column</strong></li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://webcdn.nexla.com/n3x_ctx/uploads/2018/05/An-Introduction-to-Big-Data-Formats-Nexla.pdf">An Introduction to Big Data Formats - Nexla</a></li>
<li><a href="https://orc.apache.org/docs/">ORC Documentation</a></li>
<li><a href="https://orc.apache.org/specification/ORCv1/">ORC Specification</a></li>
<li><a href="https://cloud.tencent.com/developer/article/1757862">ORC Article by Tencent</a></li>
</ul>
Apache-Iceberg Quick Investigation
https://noneback.github.io/blog/apacheiceberg%E8%B0%83%E7%A0%94/
Wed, 05 Oct 2022 19:55:54 +0800https://noneback.github.io/blog/apacheiceberg%E8%B0%83%E7%A0%94/<ul>
<li>A table format for large-scale analysis of datasets.</li>
<li>A specification for organizing data files and metadata files.</li>
<li>A schema semantic abstraction between storage and computation.</li>
<li>Developed and open-sourced by Netflix to enhance scalability, reliability, and usability.</li>
</ul>
<h2 id="background">Background</h2>
<p>Issues encountered when migrating HIVE to the cloud:</p>
<ul>
<li>Dependency on List and Rename semantics makes it impossible to replace HDFS with cheaper OSS.</li>
<li>Scalability issues: Schema information in Hive is centrally stored in metastore, which can become a performance bottleneck.</li>
<li>Unsafe operations, CBO unfriendly, etc.</li>
</ul>
<h2 id="features">Features</h2>
<ul>
<li>Supports secure and efficient schema, partition changes, and evolution, self-defined schema, hidden partition.
<ul>
<li>Abstracts its own schema, not tied to any computation engine schema; partition is maintained at the schema level. Partition and sort order provide transformer functions, such as date(timestamp).</li>
</ul>
</li>
<li>Supports object storage with minimal dependency on FS semantics.</li>
<li>ACID semantics support, parallel reads, serialized write operations:
<ul>
<li>Separation of read and write snapshots.</li>
<li>Optimistic handling of write parallel conflicts, retry to ensure writes.</li>
</ul>
</li>
<li>Snapshot support:
<ul>
<li>Data rollback and time travel.</li>
<li>Supports snapshot expiration (by default, data files are not deleted, but customizable deletion behavior is available) (<a href="https://iceberg.apache.org/javadoc/0.13.1/org/apache/iceberg/ExpireSnapshots.html">related API doc</a>).</li>
<li>Incremental reading can be achieved by comparing snapshot differences.</li>
</ul>
</li>
<li>Query optimization-friendly: predicate pushdown, data file statistics. Currently, compaction is not supported, but invalid files can be deleted during snapshot expiration (<a href="https://iceberg.apache.org/javadoc/0.13.1/org/apache/iceberg/ExpireSnapshots.html#deleteWith-java.util.function.Consumer-">deleteWith</a>).</li>
<li>High abstraction level, easy for modification, optimization, and extension. Catalog, read/write paths, file formats, storage dependencies are all pluggable. Iceberg’s design goal is to define a standard, open, and general data organization format while hiding differences in underlying data storage formats, providing a unified operational API for different engines to connect through its API.</li>
<li>Others: file-level encryption and decryption.</li>
</ul>
<h2 id="ecosystem">Ecosystem</h2>
<ul>
<li>Community support for OSS, Flink, Spark, and Presto:
<ul>
<li>Flink (<a href="https://iceberg.apache.org/docs/latest/flink/">detail</a>): Supports streaming reads and writes, incremental reads (based on snapshot), upsert write (<a href="https://iceberg.apache.org/releases/#0130-release-notes">0.13.0-release-notes</a>).</li>
<li>Presto: <a href="https://prestodb.io/docs/current/connector/iceberg.html">Iceberg connector</a>.</li>
<li>Aliyun OSS: <a href="https://github.com/apache/iceberg/pull/3686/files"># pr 3689</a>.</li>
</ul>
</li>
<li>Integration with other components:
<ul>
<li>Integration with lower storage layers: Only relies on three semantics: In-place write, Seekable reads, Deletes, supports AliOSS (<a href="https://github.com/apache/iceberg/pull/3686/files"># pr 3689</a>).</li>
<li>Integration with other file formats: High abstraction level, currently supports Avro, Parquet, ORC.</li>
<li>Catalog: Customizable (<a href="https://iceberg.apache.org/docs/latest/custom-catalog/">Doc: Custom Catalog Implementation</a>), currently supports JDBC, Hive Metastore, Hadoop, etc.</li>
<li>Integration with computation layer: Provides native JAVA & Python APIs, with a high level of abstraction, supporting most computation engines.</li>
</ul>
</li>
<li>Open and neutral community, allowing contributions to improve influence.</li>
</ul>
<h2 id="table-specification">Table Specification</h2>
<p>Specification for organizing data files and metadata files.</p>
<p><img alt="img" src="https://iceberg.apache.org/img/iceberg-metadata.png"></p>
<h3 id="case-spark--iceberg--local-fs">Case: Spark + Iceberg + Local FS</h3>
<p>Iceberg supports Parquet, Avro, ORC file formats.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">#</span> <span style="color:#960050;background-color:#1e0010">Storage</span> <span style="color:#960050;background-color:#1e0010">organization</span>
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">test</span><span style="color:#ae81ff">2</span>
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">├──</span> <span style="color:#960050;background-color:#1e0010">data</span>
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">│</span> <span style="color:#960050;background-color:#1e0010">├──</span> <span style="color:#ae81ff">00000-1</span><span style="color:#960050;background-color:#1e0010">-ccff</span><span style="color:#ae81ff">6767-12</span><span style="color:#960050;background-color:#1e0010">cc</span><span style="color:#ae81ff">-481</span><span style="color:#960050;background-color:#1e0010">c</span><span style="color:#ae81ff">-93</span><span style="color:#960050;background-color:#1e0010">fc-db</span><span style="color:#ae81ff">9</span><span style="color:#960050;background-color:#1e0010">f</span><span style="color:#ae81ff">1</span><span style="color:#960050;background-color:#1e0010">a</span><span style="color:#ae81ff">57438</span><span style="color:#960050;background-color:#1e0010">c</span><span style="color:#ae81ff">-00001</span><span style="color:#960050;background-color:#1e0010">.parquet</span>
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">│</span> <span style="color:#960050;background-color:#1e0010">└──</span> <span style="color:#ae81ff">00001-2-6</span><span style="color:#960050;background-color:#1e0010">c</span><span style="color:#ae81ff">1e5</span><span style="color:#960050;background-color:#1e0010">a</span><span style="color:#ae81ff">0</span><span style="color:#960050;background-color:#1e0010">b</span><span style="color:#ae81ff">-89</span><span style="color:#960050;background-color:#1e0010">fe</span><span style="color:#ae81ff">-4e77</span><span style="color:#960050;background-color:#1e0010">-b</span><span style="color:#ae81ff">90</span><span style="color:#960050;background-color:#1e0010">a</span><span style="color:#ae81ff">-1773</span><span style="color:#960050;background-color:#1e0010">a</span><span style="color:#ae81ff">7</span><span style="color:#960050;background-color:#1e0010">fbbcc</span><span style="color:#ae81ff">8-00001</span><span style="color:#960050;background-color:#1e0010">.parquet</span>
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">└──</span> <span style="color:#960050;background-color:#1e0010">metadata</span>
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">├──</span> <span style="color:#ae81ff">2</span><span style="color:#960050;background-color:#1e0010">c</span><span style="color:#ae81ff">1</span><span style="color:#960050;background-color:#1e0010">dc</span><span style="color:#ae81ff">0e8</span><span style="color:#ae81ff">-1843-4</span><span style="color:#960050;background-color:#1e0010">cb</span><span style="color:#ae81ff">9-9</span><span style="color:#960050;background-color:#1e0010">c</span><span style="color:#ae81ff">55</span><span style="color:#960050;background-color:#1e0010">-ae</span><span style="color:#ae81ff">43</span><span style="color:#960050;background-color:#1e0010">f</span><span style="color:#ae81ff">800</span><span style="color:#960050;background-color:#1e0010">bf</span><span style="color:#ae81ff">3</span><span style="color:#960050;background-color:#1e0010">f-m</span><span style="color:#ae81ff">0</span><span style="color:#960050;background-color:#1e0010">.avro</span> <span style="color:#75715e">// manifest file
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#960050;background-color:#1e0010">├──</span> <span style="color:#960050;background-color:#1e0010">snap</span><span style="color:#ae81ff">-8512048775051875497-1-2</span><span style="color:#960050;background-color:#1e0010">c</span><span style="color:#ae81ff">1</span><span style="color:#960050;background-color:#1e0010">dc</span><span style="color:#ae81ff">0e8</span><span style="color:#ae81ff">-1843-4</span><span style="color:#960050;background-color:#1e0010">cb</span><span style="color:#ae81ff">9-9</span><span style="color:#960050;background-color:#1e0010">c</span><span style="color:#ae81ff">55</span><span style="color:#960050;background-color:#1e0010">-ae</span><span style="color:#ae81ff">43</span><span style="color:#960050;background-color:#1e0010">f</span><span style="color:#ae81ff">800</span><span style="color:#960050;background-color:#1e0010">bf</span><span style="color:#ae81ff">3</span><span style="color:#960050;background-color:#1e0010">f.avro</span> <span style="color:#75715e">// manifest list file
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#960050;background-color:#1e0010">├──</span> <span style="color:#960050;background-color:#1e0010">v</span><span style="color:#ae81ff">1</span><span style="color:#960050;background-color:#1e0010">.metadata.json</span> <span style="color:#75715e">// metadata file
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#960050;background-color:#1e0010">├──</span> <span style="color:#960050;background-color:#1e0010">v</span><span style="color:#ae81ff">2</span><span style="color:#960050;background-color:#1e0010">.metadata.json</span>
</span></span><span style="display:flex;"><span> <span style="color:#960050;background-color:#1e0010">└──</span> <span style="color:#960050;background-color:#1e0010">version-hint.text</span> <span style="color:#75715e">// catalog
</span></span></span></code></pre></div><h4 id="datafile">DataFile</h4>
<p>Data files in columnar format: Parquet, ORC.</p>
<p>There are three types of Data Files: data file, partition delete file, equality delete file.</p>
<h4 id="manifest-file">Manifest File</h4>
<p>Indexes data files, including statistics and partition information.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"status"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"snapshot_id"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"long"</span>:<span style="color:#ae81ff">1274364374047997583</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"data_file"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"file_path"</span>:<span style="color:#e6db74">"/tmp/warehouse/db/test3/data/id=1/00000-31-401a9d2e-d501-434c-a38f-5df5f08ebbd7-00001.parquet"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"file_format"</span>:<span style="color:#e6db74">"PARQUET"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"partition"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"id"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"long"</span>:<span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"record_count"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"file_size_in_bytes"</span>:<span style="color:#ae81ff">643</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"block_size_in_bytes"</span>:<span style="color:#ae81ff">67108864</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"column_sizes"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#ae81ff">46</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">2</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#ae81ff">48</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value_counts"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">2</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"null_value_counts"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">2</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"nan_value_counts"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"lower_bounds"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#e6db74">"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">2</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#e6db74">"a"</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"upper_bounds"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#e6db74">"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key"</span>:<span style="color:#ae81ff">2</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"value"</span>:<span style="color:#e6db74">"a"</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"key_metadata"</span>:<span style="color:#66d9ef">null</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"split_offsets"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span> <span style="color:#ae81ff">4</span>
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"sort_order_id"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"int"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// another data file meta
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> }
</span></span><span style="display:flex;"><span>]
</span></span></code></pre></div><h4 id="snapshot">Snapshot</h4>
<ul>
<li>Represents the state of a Table at a specific point in time, saved via a Manifest List File.</li>
<li><strong>A new Snapshot is generated every time a data change is made to the Table.</strong></li>
</ul>
<h4 id="manifest-list-file">Manifest List File</h4>
<ul>
<li>Contains information about all Manifest files in a Snapshot, as well as partition stats and data file count.</li>
<li>One Snapshot corresponds to one Manifest List File, and each submission generates a manifest list file.</li>
<li>Optimistic concurrency: when concurrent Snapshot submissions conflict, the later submission <strong>retries</strong> to ensure submission.</li>
</ul>
<blockquote>
<p>Each manifest list stores metadata about manifests, including partition stats and data file counts.</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"manifest_path"</span>:<span style="color:#e6db74">"/tmp/warehouse/db/test3/metadata/f22b748f-a7bc-4e4c-ad6c-3e335c1c0c2b-m0.avro"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"manifest_length"</span>:<span style="color:#ae81ff">6019</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"partition_spec_id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"added_snapshot_id"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"long"</span>:<span style="color:#ae81ff">1274364374047997583</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"added_data_files_count"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"int"</span>:<span style="color:#ae81ff">2</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"existing_data_files_count"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"int"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"deleted_data_files_count"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"int"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"partitions"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"array"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"contains_null"</span>:<span style="color:#66d9ef">false</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"contains_nan"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"boolean"</span>:<span style="color:#66d9ef">false</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"lower_bound"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"bytes"</span>:<span style="color:#e6db74">"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"upper_bound"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"bytes"</span>:<span style="color:#e6db74">"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"added_rows_count"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"long"</span>:<span style="color:#ae81ff">2</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"existing_rows_count"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"long"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"deleted_rows_count"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"long"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// another manifest file
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> }
</span></span><span style="display:flex;"><span>]
</span></span></code></pre></div><h4 id="metadata-file">Metadata File</h4>
<p>Tracks the state of the table. When the state changes, a new metadata file is generated and replaces the previous one, <strong>ensuring atomicity</strong>.</p>
<blockquote>
<p>The table metadata file tracks the <strong>table schema</strong>, <strong>partitioning config</strong>, custom properties, and <strong>snapshots</strong> of the table contents.</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"format-version"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"table-uuid"</span>:<span style="color:#e6db74">"175d0b61-8507-40b2-9c19-3338b05f3d48"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"location"</span>:<span style="color:#e6db74">"/tmp/warehouse/db/test3"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"last-updated-ms"</span>:<span style="color:#ae81ff">1653387947819</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"last-column-id"</span>:<span style="color:#ae81ff">2</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"schema"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"type"</span>:<span style="color:#e6db74">"struct"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"schema-id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"fields"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"id"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"name"</span>:<span style="color:#e6db74">"id"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"required"</span>:<span style="color:#66d9ef">false</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"type"</span>:<span style="color:#e6db74">"long"</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#960050;background-color:#1e0010">Object</span>{<span style="color:#960050;background-color:#1e0010">...</span>}
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"current-schema-id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"schemas"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"type"</span>:<span style="color:#e6db74">"struct"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"schema-id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"fields"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"id"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"name"</span>:<span style="color:#e6db74">"id"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"required"</span>:<span style="color:#66d9ef">false</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"type"</span>:<span style="color:#e6db74">"long"</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#960050;background-color:#1e0010">Object</span>{<span style="color:#960050;background-color:#1e0010">...</span>}
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ],
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"partition-spec"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"name"</span>:<span style="color:#e6db74">"id"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"transform"</span>:<span style="color:#e6db74">"identity"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"source-id"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"field-id"</span>:<span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ],
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"default-spec-id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"partition-specs"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"spec-id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"fields"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"name"</span>:<span style="color:#e6db74">"id"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"transform"</span>:<span style="color:#e6db74">"identity"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"source-id"</span>:<span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"field-id"</span>:<span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ],
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"last-partition-id"</span>:<span style="color:#ae81ff">1000</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"default-sort-order-id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"sort-orders"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"order-id"</span>:<span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"fields"</span>:[
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ],
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"properties"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"owner"</span>:<span style="color:#e6db74">"chenlan"</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"current-snapshot-id"</span>:<span style="color:#ae81ff">1274364374047997700</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"snapshots"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"snapshot-id"</span>:<span style="color:#ae81ff">1274364374047997700</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"timestamp-ms"</span>:<span style="color:#ae81ff">1653387947819</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"summary"</span>:{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"operation"</span>:<span style="color:#e6db74">"append"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"spark.app.id"</span>:<span style="color:#e6db74">"local-1653381214613"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"added-data-files"</span>:<span style="color:#e6db74">"2"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"added-records"</span>:<span style="color:#e6db74">"2"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"added-files-size"</span>:<span style="color:#e6db74">"1286"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"changed-partition-count"</span>:<span style="color:#e6db74">"2"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"total-records"</span>:<span style="color:#e6db74">"2"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"total-files-size"</span>:<span style="color:#e6db74">"1286"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"total-data-files"</span>:<span style="color:#e6db74">"2"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"total-delete-files"</span>:<span style="color:#e6db74">"0"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"total-position-deletes"</span>:<span style="color:#e6db74">"0"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"total-equality-deletes"</span>:<span style="color:#e6db74">"0"</span>
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"manifest-list"</span>:<span style="color:#e6db74">"/tmp/warehouse/db/test3/metadata/snap-1274364374047997583-1-f22b748f-a7bc-4e4c-ad6c-3e335c1c0c2b.avro"</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"schema-id"</span>:<span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ],
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"snapshot-log"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"timestamp-ms"</span>:<span style="color:#ae81ff">1653387947819</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"snapshot-id"</span>:<span style="color:#ae81ff">1274364374047997700</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> ],
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"metadata-log"</span>:[
</span></span><span style="display:flex;"><span> {
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"timestamp-ms"</span>:<span style="color:#ae81ff">1653387937345</span>,
</span></span><span style="display:flex;"><span> <span style="color:#f92672">"metadata-file"</span>:<span style="color:#e6db74">"/tmp/warehouse/db/test3/metadata/v1.metadata.json"</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> ]
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="catalog">Catalog</h4>
<p>Records the latest metadata file path.</p>
<p><img alt="img" src="https://pic3.zhimg.com/80/v2-27edee80bbac03b462898a0564722a56_1440w.jpg"></p>
<h3 id="features-1">Features</h3>
<ul>
<li>ACID semantics guarantee: Atomic table state changes + snapshot-based reads and writes.</li>
<li>Flexible partition management: hidden partition, seamless partition changes.</li>
<li>Supports incremental reads: incremental read of each change using snapshots.</li>
<li>Multi-version data: beneficial for data rollback.</li>
<li>No side effects, safe schema, and partition changes.</li>
</ul>
<h3 id="data-types">Data Types</h3>
<p>Data files in different formats define different types.</p>
<ul>
<li>
<p>Nested Types:</p>
<ul>
<li>struct: A tuple of typed values.</li>
<li>list: A collection of values with an element type.</li>
<li>map: A collection of key-value pairs with a key type and a value type.</li>
</ul>
</li>
<li>
<p>Primitive Types:</p>
</li>
</ul>
<table>
<thead>
<tr>
<th>Primitive type</th>
<th>Description</th>
<th>Requirements</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>boolean</strong></td>
<td>True or false</td>
<td></td>
</tr>
<tr>
<td><strong>int</strong></td>
<td>32-bit signed integers</td>
<td>Can promote to <code>long</code></td>
</tr>
<tr>
<td><strong>long</strong></td>
<td>64-bit signed integers</td>
<td></td>
</tr>
<tr>
<td><strong>float</strong></td>
<td><a href="https://en.wikipedia.org/wiki/IEEE_754">32-bit IEEE 754</a> floating point</td>
<td>Can promote to double</td>
</tr>
<tr>
<td><strong>double</strong></td>
<td><a href="https://en.wikipedia.org/wiki/IEEE_754">64-bit IEEE 754</a> floating point</td>
<td></td>
</tr>
<tr>
<td><strong>decimal(P,S)</strong></td>
<td>Fixed-point decimal; precision P, scale S</td>
<td>Scale is fixed [1], precision must be 38 or less</td>
</tr>
<tr>
<td><strong>date</strong></td>
<td>Calendar date without timezone or time</td>
<td></td>
</tr>
<tr>
<td><strong>time</strong></td>
<td>Time of day without date, timezone</td>
<td>Microsecond precision [2]</td>
</tr>
<tr>
<td><strong>timestamp</strong></td>
<td>Timestamp without timezone</td>
<td>Microsecond precision [2]</td>
</tr>
<tr>
<td><strong>timestamptz</strong></td>
<td>Timestamp with timezone</td>
<td>Stored as UTC [2]</td>
</tr>
<tr>
<td><strong>string</strong></td>
<td>Arbitrary-length character sequences</td>
<td>Encoded with UTF-8 [3]</td>
</tr>
<tr>
<td><strong>uuid</strong></td>
<td>Universally unique identifiers</td>
<td>Should use 16-byte fixed</td>
</tr>
<tr>
<td><strong>fixed(L)</strong></td>
<td>Fixed-length byte array of length L</td>
<td></td>
</tr>
<tr>
<td><strong>binary</strong></td>
<td>Arbitrary-length byte array</td>
<td></td>
</tr>
</tbody>
</table>
<h3 id="read--write-paths">Read & Write Paths</h3>
<p>select: catalog -> manifest list file -> manifest file -> data file -> data group.</p>
<p>insert: reverse (catalog -> manifest list file -> manifest file -> data file -> data group).</p>
<p>update: delete & insert, data file + partition delete file + equality delete file.</p>
<p>Using Partition delete file transaction: issue of repeatedly inserting and deleting the same row within a transaction.</p>
<p>delete: row-level delete.</p>
<h2 id="references">References</h2>
<p><a href="https://iceberg.apache.org/spec/">Iceberg Spec</a></p>
<p><a href="https://www.datatong.net/thread-39745-1-1.html">Flink+Iceberg Data Lake Construction</a></p>
<p><a href="https://zhuanlan.zhihu.com/p/347660549">Construction Practice of Real-time Data Warehouse with Flink + Iceberg (Chinese)</a></p>
<p><a href="https://github.com/apache/iceberg/pull/3686">Iceberg Aliyun OSS</a></p>
<p><a href="https://iceberg.apache.org/docs/latest/flink/">Iceberg Flink Support</a></p>
<p><a href="https://zhuanlan.zhihu.com/p/305746643">Building Enterprise-grade Real-time Data Lake with Flink + Iceberg</a></p>
<p><a href="https://zhuanlan.zhihu.com/p/353030161">How Flink Analyzes CDC Data in Iceberg Real-time Data Lake</a></p>
<p><a href="https://github.com/apache/iceberg">Iceberg GitHub</a></p>
<p><a href="https://docs.alluxio.io/os/user/stable/en/api/POSIX-API.html">Alluxio POSIX API</a></p>
<p><a href="https://zhuanlan.zhihu.com/p/110748218">Comparison of Delta, Iceberg, and Hudi Open-source Data Lake Solutions</a></p>
LevelDB Write
https://noneback.github.io/blog/leveldb-write/
Tue, 10 May 2022 17:14:14 +0800https://noneback.github.io/blog/leveldb-write/<p>This is the second chapter of my notes on reading the LevelDB source code, focusing on the write flow of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts.</p>
<h2 id="main-process">Main Process</h2>
<p>The main write logic of LevelDB is relatively simple. First, the write operation is encapsulated into a <code>WriteBatch</code>, and then it is executed.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span>Status DB<span style="color:#f92672">::</span>Put(<span style="color:#66d9ef">const</span> WriteOptions<span style="color:#f92672">&</span> opt, <span style="color:#66d9ef">const</span> Slice<span style="color:#f92672">&</span> key, <span style="color:#66d9ef">const</span> Slice<span style="color:#f92672">&</span> value) {
</span></span><span style="display:flex;"><span> WriteBatch batch;
</span></span><span style="display:flex;"><span> batch.Put(key, value);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">Write</span>(opt, <span style="color:#f92672">&</span>batch);
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h3 id="writebatch">WriteBatch</h3>
<p><code>WriteBatch</code> is an encapsulation of a group of update operations, which are applied <strong>atomically</strong> to the state machine. A block of memory is used to save the user’s update operations.</p>
<blockquote>
<p>InMemory Format:
<code>| seq_num: 8 bytes | count: 4 bytes | list of records{ type + key + value}</code></p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">WriteBatch</span> {
</span></span><span style="display:flex;"><span> ...
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// See comment in write_batch.cc for the format of rep_;
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// WriteBatch::rep_ :=
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// sequence: fixed64
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// count: fixed32
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// data: record[count]
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// record :=
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// kTypeValue varstring varstring |
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// kTypeDeletion varstring
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// varstring :=
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// len: varint32
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// data: uint8[len]
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> std<span style="color:#f92672">::</span>string rep_;
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span><span style="color:#75715e">// some opt
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#66d9ef">void</span> WriteBatch<span style="color:#f92672">::</span>Put(<span style="color:#66d9ef">const</span> Slice<span style="color:#f92672">&</span> key, <span style="color:#66d9ef">const</span> Slice<span style="color:#f92672">&</span> value) {
</span></span><span style="display:flex;"><span> WriteBatchInternal<span style="color:#f92672">::</span>SetCount(<span style="color:#66d9ef">this</span>, WriteBatchInternal<span style="color:#f92672">::</span>Count(<span style="color:#66d9ef">this</span>) <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>);
</span></span><span style="display:flex;"><span> rep_.push_back(<span style="color:#66d9ef">static_cast</span><span style="color:#f92672"><</span><span style="color:#66d9ef">char</span><span style="color:#f92672">></span>(kTypeValue));
</span></span><span style="display:flex;"><span> PutLengthPrefixedSlice(<span style="color:#f92672">&</span>rep_, key);
</span></span><span style="display:flex;"><span> PutLengthPrefixedSlice(<span style="color:#f92672">&</span>rep_, value);
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="write-flow">Write Flow</h2>
<p>The write operation mainly consists of four steps:</p>
<h3 id="initializing-writer">Initializing Writer</h3>
<p><code>Writer</code> actually contains all the information needed for a write operation.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#75715e">// Information kept for every waiting writer
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">DBImpl</span><span style="color:#f92672">::</span>Writer {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">explicit</span> <span style="color:#a6e22e">Writer</span>(port<span style="color:#f92672">::</span>Mutex<span style="color:#f92672">*</span> mu)
</span></span><span style="display:flex;"><span> <span style="color:#f92672">:</span> batch(<span style="color:#66d9ef">nullptr</span>), sync(false), done(false), cv(mu) {}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> Status status;
</span></span><span style="display:flex;"><span> WriteBatch<span style="color:#f92672">*</span> batch;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> sync;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> done;
</span></span><span style="display:flex;"><span> port<span style="color:#f92672">::</span>CondVar cv;
</span></span><span style="display:flex;"><span>};
</span></span></code></pre></div><h3 id="writer-scheduling">Writer Scheduling</h3>
<p>LevelDB’s write process is a multi-producer, single-consumer model, with multiple threads producing writers and a single thread consuming them. Each writer is produced by one thread but can be consumed by multiple threads.</p>
<p>Internally, LevelDB maintains a writer queue. Each thread’s write, delete, or update operation appends a writer to the end of the queue, with a lock ensuring data safety. Once a writer is added to the queue, the thread waits until it either reaches the head of the queue (is scheduled) or its operation is completed by another thread.</p>
<blockquote>
<p>When consuming writers and executing the actual update operations, LevelDB optimizes the write task by merging writers of the same type (based on the sync flag).</p>
</blockquote>
<h3 id="writing-writer-batches">Writing Writer Batches</h3>
<ul>
<li>First, <code>MakeRoomForWrite</code> ensures that the <code>memtable</code> has enough space and that the Write-Ahead Log (WAL) can guarantee a successful write. If the current <code>memtable</code> has enough space, it is reused. Otherwise, a new <code>memtable</code> and WAL are created, and the previous <code>memtable</code> is converted to an immutable <code>memtable</code>, awaiting compaction (minor compaction is serialized).</li>
<li>The function also checks the number of Level 0 files and decides whether to throttle the write rate based on configurations and triggers. There are two main configurations:
<ul>
<li><strong>Slowdown Trigger</strong>: This trigger causes write threads to sleep, slowing down writes so that compaction tasks can proceed. It also limits the number of Level 0 files to ensure read efficiency.</li>
<li><strong>StopWritesTrigger</strong>: Stops write threads.</li>
</ul>
</li>
<li><code>BuildBatchGroup</code> merges batches from writers of the same type starting from the head of the queue into a single batch.</li>
<li>The merged batch is then written first to the WAL and then to the <code>memtable</code>.</li>
</ul>
<h4 id="write-ahead-log-wal">Write-Ahead Log (WAL)</h4>
<p>The WAL file is split into blocks, and each block consists of records. The format is as follows:</p>
<blockquote>
<p><code>| Header{ checksum(4 bytes) + len(2 bytes) + type(1 byte)} | Data |</code></p>
</blockquote>
<p>The record types are <code>Zero</code>, <code>Full</code>, <code>First</code>, <code>Middle</code>, and <code>Last</code>. <code>Zero</code> is reserved for preallocated files. Since a key-value pair might be too large and needs to be recorded in several chunks, the other four types are used accordingly.</p>
<h5 id="write-flow-1">Write Flow</h5>
<p><img alt="Write Flow" src="https://s2.loli.net/2022/08/15/3TOkslnhjuMIg9Y.jpg"></p>
<h4 id="memtable">Memtable</h4>
<p>The core design of <code>memtable</code> involves two parts: the skip list and the encoding of key-value pairs in the skip list. The skip list ensures the sorted nature of inserted data. For more details, you can refer to my other blog post.</p>
<p>The key encoded in <code>memtable</code> is called the <code>Internal Key</code>, which is encoded as follows:</p>
<blockquote>
<p><code>| Original Key(varstring) + seq num(7 bytes) + type(1 byte) |</code></p>
</blockquote>
<p><img alt="Internal Key" src="https://leveldb-handbook.readthedocs.io/zh/latest/_images/internalkey.jpeg"></p>
<blockquote>
<p><code>SeqNum</code> is a monotonically increasing sequence number generated for each update operation, serving as a logical clock to indicate the recency of operations. Based on <code>SeqNum</code>, snapshot-based reads (versioned reads) can be implemented.</p>
</blockquote>
<h2 id="reference">Reference</h2>
<ul>
<li><a href="https://leveldb-handbook.readthedocs.io/zh/latest/index.html">LevelDB handbook</a></li>
</ul>
MIT6.824-RaftKV
https://noneback.github.io/blog/mit6.824-raftkv/
Fri, 15 Apr 2022 10:49:57 +0800https://noneback.github.io/blog/mit6.824-raftkv/<p>Earlier, I looked at the code of Casbin-Mesh because I wanted to try GSOC. Casbin-Mesh is a distributed Casbin application based on Raft. This RaftKV in MIT6.824 is quite similar, so I took the opportunity to write this blog.</p>
<h2 id="lab-overview">Lab Overview</h2>
<p>Lab 03 involves building a distributed KV service based on Raft. We need to implement the server and client for this service.</p>
<p>The structure of RaftKV and the interaction between its modules are shown below:</p>
<p><img alt="image-20220429211429808" src="https://s2.loli.net/2022/04/29/xuQMp28PRH7rheb.png"></p>
<p>Compared to the previous lab, the difficulty is significantly lower. For implementation, you can refer to this excellent <a href="https://github.com/OneSizeFitsQuorum/MIT6.824-2021/blob/master/docs/lab3.md">implementation</a>, so I won’t elaborate too much.</p>
<h2 id="raft-related-topics">Raft-Related Topics</h2>
<p>Let’s talk about Raft and its interactions with clients.</p>
<h3 id="routing-and-linearizability">Routing and Linearizability</h3>
<p>To build a service that allows client access on top of Raft, the issues of <strong>routing</strong> and <strong>linearizability</strong> must first be addressed.</p>
<h4 id="routing">Routing</h4>
<p>Raft is a <strong>Strong Leader</strong> consensus algorithm, and read and write requests usually need to be executed by the Leader. When a client queries the Raft cluster, it typically randomly selects a node. If that node is not the Leader, it returns the Leader information to the client, and the client redirects the request to the Leader.</p>
<h4 id="linearizability">Linearizability</h4>
<p>Currently, Raft only supports <strong>At Least Once</strong> semantics. For a single client request, the Raft state machine may apply the command multiple times, which is particularly unsuitable for consensus-based systems.</p>
<p>To achieve linearizability, it is clear that requests need to be made idempotent.</p>
<p>A basic approach is for the client to assign a unique UID to each request, and the server maintains a session using this <code>UID</code> to cache the response of successful requests. When a duplicate request arrives at the server, it can respond directly using the cached response, thus achieving idempotency.</p>
<p>Of course, this introduces the issue of session management, but that is not the focus of this article.</p>
<h3 id="read-only-optimization">Read-Only Optimization</h3>
<p>After solving the above two problems, we have a usable Raft-based service.</p>
<p>However, we notice that whether it’s a read or write request, our application needs to go through a round of <code>AppendEntries</code> communication initiated by the Leader. It also requires successful quorum ACKs and additional disk write operations before the log is committed, after which the result can be returned to the client.</p>
<p>Write operations change the state machine, so these are necessary steps for write requests. However, read operations do not change the state machine, and we can optimize read requests to bypass the Raft log, reducing the overhead of synchronous write operations on disk IO.</p>
<p>The problem is that without additional measures, read-only query results that bypass the Raft log may become stale.</p>
<blockquote>
<p>For example, if the old cluster Leader and a new Leader’s cluster are partitioned, queries made to the old Leader could be outdated.</p>
</blockquote>
<p>The Raft paper mentions two methods to bypass the Raft log and optimize read-only requests: <strong>Read Index</strong> and <strong>Lease Read</strong>.</p>
<h4 id="read-index">Read Index</h4>
<p>The <strong>Read Index</strong> approach needs to address several issues:</p>
<ul>
<li>Committed logs from the old term</li>
</ul>
<blockquote>
<p>For example, if the old Leader commits a log but crashes before sending heartbeats, other nodes will elect a new Leader. According to the Raft paper, the new Leader does not proactively commit logs from the old Leader.</p>
<p>To solve this, a no-op log is committed after a new Leader is elected to commit the old log.</p>
</blockquote>
<ul>
<li>Gap between <code>commitIndex</code> and <code>appliedIndex</code></li>
</ul>
<blockquote>
<p>Introduce a <code>readIndex</code> variable, where the Leader saves the current <code>commitIndex</code> in a local variable called <code>readIndex</code>. This acts as a boundary for applying the log, and when a read-only request arrives, the log must be applied up to the position recorded by <code>readIndex</code> before the Leader can query the state machine to provide read services.</p>
</blockquote>
<ul>
<li>Ensure no Leader change when providing read-only services</li>
</ul>
<blockquote>
<p>To achieve this, after receiving a read request, the Leader first sends a heartbeat and needs to receive quorum ACKs to ensure there is no other Leader with a higher term, thus ensuring that <code>readIndex</code> is the highest committed index in the cluster.</p>
</blockquote>
<p>For the specific process and optimizations like Batch and Follower Read, refer to the author’s PhD dissertation on Raft.</p>
<h4 id="lease-read">Lease Read</h4>
<p>The <strong>Read Index</strong> approach only optimizes the overhead of disk IO, but still requires a round of network communication. However, this overhead can also be optimized, leading to the <strong>Lease Read</strong> approach.</p>
<p>The <strong>core idea</strong> of <strong>Lease Read</strong> is to use the fact that a Leader Election requires at least one <code>ElectionTimeout</code> time period. During this period, the system will not conduct a new election, thereby avoiding Leader changes when providing read-only services. We can use clocks to optimize network IO.</p>
<h5 id="implementation">Implementation</h5>
<p>To let the clock replace network communication, we need an additional lease mechanism. Once the Leader’s <code>Heartbeat</code> is approved by a quorum, the Leader can assume that no other node can become Leader during the <code>ElectionTimeout</code> period, and it can extend its lease accordingly. While holding the lease, the Leader can directly serve read-only queries without extra network communication.</p>
<p>However, there may be <strong>clock drift</strong> among servers, which means Followers cannot ensure that the Leader will not time out during the lease. This introduces the critical design for <code>Lease Read</code>: <strong>what strategy should be used to extend the lease?</strong></p>
<p>The paper assumes that $ClockDrift$ is bounded, and when a heartbeat successfully updates the lease, the lease is extended to $start + rac{ElectionTimeout}{ClockDriftBound}$.</p>
<p>$ClockDriftBound$ represents the limit of clock drift in the cluster, but discovering and maintaining this limit is challenging due to many real-time factors that cause clock drift.</p>
<blockquote>
<p>For instance, garbage collection (GC), virtual machine scheduling, cloud machine scaling, etc.</p>
</blockquote>
<p>In practice, some safety is usually sacrificed for <code>Lease Read</code> performance. Generally, the lease is extended to $StartTime + ElectionTimeout - \Delta{t}$, where $\Delta{t}$ is a positive value. This reduces the lease extension time compared to <code>ElectionTimeout</code>, trading off between network IO overhead and safety.</p>
<h2 id="summary">Summary</h2>
<p>When building a Raft-based service, it is crucial to design routing and idempotency mechanisms for accessing the service.</p>
<p>For read-only operations, there are two main optimization methods: <strong>Read Index</strong> and <strong>Lease Read</strong>. The former optimizes disk IO during read operations, while the latter uses clocks to optimize network IO.</p>
<h2 id="references">References</h2>
<p><a href="https://github.com/OneSizeFitsQuorum/MIT6.824-2021/blob/master/docs/lab3.md">Implementation Doc</a></p>
<p><a href="https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf">Raft Paper</a></p>
<p><a href="https://pdos.csail.mit.edu/6.824/index.html">MIT6.824 Official</a></p>
<p><a href="https://github.com/OneSizeFitsQuorum/raft-thesis-zh_cn">Consensus: Bridging Theory and Practice - zh</a></p>
<p><a href="https://pingcap.com/zh/blog/lease-read">Tikv Lease-Read</a></p>
LevelDB Startup
https://noneback.github.io/blog/leveldb-%E5%90%AF%E5%8A%A8/
Sat, 09 Apr 2022 14:43:25 +0800https://noneback.github.io/blog/leveldb-%E5%90%AF%E5%8A%A8/<p>This is the first chapter of my notes on reading the LevelDB source code, focusing on the startup process of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts.</p>
<p>A code repository with annotations will be shared on GitHub later for those interested in studying it.</p>
<h2 id="prerequisites">Prerequisites</h2>
<h3 id="database-files">Database Files</h3>
<p>For now, I won’t delve into the encoding and naming details of these files (as I haven’t reached that part yet). I’ll focus on the meaning and role of each file.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>├── 000005.ldb
</span></span><span style="display:flex;"><span>├── 000008.ldb // sst or ldb are both sst files
</span></span><span style="display:flex;"><span>├── 000009.log // WAL
</span></span><span style="display:flex;"><span>├── CURRENT // Records the name of the manifest file in use, also indicates the presence of the database
</span></span><span style="display:flex;"><span>├── LOCK // Empty file, ensures only one DB instance operates on the database
</span></span><span style="display:flex;"><span>├── LOG // Logs printed by LevelDB
</span></span><span style="display:flex;"><span>├── LOG.old
</span></span><span style="display:flex;"><span>└── MANIFEST-000007 // descriptor_file, metadata file
</span></span></code></pre></div><p>Some questions worth exploring, which I may write about later:</p>
<ul>
<li>How does LOCK ensure only one DB instance holds the database?</li>
</ul>
<blockquote>
<p>Essentially, it uses the <code>fcntl</code> system call to set a write lock on the LOCK file.</p>
</blockquote>
<ul>
<li>Encoding issues of various files</li>
</ul>
<blockquote>
<p>I’ll discuss LevelDB’s encoding design in a future blog.</p>
</blockquote>
<h3 id="db-state">DB State</h3>
<p>LevelDB is an embedded database, often used as a component in other applications (e.g., for metadata nodes in distributed storage systems). These applications may crash or exit gracefully, leaving LevelDB data files behind. Thus, it’s necessary to restore the previous database state during startup to ensure data integrity.</p>
<p>So, what should the DB state include? LevelDB is an LSM-based storage engine, essentially an <code>LSM Tree data structure + various read/write and storage optimizations</code>. Based on this and LevelDB’s documentation, the DB state includes at least the following persistent information:</p>
<ul>
<li>The SST files for each level and the key range covered by each SST file
<blockquote>
<p>The key range helps avoid unnecessary I/O.</p>
</blockquote>
</li>
<li>Global logical clock, <code>last_seq_number</code>
<blockquote>
<p>Each data update has a <code>seq_num</code> that marks the recency of the update and is related to ordering.</p>
</blockquote>
</li>
<li>Compaction-related parameters (<code>file_to_compact</code>, <code>score</code>, <code>point</code>)
<blockquote>
<p>Compaction parameters are used to trigger compaction after a crash.</p>
</blockquote>
</li>
<li>Comparator name
<blockquote>
<p>Once the DB is initialized, the data sorting logic is fixed and cannot be changed. The comparator name serves as a credential.</p>
</blockquote>
</li>
<li><code>log_number</code>, <code>next_file_number</code>
<blockquote>
<p>WAL number and the next available file number.</p>
</blockquote>
</li>
<li><code>deleted_files</code> and <code>add_files</code>
<blockquote>
<p>SST files to be deleted or added due to compaction or reference count reaching zero.</p>
</blockquote>
</li>
</ul>
<p>In practice, each metadata change in LevelDB (usually caused by compaction) is recorded in a <code>VersionEdit</code> data structure. Thus, the DB state in LevelDB is essentially <code>initial state + list of applied VersionEdits</code>.</p>
<h3 id="version-control">Version Control</h3>
<p>Since we mentioned <code>VersionEdit</code>, let’s also discuss the version control in LevelDB’s startup process, which mainly involves three data structures: <code>Version</code>, <code>VersionEdit</code>, and <code>VersionSet</code>.</p>
<p>Why is version control needed? In short, LevelDB uses the Multi-Version Concurrency Control (MVCC) mechanism to avoid using a big lock and improve performance.</p>
<p>Snapshot reads at the command level are implemented via <code>sequence_number</code>. Each operation is assigned the current <code>sequence_number</code>, which is used to determine the data visible to that operation. Records with a <code>sequence_number</code> greater than that of the command are invisible to the operation.</p>
<p>MVCC at the SST file level is implemented using a version chain, primarily to avoid conflicts in the following scenario: when reading a file while a background major compaction tries to delete that file.</p>
<h4 id="related-data-structures">Related Data Structures</h4>
<p>The main data structures related to SST-level MVCC are <code>Version</code>, <code>VersionEdit</code>, and <code>VersionSet</code>.</p>
<h5 id="version">Version</h5>
<p>Represents the latest data state after startup or compaction.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Version</span> {
</span></span><span style="display:flex;"><span> VersionSet<span style="color:#f92672">*</span> vset_; <span style="color:#75715e">// VersionSet to which this Version belongs
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> Version<span style="color:#f92672">*</span> next_; <span style="color:#75715e">// Next version in linked list
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> Version<span style="color:#f92672">*</span> prev_; <span style="color:#75715e">// Previous version in linked list
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">int</span> refs_; <span style="color:#75715e">// Number of live refs to this version
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// List of files and metadata per level
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> std<span style="color:#f92672">::</span>vector<span style="color:#f92672"><</span>FileMetaData<span style="color:#f92672">*></span> files_[config<span style="color:#f92672">::</span>kNumLevels];
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Next file to compact based on seek stats (compaction due to allowed_seek exhaustion)
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> FileMetaData<span style="color:#f92672">*</span> file_to_compact_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> file_to_compact_level_;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Level that should be compacted next and its compaction score.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#75715e">// Score < 1 means compaction is not strictly needed. These fields
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#75715e">// are initialized by Finalize().
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">double</span> compaction_score_; <span style="color:#75715e">// Score represents data imbalance; higher score indicates greater imbalance and compaction need.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">int</span> compaction_level_;
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h5 id="versionset">VersionSet</h5>
<p>Manages the current runtime state of the entire DB.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">VersionSet</span> {
</span></span><span style="display:flex;"><span> Env<span style="color:#f92672">*</span> <span style="color:#66d9ef">const</span> env_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">const</span> std<span style="color:#f92672">::</span>string dbname_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">const</span> Options<span style="color:#f92672">*</span> <span style="color:#66d9ef">const</span> options_;
</span></span><span style="display:flex;"><span> TableCache<span style="color:#f92672">*</span> <span style="color:#66d9ef">const</span> table_cache_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">const</span> InternalKeyComparator icmp_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> next_file_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> manifest_file_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> last_sequence_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> prev_log_number_; <span style="color:#75715e">// 0 or backing store for memtable being compacted
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Opened lazily
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> WritableFile<span style="color:#f92672">*</span> descriptor_file_; <span style="color:#75715e">// descriptor_ is for manifest file
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> log<span style="color:#f92672">::</span>Writer<span style="color:#f92672">*</span> descriptor_log_; <span style="color:#75715e">// descriptor_ is for manifest file
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> Version dummy_versions_; <span style="color:#75715e">// Head of circular doubly-linked list of versions.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> Version<span style="color:#f92672">*</span> current_; <span style="color:#75715e">// == dummy_versions_.prev_
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Per-level key at which the next compaction at that level should start.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#75715e">// Either an empty string, or a valid InternalKey.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> std<span style="color:#f92672">::</span>string compact_pointer_[config<span style="color:#f92672">::</span>kNumLevels];
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h5 id="versionedit">VersionEdit</h5>
<p>Encapsulates metadata changes. This encapsulation reduces the window for version switching.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">VersionEdit</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">/** other code */</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">typedef</span> std<span style="color:#f92672">::</span>set<span style="color:#f92672"><</span>std<span style="color:#f92672">::</span>pair<span style="color:#f92672"><</span><span style="color:#66d9ef">int</span>, <span style="color:#66d9ef">uint64_t</span><span style="color:#f92672">>></span> DeletedFileSet;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>string comparator_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> prev_log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">uint64_t</span> next_file_number_;
</span></span><span style="display:flex;"><span> SequenceNumber last_sequence_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_comparator_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_prev_log_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_next_file_number_;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">bool</span> has_last_sequence_;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>vector<span style="color:#f92672"><</span>std<span style="color:#f92672">::</span>pair<span style="color:#f92672"><</span><span style="color:#66d9ef">int</span>, InternalKey<span style="color:#f92672">>></span> compact_pointers_;
</span></span><span style="display:flex;"><span> DeletedFileSet deleted_files_;
</span></span><span style="display:flex;"><span> std<span style="color:#f92672">::</span>vector<span style="color:#f92672"><</span>std<span style="color:#f92672">::</span>pair<span style="color:#f92672"><</span><span style="color:#66d9ef">int</span>, FileMetaData<span style="color:#f92672">>></span> new_files_;
</span></span><span style="display:flex;"><span>};
</span></span></code></pre></div><h4 id="manifest-content">Manifest Content</h4>
<p>As mentioned earlier, the manifest is LevelDB’s metadata file that stores the persistent state of the database. During startup, LevelDB may need to restore the previous DB state using existing data files. Additionally, when a version changes, LevelDB generates a <code>VersionEdit</code>. The metadata changes recorded by <code>VersionEdit</code> need to be persisted to the manifest to ensure LevelDB’s MVCC multi-version state is crash-safe. Thus, the encoding layout inside the manifest is crucial.</p>
<p>Internally, metadata is encoded as <code>SnapshotSessionRecord + list of SessionRecords</code>, essentially <code>initial state + list of applied VersionEdits</code>.</p>
<p><img alt="Manifest Structure" src="https://s2.loli.net/2022/04/11/AUusMjdYROz874v.jpg"></p>
<p>A manifest contains several session records. <strong>The first session record</strong> stores the <em>full version information</em> of LevelDB at that time, while subsequent session records only record incremental changes.</p>
<blockquote>
<p>A session record may contain the following fields:</p>
<ul>
<li>Comparer name</li>
<li>Latest WAL file number</li>
<li>Next available file number</li>
<li>The largest <code>sequence number</code> among the data persisted by the DB</li>
<li>Information on new files</li>
<li>Information on deleted files</li>
<li>Compaction record information</li>
</ul>
</blockquote>
<h5 id="writing-version-changes-to-the-manifest">Writing Version Changes to the Manifest</h5>
<p><img alt="Writing to Manifest" src="https://s2.loli.net/2022/04/14/7tlwEMPGgXHIp6s.jpg"></p>
<p>For LevelDB, adding or deleting some SSTable files needs to be an <strong>atomic operation</strong> to maintain <strong>database consistency</strong> before and after the state change.</p>
<h6 id="atomicity">Atomicity</h6>
<p><strong>Atomicity</strong> means that the operation is complete only when a session record is fully written to the manifest. If the process crashes before completion, the database can be restored to a correct state on restart, and those useless SSTable files will be deleted with compaction resumed.</p>
<h6 id="consistency">Consistency</h6>
<p><strong>Consistency</strong> is ensured by marking state changes with version updates, which occur at the very end of the process. Thus, the database always transitions from one consistent state to another.</p>
<h5 id="restoring-db-from-the-manifest">Restoring DB from the Manifest</h5>
<p><img alt="Restoring DB from Manifest" src="https://s2.loli.net/2022/04/14/Jk5eyRzUWowi4YH.jpg"></p>
<p>As LevelDB runs, the number of session records in a manifest grows. Therefore, each time LevelDB restarts, a new manifest is created, and the first session record captures a snapshot of the current version state.</p>
<p>Outdated manifests are deleted during the recovery process at the next startup.</p>
<blockquote>
<p>LevelDB uses this method to control the size of the manifest file. However, if the database is not restarted, the manifest will keep growing.</p>
</blockquote>
<h2 id="db-state-recovery-process">DB State Recovery Process</h2>
<ol>
<li>Check lock status and create data directory.</li>
<li>Check <code>lockfile</code> to determine if another DB instance exists.</li>
<li>Check if the <code>CURRENT</code> file exists.</li>
<li>Restore metadata from the manifest.</li>
<li>Recover <code>last_seq_number</code> and <code>file_number</code> from the WAL.</li>
</ol>
<h2 id="main-open-process">Main Open Process</h2>
<ol>
<li>Create default DB and <code>VersionEdit</code> instances.</li>
<li>Acquire lock.</li>
<li>Restore metadata from manifest and WAL.</li>
<li>If the new DB instance does not have a <code>memtable</code>, create one along with a WAL file.</li>
<li>Apply <code>VersionEdit</code> and persist it to the manifest.</li>
<li>Attempt to delete obsolete files.</li>
<li>Attempt to compact data.</li>
</ol>
<h2 id="references">References</h2>
<ul>
<li><a href="https://zhaox.github.io/leveldb/2015/12/23/leveldb-files">LevelDB files</a></li>
<li><a href="https://leveldb-handbook.readthedocs.io/zh/latest/">LevelDB handbook</a></li>
<li><a href="https://github.com/google/leveldb/blob/main/doc/impl.md">LevelDB documentation</a></li>
<li><a href="https://github.com/1Feng/decode-leveldb/blob/master/doc/leveldb%E5%AE%9E%E7%8E%B0%E8%A7%A3%E6%9E%90.pdf">LevelDB implementation analysis</a></li>
<li><a href="https://github.com/noneback/leveldb_annotated">My notes on LevelDB</a></li>
</ul>
MIT6.824-Raft
https://noneback.github.io/blog/mit6.824-raft/
Mon, 21 Feb 2022 01:26:46 +0800https://noneback.github.io/blog/mit6.824-raft/<p>Finally, I managed to complete Lab 02 during this winter break, which had been on hold for quite some time. I was stuck on one of the cases in Test 2B for a while. During the winter break, I revisited the implementations from experts, and finally completed all the tasks, so I decided to document them briefly.</p>
<h2 id="algorithm-overview">Algorithm Overview</h2>
<p>The basis of consensus algorithms is the replicated state machine, which means that <strong>executing the same deterministic commands in the same order will eventually lead to a consistent state</strong>. Raft is a distributed consensus algorithm that serves as an alternative to Paxos, making it easier to learn and understand compared to Paxos.</p>
<p>The core content of the Raft algorithm can be divided into three parts: $Leader Election + Log Replication + Safety$.</p>
<p><img alt="img" src="https://s2.loli.net/2022/02/19/9mGfndCtDHzMqe4.png"></p>
<p>Initially, all nodes in the cluster start as Followers. If a Follower does not receive a heartbeat from the Leader within a certain period, it becomes a Candidate and triggers an election, requesting votes from the other Followers. The Candidate that receives a majority of votes becomes the Leader.</p>
<p>Raft is a <strong>strong leader</strong> and strongly consistent distributed consensus algorithm. It uses Terms as a logical clock, and only one Leader can exist in each term. The Leader needs to send heartbeats periodically to maintain its status and to handle <strong>log replication</strong>.</p>
<p>When replicating logs, the Leader first replicates the log to other Followers. Once a majority of the Followers successfully replicate the log, the Leader commits the log.</p>
<p>Safety mainly consists of five parts, with two core elements relevant to the implementation. One is the leader’s append-only rule, which means it cannot modify committed logs. The other is election safety, preventing split-brain scenarios and ensuring that the new Leader has the most up-to-date log.</p>
<p>For more details, please refer to the original paper.</p>
<h2 id="implementation-ideas">Implementation Ideas</h2>
<p>The implementation largely follows an excellent blog post (see references), and many algorithm details are also provided in Figure 2 of the original paper, so I will only focus on aspects that need attention when implementing each function.</p>
<h3 id="leader-election">Leader Election</h3>
<h4 id="triggering-election--handling-election-results">Triggering Election + Handling Election Results</h4>
<p>The election is initiated by launching multiple goroutines to send RPC requests to other nodes in the background. Therefore, when handling RPC responses, it is necessary to confirm that the current node is a Candidate and that the request is not outdated, i.e., <code>rf.state == Candidate && req.Term == rf.currentTerm</code>. If the election is successful, the node should immediately send heartbeats to notify other nodes of the election result.</p>
<p>If a failed response is received with <code>resp.Term > rf.currentTerm</code>, the node should switch to the Follower state, update the term, and <strong>reset voting information</strong>.</p>
<blockquote>
<p>In fact, whenever the term is updated, the voting information needs to be reset. If the <code>votedFor</code> information is not reset, some tests will fail.</p>
</blockquote>
<h4 id="request-vote-rpc">Request Vote RPC</h4>
<p>First, filter outdated requests with <code>req.Term < rf.currentTerm</code> and ignore duplicate voting requests for the current term. Then, follow the algorithm’s logic to process the request. Note that if the node successfully grants the vote, it should reset the election timer.</p>
<blockquote>
<p>Resetting the election timeout only when granting a vote helps with liveness in leader elections under unstable network conditions.</p>
</blockquote>
<h4 id="state-transition">State Transition</h4>
<p>When switching roles, be mindful of handling the state of different timers (stop or reset). When switching to Leader, reset the values of <code>matchIndex</code> and <code>nextIndex</code>.</p>
<h3 id="log-replication">Log Replication</h3>
<p>Log replication is the core of the Raft algorithm, and it requires careful attention.</p>
<p>My implementation uses multiple replicator and applier threads for asynchronous replication and application.</p>
<h4 id="log-replication-rpc">Log Replication RPC</h4>
<p>First, filter outdated requests with <code>req.Term < rf.currentTerm</code>. Then, handle log inconsistencies, log truncation, and duplicate log entries before replicating logs and processing <code>commitIndex</code>.</p>
<h4 id="trigger-log-replication--handle-request-results">Trigger Log Replication + Handle Request Results</h4>
<p>Determine whether to replicate logs directly or send a snapshot before initiating replication.</p>
<p>The key point in handling request results is how to update <code>matchIndex</code>, <code>nextIndex</code>, and <code>commitIndex</code>.</p>
<p><code>matchIndex</code> is used to record the latest log successfully replicated on other nodes, while <code>nextIndex</code> records the next log to be sent to other nodes. <code>commitIndex</code> is updated by sorting <code>matchIndex</code> and determining whether to trigger the applier to update <code>appliedIndex</code>.</p>
<p>If the request fails, <code>nextIndex</code> should be decremented, or the node should switch to the Follower state.</p>
<h4 id="asynchronous-apply">Asynchronous Apply</h4>
<p>This is essentially a background goroutine controlled by condition variables and uses channels for communication. Each time it is triggered, it sends <code>log[lastApplied:commitIndex]</code> to the upper layer and updates <code>appliedIndex</code>.</p>
<h3 id="persistence">Persistence</h3>
<p>Persist the updated attributes that need to be saved to disk in a timely manner.</p>
<h3 id="install-snapshot">Install Snapshot</h3>
<p>The main components are the Snapshot triggered by the Leader and the corresponding RPC. When applying a Snapshot, determine its freshness and update <code>log[0]</code>, <code>appliedIndex</code>, and <code>commitIndex</code>.</p>
<h2 id="pitfalls">Pitfalls</h2>
<h3 id="defer">Defer</h3>
<p>The first pitfall is related to the <strong>defer</strong> keyword in Go. I like to use the <code>defer</code> keyword at the beginning of an RPC to directly print some data from the node: <code>defer Dprintf("%+v", raft.currentTerm)</code>. This way, the log can be printed at the end of the call. However, the content to be printed is fixed at the time the defer statement is executed. The correct usage should be <code>defer func(ID int) { Dprintf("%+v", id) }()</code>.</p>
<h3 id="log-dummy-header">Log Dummy Header</h3>
<p>It is best to reserve a spot in the log for storing the index and term of the snapshot to avoid a painful refactor of the Snapshot section later.</p>
<h3 id="lock">Lock</h3>
<p>Refer to the guidance on locking suggestions. Use a single coarse-grained lock rather than multiple locks. Correctness of the algorithm is more important than performance. Avoid holding locks while sending RPCs or using channels, as it may lead to timeouts.</p>
<h2 id="references">References</h2>
<p><a href="https://zh.wikipedia.org/wiki/Raft">Raft Wikipedia</a></p>
<p><a href="https://raft.github.io/">Raft Official Website</a></p>
<p><a href="https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf">Raft Paper</a></p>
<p><a href="https://pdos.csail.mit.edu/6.824/index.html">MIT6.824 Official</a></p>
<p><a href="https://github.com/OneSizeFitsQuorum/MIT6.824-2021/blob/master/docs/lab2.md">Potato’s Implementation Doc</a></p>
Arch + DWM Setup Attempt
https://noneback.github.io/blog/arch+dwm%E5%A5%97%E9%A4%90/
Sat, 15 Jan 2022 23:13:16 +0800https://noneback.github.io/blog/arch+dwm%E5%A5%97%E9%A4%90/<p>Originally, I wanted to replace Manjaro KDE with DWM, but I got stuck at the boot screen, and while trying to fix it, I ended up corrupting the bootloader. So, I decided to go all in, format the entire disk, and try setting up an Arch + DWM development environment. Here, I’m documenting the process to assist with future repairs and device migrations.</p>
<p>This is not a step-by-step guide, but rather a concise record of my journey.</p>
<h2 id="installing-arch-linux">Installing Arch Linux</h2>
<h3 id="preparation">Preparation</h3>
<h4 id="environment-for-installing-arch">Environment for Installing Arch</h4>
<p>To create the installation USB, you’ll need:</p>
<ul>
<li><strong>16GB+ USB drive</strong></li>
<li><strong>Rufus</strong></li>
<li><strong>Windows machine</strong></li>
<li><strong>Arch Linux ISO</strong></li>
</ul>
<p>After creating the bootable USB, boot from it to start Arch Linux.</p>
<h4 id="network-and-mirrors">Network and Mirrors</h4>
<p>Connect to WiFi using <code>iwctl</code>, then update the system clock and modify the Pacman mirror list.</p>
<h3 id="installing-arch-linux-1">Installing Arch Linux</h3>
<h4 id="disk-partitioning">Disk Partitioning</h4>
<p>The disk should be divided into three main parts: Boot, Swap, and Root partitions.</p>
<table>
<thead>
<tr>
<th>Mount Point</th>
<th>Partition</th>
<th><a href="https://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs">Partition Type</a></th>
<th>Suggested Size</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>/mnt/boot</code> or <code>/mnt/efi</code></td>
<td><code>/dev/*efi_system_partition*</code></td>
<td><a href="https://wiki.archlinux.org/title/EFI_system_partition_%28%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87%29">EFI System Partition</a></td>
<td>At least 260 MiB</td>
</tr>
<tr>
<td><code>[SWAP]</code></td>
<td><code>/dev/*swap_partition*</code></td>
<td>Linux swap</td>
<td>More than 512 MiB</td>
</tr>
<tr>
<td><code>/mnt</code></td>
<td><code>/dev/*root_partition*</code></td>
<td>Linux x86-64 Root (/)</td>
<td>Remaining Space</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>fdisk -l <span style="color:#75715e"># View disk information</span>
</span></span><span style="display:flex;"><span>cfdisk /dev/nvme <span style="color:#75715e"># Partition the disk</span>
</span></span></code></pre></div><h4 id="formatting-partitions">Formatting Partitions</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>mkfs.ext4 <span style="color:#e6db74">${</span>root<span style="color:#e6db74">}</span>
</span></span><span style="display:flex;"><span>mkswap <span style="color:#e6db74">${</span>swap<span style="color:#e6db74">}</span>
</span></span><span style="display:flex;"><span>mkfs.fat -F <span style="color:#ae81ff">32</span> <span style="color:#e6db74">${</span>efi<span style="color:#e6db74">}</span>
</span></span></code></pre></div><h4 id="configuring-partitions-and-installing-the-system">Configuring Partitions and Installing the System</h4>
<ul>
<li>Mount Root: <code>mount /dev/${root_partition} /mnt</code></li>
<li>Mount EFI: <code>mount /dev/${efi_partition} /mnt/boot/efi</code></li>
<li>Activate Swap: <code>swapon /dev/${swap_partition}</code></li>
<li>Install Kernel and Essential Packages: <code>pacstrap /mnt base linux linux-firmware</code></li>
<li>Generate <code>fstab</code> Config: <code>genfstab -U /mnt >> /mnt/etc/fstab</code> (check for correctness)</li>
</ul>
<p>The system should now be installed, but there is no bootloader, so we need to install GRUB.</p>
<h4 id="other-configurations-before-booting">Other Configurations Before Booting</h4>
<ul>
<li>Change root to the new system:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>mount /dev/<span style="color:#e6db74">${</span>root_partition<span style="color:#e6db74">}</span> /mnt
</span></span><span style="display:flex;"><span>arch-chroot /mnt
</span></span></code></pre></div><ul>
<li>
<p>Set timezone and sync time.</p>
</li>
<li>
<p>Configure language by editing <strong>locale.gen</strong> and <strong>locale.conf</strong>.</p>
</li>
<li>
<p>Network configuration: set <strong>hostname</strong> and <strong>hosts</strong>.</p>
</li>
<li>
<p>Set the root password.</p>
</li>
<li>
<p>Install the GRUB bootloader and EFI tools: <code>grub-install --target=x86_64-efi --efi-directory=esp --bootloader-id=GRUB</code></p>
</li>
<li>
<p>Install and start <strong>iwd</strong> to connect to WiFi.</p>
</li>
<li>
<p>Boot into Arch Linux.</p>
</li>
</ul>
<h3 id="post-boot-configuration">Post-Boot Configuration</h3>
<h4 id="install-essential-software">Install Essential Software</h4>
<table>
<thead>
<tr>
<th>Purpose</th>
<th>Software</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bluetooth</td>
<td>bluetoothctl</td>
</tr>
<tr>
<td>Network</td>
<td>iwd</td>
</tr>
<tr>
<td>Daily Use</td>
<td>nvim, ranger, zsh</td>
</tr>
<tr>
<td>Sound</td>
<td>alsamixer</td>
</tr>
<tr>
<td>Input Method</td>
<td>fcitx5-im, fcitx5-chinese-addons</td>
</tr>
<tr>
<td>Proxy</td>
<td>clash</td>
</tr>
</tbody>
</table>
<h3 id="installing-the-desktop-environment">Installing the Desktop Environment</h3>
<h4 id="install-xorg">Install Xorg</h4>
<p><a href="https://wiki.archlinux.org/title/Xorg_%28%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87%29">Xorg</a> provides an open-source implementation of the X window system, which is the basis for graphical user interfaces.</p>
<p>Install: <strong>xorg-server</strong>, <strong>xorg-apps</strong>, <strong>xrandr</strong>, <strong>xinit</strong>.</p>
<h4 id="install-desktop-companion-software">Install Desktop Companion Software</h4>
<p>I used the Suckless tiling window management suite: <strong>dwm</strong>, <strong>slock</strong>, <strong>st</strong>, <strong>dmenu</strong>, <strong>slim</strong>, <strong>slstatus</strong>.</p>
<h4 id="configure-xinitc-and-xprofile">Configure <code>.xinitc</code> and <code>.xprofile</code></h4>
<p>Add to <code>.xinitc</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span><span style="color:#75715e"># .xinitc</span>
</span></span><span style="display:flex;"><span>fcitx5 &
</span></span><span style="display:flex;"><span>xautolock -time <span style="color:#ae81ff">10</span> -locker slock &
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>autorandr -l home
</span></span><span style="display:flex;"><span>picom -b
</span></span><span style="display:flex;"><span>feh --bg-fill --randomize /home/noneback/Picture/wallpaper/*.jpg
</span></span><span style="display:flex;"><span>exec slstatus &
</span></span><span style="display:flex;"><span>exec dwm
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># .xprofile</span>
</span></span><span style="display:flex;"><span>export INPUT_METHOD<span style="color:#f92672">=</span>fcitx5
</span></span><span style="display:flex;"><span>export GTK_IM_MODULE<span style="color:#f92672">=</span>fcitx5
</span></span><span style="display:flex;"><span>export QT_IM_MODULE<span style="color:#f92672">=</span>fcitx5
</span></span><span style="display:flex;"><span>export XMODIFIERS<span style="color:#f92672">=</span>@im<span style="color:#f92672">=</span>fcitx5
</span></span></code></pre></div><h4 id="customization-and-usability">Customization and Usability</h4>
<table>
<thead>
<tr>
<th>Purpose</th>
<th>Software</th>
</tr>
</thead>
<tbody>
<tr>
<td>Wallpaper</td>
<td>feh</td>
</tr>
<tr>
<td>Window Effects</td>
<td>picom</td>
</tr>
<tr>
<td>Screen Lock</td>
<td>xautolock</td>
</tr>
<tr>
<td>Multi-Screen</td>
<td>autorandr</td>
</tr>
<tr>
<td>Power Saving</td>
<td>tlp</td>
</tr>
</tbody>
</table>
<h2 id="additional-notes">Additional Notes</h2>
<p>For more detailed instructions, please refer to the official installation documentation.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://wiki.archlinux.org/title/Installation_guide_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87)#%E5%BB%BA%E7%AB%8B%E7%A1%AC%E7%9B%98%E5%88%86%E5%8C%BA">Arch Linux Install Wiki</a></li>
<li><a href="https://wiki.archlinux.org/title/Category:X_server_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87)">X Server Wiki</a></li>
<li><a href="https://github.com/noneback/dwm-releated">Personal DWM Desktop</a></li>
</ul>
How to Implement SkipList
https://noneback.github.io/blog/how-to-implement-skiplist/
Sun, 21 Nov 2021 15:28:42 +0800https://noneback.github.io/blog/how-to-implement-skiplist/<p>Some time ago, I decided to implement a simple LSM storage engine model. As part of that, I implemented a basic SkipList and BloomFilter with BitSet. However, due to work demands and after-hours laziness, the project was put on hold. Now that I’m thinking about it again, I realize I’ve forgotten some of the details, so I’m writing it down for future reference.</p>
<h2 id="what-is-skiplist">What is SkipList?</h2>
<p><strong>SkipList</strong> is an ordered data structure that can be seen as an alternative to balanced trees. It essentially uses <strong>sparse indexing</strong> to accelerate searches in a linked list structure. It combines both <strong>data and index</strong> into a single structure, allowing efficient insertions and deletions.</p>
<blockquote>
<p>Balanced trees, such as BST and Red-Black Trees, solve the problem of tree imbalance but introduce rotation, coloring, and other complexities. In concurrent scenarios, these can lead to larger lock granularity and affect performance. Compared to balanced trees, SkipList avoids these problems.</p>
</blockquote>
<p><img alt="SkipList diagram" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Skip_list.svg/600px-Skip_list.svg.png"></p>
<h2 id="implementation">Implementation</h2>
<p>SkipLists are usually implemented on top of ordered linked lists. The key challenge with ordered linked lists is figuring out how to insert a new node while maintaining order.</p>
<p>For arrays, binary search can be used to quickly locate the insert position, and then elements are moved to make room. For linked lists, the overhead of moving elements does not exist, but they do not support jumping directly to a position, making it challenging to locate where to insert.</p>
<p>The essence of SkipLists is to <strong>maintain a linked list of nodes with multiple layers of sparse indices</strong> that can be used to efficiently locate nodes.</p>
<blockquote>
<p>In the base level, all nodes are present. On the next level, approximately every other node is present, and so on. This approach reduces the average time complexity for search, insertion, and deletion.</p>
</blockquote>
<h3 id="using-randomization">Using Randomization</h3>
<p>To efficiently maintain these index nodes, randomization is used to decide whether a newly added node should be part of the index.</p>
<blockquote>
<p>For a SkipList of maximum level <code>X</code>, each time a new node is added, we use a random approach to determine whether the node should be indexed at each level, with a probability of <code>1/(2^n)</code> for each level. This results in a roughly equal distribution similar to dividing nodes into even partitions.</p>
</blockquote>
<h3 id="data-structures">Data Structures</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">SkipListNode</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">data</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">codec</span>.<span style="color:#a6e22e">Entry</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">nextPtrs</span> []<span style="color:#f92672">*</span><span style="color:#a6e22e">SkipListNode</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">SkipList</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">header</span>, <span style="color:#a6e22e">tail</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">SkipListNode</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">level</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">size</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">rwmtx</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">sync</span>.<span style="color:#a6e22e">RWMutex</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">maxSize</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h3 id="operations">Operations</h3>
<p>The two most notable operations are <strong>search</strong> and <strong>insertion</strong>.</p>
<h4 id="search">Search</h4>
<p>The key step here is to use the sparse indices in the SkipList, moving from the top level downwards to efficiently locate the required position.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#75715e">// findPreNode finds the node before the node with the given key
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#66d9ef">func</span> (<span style="color:#a6e22e">sl</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">SkipList</span>) <span style="color:#a6e22e">findPreNode</span>(<span style="color:#a6e22e">key</span> []<span style="color:#66d9ef">byte</span>) (<span style="color:#f92672">*</span><span style="color:#a6e22e">SkipListNode</span>, <span style="color:#66d9ef">bool</span>) {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Start from the highest level
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#a6e22e">h</span> <span style="color:#f92672">:=</span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">header</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> <span style="color:#a6e22e">i</span> <span style="color:#f92672">:=</span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">level</span> <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>; <span style="color:#a6e22e">i</span> <span style="color:#f92672">>=</span> <span style="color:#ae81ff">0</span>; <span style="color:#a6e22e">i</span><span style="color:#f92672">--</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> <span style="color:#a6e22e">h</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>] <span style="color:#f92672">!=</span> <span style="color:#66d9ef">nil</span> <span style="color:#f92672">&&</span> <span style="color:#a6e22e">bytes</span>.<span style="color:#a6e22e">Compare</span>(<span style="color:#a6e22e">h</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>].<span style="color:#a6e22e">data</span>.<span style="color:#a6e22e">Key</span>, <span style="color:#a6e22e">key</span>) <span style="color:#f92672">!=</span> <span style="color:#ae81ff">1</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> <span style="color:#a6e22e">bytes</span>.<span style="color:#a6e22e">Equal</span>(<span style="color:#a6e22e">h</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>].<span style="color:#a6e22e">data</span>.<span style="color:#a6e22e">Key</span>, <span style="color:#a6e22e">key</span>) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">h</span>, <span style="color:#66d9ef">true</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">h</span> = <span style="color:#a6e22e">h</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>]
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">nil</span>, <span style="color:#66d9ef">false</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="insertion">Insertion</h4>
<p>First, locate the position to insert, then perform the insertion, and finally add indices as determined by randomization.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#66d9ef">func</span> (<span style="color:#a6e22e">sl</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">SkipList</span>) <span style="color:#a6e22e">randomLevel</span>() <span style="color:#66d9ef">int</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">ans</span> <span style="color:#f92672">:=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">rand</span>.<span style="color:#a6e22e">Seed</span>(<span style="color:#a6e22e">time</span>.<span style="color:#a6e22e">Now</span>().<span style="color:#a6e22e">Unix</span>())
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> <span style="color:#a6e22e">rand</span>.<span style="color:#a6e22e">Intn</span>(<span style="color:#ae81ff">2</span>) <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span> <span style="color:#f92672">&&</span> <span style="color:#a6e22e">ans</span> <span style="color:#f92672"><=</span> <span style="color:#a6e22e">defaultMaxLevel</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">ans</span><span style="color:#f92672">++</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">ans</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">func</span> (<span style="color:#a6e22e">sl</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">SkipList</span>) <span style="color:#a6e22e">Insert</span>(<span style="color:#a6e22e">data</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">codec</span>.<span style="color:#a6e22e">Entry</span>) <span style="color:#f92672">*</span><span style="color:#a6e22e">SkipListNode</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">rwmtx</span>.<span style="color:#a6e22e">Lock</span>()
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">defer</span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">rwmtx</span>.<span style="color:#a6e22e">Unlock</span>()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">h</span> <span style="color:#f92672">:=</span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">header</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">updateNode</span> <span style="color:#f92672">:=</span> make([]<span style="color:#f92672">*</span><span style="color:#a6e22e">SkipListNode</span>, <span style="color:#a6e22e">defaultMaxLevel</span>) <span style="color:#75715e">// stores the node before newNode
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#75715e">// Search from the top level
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">for</span> <span style="color:#a6e22e">i</span> <span style="color:#f92672">:=</span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">level</span> <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>; <span style="color:#a6e22e">i</span> <span style="color:#f92672">>=</span> <span style="color:#ae81ff">0</span>; <span style="color:#a6e22e">i</span><span style="color:#f92672">--</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Loop while the current nextPtrs is not empty and data is smaller than the inserted one
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">for</span> <span style="color:#a6e22e">h</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>] <span style="color:#f92672">!=</span> <span style="color:#66d9ef">nil</span> <span style="color:#f92672">&&</span> <span style="color:#a6e22e">h</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>].<span style="color:#a6e22e">data</span>.<span style="color:#a6e22e">Less</span>(<span style="color:#a6e22e">data</span>) {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">h</span> = <span style="color:#a6e22e">h</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>]
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">updateNode</span>[<span style="color:#a6e22e">i</span>] = <span style="color:#a6e22e">h</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Choose the level to insert
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#a6e22e">lvl</span> <span style="color:#f92672">:=</span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">randomLevel</span>()
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> <span style="color:#a6e22e">lvl</span> > <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">level</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Insert into higher levels, we need to create header -> tail
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">for</span> <span style="color:#a6e22e">i</span> <span style="color:#f92672">:=</span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">level</span>; <span style="color:#a6e22e">i</span> < <span style="color:#a6e22e">lvl</span>; <span style="color:#a6e22e">i</span><span style="color:#f92672">++</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">updateNode</span>[<span style="color:#a6e22e">i</span>] = <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">header</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">level</span> = <span style="color:#a6e22e">lvl</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Insert after the updated node
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#a6e22e">n</span> <span style="color:#f92672">:=</span> <span style="color:#a6e22e">NewSkipListNode</span>(<span style="color:#a6e22e">lvl</span>, <span style="color:#a6e22e">data</span>)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> <span style="color:#a6e22e">i</span> <span style="color:#f92672">:=</span> <span style="color:#ae81ff">0</span>; <span style="color:#a6e22e">i</span> < <span style="color:#a6e22e">lvl</span>; <span style="color:#a6e22e">i</span><span style="color:#f92672">++</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">n</span>.<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>] = <span style="color:#a6e22e">updateNode</span>[<span style="color:#a6e22e">i</span>].<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>]
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">updateNode</span>[<span style="color:#a6e22e">i</span>].<span style="color:#a6e22e">nextPtrs</span>[<span style="color:#a6e22e">i</span>] = <span style="color:#a6e22e">n</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">sl</span>.<span style="color:#a6e22e">size</span><span style="color:#f92672">++</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">n</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="references">References</h2>
<ul>
<li><a href="https://www.jianshu.com/p/09c3b0835ba6#%E4%BB%8Ezset%E5%88%B0zskiplist">Skip List and its implementation in Redis</a></li>
<li><a href="https://github.com/redis/redis/blob/91e77a0cfb5c7e4bc6473ae04353e48ad9e8697b/src/t_zset.c">Redis source code: zskiplist</a></li>
<li><a href="https://en.wikipedia.org/wiki/Skip_list">Wikipedia: Skip List</a></li>
<li><a href="https://spongecaptain.cool/post/datastracture/skiplist/">SpongeCaptain’s blog on Data Structures</a></li>
</ul>
Kylin Overview
https://noneback.github.io/blog/kylin%E6%A6%82%E8%BF%B0/
Wed, 10 Nov 2021 23:45:27 +0800https://noneback.github.io/blog/kylin%E6%A6%82%E8%BF%B0/<p>Previously, I was hoping to work on an interesting thesis, but I couldn’t find a suitable advisor nearby. I initially found a good advisor before the college started the topic selection, but it turned out they couldn’t take me on. However, I wasn’t that interested in the advisor’s field, so I decided to look for something else. Recently, the college’s thesis selection process started, and I found an interesting topic in the list. I reached out to the professor and took on the project.</p>
<p>The topic I chose is <strong>“Design and Implementation of Database Query Algorithms Based on Differential Privacy”</strong>, focusing on Differential Privacy + OLAP. Specifically, it’s about adding Differential Privacy as a feature to Kylin.</p>
<p>That’s the overall gist; as for the details, I might write about them in future blog posts. This is the first in this series of blog posts.</p>
<h2 id="introduction">Introduction</h2>
<p>Kylin is a distributed OLAP data warehouse based on columnar storage systems like HBase and Parquet, and computational frameworks like Hadoop and Spark. It supports multidimensional analysis of massive datasets.</p>
<p>Kylin uses a cube pre-computation method, transforming real-time queries into queries against precomputed results, utilizing idle computation resources and storage space to optimize query times. This can significantly reduce query latency.</p>
<h2 id="background">Background</h2>
<p>Before Kylin, Hadoop was commonly used for large-scale data batch processing, with results stored in columnar storage systems like HBase. The related technologies for OLAP included <strong>big data parallel processing</strong> and <strong>columnar storage</strong>.</p>
<ul>
<li>
<p><strong>Massive Parallel Processing</strong>: It leverages multiple machines to process computational tasks in parallel, essentially using linear growth in computing resources to achieve a linear decrease in processing time.</p>
</li>
<li>
<p><strong>Columnar Storage</strong>: Stores data in columns instead of rows. This approach is particularly effective for OLAP queries, which typically involve aggregations of specific columns. Columnar storage allows querying only the necessary columns and makes effective use of sequential I/O, thus improving performance.</p>
</li>
</ul>
<p>These technologies enabled minute-level SQL query performance on platforms like Hadoop. However, even this is insufficient for interactive analysis, as the latency is still too high.</p>
<p>The core issue is that <strong>neither parallel computing nor columnar storage changes the fundamental time complexity of querying; they do not break the linear relationship between query time and data volume</strong>. Therefore, the only optimization comes from increasing computing resources and exploiting locality principles, both of which have scalability and theoretical bottlenecks as data grows.</p>
<p>To address this, Kylin introduced a <strong>pre-computation strategy</strong>, building multidimensional <strong>cubes</strong> for different dimensions and storing them as data tables. Future queries are made directly against these precomputed results. With pre-computation, the size of the materialized views is determined only by the cardinality of the dimensions and is no longer linearly proportional to the size of the dataset.</p>
<p>Essentially, this strategy <strong>uses idle computational resources and additional storage to improve response times during queries, breaking the linear relationship between query time and data size</strong>.</p>
<h2 id="core-concepts">Core Concepts</h2>
<p>The core working principle of Apache Kylin is <strong>MOLAP (Multidimensional Online Analytical Processing) Cube</strong> technology.</p>
<h3 id="dimensions-and-measures">Dimensions and Measures</h3>
<p><strong>Dimensions</strong> refer to perspectives used for aggregating data, typically attributes of data records. <strong>Measures</strong> are numerical values calculated based on data. Using dimensions, you can aggregate measures, e.g., $$D_1,D_2,D_3,… \rightarrow S_1,S_2,…$$</p>
<h3 id="cube-theory">Cube Theory</h3>
<p><strong>Data Cube</strong> involves building and querying precomputed, multidimensional data indices.</p>
<ul>
<li><strong>Cuboid</strong>: The data calculated for a particular combination of dimensions.</li>
<li><strong>Cube Segment</strong>: The smallest building block of a cube. A cube can be split into multiple segments.</li>
<li><strong>Incremental Cube Building</strong>: Typically triggered based on time attributes.</li>
<li><strong>Cube Cardinality</strong>: The cardinality of all dimensions in a cube determines the cube’s complexity. Higher cardinality often leads to cube expansion (amplified I/O and storage).</li>
</ul>
<h2 id="architecture-design">Architecture Design</h2>
<p>Kylin consists of two parts: <strong>online querying</strong> and <strong>offline building</strong>.</p>
<p><img alt="Kylin Architecture" src="https://i.loli.net/2021/11/10/AoxY4POJHdqLheb.png"></p>
<ul>
<li><strong>Offline Building</strong>: Involves three main components: the data source, the build engine, and the storage engine. Data is fetched from the data source, cubes are built, and they are stored in the columnar storage engine.</li>
<li><strong>Online Querying</strong>: Consists of an interface layer and a query engine, abstracting away concepts like cubes from the user. External applications use the REST API to submit queries, which are processed by the query engine and returned.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>As an OLAP engine, Kylin leverages <strong>parallel computing, columnar storage, and pre-computation</strong> techniques to improve both online query and offline build performance. This has the following notable pros and cons:</p>
<h3 id="advantages">Advantages</h3>
<ul>
<li><strong>Standard SQL Interface</strong>: Supports BI tools and makes integration easy.</li>
<li><strong>High Query Speed</strong>: Queries against precomputed results are very fast.</li>
<li><strong>Scalable Architecture</strong>: Easily scales to handle increasing data volumes.</li>
</ul>
<h3 id="disadvantages">Disadvantages</h3>
<ul>
<li><strong>Complex Dependencies</strong>: Kylin relies on many external systems, which can make operations and maintenance challenging.</li>
<li><strong>I/O and Storage Overhead</strong>: Pre-computation and cube building can lead to amplified I/O and storage needs.</li>
<li><strong>Limited by Data Models</strong>: The complexity of data models and cube cardinality can impose limitations on scalability.</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://tech.meituan.com/2020/11/19/apache-kylin-practice-in-meituan.html">Meituan: Apache Kylin’s Practice and Optimization</a></li>
<li><a href="https://kylin.apache.org/">Kylin Official Documentation</a></li>
</ul>
DFS-Haystack
https://noneback.github.io/blog/dfs-haystack/
Wed, 06 Oct 2021 22:44:01 +0800https://noneback.github.io/blog/dfs-haystack/<p>The primary project in my group is a distributed file system (DFS) that provides POSIX file system semantics. The approach to handle “lots of small files” (LOSF) is inspired by Haystack, which is specifically designed for small files. I decided to read through the Haystack paper and take some notes as a learning exercise.</p>
<p>These notes are not an in-depth analysis of specific details but rather a record of my thoughts on the problem and design approach.</p>
<h2 id="introduction">Introduction</h2>
<p>Haystack is a storage system designed by Facebook for small files. In traditional DFS, file addressing typically involves using caches to store metadata, reducing disk interaction and improving lookup efficiency. For each file, a separate set of metadata must be maintained, with the volume of metadata depending on the number of files. In high-concurrency scenarios, metadata is cached in memory to reduce disk I/O.</p>
<p>With a large number of small files, the volume of metadata becomes significant. Considering the maintenance overhead of in-memory metadata, this approach becomes impractical. Therefore, Haystack was developed specifically for small files, with the core idea of aggregating multiple small files into a larger one to reduce metadata.</p>
<h2 id="background">Background</h2>
<p>The “small files” in the paper specifically refer to image data.</p>
<p>Facebook, as a social media company, deals heavily with image uploads and retrieval. As the business scaled, it became necessary to have a dedicated service to handle the massive, high-concurrency requests for image reads and writes.</p>
<p>In the social networking context, this type of data is characterized as <code>written once, read often, never modified, and rarely deleted</code>. Based on this, Facebook developed Haystack to support image sharing services.</p>
<h2 id="design">Design</h2>
<h3 id="traditional-design">Traditional Design</h3>
<p>The paper describes two historical designs: CDN-based and NAS-based solutions.</p>
<h4 id="cdn-based-solution">CDN-based Solution</h4>
<p>The core of this solution is to use CDN (Content Delivery Network) to cache hot image data, reducing network transmission.</p>
<p>This approach optimizes access to hot images but also has some issues. Firstly, CDN is expensive and has limited capacity. Secondly, image sharing includes many <code>less popular</code> images, which leads to the long tail effect, slowing down access.</p>
<p><img src="https://raw.githubusercontent.com/noneback/images/picgo/202411011455343.png"></p>
<blockquote>
<p>CDNs are generally used to serve static data and are often pre-warmed before an event, making them unsuitable as an image cache service. Many <code>less popular</code> images do not enter the CDN, leading to the long tail effect.</p>
</blockquote>
<h4 id="nas-based-solution">NAS-based Solution</h4>
<p>This was Facebook’s initial design and is essentially a variation of the CDN-based solution.</p>
<p>They introduced NAS (Network Attached Storage) for horizontal storage expansion, incorporating file system semantics, but disk I/O remained an issue. Similar to local files, reading uncached data requires at least three disk I/O operations:</p>
<ul>
<li>Read directory metadata into memory</li>
<li>Load the inode into memory</li>
<li>Read the content of the file</li>
</ul>
<p>PhotoStore was used as a caching layer to store some metadata like file handles to speed up the addressing process.</p>
<p><img src="https://raw.githubusercontent.com/noneback/images/picgo/202411011454979.png"></p>
<p>The NAS-based design did not solve the fundamental issue of excessive metadata that could not be fully cached. When the number of files reaches a certain threshold, disk I/O becomes inevitable.</p>
<blockquote>
<p>The fundamental issue is the <strong>one-to-one relationship between files and addressing metadata</strong>, causing the volume of metadata to change with the number of files.</p>
</blockquote>
<p>Thus, the key to optimization is changing the <strong>one-to-one relationship between files and metadata</strong>, reducing the frequency of disk I/O during addressing.</p>
<h3 id="haystack-based-solution">Haystack-based Solution</h3>
<p>The core idea of Haystack is to <strong>aggregate multiple small files into a larger one</strong>, maintaining a single piece of metadata for the large file. This changes the mapping between metadata and files, making it feasible to keep all metadata in memory.</p>
<blockquote>
<p>Metadata is maintained only for the aggregated file, and the position of small files within the large file is maintained separately.</p>
</blockquote>
<p><img src="https://raw.githubusercontent.com/noneback/images/picgo/202411011456020.png"></p>
<h2 id="implementation">Implementation</h2>
<p>Haystack mainly consists of three components: Haystack Directory, Haystack Cache, and Haystack Store.</p>
<h3 id="file-mapping-and-storage">File Mapping and Storage</h3>
<p>File data is ultimately stored on logical volumes, each of which corresponds to multiple physical volumes across machines.</p>
<p>Users first access the Directory to obtain access paths and then use the URL generated by the Directory to access other components to retrieve the required data.</p>
<h3 id="components">Components</h3>
<h4 id="haystack-directory">Haystack Directory</h4>
<p>This is Haystack’s access layer, responsible for <strong>file addressing</strong> and <strong>access control</strong>.</p>
<p>Read and write requests first go through the Directory. For read requests, the Directory generates an access URL containing the path: <code>http://{cdn}/{cache}/{machine id}/{logicalvolume,Photo}</code>. For write requests, it provides a volume to write into.</p>
<p>The Directory has four main functions:</p>
<ol>
<li>Load balancing for read and write requests.</li>
<li>Determine request access paths (e.g., CDN or direct access) and generate access URLs.</li>
<li>Metadata and mapping management, e.g., logical attributes to volume mapping.</li>
<li>Logical volume read/write management, where volumes can be read-only or write-enabled.</li>
</ol>
<blockquote>
<p>This design is based on the data characteristics: “write once, read more.” This setup improves concurrency.</p>
</blockquote>
<p>The Directory stores metadata such as file-to-volume mappings, logical-to-physical mappings, and volume attributes (size, owner, etc.). It relies on a distributed key-value store and a cache service to ensure low latency and high availability.</p>
<blockquote>
<p><strong>Proxy, Metadata Mapping, Access Control</strong></p>
</blockquote>
<h4 id="haystack-cache">Haystack Cache</h4>
<p>The Cache layer optimizes addressing and image retrieval. The core design is the <strong>Cache Rule</strong>, which determines what data should be cached and how to handle <strong>cache misses</strong>.</p>
<p>Images are cached if they meet these criteria:</p>
<ol>
<li>The request is directly from a user, not from a CDN.</li>
<li>The photo is retrieved from a write-enabled store machine.</li>
</ol>
<p>If a cache miss occurs, the Cache fetches the image from the Store and pushes it to both the user and the CDN.</p>
<blockquote>
<p>The caching policy is based on typical access patterns.</p>
</blockquote>
<h4 id="haystack-store">Haystack Store</h4>
<p>The Store layer is responsible for data storage operations.</p>
<p>The addressing abstraction is: <code>filename + offset => logical volume id + offset => data</code>.</p>
<p>Multiple physical volumes constitute a logical volume. In the Store, small files are encapsulated as <strong>Needles</strong> managed by physical volumes.</p>
<p><img alt="Needle Abstraction" src="https://tva1.sinaimg.cn/large/008i3skNly1gv5oo0mltfj60zs0u0q5j02.jpg"></p>
<blockquote>
<p>Needles represent a way to encapsulate small files and manage volume blocks.</p>
</blockquote>
<p>Store data is accessed at the Needle level. To speed up addressing, a memory map is used: <code>key/alternate key => needle's flag/offset/other attributes</code>.</p>
<p>These maps are persisted in <strong>Index Files</strong> on disk to provide a checkpoint for quick metadata recovery after a crash.</p>
<p><img alt="Index File" src="https://tva1.sinaimg.cn/large/008i3skNly1gv5put6m7qj60u40jc0u102.jpg"></p>
<p><img alt="Volume Mapping" src="https://tva1.sinaimg.cn/large/008i3skNly1gv5puqgvgcj60te0dk0ua02.jpg"></p>
<blockquote>
<p>Each volume maintains its own in-memory mapping and index file.</p>
</blockquote>
<p>When updating the in-memory mapping (e.g., adding or modifying a file), the index file is updated asynchronously. Deleted files are only marked as deleted, not removed from the index file.</p>
<blockquote>
<p>The index serves as a lookup aid. Needles without an index can still be addressed, making the asynchronous update and index retention strategy feasible.</p>
</blockquote>
<h3 id="workloads">Workloads</h3>
<h4 id="read">Read</h4>
<p><code>(Logical Volume ID, key, alternate key, cookies) => photo</code></p>
<p>For a read request, Store queries the in-memory mapping for the corresponding Needle. If found, it fetches the data from the volume and verifies the cookie and integrity; otherwise, it returns an error.</p>
<blockquote>
<p>Cookies are randomly generated strings that prevent malicious attacks.</p>
</blockquote>
<h4 id="write">Write</h4>
<p><code>(Logical Volume ID, key, alternate key, cookies, data) => result</code></p>
<p>Haystack only supports appending data rather than overwriting. When a write request is received, Store asynchronously appends data to a Needle and updates the in-memory mapping. If it’s an existing file, the Directory updates its metadata to point to the latest version.</p>
<blockquote>
<p>Older volumes are frozen as read-only, and new writes are appended, so a larger offset indicates a newer version.</p>
</blockquote>
<h4 id="delete">Delete</h4>
<p>Deletion is handled using <strong>Mark Delete + Compact GC</strong>.</p>
<h3 id="fault-tolerance">Fault Tolerance</h3>
<p>Store ensures fault tolerance through <strong>monitoring + hot backup</strong>. Directory and Cache use Raft-like consistency algorithms for data replication and availability.</p>
<h2 id="optimization">Optimization</h2>
<p>The main optimizations include: Compaction, Batch Load, and In-Memory processing.</p>
<h2 id="summary">Summary</h2>
<ul>
<li>Key abstraction optimizations include asynchronous processing, batch operations, and caching.</li>
<li>Identifying the core issues, such as metadata management burden for a large number of small files, is crucial.</li>
</ul>
<h2 id="references">References</h2>
<p><a href="https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf">Finding a needle in Haystack: Facebook’s photo storage</a></p>
MIT6.824 Bigtable
https://noneback.github.io/blog/mit6.824-bigtable/
Thu, 16 Sep 2021 22:54:59 +0800https://noneback.github.io/blog/mit6.824-bigtable/<p>I recently found a translated version of the Bigtable paper online and saved it, but hadn’t gotten around to reading it. Lately, I’ve noticed that Bigtable shares many design similarities with a current project in our group, so I took some time over the weekend to read through it.</p>
<p>This is the last of Google’s three foundational distributed system papers, and although it wasn’t originally part of the MIT6.824 reading list, I’ve categorized it here for consistency.</p>
<p>As with previous notes, I won’t dive deep into the technical details but will instead focus on the design considerations and thoughts on the problem.</p>
<h2 id="introduction">Introduction</h2>
<p>Bigtable is a distributed <strong>structured data</strong> storage system built on top of GFS, designed to store large amounts of structured and semi-structured data. It is a NoSQL data store that emphasizes scalability and performance, as well as reliable fault tolerance through GFS.</p>
<blockquote>
<p>Design Goal: Wide Applicability, Scalability, High Performance, High Availability</p>
</blockquote>
<h2 id="data-model">Data Model</h2>
<p>Bigtable’s data model is No Schema and provides a simple model. It treats all data as strings, with encoding and decoding handled by the application layer.</p>
<p>Bigtable is essentially a <strong>sparse, distributed, persistent multidimensional sorted Map</strong>. The <strong>index</strong> of the Map is composed of <strong>Row Key, Column Key, and TimeStamp</strong>, and the <strong>value</strong> is an unstructured byte array.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#75715e">// Mapping abstraction
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>(<span style="color:#a6e22e">row</span>:<span style="color:#66d9ef">string</span>, <span style="color:#a6e22e">column</span>:<span style="color:#66d9ef">string</span>, <span style="color:#a6e22e">time</span>:<span style="color:#66d9ef">int64</span>) <span style="color:#f92672">-</span>> <span style="color:#66d9ef">string</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">// A Row Key is essentially a multi-dimensional structure composed of {Row, Column, Timestamp}.
</span></span></span></code></pre></div><p>The paper describes the data model as follows:</p>
<blockquote>
<p>A Bigtable is a sparse, distributed, persistent multidimensional sorted map.</p>
</blockquote>
<p><strong>Sparse</strong> means that columns in the same table can be null, which is quite common.</p>
<table>
<thead>
<tr>
<th>Row</th>
<th>Columns</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row1</td>
<td>{ID, Name, Phone}</td>
</tr>
<tr>
<td>Row2</td>
<td>{ID, Name, Phone, Address}</td>
</tr>
<tr>
<td>Row3</td>
<td>{ID, Name, Phone, Email}</td>
</tr>
</tbody>
</table>
<p><strong>Distributed</strong> refers to scalability and fault tolerance, i.e., <strong>Replication</strong> and <strong>Sharding</strong>. Bigtable leverages GFS replicas for fault tolerance and uses <strong>Tablet</strong> for partitioning data to achieve scalability.</p>
<p><strong>Persistent Multidimensional Sorted</strong> indicates data is eventually persisted, and Bigtable optimizes write and read latency with WAL and LSM.</p>
<blockquote>
<p>The open-source implementation of Bigtable is HBase, a row and column database.</p>
</blockquote>
<h3 id="rows">Rows</h3>
<p>Bigtable organizes data using lexicographic order of row keys. A Row Key can be any string, and read and write operations are atomic at the row level.</p>
<blockquote>
<p>Lexicographic ordering helps aggregate related row records.
MySQL achieves atomic row operations using an undo log.</p>
</blockquote>
<h3 id="column-family">Column Family</h3>
<p>A set of column keys forms a Column Family, where the data often shares the same type.</p>
<p>A column key is composed of <code>Column Family : Qualifier</code>. The column family’s name must be a printable string, whereas the qualifier name can be any string.</p>
<blockquote>
<p>The paper mentions:</p>
<blockquote>
<p>Access control and both disk and memory accounting are performed at the column-family level.</p>
</blockquote>
<p>This is because business users tend to retrieve data by columns, e.g., reading webpage content. In practice, column data is often compressed for storage. Thus, the Column Family level is a more suitable level for access control and resource accounting than rows.</p>
</blockquote>
<h3 id="timestamp">TimeStamp</h3>
<p>The timestamp is used to maintain different versions of the same data, serving as a logical clock. It is also used as an index to query data versions.</p>
<blockquote>
<p>Typically, timestamps are sorted in reverse chronological order. When the number of versions is low, a pointer to the previous version is used to maintain data versioning; when the number of versions increases, an index structure is needed.
TimeStamp indexing inherently requires range queries, so a sortable data structure is appropriate for indexing.
Extra version management increases maintenance overhead, usually handled by limiting the number of data versions and garbage collecting outdated versions.</p>
</blockquote>
<h3 id="tablet">Tablet</h3>
<p>Bigtable uses a <strong>range-based data sharding</strong> strategy, and <strong>Tablet</strong> is the basic unit for data sharding and load balancing.</p>
<p>A tablet is a collection of rows, managed by a Tablet Server. Rows in Bigtable are ultimately stored in a tablet, which is split or merged for load balancing among Tablet Servers.</p>
<blockquote>
<p>Range-based sharding is beneficial for range queries, compared to hash-based sharding.</p>
</blockquote>
<h3 id="sstable">SSTable</h3>
<p>SSTable is a <strong>persistent, sorted, immutable Map</strong>. Both keys and values are arbitrary byte arrays.</p>
<p>A tablet in Bigtable is stored in the form of SSTable files.</p>
<blockquote>
<p>SSTable is organized into data blocks (typically 64KB each), with an index for fast data lookup. Data is read by first reading the index, searching the index, and then reading the data block.</p>
</blockquote>
<h3 id="api">API</h3>
<p>The paper provides an API that highlights the differences from RDBMS.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#75715e">// Writing to Bigtable
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// Open the table
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>Table <span style="color:#f92672">*</span>T <span style="color:#f92672">=</span> OpenOrDie(<span style="color:#e6db74">"/bigtable/web/webtable"</span>);
</span></span><span style="display:flex;"><span><span style="color:#75715e">// Write a new anchor and delete an old anchor
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>RowMutation <span style="color:#a6e22e">r1</span>(T, <span style="color:#e6db74">"com.cnn.www"</span>);
</span></span><span style="display:flex;"><span>r1.Set(<span style="color:#e6db74">"anchor:www.c-span.org"</span>, <span style="color:#e6db74">"CNN"</span>);
</span></span><span style="display:flex;"><span>r1.Delete(<span style="color:#e6db74">"anchor:www.abc.com"</span>);
</span></span><span style="display:flex;"><span>Operation op;
</span></span><span style="display:flex;"><span>Apply(<span style="color:#f92672">&</span>op, <span style="color:#f92672">&</span>r1);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">// Reading from Bigtable
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>Scanner <span style="color:#a6e22e">scanner</span>(T);
</span></span><span style="display:flex;"><span>ScanStream <span style="color:#f92672">*</span>stream;
</span></span><span style="display:flex;"><span>stream <span style="color:#f92672">=</span> scanner.FetchColumnFamily(<span style="color:#e6db74">"anchor"</span>);
</span></span><span style="display:flex;"><span>stream<span style="color:#f92672">-></span>SetReturnAllVersions();
</span></span><span style="display:flex;"><span>scanner.Lookup(<span style="color:#e6db74">"com.cnn.www"</span>);
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> (; <span style="color:#f92672">!</span>stream<span style="color:#f92672">-></span>Done(); stream<span style="color:#f92672">-></span>Next()) {
</span></span><span style="display:flex;"><span> printf(<span style="color:#e6db74">"%s %s %lld %s</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>,
</span></span><span style="display:flex;"><span> scanner.RowName(),
</span></span><span style="display:flex;"><span> stream<span style="color:#f92672">-></span>ColumnName(),
</span></span><span style="display:flex;"><span> stream<span style="color:#f92672">-></span>MicroTimestamp(),
</span></span><span style="display:flex;"><span> stream<span style="color:#f92672">-></span>Value());
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="architecture-design">Architecture Design</h2>
<h3 id="external-components">External Components</h3>
<p>Bigtable is built on top of other components in Google’s ecosystem, which significantly simplifies Bigtable’s design.</p>
<h4 id="gfs">GFS</h4>
<p>GFS is Bigtable’s underlying storage, providing replication and fault tolerance.</p>
<blockquote>
<p>Refer to the previous notes for details.</p>
</blockquote>
<h4 id="chubby">Chubby</h4>
<p>Chubby is a highly available distributed lock service that provides a namespace, where directories and files can serve as distributed locks.</p>
<blockquote>
<p>High availability means maintaining multiple service replicas, with consistency ensured via Paxos. A lease mechanism prevents defunct Chubby clients from holding onto locks indefinitely.</p>
</blockquote>
<p>Why Chubby? What is its role?</p>
<ul>
<li>Stores Column Family information</li>
<li>Stores ACL (Access Control List)</li>
<li>Stores root metadata for the Root Tablet location, which is essential for Bigtable startup.
<blockquote>
<p>Bigtable uses a three-layer B+ tree-like structure for metadata. The Root Tablet location is in Chubby, which helps locate other metadata tablets, which in turn store user Tablet locations.</p>
</blockquote>
</li>
<li>Tablet Server lifecycle monitoring
<blockquote>
<p>Each Tablet Server creates a unique file in a designated directory in Chubby and acquires an exclusive lock on it. The server is considered offline if it loses the lock.</p>
</blockquote>
</li>
</ul>
<p>In summary, Chubby’s functionality can be categorized into two parts. One is to store critical metadata as a highly available node, while the other is to manage the lifecycle of storage nodes (Tablet Servers) using distributed locking.</p>
<p>In GFS, these responsibilities are handled by the Master. By offloading them to Chubby, Bigtable simplifies the Master design and reduces its load.</p>
<blockquote>
<p>Conceptually, Chubby can be seen as part of the Master node.</p>
</blockquote>
<h3 id="internal-components">Internal Components</h3>
<h4 id="master">Master</h4>
<p>Bigtable follows a Master-Slave architecture, similar to GFS and MapReduce. However, unlike GFS, Bigtable relies on Chubby and Tablet Servers to store metadata, with the Master only responsible for orchestrating the process and not storing tablet locations.</p>
<blockquote>
<p>Responsibilities include Tablet allocation, garbage collection, monitoring Tablet Server health, load balancing, and metadata updates.
The Master requires:</p>
<ol>
<li>All Tablet information to determine allocation and distribution.</li>
<li>Tablet Server status information to decide on allocations.</li>
</ol>
</blockquote>
<h4 id="tablet-server">Tablet Server</h4>
<p>Tablet Servers manage tablets, handling reads and writes, splitting and merging tablets when necessary.</p>
<blockquote>
<p>Metadata is not stored by the Master. Clients interact directly with Chubby and Tablet Servers for reading data.
Tablets are split by Tablet Servers, and Master may not be notified instantly. WAL+retry mechanisms should be employed to ensure operations aren’t lost.</p>
</blockquote>
<h4 id="client-sdk">Client SDK</h4>
<p>The client SDK is the entry point for businesses to access Bigtable. To minimize metadata lookup overhead, caching and prefetching are used to reduce the frequency of network interactions, making use of temporal and spatial locality.</p>
<blockquote>
<p>Caching may introduce inconsistency issues, which require appropriate solutions, such as retries during inconsistent states.</p>
</blockquote>
<h2 id="storage-design">Storage Design</h2>
<h3 id="mapping-and-addressing">Mapping and Addressing</h3>
<p>Bigtable data is uniquely determined by a <code>(Table, Row, Column)</code> tuple, stored in tablets, which in turn are stored in SSTable format on GFS.</p>
<p>Tablets are logical representations of Bigtable’s on-disk entity, managed by Tablet Servers.</p>
<p>Bigtable uses <code>Root Tablet + METADATA Table</code> for addressing. The Root Tablet location is stored in Chubby, while the METADATA Table is maintained by Tablet Servers.</p>
<p>The Root Tablet stores the location of METADATA Tablets, and each METADATA Tablet contains the location of user tablets.</p>
<blockquote>
<p>METADATA Table Row: <code>(TableID, encoding of last row in Tablet) => Tablet Location</code></p>
</blockquote>
<blockquote>
<p>The system uses a B+ tree-like three-layer structure to maintain tablet location information.</p>
</blockquote>
<h3 id="scheduling-and-monitoring">Scheduling and Monitoring</h3>
<h4 id="scheduling">Scheduling</h4>
<p>Scheduling involves Tablet allocation and load balancing.</p>
<p>A Tablet can only be assigned to one Tablet Server at any given time. The Master maintains Tablet Server states and sends allocation requests as needed.</p>
<blockquote>
<p>The Master does not maintain addressing information but holds Tablet Server states (including tablet count, status, and available resources) for scheduling.</p>
</blockquote>
<h4 id="monitoring">Monitoring</h4>
<p>Monitoring is carried out by Chubby and the Master.</p>
<p>Each Tablet Server creates a unique file in a Chubby directory and acquires an exclusive lock. When the Tablet Server disconnects and loses its lease, the lock is released.</p>
<blockquote>
<p>The unique file determines whether a Tablet Server is active, and the Master may delete the file as needed.
In cases of network disconnection, the Tablet Server will try to re-acquire the exclusive lock if the file still exists.
If the file doesn’t exist, the disconnected Tablet Server should automatically leave the cluster.</p>
</blockquote>
<p>The Master ensures its uniqueness by acquiring an exclusive lock on a unique file in Chubby, and monitors a specific directory for Tablet Server files.</p>
<p>Once it detects a failure, it deletes the Tablet Server’s Chubby file and reallocates its tablets to other Tablet Servers.</p>
<h2 id="compaction">Compaction</h2>
<p>Bigtable provides read and write services and uses an LSM-like structure to optimize write performance. For each write operation, the ACL information is first retrieved from Chubby to verify permissions. The write is then logged in WAL and stored in Memtable before eventually being persisted in SSTable.</p>
<p>When Memtable grows to a certain size, it triggers a <strong>Minor Compaction</strong> to convert Memtable to SSTable and write it to GFS.</p>
<blockquote>
<p>Memtable is first converted into an immutable Memtable before becoming SSTable. This intermediate step ensures that Minor Compaction does not interfere with incoming writes.</p>
</blockquote>
<p>Bigtable uses <strong>Compaction</strong> to accelerate writes, converting random writes into sequential writes and writing data in the background. Compaction occurs in three types:</p>
<ul>
<li><strong>Minor Compaction</strong>: Converts Memtable to SSTable, discarding deleted data and retaining only the latest version.</li>
<li><strong>Merge Compaction</strong>: Combines Memtable and SSTable into a new SSTable.</li>
<li><strong>Major Compaction</strong>: Combines multiple SSTables into one.</li>
</ul>
<p>For reads, data aggregation is required across Memtable and multiple SSTables, as data may be distributed across these structures. <strong>Second-level caching</strong> and <strong>Bloom filters</strong> are used to speed up reads.</p>
<p>Tablet Servers have two levels of caching:</p>
<ol>
<li><strong>Scan Cache</strong>: Caches frequently read key-value pairs.</li>
<li><strong>Block Cache</strong>: Caches SSTable blocks.</li>
</ol>
<p>Bloom filters are also employed to reduce the number of SSTable lookups by indicating whether a key is not present.</p>
<h2 id="optimization">Optimization</h2>
<h3 id="locality">Locality</h3>
<p>High-frequency columns can be grouped together into one SSTable, reducing the time to fetch related data.</p>
<blockquote>
<p>Space is traded for time, leveraging locality principles.</p>
</blockquote>
<h3 id="compression">Compression</h3>
<p>SSTable blocks are compressed to reduce network bandwidth and latency during transfers.</p>
<blockquote>
<p>Compression is performed in blocks to reduce encoding/decoding time and improve parallelism.</p>
</blockquote>
<h3 id="commitlog-design">CommitLog Design</h3>
<p>Tablet Servers maintain one <strong>Commit Log</strong> each, instead of one per Tablet, to minimize disk seeks and enable batch operations. During recovery, log entries must be sorted by <code>(Table, Row, Log Seq Num)</code> to facilitate recovery.</p>
<h2 id="summary">Summary</h2>
<ul>
<li>Keep it simple: Simple is better than complex.</li>
<li>Cluster monitoring is crucial for distributed services. Google’s three papers emphasize cluster monitoring and scheduling.</li>
<li>Do not make assumptions about other systems in your design. Issues may range from common network issues to unexpected operational problems.</li>
<li>Leverage background operations to accelerate user-facing actions, such as making writes fast and using background processes for cleanups.</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://zh.wikipedia.org/wiki/Bigtable">Wikipedia - Bigtable</a></li>
<li><a href="https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/bigtable-osdi06.pdf">Bigtable Paper</a></li>
<li><a href="https://www.cnblogs.com/xybaby/p/9096748.html">Bigtable Analysis</a></li>
<li><a href="https://zhuanlan.zhihu.com/p/181498475">LSM Tree Explained</a></li>
</ul>
MIT6.824 GFS
https://noneback.github.io/blog/mit6.824-gfs/
Thu, 09 Sep 2021 00:44:24 +0800https://noneback.github.io/blog/mit6.824-gfs/<p>This article introduces the Google File System (GFS) paper published in 2003, which proposed a distributed file system designed to store large volumes of data reliably, meeting Google’s data storage needs. This write-up reflects on the design goals, trade-offs, and architectural choices of GFS.</p>
<h2 id="introduction">Introduction</h2>
<p>GFS is a distributed file system developed by Google to meet the needs of data-intensive applications, using commodity hardware to provide a scalable and fault-tolerant solution.</p>
<h3 id="background">Background</h3>
<ol>
<li><strong>Component Failures as the Norm</strong>: In GFS, component failures are treated as normal events rather than exceptions.</li>
</ol>
<blockquote>
<p>GFS uses inexpensive hardware to build a reliable service. Each machine has a certain probability of failure, resulting in a binomial distribution of overall system failures. The key challenge is to ensure the system remains available through redundancy and rapid failover.</p>
</blockquote>
<ol start="2">
<li><strong>Massive Files</strong>: Files in GFS can be extremely large, ranging from several hundred megabytes to tens of gigabytes.</li>
</ol>
<blockquote>
<p>GFS favors large files rather than many small files. Managing a large number of small files in a distributed system can lead to increased metadata overhead, inefficient caching, and greater inode usage.</p>
</blockquote>
<ol start="3">
<li><strong>Sequential Access</strong>: Most file modifications append data to the end of files rather than random modifications, and reads are generally sequential.</li>
</ol>
<blockquote>
<p>GFS is optimized for sequential writes, especially for appending data. Random writes are not well-supported and do not guarantee consistency.</p>
</blockquote>
<ol start="4">
<li><strong>Collaborative Design</strong>: The API and file system are designed collaboratively to improve efficiency and flexibility.</li>
</ol>
<blockquote>
<p>GFS provides an API similar to POSIX but includes additional optimizations to better match Google’s workload.</p>
</blockquote>
<h2 id="design-goals">Design Goals</h2>
<h3 id="storage-capacity">Storage Capacity</h3>
<p>GFS is designed to manage millions of files, most of which are at least 100 MB in size. Files of several gigabytes are common, but GFS also supports smaller files without specific optimization.</p>
<h3 id="workload">Workload</h3>
<h4 id="read-workload">Read Workload</h4>
<ol>
<li><strong>Large-Scale Sequential Reads</strong>: Large-scale sequential data retrieval using disk I/O.</li>
<li><strong>Small-Scale Random Reads</strong>: Small-scale random data retrieval, optimized through techniques such as request batching.</li>
</ol>
<h4 id="write-workload">Write Workload</h4>
<p>Primarily large-scale sequential writes, typically appending data to the end of files. GFS supports <strong>concurrent data appends</strong> from multiple clients, with atomic guarantees and synchronization.</p>
<h3 id="bandwidth-vs-latency">Bandwidth vs. Latency</h3>
<p>High <strong>sustained bandwidth</strong> is prioritized over low latency, given the typical workloads of GFS.</p>
<h3 id="fault-tolerance">Fault Tolerance</h3>
<p>GFS continuously monitors its state to detect and recover from component failures, which are treated as common occurrences.</p>
<h3 id="operations-and-interfaces">Operations and Interfaces</h3>
<p>GFS provides traditional file system operations such as file creation, deletion, and reading, along with features like <strong>snapshots</strong> and <strong>atomic record append</strong>.</p>
<blockquote>
<p>Snapshots create file or directory copies, while atomic record append guarantees that data is appended atomically.</p>
</blockquote>
<h2 id="architecture">Architecture</h2>
<p>The architecture of GFS follows a Master-Slave design, consisting of a single Master node and multiple Chunk Servers.</p>
<blockquote>
<p>The Master and Chunk Servers are logical concepts and do not necessarily refer to specific physical machines.</p>
</blockquote>
<p><img alt="GFS Architecture" src="https://tva1.sinaimg.cn/large/008i3skNly1gu6y6qm5t0j61i40nojuk02.jpg"></p>
<p>GFS provides a client library (SDK) that allows clients to access the system, abstracting the underlying complexity. File data is divided into chunks and stored across multiple Chunk Servers, with replication for reliability. The Master manages metadata such as namespace, chunk locations, and more.</p>
<h3 id="component-overview">Component Overview</h3>
<h4 id="client">Client</h4>
<p>Clients in GFS are application processes that use the GFS SDK for seamless integration. Key functionalities of the client include:</p>
<ul>
<li><strong>Caching</strong>: Cache metadata obtained from the Master to reduce communication overhead.</li>
<li><strong>Encapsulation</strong>: Encapsulate retries, request splitting, and checksum validation.</li>
<li><strong>Optimization</strong>: Perform request batching, load balancing, and caching to enhance efficiency.</li>
<li><strong>Mapping</strong>: Map file operations to chunk-based ones, such as converting <code>(filename, offset)</code> into <code>(chunk index, offset)</code>.</li>
</ul>
<h4 id="master">Master</h4>
<p>The Master maintains all metadata, including the namespace, file-to-chunk mappings, and chunk versioning. Key functionalities include:</p>
<ul>
<li><strong>Monitoring</strong>: Track Chunk Server status and data locations using heartbeats.</li>
<li><strong>Directory Tree Management</strong>: Manage the hierarchical file system structure with efficient locking mechanisms.</li>
<li><strong>Mapping Management</strong>: Maintain mappings between files and chunks for fast lookups.</li>
<li><strong>Fault Tolerance</strong>: Utilize checkpointing and Raft-style multi-replica backups to recover from Master failures.</li>
<li><strong>System Scheduling</strong>: Manage chunk replication, garbage collection, lease distribution, and primary Chunk Server selection.</li>
</ul>
<blockquote>
<p>Metadata is stored in memory for performance reasons, resulting in a simplified design, but making checkpointing and logging crucial to ensure recovery.</p>
</blockquote>
<h4 id="chunk-server">Chunk Server</h4>
<p>Chunk Servers are responsible for storing data, with each file chunk being saved as a Linux file. Chunk Servers also perform data integrity checks and report health information to the Master regularly.</p>
<h2 id="key-concepts-and-mechanisms">Key Concepts and Mechanisms</h2>
<h3 id="chunk-size">Chunk Size</h3>
<p>Chunks are the logical units for storing data in GFS, with each chunk typically sized at 64 MB. The chunk size balances metadata overhead, caching efficiency, data locality, and fault tolerance.</p>
<blockquote>
<p>Small chunks increase metadata load on the Master, whereas larger chunks can create data hot spots and fragmentation.</p>
</blockquote>
<h3 id="lease-mechanism">Lease Mechanism</h3>
<p>GFS uses a <strong>lease mechanism</strong> to ensure consistency between chunk replicas. When concurrent write requests occur, the Master selects a Chunk Server to be the <strong>primary</strong>. The primary node assigns an order to client operations, ensuring concurrent operations are executed consistently.</p>
<blockquote>
<p>This mechanism reduces the coordination load on the Master and allows data to be appended atomically.</p>
</blockquote>
<h3 id="chunk-versioning">Chunk Versioning</h3>
<p>The versioning system is used to ensure that only the latest chunk version is valid. The Master increments the version whenever a lease is granted, and a new version number is committed after acknowledgment from the primary.</p>
<blockquote>
<p>Versioning helps determine the freshness of data during recoveries.</p>
</blockquote>
<h3 id="control-flow-vs-data-flow">Control Flow vs. Data Flow</h3>
<p>GFS separates <strong>control flow</strong> and <strong>data flow</strong> to optimize data transfers. Control commands are issued separately from data transfers, enabling efficient utilization of network topology.</p>
<blockquote>
<p>Data is sent using a <strong>pipeline</strong> approach between Chunk Servers, which minimizes network overhead and uses cache effectively.</p>
</blockquote>
<h3 id="data-integrity">Data Integrity</h3>
<p>Chunks are split into 64 KB blocks, each with a corresponding checksum for data integrity. These checksums are used to verify data during read operations.</p>
<blockquote>
<p>Checksums are stored separately from the data, providing an additional layer of reliability.</p>
</blockquote>
<h3 id="fault-tolerance-and-replication">Fault Tolerance and Replication</h3>
<p>Chunks are stored in multiple replicas across different Chunk Servers for reliability. The Master detects Chunk Server failures via heartbeats and manages replication to meet desired redundancy levels.</p>
<blockquote>
<p>Data integrity failures or Chunk Server disconnections trigger replication to maintain availability.</p>
</blockquote>
<h3 id="consistency">Consistency</h3>
<p>GFS has a relaxed consistency model. It provides <strong>eventual consistency</strong> and does not guarantee strong consistency.</p>
<blockquote>
<p>In practice, operations such as <strong>atomic record append</strong> ensure data integrity during appends but may not eliminate duplicate writes. Random writes are not consistently managed.</p>
</blockquote>
<h2 id="summary">Summary</h2>
<p>GFS demonstrates how practical design trade-offs, driven by specific business needs, can lead to an efficient and scalable distributed file system. It focuses on resilience, fault tolerance, and high throughput, making it ideal for Google’s data processing needs.</p>
<p>In distributed systems, scalability is often more important than single-node performance. GFS embraces this principle through large file management, redundancy, and workload distribution.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://spongecaptain.cool/post/paper/googlefilesystem/">Google File System - GFS Paper Reading</a></li>
<li><a href="https://tanxinyu.work/gfs-thesis/">GFS Paper Summary</a></li>
<li><a href="https://nxwz51a5wp.feishu.cn/docs/doccnNYeo3oXj6cWohseo6yB4id">GFS Paper Overview</a></li>
<li><a href="https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/gfs-sosp2003.pdf">GFS Original Paper</a></li>
<li><a href="https://pdos.csail.mit.edu/6.824/schedule.html">MIT6.824 Course</a></li>
</ul>
Epoll and IO Multiplexing
https://noneback.github.io/blog/epoll-and-io%E5%A4%8D%E7%94%A8/
Sun, 15 Aug 2021 21:47:45 +0800https://noneback.github.io/blog/epoll-and-io%E5%A4%8D%E7%94%A8/<p>Let’s start with epoll.</p>
<p>epoll is an I/O event notification mechanism in the Linux kernel, designed to replace select and poll. It aims to efficiently handle large numbers of file descriptors and supports the system’s maximum file open limit, providing excellent performance.</p>
<h2 id="usage">Usage</h2>
<h3 id="api">API</h3>
<p>epoll has three primary system calls:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#75715e">/** epoll_create
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * Creates an epoll instance and returns a file descriptor for it.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * Needs to be closed afterward, as epfd also consumes the system's fd resources.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * size: Indicates the number of file descriptors to be monitored by epfd.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> */</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">int</span> <span style="color:#a6e22e">epoll_create</span>(<span style="color:#66d9ef">int</span> size);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">/** epoll_ctl
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * Adds or modifies a file descriptor to be monitored by epoll.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * epfd: The epoll file descriptor.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * op: Operation type.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLL_CTL_ADD: Add a new fd to epfd.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLL_CTL_MOD: Modify an already registered fd.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLL_CTL_DEL: Remove an fd from epfd.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * fd: The file descriptor to be monitored.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * event: Specifies the type of event to be monitored.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLLIN: Indicates the fd is ready for reading (including when the peer socket is closed).
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLLOUT: Indicates the fd is ready for writing.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLLPRI: Indicates urgent data can be read.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLLERR: Indicates an error occurred on the fd.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLLHUP: Indicates the fd has been hung up.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLLET: Sets epoll to Edge Triggered (ET) mode, as opposed to Level Triggered (LT).
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * EPOLLONESHOT: Only listen for the event once. If continued monitoring is required, the socket must be re-added to epfd.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> */</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">int</span> <span style="color:#a6e22e">epoll_ctl</span>(<span style="color:#66d9ef">int</span> epfd, <span style="color:#66d9ef">int</span> op, <span style="color:#66d9ef">int</span> fd, <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">epoll_event</span> <span style="color:#f92672">*</span>event);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">/** epoll_wait
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * Collects events that have been triggered and returns the number of triggered events.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * epfd: The epoll file descriptor.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * events: Array of epoll events that will be populated with triggered events.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * maxevents: Indicates the size of the events array.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> * timeout: Timeout duration; 0 returns immediately, -1 blocks indefinitely, >0 waits for the specified duration.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> */</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">int</span> <span style="color:#a6e22e">epoll_wait</span>(<span style="color:#66d9ef">int</span> epfd, <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">epoll_event</span> <span style="color:#f92672">*</span>events, <span style="color:#66d9ef">int</span> maxevents, <span style="color:#66d9ef">int</span> timeout);
</span></span></code></pre></div><h3 id="processing-flow">Processing Flow</h3>
<h4 id="epoll_create">epoll_create</h4>
<p>When a process calls <code>epoll_create</code>, the Linux kernel creates an <code>eventpoll</code> structure:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">eventpoll</span> {
</span></span><span style="display:flex;"><span> spinlock_t lock;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">mutex</span> mtx;
</span></span><span style="display:flex;"><span> wait_queue_head_t wq;
</span></span><span style="display:flex;"><span> wait_queue_head_t poll_wait;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">list_head</span> rdllist;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">rb_root</span> rbr;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">epitem</span> <span style="color:#f92672">*</span>ovflist;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">user_struct</span> <span style="color:#f92672">*</span>user;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">file</span> <span style="color:#f92672">*</span>file;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> visited;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">list_head</span> visited_list_link;
</span></span><span style="display:flex;"><span>};
</span></span></code></pre></div><p>During <code>epoll_create</code>, epoll registers a red-black tree with the kernel, which is used to store the monitored sockets. When <code>epoll_create</code> is called, a file node is created in this red-black tree, and the corresponding socket fd is registered.</p>
<p>Additionally, a doubly linked list is created to store events that are ready.</p>
<h4 id="epoll_ctl">epoll_ctl</h4>
<p>For each monitored event, an <code>epitem</code> structure is created:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">epitem</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">rb_node</span> rbn;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">list_head</span> rdllink;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">epitem</span> <span style="color:#f92672">*</span>next;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">epoll_filefd</span> ffd;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> nwait;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">list_head</span> pwqlist;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">eventpoll</span> <span style="color:#f92672">*</span>ep;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">list_head</span> fllink;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">epoll_event</span> event;
</span></span><span style="display:flex;"><span>};
</span></span></code></pre></div><p><img alt="epitem structure" src="https://i.loli.net/2021/08/15/ZH6Pixq4X5BLc2z.png"></p>
<p>When <code>epoll_ctl</code> is called, the socket fd is registered to the <code>eventpoll</code>’s red-black tree, and a callback function is registered with the kernel interrupt handler. When an interrupt occurs for the fd, it is added to the ready list.</p>
<h4 id="epoll_wait">epoll_wait</h4>
<p>When <code>epoll_wait</code> is called, it simply checks if there is data in the list of ready events (<code>epitem</code>). If there is data, it returns immediately; otherwise, it sleeps until either data arrives or the timeout expires.</p>
<h2 id="epoll-usage-model">Epoll Usage Model</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> (;;) {
</span></span><span style="display:flex;"><span> nfds <span style="color:#f92672">=</span> epoll_wait(epfd, events, <span style="color:#ae81ff">20</span>, <span style="color:#ae81ff">500</span>);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> (i <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>; i <span style="color:#f92672"><</span> nfds; <span style="color:#f92672">++</span>i) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (events[i].data.fd <span style="color:#f92672">==</span> listenfd) {
</span></span><span style="display:flex;"><span> connfd <span style="color:#f92672">=</span> accept(listenfd, (sockaddr <span style="color:#f92672">*</span>)<span style="color:#f92672">&</span>clientaddr, <span style="color:#f92672">&</span>clilen);
</span></span><span style="display:flex;"><span> ev.data.fd <span style="color:#f92672">=</span> connfd;
</span></span><span style="display:flex;"><span> ev.events <span style="color:#f92672">=</span> EPOLLIN <span style="color:#f92672">|</span> EPOLLET;
</span></span><span style="display:flex;"><span> epoll_ctl(epfd, EPOLL_CTL_ADD, connfd, <span style="color:#f92672">&</span>ev);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">else</span> <span style="color:#a6e22e">if</span> (events[i].events <span style="color:#f92672">&</span> EPOLLIN) {
</span></span><span style="display:flex;"><span> n <span style="color:#f92672">=</span> read(sockfd, line, MAXLINE);
</span></span><span style="display:flex;"><span> ev.data.ptr <span style="color:#f92672">=</span> md;
</span></span><span style="display:flex;"><span> ev.events <span style="color:#f92672">=</span> EPOLLOUT <span style="color:#f92672">|</span> EPOLLET;
</span></span><span style="display:flex;"><span> epoll_ctl(epfd, EPOLL_CTL_MOD, sockfd, <span style="color:#f92672">&</span>ev);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">else</span> <span style="color:#a6e22e">if</span> (events[i].events <span style="color:#f92672">&</span> EPOLLOUT) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">struct</span> <span style="color:#a6e22e">myepoll_data</span> <span style="color:#f92672">*</span>md <span style="color:#f92672">=</span> (myepoll_data <span style="color:#f92672">*</span>)events[i].data.ptr;
</span></span><span style="display:flex;"><span> sockfd <span style="color:#f92672">=</span> md<span style="color:#f92672">-></span>fd;
</span></span><span style="display:flex;"><span> send(sockfd, md<span style="color:#f92672">-></span>ptr, strlen((<span style="color:#66d9ef">char</span> <span style="color:#f92672">*</span>)md<span style="color:#f92672">-></span>ptr), <span style="color:#ae81ff">0</span>);
</span></span><span style="display:flex;"><span> ev.data.fd <span style="color:#f92672">=</span> sockfd;
</span></span><span style="display:flex;"><span> ev.events <span style="color:#f92672">=</span> EPOLLIN <span style="color:#f92672">|</span> EPOLLET;
</span></span><span style="display:flex;"><span> epoll_ctl(epfd, EPOLL_CTL_MOD, sockfd, <span style="color:#f92672">&</span>ev);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Other processing
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="blocking-io-non-blocking-io-and-io-multiplexing">Blocking IO, Non-blocking IO, and IO Multiplexing</h2>
<h3 id="blocking-io">Blocking IO</h3>
<p><strong>Blocking IO</strong> means that a thread waits for data to arrive, releasing the CPU until the data is available. When data arrives, the thread is rescheduled to run.</p>
<p>In scenarios with many read/write requests, frequent context switching and thread scheduling can lead to inefficiency.</p>
<h3 id="non-blocking-io">Non-blocking IO</h3>
<p>In <strong>non-blocking IO</strong>, a user thread makes an IO request, and if data is not yet available, it returns immediately. The thread must keep checking until the data is ready, at which point it can proceed.</p>
<p>Non-blocking IO has a significant drawback: the thread must constantly poll, which can lead to high CPU usage.</p>
<h3 id="io-multiplexing">IO Multiplexing</h3>
<p>Blocking IO occupies resources, and excessive context switching can be inefficient. Non-blocking IO can lead to high CPU utilization due to constant polling.</p>
<p><strong>IO multiplexing</strong> manages multiple file descriptors in a single thread, reducing context switching and idle CPU usage. Mechanisms like select, poll, and epoll were developed to implement this concept, with epoll being the most scalable and efficient.</p>
<h2 id="references">References</h2>
<p><a href="https://www.cnblogs.com/lojunren/p/3856290.html">Linux IO Multiplexing and epoll Explained</a></p>
<p><a href="https://www.infoq.cn/article/26lpjzsp9echwgnic7lq">Deep Dive into epoll</a></p>
Linux Cgroups Overview
https://noneback.github.io/blog/linux-cgroups%E7%AE%80%E4%BB%8B/
Tue, 08 Jun 2021 22:26:17 +0800https://noneback.github.io/blog/linux-cgroups%E7%AE%80%E4%BB%8B/<p><strong>Linux Cgroups</strong> (Control Groups) provide the ability to limit, control, and monitor the resources used by a group of processes and their future child processes. These resources include CPU, memory, storage, and network. With Cgroups, it’s easy to limit a process’s resource usage and monitor its metrics in real time.</p>
<h2 id="three-components-of-cgroups">Three Components of Cgroups</h2>
<ul>
<li>
<p><strong>cgroup</strong></p>
<p>A mechanism for managing groups of processes. A cgroup contains a group of processes, and various Linux subsystem parameters can be configured on this cgroup, associating a group of processes with a group of system parameters from subsystems.</p>
</li>
<li>
<p><strong>subsystem</strong></p>
<p>A module that controls a set of resources.</p>
<p><img alt="Subsystem" src="https://i.loli.net/2021/06/08/p4e91XZRFAPBqyW.png"></p>
<p>Each subsystem is linked to a cgroup that defines the respective limits, and the subsystem imposes these limits on the processes in the cgroup.</p>
</li>
<li>
<p><strong>hierarchy</strong></p>
<p>A hierarchy is a tree structure that links multiple cgroups. With this tree structure, cgroups can inherit attributes from their parent cgroups.</p>
<blockquote>
<p>Example Scenario:
Suppose there is a group of periodic tasks limited by <code>cgroup1</code> in terms of CPU usage. If one of these tasks is a logging process that also needs to be limited by disk I/O, a new <code>cgroup2</code> can be created that inherits from <code>cgroup1</code>. <code>cgroup2</code> will inherit the CPU limit from <code>cgroup1</code> and add its own disk I/O limitation, without affecting other processes in <code>cgroup1</code>.</p>
</blockquote>
</li>
</ul>
<h2 id="relationships-between-the-three">Relationships Between the Three</h2>
<ul>
<li>When a new hierarchy is created, all processes in the system <strong>join</strong> the <strong>root cgroup</strong> of that hierarchy by default. This root cgroup is created automatically with the hierarchy.</li>
<li>A subsystem can only be attached to one hierarchy.</li>
<li>A hierarchy can have multiple subsystems attached.</li>
<li>A process can belong to multiple cgroups in different hierarchies.</li>
<li>A child process is in the same cgroup as its parent process but can be moved to a different cgroup later.</li>
</ul>
<h2 id="kernel-interface">Kernel Interface</h2>
<p>Hierarchies in cgroups are organized in a <strong>tree</strong> structure. The kernel provides a <strong>virtual tree-like file system</strong> to configure cgroups, making it intuitive to work with them through a hierarchical directory structure.</p>
<ul>
<li>Create a hierarchy and add sub-cgroups:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>mkdir cgroup <span style="color:#75715e"># Create mount point</span>
</span></span><span style="display:flex;"><span>sudo mount -t cgroup -o none,name<span style="color:#f92672">=</span>cgroup-test cgroup-test ./cgroup-test <span style="color:#75715e"># Mount hierarchy</span>
</span></span><span style="display:flex;"><span>sudo mkdir cgroup-1
</span></span><span style="display:flex;"><span>sudo mkdir cgroup-2
</span></span><span style="display:flex;"><span>tree
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>.
</span></span><span style="display:flex;"><span>├── cgroup-1
</span></span><span style="display:flex;"><span>│ ├── cgroup.clone_children
</span></span><span style="display:flex;"><span>│ ├── cgroup.procs
</span></span><span style="display:flex;"><span>│ ├── notify_on_release
</span></span><span style="display:flex;"><span>│ └── tasks
</span></span><span style="display:flex;"><span>├── cgroup-2
</span></span><span style="display:flex;"><span>│ ├── cgroup.clone_children
</span></span><span style="display:flex;"><span>│ ├── cgroup.procs
</span></span><span style="display:flex;"><span>│ ├── notify_on_release
</span></span><span style="display:flex;"><span>│ └── tasks
</span></span><span style="display:flex;"><span>├── cgroup.clone_children
</span></span><span style="display:flex;"><span>├── cgroup.procs
</span></span><span style="display:flex;"><span>├── cgroup.sane_behavior
</span></span><span style="display:flex;"><span>├── notify_on_release
</span></span><span style="display:flex;"><span>├── release_agent
</span></span><span style="display:flex;"><span>└── tasks
</span></span></code></pre></div><p><strong>Meaning of Different Files</strong></p>
<p><img alt="File Descriptions" src="https://i.loli.net/2021/06/08/LokHKWqXs5SN4cI.png"></p>
<ul>
<li>Add and move processes to a cgroup (move process PID into the corresponding <code>tasks</code> file):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>sudo sh -c <span style="color:#e6db74">"echo </span>$$<span style="color:#e6db74"> >> ./cgroup-1/tasks"</span> <span style="color:#75715e"># Move terminal process to cgroup-1</span>
</span></span><span style="display:flex;"><span>cat /proc/$$/cgroup
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>>>
</span></span><span style="display:flex;"><span>13:name<span style="color:#f92672">=</span>cgroup-test:/cgroup-1
</span></span><span style="display:flex;"><span>12:memory:/user.slice/user-1002.slice/session-12331.scope
</span></span><span style="display:flex;"><span>11:perf_event:/
</span></span><span style="display:flex;"><span>10:cpuset:/
</span></span><span style="display:flex;"><span>9:freezer:/
</span></span><span style="display:flex;"><span>8:blkio:/user.slice
</span></span><span style="display:flex;"><span>7:rdma:/
</span></span><span style="display:flex;"><span>6:hugetlb:/
</span></span><span style="display:flex;"><span>5:pids:/user.slice/user-1002.slice/session-12331.scope
</span></span><span style="display:flex;"><span>4:cpu,cpuacct:/user.slice
</span></span><span style="display:flex;"><span>3:net_cls,net_prio:/
</span></span><span style="display:flex;"><span>2:devices:/user.slice
</span></span><span style="display:flex;"><span>1:name<span style="color:#f92672">=</span>systemd:/user.slice/user-1002.slice/session-12331.scope
</span></span><span style="display:flex;"><span>0::/user.slice/user-1002.slice/session-12331.scope
</span></span></code></pre></div><ul>
<li>
<p>Limit cgroup resource usage via subsystems:</p>
<p>First, link the hierarchy to a subsystem. By default, the system links the hierarchy to a memory subsystem.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Start a memory-intensive stress process without any limitations</span>
</span></span><span style="display:flex;"><span>stress --vm-bytes 200m --vm-keep -m <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>sudo mkdir test-limit-memory <span style="color:#f92672">&&</span> cd test-limit-memory <span style="color:#75715e"># Create a cgroup</span>
</span></span><span style="display:flex;"><span>sudo sh -c <span style="color:#e6db74">"echo "</span>100m<span style="color:#e6db74">" > memory.limit"</span> <span style="color:#75715e"># Set max memory usage to 100m</span>
</span></span><span style="display:flex;"><span>sudo sh -c <span style="color:#e6db74">"echo </span>$$<span style="color:#e6db74"> > tasks"</span> <span style="color:#75715e"># Move current process to cgroup</span>
</span></span><span style="display:flex;"><span>stress --vm-bytes 200m --vm-keep -m <span style="color:#ae81ff">1</span>
</span></span></code></pre></div></li>
</ul>
<h4 id="observation">Observation</h4>
<p>The memory usage of the process is limited by the specified setting.</p>
Distributed Transactions
https://noneback.github.io/blog/%E5%88%86%E5%B8%83%E5%BC%8F%E4%BA%8B%E5%8A%A1/
Thu, 20 May 2021 23:55:11 +0800https://noneback.github.io/blog/%E5%88%86%E5%B8%83%E5%BC%8F%E4%BA%8B%E5%8A%A1/<h1 id="transactions-and-distributed-transactions">Transactions and Distributed Transactions</h1>
<h2 id="transactions">Transactions</h2>
<p>A <strong>transaction</strong> is a logical unit of work in a database, composed of a finite sequence of database operations. The database must ensure the <strong>atomicity</strong> of transaction operations: when a transaction is successful, it means that all operations in the transaction have been fully executed; if the transaction fails, all executed SQL operations are rolled back.</p>
<p>A single-node database transaction has four main properties:</p>
<ul>
<li><strong>Atomicity</strong>: The transaction is executed as a whole. Either all operations within the transaction are executed, or none are executed.</li>
<li><strong>Consistency</strong>: The transaction must ensure that the database moves from one consistent state to another. Consistent states mean that the data in the database must satisfy all integrity constraints.</li>
<li><strong>Isolation</strong>: When multiple transactions are executed concurrently, the execution of one transaction should not affect the execution of others.</li>
<li><strong>Durability</strong>: Changes made by a committed transaction should be permanently stored in the database.</li>
</ul>
<h2 id="distributed-transactions">Distributed Transactions</h2>
<p>A <strong>distributed transaction</strong> is a transaction where the <strong>participants</strong>, <strong>transaction-supporting servers</strong>, <strong>resource servers</strong>, and <strong>transaction manager</strong> are located on different nodes of a distributed system.</p>
<p>With the adoption of microservice architectures, large business domains often involve multiple services, and a business process requires participation from multiple services. In specific business scenarios, data consistency among multiple services must be ensured.</p>
<p>For example, in a large e-commerce system, the order interface typically deducts inventory, reduces discounts, and generates an order ID. The order service, inventory, discount, and order ID are all separate services. The success of the order interface depends not only on local database operations but also on third-party system results. In this case, distributed transactions ensure that all these operations either succeed together or fail together.</p>
<p>In essence, <strong>distributed transactions are used to ensure data consistency across different databases</strong>.</p>
<h1 id="use-cases">Use Cases</h1>
<p>Typical use cases in e-commerce systems include:</p>
<ul>
<li>
<p><strong>Order Inventory Deduction</strong></p>
<p>When placing an order, operations include generating an order record and reducing product inventory. These are handled by separate microservices, so distributed transactions are required to ensure the atomicity of the order operation.</p>
</li>
<li>
<p><strong>Third-Party Payments</strong></p>
<p>In a microservice architecture, payment and orders are independent services. The order payment status depends on a notification from the financial service, which, in turn, depends on notifications from a third-party payment service.</p>
<p>A classic scenario is illustrated below:</p>
<p><img alt="https://xiaomi-info.github.io/2020/01/02/distributed-transaction/notify-message.png" src="https://xiaomi-info.github.io/2020/01/02/distributed-transaction/notify-message.png"></p>
<p>From the diagram, there are two calls: the third-party payment service calling the payment service, and the payment service calling the order service. Both calls can encounter <strong>timeouts</strong>. Without distributed transactions, the actual payment status and the final payment status visible to the user may become <strong>inconsistent</strong>.</p>
</li>
</ul>
<h1 id="implementation-approaches">Implementation Approaches</h1>
<h2 id="two-phase-commit-2pc">Two-Phase Commit (2PC)</h2>
<p><img alt="https://i.loli.net/2021/05/19/MfWzxseBFKaAnhk.png" src="https://i.loli.net/2021/05/19/MfWzxseBFKaAnhk.png"></p>
<p>A transaction commit is divided into two phases:</p>
<ol>
<li>
<p><strong>Preparation Phase</strong>:</p>
<ul>
<li>The transaction manager (TM) initiates the transaction, logs the start of the transaction, and asks the participating resource managers (RMs) whether they can execute the commit operation, then waits for their responses.</li>
<li>RMs execute local transactions, log redo/undo data, and return results to TM, but do not commit.</li>
</ul>
</li>
<li>
<p><strong>Commit/Rollback Phase</strong>:</p>
<ul>
<li>If all participating RMs execute successfully, the transaction proceeds to the <strong>commit phase</strong>:
<ul>
<li>TM logs the commit, sends a commit instruction to all RMs.</li>
<li>RMs commit the local transaction and respond to TM.</li>
<li>TM logs the end of the transaction.</li>
</ul>
</li>
<li>If any RM fails or times out during preparation or commit:
<ul>
<li>TM logs the rollback, sends rollback instructions to all RMs.</li>
<li>RMs rollback the local transaction and respond to TM.</li>
<li>TM logs the end of the transaction.</li>
</ul>
</li>
</ul>
</li>
</ol>
<h3 id="characteristics">Characteristics</h3>
<ul>
<li><strong>Atomicity</strong>: Supported</li>
<li><strong>Consistency</strong>: Strong consistency</li>
<li><strong>Isolation</strong>: Supported</li>
<li><strong>Durability</strong>: Supported</li>
</ul>
<h3 id="disadvantages">Disadvantages</h3>
<ul>
<li><strong>Synchronous Blocking</strong>: When participants occupy shared resources, others can only wait for resource release, leading to blocking.</li>
<li><strong>Single Point of Failure</strong>: If the transaction manager fails, the entire system becomes unavailable.</li>
<li><strong>Data Inconsistency</strong>: If the transaction manager only sends some commit messages, and a network issue occurs, only some participants receive the commit message, leading to inconsistency.</li>
<li><strong>Uncertainty</strong>: If both the transaction manager and a participant fail after sending a commit message, it is uncertain whether the message was successfully committed.</li>
</ul>
<h2 id="local-message-table">Local Message Table</h2>
<p>The transaction initiator maintains a <strong>local message table</strong>, and operations on the business table and the message table are within the same local transaction. Asynchronously, a <strong>scheduled task</strong> scans the message table and delivers the message downstream.</p>
<p>The broad concept of the local message table also allows downstream notification through methods other than message delivery, such as RPC calls.</p>
<p><img alt="https://i.loli.net/2021/05/19/tmNeiALsdof24PW.png" src="https://i.loli.net/2021/05/19/tmNeiALsdof24PW.png"></p>
<ol>
<li>The initiator executes a local transaction, operating both the business table and the local message table.</li>
<li>A scheduled task scans pending local messages (in the message table) and sends them to the message queue:
<ul>
<li>If successful, mark the local message as sent.</li>
<li>If failed, retry until successful.</li>
</ul>
</li>
<li>The message queue delivers the message downstream.</li>
<li>The downstream transaction participant receives the message and executes a local transaction:
<ul>
<li>If failed, no ACK is returned, and the message queue retries.</li>
<li>If successful, an ACK is returned, marking the end of the global transaction.</li>
<li>If the message or ACK is lost, the message queue retries.</li>
</ul>
</li>
</ol>
<h3 id="exceptional-scenarios">Exceptional Scenarios</h3>
<ul>
<li><strong>Message Loss</strong>: Handled by repeating the scheduled task.</li>
<li><strong>Delivery Failure</strong>: Handled by retries, downstream must ensure idempotency.</li>
<li><strong>ACK Loss</strong>: Handled by retries, downstream must ensure idempotency.</li>
</ul>
<h3 id="advantages-and-challenges">Advantages and Challenges</h3>
<p><strong>Advantages</strong>:</p>
<ul>
<li>High system throughput, asynchronous downstream transactions via middleware decoupling.</li>
<li>Moderate business intrusion, requiring local message tables and scheduled tasks.</li>
</ul>
<p><strong>Challenges</strong>:</p>
<ul>
<li>Incomplete transaction support, downstream transactions cannot be rolled back, only retried.</li>
</ul>
<h3 id="characteristics-1">Characteristics</h3>
<ul>
<li><strong>Atomicity</strong>: Supported</li>
<li><strong>Consistency</strong>: Eventual consistency</li>
<li><strong>Isolation</strong>: Not supported (committed branch transactions are visible to other transactions)</li>
<li><strong>Durability</strong>: Supported</li>
</ul>
<h2 id="best-effort-notification">Best-Effort Notification</h2>
<p>The best-effort notification is a simple approach to flexible transactions, suitable for business with low time sensitivity to eventual consistency, where the result of the passive party does not affect the initiator’s result.</p>
<p>This approach roughly works as follows:</p>
<ol>
<li>System A completes its local transaction and sends a message to the MQ.</li>
<li>A service consumes the MQ and calls System B’s interface.</li>
<li>If System B succeeds, everything is fine; if it fails, the notification service periodically retries calling System B up to N times. If it still fails, it gives up.</li>
</ol>
<h3 id="advantages-and-challenges-1">Advantages and Challenges</h3>
<p><strong>Advantages</strong>:</p>
<ul>
<li>Simple implementation.</li>
</ul>
<p><strong>Challenges</strong>:</p>
<ul>
<li>No compensation mechanism, no guarantee of delivery.</li>
<li>Requires idempotency, with interfaces ensuring consistency and atomicity.</li>
</ul>
<h3 id="characteristics-2">Characteristics</h3>
<ul>
<li><strong>Atomicity</strong>: Not supported (requires additional interfaces)</li>
<li><strong>Consistency</strong>: Not supported (requires additional interfaces)</li>
<li><strong>Isolation</strong>: Not supported (committed branch transactions are visible to other transactions)</li>
<li><strong>Durability</strong>: Supported</li>
</ul>
<h3 id="classic-scenario">Classic Scenario</h3>
<p><strong>Payment Callback</strong>:</p>
<p>The payment service receives a successful payment notification from a third-party service, updates the payment status of the order, and synchronously notifies the order service. If this synchronous notification fails, an asynchronous script will keep retrying the order service interface.</p>
<p><img alt="https://xiaomi-info.github.io/2020/01/02/distributed-transaction/try-best-notify.jpg" src="https://xiaomi-info.github.io/2020/01/02/distributed-transaction/try-best-notify.jpg"></p>
<h2 id="references">References</h2>
<p><a href="https://xiaomi-info.github.io/2020/01/02/distributed-transaction/">Distributed Transactions: All You Need to Know</a></p>
CPU False Sharing
https://noneback.github.io/blog/cpu%E4%BC%AA%E5%85%B1%E4%BA%AB/
Sun, 02 May 2021 13:47:30 +0800https://noneback.github.io/blog/cpu%E4%BC%AA%E5%85%B1%E4%BA%AB/<p>The motivation for this post comes from an interview question I was asked: What is CPU false sharing?</p>
<h2 id="cpu-cache">CPU Cache</h2>
<p>Let’s start by discussing CPU cache.</p>
<p>CPU cache is a type of storage medium introduced to bridge the speed gap between the CPU and main memory. In the pyramid-shaped storage hierarchy, it is located just below CPU registers. Its capacity is much smaller than that of main memory, but its speed can be close to the processor’s frequency.</p>
<p>The effectiveness of caching relies on the principle of temporal and spatial locality.</p>
<p>When the processor issues a memory access request, it first checks if the requested data is in the cache. If it is (a cache hit), it directly returns the data without accessing main memory. If it isn’t (a cache miss), it loads the data from main memory into the cache before returning it to the processor.</p>
<h2 id="cpu-cache-architecture">CPU Cache Architecture</h2>
<p>There are usually three levels of cache between the CPU and main memory. The closer the cache is to the CPU, the faster it is but the smaller its capacity. When accessing data, the CPU first checks <strong>L1</strong>, then <strong>L2</strong>, and finally <strong>L3</strong>. If the data isn’t in any of these caches, it must be fetched from main memory.</p>
<p><img alt="Cache Architecture" src="https://i.loli.net/2021/05/12/CSi7FqmcUZk2LTH.png"></p>
<ul>
<li><strong>L1</strong> is close to the CPU core that uses it. L1 and L2 caches can only be used by a single CPU core.</li>
<li><strong>L3</strong> can be shared by all CPU cores in a socket.</li>
</ul>
<h2 id="cpu-cache-line">CPU Cache Line</h2>
<p>Caches operate on the basis of <strong>cache lines</strong>, which are the smallest unit of data transfer between the cache and main memory, typically 64 bytes. A cache line effectively references a block of memory (64 bytes).</p>
<p>Loading a cache line has the advantage that if the required data is located close to each other, it can be accessed without reloading the cache.</p>
<p>However, it can also lead to a problem known as <strong>CPU false sharing</strong>.</p>
<h2 id="cpu-false-sharing">CPU False Sharing</h2>
<p>Consider this scenario:</p>
<ul>
<li>We have a <code>long</code> variable <code>a</code>, which is not part of an array but is a standalone variable, and there’s another <code>long</code> variable <code>b</code> right next to it. When <code>a</code> is loaded, <code>b</code> is also loaded into the cache line for free.</li>
<li>Now, a thread on one CPU core modifies <code>a</code>, while another thread on a different CPU core reads <code>b</code>.</li>
<li>When <code>a</code> is modified, both <code>a</code> and <code>b</code> are loaded into the cache line of the modifying core, and after updating <code>a</code>, all other cache lines containing <code>a</code> become invalid, since they no longer hold the latest value of <code>a</code>.</li>
<li>When the other core reads <code>b</code>, it finds that the cache line is invalid and must reload it from main memory.</li>
</ul>
<p>Because the cache operates at the level of cache lines, invalidating <code>a</code>’s cache line also invalidates <code>b</code>, and vice versa.</p>
<p><img alt="False Sharing" src="https://pic3.zhimg.com/80/v2-32672c4b2b7fc48437fc951c27497bee_1440w.jpg"></p>
<p>This causes a problem:</p>
<p><code>b</code> and <code>a</code> are completely unrelated, but each time <code>a</code> is updated, <code>b</code> has to be reloaded from main memory due to a cache miss, slowing down the process.</p>
<p><strong>CPU false sharing</strong>: When multiple threads modify independent variables that share the same cache line, they unintentionally affect each other’s performance. This is known as false sharing.</p>
<h2 id="avoiding-cpu-false-sharing">Avoiding CPU False Sharing</h2>
<ul>
<li>Ensure that memory for different variables is not placed adjacently.</li>
<li>Align variables during compilation to avoid false sharing. See <a href="https://zh.wikipedia.org/wiki/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84%E5%AF%B9%E9%BD%90">data structure alignment</a>.</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://zhuanlan.zhihu.com/p/65394173">Discussion: What is CPU False Sharing</a></li>
<li><a href="https://en.wikipedia.org/wiki/CPU_cache">Wikipedia - CPU Cache</a></li>
</ul>
MySQL Index Overview
https://noneback.github.io/blog/mysql%E7%B4%A2%E5%BC%95%E6%B5%85%E6%9E%90/
Sun, 21 Mar 2021 20:41:33 +0800https://noneback.github.io/blog/mysql%E7%B4%A2%E5%BC%95%E6%B5%85%E6%9E%90/<p><strong>Database indexes</strong> are sorted data structures in DBMS that help in quickly querying and updating data in a database. Generally, data structures used for building indexes include B-trees, B+ trees, hash tables, etc.</p>
<p>MySQL uses B+ trees to build indexes. The reason for this choice is that a B+ tree node can store more data, and in a B+ tree, only leaf nodes store data, while non-leaf nodes store only indexes. This allows the index to be stored in memory as much as possible, reducing tree height, minimizing disk I/O when querying indexes, and greatly improving query efficiency.</p>
<h2 id="indexes-in-innodb">Indexes in InnoDB</h2>
<h3 id="clustered-index-and-non-clustered-index">Clustered Index and Non-Clustered Index</h3>
<p>Indexes can be divided into clustered and non-clustered indexes based on the data stored in the leaf nodes.</p>
<ul>
<li><strong>Clustered Index</strong>: The leaf nodes store the data rows directly, allowing direct access to user data.</li>
<li><strong>Non-Clustered Index</strong>: The leaf nodes store the primary key value, and data must be fetched by traversing back to the primary key index (a process known as <strong>index backtracking</strong>).</li>
</ul>
<p>In the InnoDB engine, the table’s data is organized using the primary key index. Each table must have a primary key, which constructs the B+ tree, resulting in a primary key index. <strong>The primary key index is a clustered index</strong>, and all other <strong>secondary indexes are non-clustered indexes</strong>.</p>
<h3 id="composite-index">Composite Index</h3>
<p>A composite index is an index composed of multiple fields.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">create</span> <span style="color:#66d9ef">index</span> index_name <span style="color:#66d9ef">on</span> <span style="color:#66d9ef">table_name</span> (col_1, col_2...)
</span></span></code></pre></div><p>Compared to a single-field index, the main difference is that it follows the <strong>leftmost prefix matching principle</strong>.</p>
<blockquote>
<p><strong>Leftmost Prefix Matching Principle</strong>: When using a composite index, the index values are sorted according to the fields in the index from left to right.</p>
</blockquote>
<h2 id="using-indexes-to-optimize-query-performance">Using Indexes to Optimize Query Performance</h2>
<p>Since indexes are ordered, they can significantly improve query efficiency. When using indexes for query optimization, some principles must be followed.</p>
<h3 id="leftmost-prefix-matching-principle">Leftmost Prefix Matching Principle</h3>
<p>When using a composite index, the index values are sorted from left to right based on the indexed fields. We need to follow the leftmost prefix matching rule in our queries; otherwise, the data arrangement becomes unordered, resulting in a full table scan.</p>
<blockquote>
<p>Suppose you create an index on <code>col1, col2, col3</code>. Following the leftmost prefix matching principle, the query conditions should be designed in the order <code>col1 -> col2 -> col3</code>.</p>
<p>Example:</p>
<p><code>select * from table_name where col1 = 1 and col2 = 2;</code> This will use the index.</p>
<p><code>select * from table_name where col2 = 1 and col3 = 2;</code> This will not use the index.</p>
<p>Note: <strong>MySQL will continue matching the columns until it encounters a range query (>, <, between, like), after which it stops matching.</strong></p>
</blockquote>
<h3 id="index-coverage-principle">Index Coverage Principle</h3>
<p>Index coverage refers to querying values directly from the index without needing to traverse back to the table. Well-designed indexes can reduce the number of backtracking operations.</p>
<blockquote>
<p>For a composite index <code>(col1, col2, col3)</code>:</p>
<p>A query like <code>select col1, col2, col3 from test where col1=1 and col2=2</code> can directly retrieve values for <code>col1</code>, <code>col2</code>, and <code>col3</code> without needing to traverse back to the table, as their values are already stored in the secondary index.</p>
</blockquote>
HTTPS Introduction
https://noneback.github.io/blog/https%E6%B5%85%E6%9E%90/
Sun, 21 Feb 2021 16:48:55 +0800https://noneback.github.io/blog/https%E6%B5%85%E6%9E%90/<p>HTTPS (HTTP over SSL) was introduced to address the security vulnerabilities of HTTP, such as eavesdropping and identity spoofing. It uses <a href="https://developer.mozilla.org/en-US/docs/Glossary/SSL">SSL</a> or <a href="https://developer.mozilla.org/en-US/docs/Glossary/TLS">TLS</a> to encrypt communication between the client and the server.</p>
<h2 id="problems-with-http">Problems with HTTP</h2>
<ul>
<li>Communication uses plain text, making it susceptible to eavesdropping.</li>
<li>Unable to verify the identity of the communication party, making it vulnerable to spoofing (e.g., Denial of Service attacks).</li>
<li>Cannot guarantee message integrity, making it possible for messages to be altered (e.g., Man-in-the-Middle attacks).</li>
</ul>
<p>To address these issues, we need:</p>
<ul>
<li><strong>Encryption</strong> to prevent eavesdropping.
<ul>
<li>Encrypting either the <strong>content</strong> or the <strong>communication channel</strong> can help secure the communication.</li>
</ul>
</li>
<li><strong>Authentication</strong> to prevent impersonation attacks.
<ul>
<li>Certificates are commonly used for identity verification.</li>
</ul>
</li>
<li><strong>Integrity checks</strong> to prevent tampering.
<ul>
<li>Hash functions like MD5 and SHA-1 are often used to ensure data integrity.</li>
</ul>
</li>
</ul>
<h2 id="https">HTTPS</h2>
<p>To solve the above problems comprehensively, we add encryption, authentication, and integrity protection to HTTP, resulting in <strong>HTTPS</strong>.</p>
<p>$HTTP + Encryption + Authentication + Integrity Protection = HTTPS$</p>
<h3 id="https-over-ssl">HTTPS over SSL</h3>
<p>HTTPS is not a new protocol; it simply adds SSL or TLS between HTTP and TCP. By doing so, HTTPS provides encryption, certificates, and integrity protection.</p>
<p><img alt="HTTPS Layers" src="https://i.loli.net/2021/02/21/cdQk9AGJUCF4MLI.png"></p>
<blockquote>
<p>SSL is independent of HTTP, and it can be used with other protocols like SMTP and Telnet to provide encryption.</p>
</blockquote>
<h3 id="encryption-mechanism">Encryption Mechanism</h3>
<p>HTTPS uses both symmetric (shared key) and asymmetric (public key) encryption to achieve its goals effectively:</p>
<ul>
<li><strong>Public key encryption</strong> is used to encrypt the <strong>shared key</strong> (Pre-master secret), ensuring it cannot be intercepted.</li>
<li>Once the shared key is established, <strong>symmetric encryption</strong> is used for communication to ensure better performance.</li>
</ul>
<blockquote>
<p>Key differences:</p>
<ul>
<li><strong>Public key encryption</strong>: Asymmetric, secure but computationally expensive.</li>
<li><strong>Shared key encryption</strong>: Symmetric, less secure for key exchange but more efficient for encryption.</li>
</ul>
</blockquote>
<h3 id="authentication-mechanism">Authentication Mechanism</h3>
<p>Public key encryption requires proof that the public key itself is legitimate and not replaced. HTTPS uses <strong>certificates</strong> to achieve this authentication.</p>
<p>Certificates are issued by <strong>Certificate Authorities (CAs)</strong>, who verify the identity of the party requesting the certificate and sign the public key.</p>
<ul>
<li>The server sends the signed public key certificate to the client using public key encryption.</li>
<li>The client uses the CA’s public key to verify the signature. If verified, it proves:
<ul>
<li>The CA is trustworthy.</li>
<li>The server’s public key is legitimate.</li>
</ul>
</li>
<li>Both parties then use the public key to establish a secure channel.</li>
</ul>
<blockquote>
<p>The CA’s public key is usually pre-installed in browsers.</p>
</blockquote>
<h3 id="integrity-protection">Integrity Protection</h3>
<p>HTTPS ensures message integrity by using <strong>message digest algorithms</strong>.</p>
<h4 id="hash-algorithms">Hash Algorithms</h4>
<p>These algorithms, also known as hash or digest functions, convert input data of any length into a fixed-length output string, such as MD5 and SHA-2.</p>
<blockquote>
<p>A hash algorithm <strong>is not an encryption algorithm</strong>. It cannot be reversed to obtain the original data, so it can only be used for integrity checking.</p>
</blockquote>
<p>Applications attach a <strong>Message Authentication Code (MAC)</strong> to messages. The MAC helps detect tampering, thus ensuring the integrity of the communication.</p>
<h2 id="https-communication-flow">HTTPS Communication Flow</h2>
<p><img alt="HTTPS Communication Flow" src="https://i.loli.net/2021/02/21/BWwJxbpPETst5ug.png"></p>
<h2 id="other-considerations">Other Considerations</h2>
<ul>
<li>Due to the overhead of encryption, decryption, and SSL handshake, HTTPS is generally slower and requires more CPU resources than HTTP.
<ul>
<li>SSL accelerators (dedicated servers) are sometimes used to mitigate this issue.</li>
</ul>
</li>
<li>When a client repeatedly accesses the same HTTPS server, it may not need to perform a complete TLS handshake each time.
<ul>
<li>The server maintains a session ID for each client and uses it to resume secure sessions, avoiding a full handshake.</li>
</ul>
</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://developer.mozilla.org/zh-CN/docs/Glossary/https">MDN Web Docs</a></li>
<li><a href="https://zh.wikipedia.org/wiki/%E8%B6%85%E6%96%87%E6%9C%AC%E4%BC%A0%E8%BE%93%E5%AE%89%E5%85%A8%E5%8D%8F%E8%AE%AE#%E4%B8%8EHTTP%E7%9A%84%E5%B7%AE%E5%BC%82">Wikipedia</a></li>
<li><a href="https://www.cnblogs.com/cxuanBlog/p/12490862.html">Cxuan’s Blog</a></li>
<li><a href="https://www.oreilly.com/library/view/illustrated-httphttps/9781492031484/">“Illustrated HTTP/HTTPS”</a></li>
</ul>
MIT6.824-MapReduce
https://noneback.github.io/blog/mit6.824-mapreduce/
Fri, 22 Jan 2021 17:02:44 +0800https://noneback.github.io/blog/mit6.824-mapreduce/<p>The third year of university has been quite intense, leaving me with little time to continue my studies on 6.824, so my progress stalled at Lab 1. With a bit more free time during the winter break, I decided to continue. Each paper or experiment will be recorded in this article.</p>
<p>This is the first chapter of my Distributed System study notes.</p>
<hr>
<h2 id="about-the-paper">About the Paper</h2>
<p>The core content of the paper is the proposed MapReduce distributed computing model and the approach to implementing the <strong>Distributed</strong> MapReduce System, including the Master data structure, fault tolerance, and some refinements.</p>
<h3 id="mapreduce-computing-model">MapReduce Computing Model</h3>
<p>The model takes a series of key-value pairs as input and outputs a series of key-value pairs as a result. Users can use the MapReduce System by designing Map and Reduce functions.</p>
<ul>
<li>Map: Takes input data and generates a set of intermediate key-value pairs</li>
<li>Reduce: Takes intermediate key-value pairs as input, combines all data with the same key, and outputs the result.</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-cpp" data-lang="cpp"><span style="display:flex;"><span>map(String key, String value)<span style="color:#f92672">:</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// key: document name
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#75715e">// value: document contents
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">for</span> each word w in value:
</span></span><span style="display:flex;"><span> EmitIntermediate(w, <span style="color:#e6db74">"1"</span>);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>reduce(String key, Iterator values)<span style="color:#f92672">:</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// key: a word
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#75715e">// values: a list of counts
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#66d9ef">int</span> result <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> each v in values:
</span></span><span style="display:flex;"><span> result <span style="color:#f92672">+=</span> ParseInt(v);
</span></span><span style="display:flex;"><span> Emit(AsString(result));
</span></span></code></pre></div><h3 id="mapreduce-execution-process">MapReduce Execution Process</h3>
<p>The Distributed MapReduce System adopts a master-slave design. During the MapReduce computation, there is generally one Master and several Workers.</p>
<ul>
<li>Master: Responsible for creating, assigning, and scheduling Map and Reduce tasks</li>
<li>Worker: Responsible for executing Map and Reduce tasks</li>
</ul>
<p><img alt="Screenshot_20210112_125637" src="https://i.loli.net/2021/01/12/UK8yJRHc5DzMg3u.png"></p>
<p>A more detailed description is as follows:</p>
<ol>
<li>
<p>The entire MapReduce execution process includes M Map Tasks and R Reduce Tasks, divided into two phases: Map Phase and Reduce Phase.</p>
</li>
<li>
<p>The input file is split into M splits, and the computation enters the Map Phase. The Master assigns Map Tasks to idle Workers. The assigned Worker reads the corresponding split data and executes the Task. When all Map Tasks are completed, the Map Phase ends. The Partition function (generally <code>hash(key) mod R</code>) is used to generate R sets of intermediate key-value pairs, which are stored in files and reported to the Master for subsequent Reduce Task operations.</p>
</li>
<li>
<p>The computation enters the Reduce Phase. The Master assigns Reduce Tasks, and each Worker reads the corresponding intermediate key-value file and executes the Task. Once all Reduce tasks are completed, the computation is finished, and the results are stored in result files.</p>
</li>
</ol>
<h3 id="mapreduce-fault-tolerance-mechanism">MapReduce Fault Tolerance Mechanism</h3>
<p>Since Google MapReduce heavily relies on the distributed atomic file read/write operations provided by Google File System, the fault tolerance mechanism of the MapReduce cluster is much simpler and primarily focuses on recovering from unexpected task interruptions.</p>
<h4 id="worker-fault-tolerance">Worker Fault Tolerance</h4>
<p>In the cluster, the Master periodically sends Ping signals to each Worker. If a Worker does not respond for a period of time, the Master considers the Worker unavailable.</p>
<p>Any Map task assigned to that Worker, whether running or completed, must be reassigned by the Master to another Worker, as the Worker being unavailable also means the intermediate results stored on that Worker’s local disk are no longer available. The Master will also notify all Reducers about the retry, and Reducers that fail to obtain complete intermediate results from the original Mapper will start fetching data from the new Mapper.</p>
<p>If a Reduce task is assigned to that Worker, the Master will select any unfinished Reduce tasks and reassign them to other Workers. Since the results of completed Reduce tasks are stored in Google File System, the availability of these results is ensured by Google File System, and the MapReduce Master only needs to handle unfinished Reduce tasks.</p>
<p>If there is a Worker in the cluster that takes an unusually long time to complete the last few Map or Reduce tasks, the entire MapReduce computation time will be prolonged, and such a Worker becomes a straggler.</p>
<p>Once the MapReduce computation reaches a certain completion level, any remaining tasks are backed up and assigned to other idle Workers, and the task is considered completed once one of the Workers finishes it.</p>
<h4 id="master-fault-tolerance">Master Fault Tolerance</h4>
<p>There is only one Master node in the entire MapReduce cluster, so Master failures are relatively rare.</p>
<p>During operation, the Master node periodically saves the current state of the cluster as a checkpoint to disk. After the Master process terminates, a restarted Master process can use the data stored on disk to recover to the state of the last checkpoint.</p>
<h3 id="refinement">Refinement</h3>
<h4 id="partition-function">Partition Function</h4>
<p>Used during the Map Phase to assign intermediate key-value pairs to R files according to certain rules.</p>
<h4 id="combiner">Combiner</h4>
<p>In some situations, the user-defined Map task may generate a large number of duplicate intermediate keys. The Combiner function performs a partial merge of the intermediate results to reduce the amount of data that needs to be transmitted between Mapper and Reducer.</p>
<h2 id="experiment">Experiment</h2>
<p>The experiment involves designing and implementing the Master and Worker to complete the main functionality of a Simple MapReduce System.</p>
<p>In the experiment, the single Master and multiple Worker model was implemented through RPC calls, and different applications were formed by running Map and Reduce functions via Go Plugins.</p>
<h3 id="master--worker-functionality">Master & Worker Functionality</h3>
<h4 id="master">Master</h4>
<ul>
<li>Task creation and scheduling</li>
<li>Worker registration and task assignment</li>
<li>Receiving the current state of the Worker</li>
<li>Monitoring task status</li>
</ul>
<h4 id="worker">Worker</h4>
<ul>
<li>Registering with the Master</li>
<li>Getting tasks and processing them</li>
<li>Reporting status</li>
</ul>
<blockquote>
<p>Note: The Master provides corresponding functions to Workers via RPC calls</p>
</blockquote>
<h3 id="main-data-structures">Main Data Structures</h3>
<p>The design of data structures is the main task, and good design helps in implementing functionality. The relevant code is shown here; for the specific implementation, see <a href="https://github.com/noneback/Toys/tree/master/6.824-Lab1-MapReduce">GitHub</a>.</p>
<h4 id="master-1">Master</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">Master</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Your definitions here.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span> <span style="color:#a6e22e">nReduce</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">taskQueue</span> <span style="color:#66d9ef">chan</span> <span style="color:#a6e22e">Task</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">tasksContext</span> []<span style="color:#a6e22e">TaskContext</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">lock</span> <span style="color:#a6e22e">sync</span>.<span style="color:#a6e22e">Mutex</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">files</span> []<span style="color:#66d9ef">string</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">phase</span> <span style="color:#a6e22e">PhaseKind</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">done</span> <span style="color:#66d9ef">bool</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">workerID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="worker-1">Worker</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">worker</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">ID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">mapf</span> <span style="color:#66d9ef">func</span>(<span style="color:#66d9ef">string</span>, <span style="color:#66d9ef">string</span>) []<span style="color:#a6e22e">KeyValue</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">reducef</span> <span style="color:#66d9ef">func</span>(<span style="color:#66d9ef">string</span>, []<span style="color:#66d9ef">string</span>) <span style="color:#66d9ef">string</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">nReduce</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">nMap</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="task--taskcontext">Task & TaskContext</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">Task</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">ID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">Filename</span> <span style="color:#66d9ef">string</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">Phase</span> <span style="color:#a6e22e">PhaseKind</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">TaskContext</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">t</span> <span style="color:#f92672">*</span><span style="color:#a6e22e">Task</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">state</span> <span style="color:#a6e22e">ContextState</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">workerID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">startTime</span> <span style="color:#a6e22e">time</span>.<span style="color:#a6e22e">Time</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="rpc-args--reply">Rpc Args & Reply</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">RegTaskArgs</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">WorkerID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">RegTaskReply</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">T</span> <span style="color:#a6e22e">Task</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">HasT</span> <span style="color:#66d9ef">bool</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">ReportTaskArgs</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">WorkerID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">TaskID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">State</span> <span style="color:#a6e22e">ContextState</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">ReportTaskReply</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">RegWorkerArgs</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">RegWorkerReply</span> <span style="color:#66d9ef">struct</span> {
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">ID</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">NReduce</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">NMap</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="constant--type">Constant & Type</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span><span style="color:#66d9ef">const</span> (
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">RUNNING</span> <span style="color:#a6e22e">ContextState</span> = <span style="color:#66d9ef">iota</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">FAILED</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">READY</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">IDEL</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">COMPLETE</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> (
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">MAX_PROCESSING_TIME</span> = <span style="color:#a6e22e">time</span>.<span style="color:#a6e22e">Second</span> <span style="color:#f92672">*</span> <span style="color:#ae81ff">5</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">SCHEDULE_INTERVAL</span> = <span style="color:#a6e22e">time</span>.<span style="color:#a6e22e">Second</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> (
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">MAP</span> <span style="color:#a6e22e">PhaseKind</span> = <span style="color:#66d9ef">iota</span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">REDUCE</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">ContextState</span> <span style="color:#66d9ef">int</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">type</span> <span style="color:#a6e22e">PhaseKind</span> <span style="color:#66d9ef">int</span>
</span></span></code></pre></div><h3 id="running-and-testing">Running and Testing</h3>
<h4 id="running">Running</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># In main directory</span>
</span></span><span style="display:flex;"><span>cd ./src/main
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Master</span>
</span></span><span style="display:flex;"><span>go run ./mrmaster.go pg*.txt
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Worker</span>
</span></span><span style="display:flex;"><span>go build -buildmode<span style="color:#f92672">=</span>plugin ../mrapps/wc.go <span style="color:#f92672">&&</span> go run ./mrworker.go ./wc.so
</span></span></code></pre></div><h4 id="testing">Testing</h4>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>cd ./src/main
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>sh ./test-mr.sh
</span></span></code></pre></div><h2 id="optimization">Optimization</h2>
<p>These optimizations are some designs I thought of that could be improved when reviewing my code after completing the experiment.</p>
<h3 id="hotspot-issue">Hotspot Issue</h3>
<p>The hotspot issue here refers to a scenario where a particular data item appears frequently in the dataset. The intermediate key-value pairs generated during the Map phase can lead to a situation where one key appears frequently, resulting in excessive disk IO and network IO for a few machines during the shuffle step.</p>
<blockquote>
<p>The essence of this issue is that the Shuffle step in MapReduce is highly dependent on the data.</p>
<p>The <strong>design purpose</strong> of Shuffle is to aggregate intermediate results to facilitate processing during the Reduce phase. Consequently, if the data is extremely unbalanced, hotspot issues will naturally arise.</p>
</blockquote>
<p>In fact, the core problem is that a large number of keys are assigned to a single disk file after being hashed, serving as input for the subsequent Reduce phase.</p>
<p>The hash value for the same key should be identical, so the question becomes: <em>How can we assign the same key’s hash value to different machines?</em></p>
<p>The solution I came up with is to add a random salt to the key in the Shuffle’s hash calculation so that the hash values are different, thereby reducing the probability of keys being assigned to the same machine and solving the hotspot issue.</p>
<h3 id="fault-tolerance">Fault Tolerance</h3>
<p>The paper already proposes some solutions for fault tolerance. The scenario in question is: a Worker node crashes unexpectedly and reconnects after a reboot. The Master observes the crash and reassigns its tasks to other nodes, but the reconnected Worker continues executing its original tasks, resulting in duplicate result files.</p>
<p>The potential issue here is that these two files may cause incorrect results. Furthermore, the reconnected Worker continuing to execute its original tasks wastes CPU and IO resources.</p>
<p>Based on this, we need to mark the newly generated result files, ensuring only the latest files are used as results, thus resolving the file conflict. Additionally, we should add an RPC interface for Worker nodes so that when they reconnect, the Master can call it to clear out any original tasks.</p>
<h3 id="straggler-issue">Straggler Issue</h3>
<p>The straggler issue refers to a Task taking a long time to complete, delaying the overall MapReduce computation. Essentially, it is a hotspot issue and Worker crash handling problem, which can be addressed by referring to the above sections.</p>
<h2 id="references">References</h2>
<p><a href="https://pdos.csail.mit.edu/6.824/index.html">MIT6.824 Distributed System</a></p>
<p><a href="https://pdos.csail.mit.edu/6.824/labs/lab-mr.html">Lab Official Site</a></p>
<p><a href="http://static.googleusercontent.com/media/research.google.com/zh-CN//archive/mapreduce-osdi04.pdf">MapReduce: Simplified Data Processing on Large Clusters</a></p>
<p><a href="https://zhuanlan.zhihu.com/p/34849261">Detailed Explanation of Google MapReduce Paper</a></p>
<p></p>
Chinese Spam Email Classification Based on Naive Bayes
https://noneback.github.io/blog/%E5%9F%BA%E4%BA%8E%E6%9C%B4%E7%B4%A0%E8%B4%9D%E5%8F%B6%E6%96%AF%E7%9A%84%E4%B8%AD%E6%96%87%E5%9E%83%E5%9C%BE%E7%94%B5%E5%AD%90%E9%82%AE%E4%BB%B6%E5%88%86%E7%B1%BB/
Wed, 06 May 2020 00:00:00 +0000https://noneback.github.io/blog/%E5%9F%BA%E4%BA%8E%E6%9C%B4%E7%B4%A0%E8%B4%9D%E5%8F%B6%E6%96%AF%E7%9A%84%E4%B8%AD%E6%96%87%E5%9E%83%E5%9C%BE%E7%94%B5%E5%AD%90%E9%82%AE%E4%BB%B6%E5%88%86%E7%B1%BB/<h1 id="chinese-spam-email-classification-based-on-naive-bayes">Chinese Spam Email Classification Based on Naive Bayes</h1>
<h2 id="training-and-testing-data">Training and Testing Data</h2>
<p>This project primarily uses <a href="https://github.com/shijing888/BayesSpam">open-source data on GitHub</a>.</p>
<h2 id="data-processing">Data Processing</h2>
<p>First, we use regular expressions to filter the content of Chinese emails in the training set, removing all non-Chinese characters. The remaining content is then tokenized using <a href="https://github.com/fxsjy/jieba">jieba</a> for word segmentation, and stopwords are filtered using a Chinese stopword list. The processed results for spam and normal emails are stored separately.</p>
<p>Two dictionaries, <code>spam_voca</code> and <code>normal_voca</code>, are used to store the word frequencies of different terms in different emails. The data processing is then complete.</p>
<h2 id="training-and-prediction">Training and Prediction</h2>
<p>The training and prediction process involves calculating the probability $P(Spam|word_1, word_2, \dots, word_n)$. When this probability exceeds a certain threshold, the email is classified as spam.</p>
<blockquote>
<p>Based on the conditional independence assumption of Naive Bayes, and assuming the prior probability $P(s) = P(s’) = 0.5$, we have:</p>
<p>$P(s|w_1, w_2, \dots, w_n) = \frac{P(s, w_1, w_2, \dots, w_n)}{P(w_1, w_2, \dots, w_n)}$</p>
<p>$= \frac{P(w_1, w_2, \dots, w_n | s) P(s)}{P(w_1, w_2, \dots, w_n)} = \frac{P(w_1, w_2, \dots, w_n | s) P(s)}{P(w_1, w_2, \dots, w_n | s) \cdot p(s) + P(w_1, w_2, \dots, w_n | s’) \cdot p(s’)} $</p>
<p>Since $P(spam) = P(not\ spam)$, we have</p>
<p>$\frac{\prod\limits_{j=1}^n P(w_j | s)}{\prod\limits_{j=1}^n P(w_j | s) + \prod\limits_{j=1}^n P(w_j | s’)}$</p>
<p>Further, using Bayes’ theorem $P(w_j | s) = \frac{P(s | w_j) \cdot P(w_j)}{P(s)}$, the expression becomes</p>
<p>$\frac{\prod\limits_{j=1}^n P(s | w_j)}{\prod\limits_{j=1}^n P(s | w_j) + \prod\limits_{j=1}^n P(s’ | w_j)}$</p>
</blockquote>
<p>Process details:</p>
<ul>
<li>For each email in the test set, perform the same processing, and calculate the top $n$ words with the highest $P(s|w)$. During calculation, if a word appears only in the spam dictionary, set $P(w | s’) = 0.01$; similarly, if a word appears only in the normal dictionary, set $P(w | s) = 0.01$. If the word appears in neither, set $P(s|w) = 0.4$. These assumptions are based on prior research.</li>
<li>Use the 15 most important words for each email and calculate the probability using the above formulas. If the probability is greater than the threshold $\alpha$ (typically set to 0.9), classify it as spam; otherwise, classify it as a normal email.</li>
</ul>
<p>You can refer to the code for further details.</p>
<h2 id="results">Results</h2>
<p>By adjusting the number of words used for prediction, the best result for this dataset is:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>Selected <span style="color:#ae81ff">29</span> words: <span style="color:#ae81ff">0.9642857142857143</span>
</span></span></code></pre></div><h2 id="project-structure">Project Structure</h2>
<ul>
<li><strong>data</strong>
<ul>
<li><code>中文停用词表.txt</code> (Chinese stopword list)</li>
<li><code>normal</code> (folder for normal emails)</li>
<li><code>spam</code> (folder for spam emails)</li>
<li><code>test</code> (folder for test emails)</li>
</ul>
</li>
<li><strong>main.py</strong> (main script)</li>
<li><strong>normal_voca.json</strong> (JSON file for normal email vocabulary)</li>
<li><strong><strong>pycache</strong></strong> (cache folder)
<ul>
<li><code>utils.cpython-36.pyc</code></li>
</ul>
</li>
<li><strong>spam_voca.json</strong> (JSON file for spam email vocabulary)</li>
<li><strong>utils.py</strong> (utility functions)</li>
</ul>
<h2 id="code">Code</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># utils.py</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> jieba
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> re
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> collections <span style="color:#f92672">import</span> defaultdict
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>spam_file_num <span style="color:#f92672">=</span> <span style="color:#ae81ff">7775</span>
</span></span><span style="display:flex;"><span>normal_file_num <span style="color:#f92672">=</span> <span style="color:#ae81ff">7063</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Load stopword list</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_stopwords</span>():
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> [i<span style="color:#f92672">.</span>strip() <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> open(<span style="color:#e6db74">'./data/中文停用词表.txt'</span>, encoding<span style="color:#f92672">=</span><span style="color:#e6db74">'gbk'</span>)]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Read raw email content and process it</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_raw_str_list</span>(path):
</span></span><span style="display:flex;"><span> stop_list <span style="color:#f92672">=</span> get_stopwords()
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">with</span> open(path, encoding<span style="color:#f92672">=</span><span style="color:#e6db74">'gbk'</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span> raw_str <span style="color:#f92672">=</span> f<span style="color:#f92672">.</span>read()
</span></span><span style="display:flex;"><span> pattern <span style="color:#f92672">=</span> <span style="color:#e6db74">'[^</span><span style="color:#ae81ff">\u4E00</span><span style="color:#e6db74">-</span><span style="color:#ae81ff">\u9FA5</span><span style="color:#e6db74">]'</span> <span style="color:#75715e"># Chinese unicode range</span>
</span></span><span style="display:flex;"><span> regex <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(pattern)
</span></span><span style="display:flex;"><span> handled_str <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>sub(pattern, <span style="color:#e6db74">''</span>, raw_str)
</span></span><span style="display:flex;"><span> str_list <span style="color:#f92672">=</span> [word <span style="color:#66d9ef">for</span> word <span style="color:#f92672">in</span> jieba<span style="color:#f92672">.</span>cut(handled_str) <span style="color:#66d9ef">if</span> word <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> stop_list]
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> str_list
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Build vocabulary</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_voca</span>(path, is_file_path<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> is_file_path:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> read_voca_from_file(path)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> voca <span style="color:#f92672">=</span> defaultdict(int)
</span></span><span style="display:flex;"><span> file_list <span style="color:#f92672">=</span> [file <span style="color:#66d9ef">for</span> file <span style="color:#f92672">in</span> os<span style="color:#f92672">.</span>listdir(path)]
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> file <span style="color:#f92672">in</span> file_list:
</span></span><span style="display:flex;"><span> raw_str_list <span style="color:#f92672">=</span> get_raw_str_list(path <span style="color:#f92672">+</span> str(file))
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> raw_str <span style="color:#f92672">in</span> raw_str_list:
</span></span><span style="display:flex;"><span> voca[raw_str] <span style="color:#f92672">=</span> voca[raw_str] <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> voca
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Save vocabulary to JSON file</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">save_voca2json</span>(voca, path, sort_by_value<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, indent_<span style="color:#f92672">=</span><span style="color:#ae81ff">4</span>):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> sort_by_value:
</span></span><span style="display:flex;"><span> sorted_by_value(voca)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">with</span> open(path, <span style="color:#e6db74">'w+'</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span> f<span style="color:#f92672">.</span>write(json<span style="color:#f92672">.</span>dumps(voca, ensure_ascii<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, indent<span style="color:#f92672">=</span>indent_))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Read vocabulary from JSON file</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">read_voca_from_file</span>(path):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">with</span> open(path) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span> voca <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> voca
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Sort dictionary by value</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">sorted_by_value</span>(_dict):
</span></span><span style="display:flex;"><span> _dict <span style="color:#f92672">=</span> dict(sorted(spam_voca<span style="color:#f92672">.</span>items(), key<span style="color:#f92672">=</span><span style="color:#66d9ef">lambda</span> x: x[<span style="color:#ae81ff">1</span>], reverse<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Calculate P(Spam|word)</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_top_words_prob</span>(path, spam_voca, normal_voca, words_size<span style="color:#f92672">=</span><span style="color:#ae81ff">30</span>):
</span></span><span style="display:flex;"><span> critical_words <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> word <span style="color:#f92672">in</span> get_raw_str_list(path):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> word <span style="color:#f92672">in</span> spam_voca<span style="color:#f92672">.</span>keys() <span style="color:#f92672">and</span> word <span style="color:#f92672">in</span> normal_voca<span style="color:#f92672">.</span>keys():
</span></span><span style="display:flex;"><span> p_w_s <span style="color:#f92672">=</span> spam_voca[word] <span style="color:#f92672">/</span> spam_file_num
</span></span><span style="display:flex;"><span> p_w_n <span style="color:#f92672">=</span> normal_voca[word] <span style="color:#f92672">/</span> normal_file_num
</span></span><span style="display:flex;"><span> p_s_w <span style="color:#f92672">=</span> p_w_s <span style="color:#f92672">/</span> (p_w_n <span style="color:#f92672">+</span> p_w_s)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">elif</span> word <span style="color:#f92672">in</span> spam_voca<span style="color:#f92672">.</span>keys() <span style="color:#f92672">and</span> word <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> normal_voca<span style="color:#f92672">.</span>keys():
</span></span><span style="display:flex;"><span> p_w_s <span style="color:#f92672">=</span> spam_voca[word] <span style="color:#f92672">/</span> spam_file_num
</span></span><span style="display:flex;"><span> p_w_n <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.01</span>
</span></span><span style="display:flex;"><span> p_s_w <span style="color:#f92672">=</span> p_w_s <span style="color:#f92672">/</span> (p_w_n <span style="color:#f92672">+</span> p_w_s)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">elif</span> word <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> spam_voca<span style="color:#f92672">.</span>keys() <span style="color:#f92672">and</span> word <span style="color:#f92672">in</span> normal_voca<span style="color:#f92672">.</span>keys():
</span></span><span style="display:flex;"><span> p_w_s <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.01</span>
</span></span><span style="display:flex;"><span> p_w_n <span style="color:#f92672">=</span> normal_voca[word] <span style="color:#f92672">/</span> normal_file_num
</span></span><span style="display:flex;"><span> p_s_w <span style="color:#f92672">=</span> p_w_s <span style="color:#f92672">/</span> (p_w_n <span style="color:#f92672">+</span> p_w_s)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span> p_s_w <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.4</span>
</span></span><span style="display:flex;"><span> critical_words<span style="color:#f92672">.</span>append([word, p_s_w])
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> dict(sorted(critical_words[:words_size], key<span style="color:#f92672">=</span><span style="color:#66d9ef">lambda</span> x: x[<span style="color:#ae81ff">1</span>], reverse<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Calculate Bayesian probability</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">caculate_bayes</span>(words_prob, spam_voca, normal_voca):
</span></span><span style="display:flex;"><span> p_s_w <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span> p_s_nw <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> word, prob <span style="color:#f92672">in</span> words_prob<span style="color:#f92672">.</span>items():
</span></span><span style="display:flex;"><span> p_s_w <span style="color:#f92672">*=</span> prob
</span></span><span style="display:flex;"><span> p_s_nw <span style="color:#f92672">*=</span> (<span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> prob)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> p_s_w <span style="color:#f92672">/</span> (p_s_w <span style="color:#f92672">+</span> p_s_nw)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">predict</span>(bayes, threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.9</span>):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> bayes <span style="color:#f92672">>=</span> threshold
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Get files and labels</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_files_labels</span>(dir_path, is_spam<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>):
</span></span><span style="display:flex;"><span> raw_files_list <span style="color:#f92672">=</span> os<span style="color:#f92672">.</span>listdir(dir_path)
</span></span><span style="display:flex;"><span> files_list <span style="color:#f92672">=</span> [dir_path <span style="color:#f92672">+</span> file <span style="color:#66d9ef">for</span> file <span style="color:#f92672">in</span> raw_files_list]
</span></span><span style="display:flex;"><span> labels <span style="color:#f92672">=</span> [is_spam <span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(len(files_list))]
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> files_list, labels
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Predict and print results</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">predict_result</span>(file_list, y, spam_voca, normal_voca, word_size<span style="color:#f92672">=</span><span style="color:#ae81ff">30</span>):
</span></span><span style="display:flex;"><span> ret <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span> right <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> file <span style="color:#f92672">in</span> file_list:
</span></span><span style="display:flex;"><span> words_prob <span style="color:#f92672">=</span> get_top_words_prob(file, spam_voca, normal_voca, words_size<span style="color:#f92672">=</span>word_size)
</span></span><span style="display:flex;"><span> bayes <span style="color:#f92672">=</span> caculate_bayes(words_prob, spam_voca, normal_voca)
</span></span><span style="display:flex;"><span> ret<span style="color:#f92672">.</span>append(predict(bayes))
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(len(ret)):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> ret[i] <span style="color:#f92672">==</span> y[i]:
</span></span><span style="display:flex;"><span> right <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span> print(right <span style="color:#f92672">/</span> len(y))
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># main.py</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> utils <span style="color:#f92672">import</span> <span style="color:#f92672">*</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">'__main__'</span>:
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Get vocabulary and save for future use</span>
</span></span><span style="display:flex;"><span> spam_voca <span style="color:#f92672">=</span> get_voca(<span style="color:#e6db74">'./spam_voca.json'</span>, is_file_path<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span> normal_voca <span style="color:#f92672">=</span> get_voca(<span style="color:#e6db74">'./normal_voca.json'</span>, is_file_path<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span> save
</span></span></code></pre></div>Java Multithreading Programming
https://noneback.github.io/blog/java%E5%A4%9A%E7%BA%BF%E7%A8%8B%E5%92%8C%E5%B9%B6%E8%A1%8C/
Fri, 01 Nov 2019 00:00:00 +0000https://noneback.github.io/blog/java%E5%A4%9A%E7%BA%BF%E7%A8%8B%E5%92%8C%E5%B9%B6%E8%A1%8C/<p>Yesterday evening, while revisiting the book “Advanced Java: Multithreading and Parallel Programming” by Liang Yung, I thought it would be a good idea to take the opportunity to document my understanding.</p>
<h2 id="java-multithreading-programming">Java Multithreading Programming</h2>
<p>Java provides built-in support for multithreading.</p>
<ul>
<li>A <strong>thread</strong> is a single sequential flow of control within a process, and multiple threads can run concurrently within a process, each performing different tasks.</li>
<li><strong>Multithreading</strong> is a specialized form of multitasking that consumes fewer resources.</li>
<li>A <strong>process</strong> contains the memory space allocated by the operating system and includes one or more threads. Threads cannot exist independently but must be part of a process. A process continues running until all non-daemon threads complete execution.</li>
<li>Multithreading allows developers to write efficient programs that fully utilize CPU resources.</li>
</ul>
<h3 id="thread-states">Thread States</h3>
<p>A thread is a dynamic execution entity that has different states throughout its lifecycle.</p>
<p><img alt="Thread States" src="https://www.runoob.com/wp-content/uploads/2014/01/java-thread.jpg"></p>
<ol>
<li>
<p><strong>New</strong>:</p>
<ul>
<li>A thread is in a new state when it is created using the <code>new</code> keyword with the <code>Thread</code> class or its subclass. It remains in this state until the program starts the thread using the <code>start()</code> method.</li>
</ul>
</li>
<li>
<p><strong>Runnable</strong>:</p>
<ul>
<li>After invoking the <code>start()</code> method, the thread enters the runnable state and waits in the ready queue to be allocated CPU resources by the JVM thread scheduler.</li>
</ul>
</li>
<li>
<p><strong>Running</strong>:</p>
<ul>
<li>Once the thread gets CPU resources, it enters the running state and executes the <code>run()</code> method. In the running state, a thread can transition to blocked, runnable, or terminated states.</li>
</ul>
</li>
<li>
<p><strong>Blocked</strong>:</p>
<ul>
<li>When a thread executes methods like <code>sleep()</code> or <code>suspend()</code> and loses resources, it transitions to the blocked state. After resuming resources or the sleep time is over, it can reenter the runnable state.</li>
</ul>
</li>
<li>
<p><strong>Waiting Blocked</strong>:</p>
<ul>
<li>A running thread calling the <code>wait()</code> method enters the waiting blocked state.</li>
</ul>
</li>
<li>
<p><strong>Synchronized Blocked</strong>:</p>
<ul>
<li>A thread trying to acquire a synchronized lock but failing due to another thread owning the lock transitions to the synchronized blocked state.</li>
</ul>
</li>
<li>
<p><strong>Other Blocked</strong>:</p>
<ul>
<li>Through methods like <code>sleep()</code>, <code>join()</code>, or I/O requests, a thread can enter the other blocked state. Once these operations are complete, it can reenter the runnable state.</li>
</ul>
</li>
<li>
<p><strong>Terminated</strong>:</p>
<ul>
<li>A thread enters the terminated state once it has completed its execution or met some terminating conditions.</li>
</ul>
</li>
</ol>
<h3 id="creating-task-classes-and-threads">Creating Task Classes and Threads</h3>
<ul>
<li>A <strong>task</strong> in Java is an object that implements the <code>Runnable</code> interface (containing the <code>run()</code> method). You need to override the <code>run()</code> method to define the task’s behavior.</li>
<li><strong>Threads</strong> are created through the <code>Thread</code> class, which also contains methods for controlling the thread.</li>
</ul>
<blockquote>
<p>Creating a thread is always based on a task:</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span>Thread thread <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> Thread(<span style="color:#66d9ef">new</span> TaskClass());
</span></span><span style="display:flex;"><span><span style="color:#75715e">// Calling thread.start() will invoke TaskClass's run() method immediately.</span>
</span></span></code></pre></div><h4 id="other-methods-in-the-thread-class">Other Methods in the Thread Class</h4>
<ul>
<li><code>yield()</code>: Temporarily releases the CPU to let other threads execute.</li>
<li><code>sleep()</code>: Makes the thread sleep for a specified period to allow other threads to run.
<blockquote>
<p>Note: <code>sleep()</code> may throw an <code>InterruptedException</code>, which is a checked exception, meaning Java requires you to catch it in a <code>try</code> block.</p>
</blockquote>
</li>
</ul>
<h4 id="thread-priorities">Thread Priorities</h4>
<p>Threads have priorities. The Java Virtual Machine always gives preference to higher-priority threads. If all threads have the same priority, they follow round-robin scheduling.</p>
<blockquote>
<p>Use <code>Thread.setPriority()</code> to set a thread’s priority.</p>
</blockquote>
<h3 id="thread-pool">Thread Pool</h3>
<p>If you need to create a thread for each of many tasks, starting new threads for each task can limit throughput and degrade performance. Using a <strong>thread pool</strong> is an ideal solution for managing the concurrent execution of tasks.</p>
<p>Java provides the <code>Executor</code> interface to execute tasks in a thread pool, and the <code>ExecutorService</code> interface is used to manage and control those tasks. Executors are created through static methods like <code>newFixedThreadPool(int)</code> (to create a pool with a fixed number of threads) or <code>newCachedThreadPool()</code> (to create a pool with a dynamically managed number of threads).</p>
<blockquote>
<p>Example:</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#f92672">import</span> java.util.concurrent.*;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ExecutorDemo</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">static</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">main</span>(String<span style="color:#f92672">[]</span> args) {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Create a fixed thread pool with a maximum of three threads</span>
</span></span><span style="display:flex;"><span> ExecutorService executor <span style="color:#f92672">=</span> Executors.<span style="color:#a6e22e">newFixedThreadPool</span>(3);
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Submit runnable tasks to the executor</span>
</span></span><span style="display:flex;"><span> executor.<span style="color:#a6e22e">execute</span>(<span style="color:#66d9ef">new</span> PrintChar(<span style="color:#e6db74">'a'</span>, 100));
</span></span><span style="display:flex;"><span> executor.<span style="color:#a6e22e">execute</span>(<span style="color:#66d9ef">new</span> PrintChar(<span style="color:#e6db74">'b'</span>, 100));
</span></span><span style="display:flex;"><span> executor.<span style="color:#a6e22e">execute</span>(<span style="color:#66d9ef">new</span> PrintNum(100));
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Shut down the executor</span>
</span></span><span style="display:flex;"><span> executor.<span style="color:#a6e22e">shutdown</span>();
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Thread pools provide <strong>a better way to manage threads</strong>. They primarily address issues related to the overhead of thread lifecycle and resource limitations:</p>
<ul>
<li><strong>Thread pools reduce the time and system resources spent on creating and destroying threads</strong>. By reusing threads across multiple tasks, the cost of creating threads is amortized. Since threads already exist when new requests come in, they eliminate the latency caused by thread creation, allowing the application to respond faster.</li>
<li><strong>Thread pools allow easy thread management</strong>, e.g., using a <code>ScheduledThreadPool</code> to execute tasks after a delay or on a repeating schedule.</li>
<li><strong>They control concurrency levels</strong>, preventing resource contention when many threads compete for CPU resources.</li>
</ul>
<h3 id="thread-synchronization">Thread Synchronization</h3>
<p>If multiple threads simultaneously access the same resource, it may lead to data corruption. If two tasks interact with a shared resource in a conflicting manner, they are said to be in a <strong>race condition</strong>. Without race conditions, a program is considered <strong>thread-safe</strong>.</p>
<p>To prevent race conditions, threads must be synchronized to prevent multiple threads from accessing a particular section of the program simultaneously.</p>
<h4 id="methods-for-synchronizing-threads">Methods for Synchronizing Threads</h4>
<p>Before executing a synchronized method, a lock must be obtained. Locks provide exclusive access to a shared resource. For instance methods, the object is locked; for static methods, the class is locked.</p>
<ul>
<li><code>synchronized</code> keyword:</li>
</ul>
<blockquote>
<p>You can apply this keyword to methods or blocks of code.</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">synchronized</span> (expr) {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// do something</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">synchronized</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">func</span>() {}
</span></span></code></pre></div><ul>
<li>Lock-based synchronization:
Locks and conditions can be used explicitly for thread synchronization.
<blockquote>
<p>A lock is an instance of the <code>Lock</code> interface, which provides methods to acquire and release locks.
<code>ReentrantLock</code> is an implementation of the lock mechanism for mutual exclusion.</p>
</blockquote>
</li>
</ul>
<p>Example:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">deposit</span>(<span style="color:#66d9ef">int</span> amount) {
</span></span><span style="display:flex;"><span> lock.<span style="color:#a6e22e">lock</span>(); <span style="color:#75715e">// Acquire the lock</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">try</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> newBalance <span style="color:#f92672">=</span> balance <span style="color:#f92672">+</span> amount;
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// This delay is deliberately added to magnify the</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// data corruption problem and make it easy to see.</span>
</span></span><span style="display:flex;"><span> Thread.<span style="color:#a6e22e">sleep</span>(5);
</span></span><span style="display:flex;"><span> balance <span style="color:#f92672">=</span> newBalance;
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">catch</span> (InterruptedException ex) {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Handle the exception</span>
</span></span><span style="display:flex;"><span> } <span style="color:#66d9ef">finally</span> {
</span></span><span style="display:flex;"><span> lock.<span style="color:#a6e22e">unlock</span>(); <span style="color:#75715e">// Release the lock</span>
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h4 id="avoiding-deadlocks">Avoiding Deadlocks</h4>
<p>A deadlock may occur when multiple threads need to acquire locks on several shared objects simultaneously. Deadlocks can be avoided by <strong>ordering resource acquisition</strong>.</p>
<h3 id="thread-collaboration">Thread Collaboration</h3>
<p>Threads can communicate by using <strong>conditions</strong> to specify what actions they should take under certain circumstances.</p>
<blockquote>
<p>A <strong>condition</strong> is an object created through the <code>Lock</code> object’s <code>newCondition()</code> method. Threads can use <code>await()</code>, <code>signal()</code>, or <code>signalAll()</code> to communicate.</p>
</blockquote>
<ul>
<li><code>await()</code>: Causes the current thread to wait until the condition is signaled.</li>
<li><code>signal()</code>/<code>signalAll()</code>: Wakes one or all threads waiting on the condition.</li>
</ul>
<blockquote>
<p>Conditions must be used with locks; invoking their methods without a lock will result in an <code>IllegalMonitorStateException</code>.</p>
</blockquote>
<h3 id="blocking-queues">Blocking Queues</h3>
<p>Java provides <strong>blocking queues</strong> for multithreading, which allow synchronization without needing locks or conditions explicitly. They provide two additional operations:</p>
<ul>
<li>When the queue is empty, a retrieval operation will <strong>block</strong> the thread until elements become available.</li>
<li>When the queue is full, an insert operation will <strong>block</strong> the thread until space becomes available.</li>
</ul>
<p>Blocking queues are commonly used in <strong>producer-consumer</strong> scenarios. Producer threads place results in the queue, while consumer threads retrieve and process those results. Blocking queues <strong>automatically balance the workload</strong> between producers and consumers.</p>
<h4 id="core-methods-of-blockingqueue">Core Methods of BlockingQueue</h4>
<ol>
<li>
<p><strong>Adding Data</strong>:</p>
<ul>
<li><code>put(E e)</code>: Inserts an element at the end of the queue, waiting if the queue is full.</li>
<li><code>offer(E e, long timeout, TimeUnit unit)</code>: Attempts to add an element, waiting up to the specified time if the queue is full. If successful, returns <code>true</code>; otherwise, returns <code>false</code>.</li>
</ul>
</li>
<li>
<p><strong>Retrieving Data</strong>:</p>
<ul>
<li><code>take()</code>: Retrieves and removes the head of the queue, waiting if necessary until an element becomes available.</li>
<li><code>drainTo()</code>: Retrieves and removes all available elements from the queue, improving efficiency by reducing the number of lock/unlock operations.</li>
<li><code>poll(long timeout, TimeUnit unit)</code>: Retrieves and removes the head of the queue, waiting up to the specified time if the queue is empty. If no element is found within the time limit, returns <code>null</code>.</li>
</ul>
</li>
</ol>
<h3 id="parallel-programming">Parallel Programming</h3>
<p>Java uses the <strong>Fork/Join framework</strong> to implement parallel programming. In this framework, a <strong>Fork</strong> can be considered a separate task executed by a thread.</p>
<blockquote>
<p>Decompose a problem into multiple non-overlapping subproblems that can be solved independently, then combine their solutions to get the overall answer.</p>
</blockquote>
<p>Tasks are defined using the <code>ForkJoinTask</code> class and executed in a <code>ForkJoinPool</code> instance.</p>
<blockquote>
<p><code>ForkJoinTask</code> is the base class for tasks. It’s a lightweight entity, meaning many tasks can be executed by a small number of threads in the <code>ForkJoinPool</code>.</p>
</blockquote>
<p>Example:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-java" data-lang="java"><span style="display:flex;"><span><span style="color:#f92672">import</span> java.util.concurrent.RecursiveAction;
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> java.util.concurrent.ForkJoinPool;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ParallelMergeSort</span> {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">static</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">main</span>(String<span style="color:#f92672">[]</span> args) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">final</span> <span style="color:#66d9ef">int</span> SIZE <span style="color:#f92672">=</span> 7000000;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[]</span> list1 <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[</span>SIZE<span style="color:#f92672">]</span>;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[]</span> list2 <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[</span>SIZE<span style="color:#f92672">]</span>;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> (<span style="color:#66d9ef">int</span> i <span style="color:#f92672">=</span> 0; i <span style="color:#f92672"><</span> list1.<span style="color:#a6e22e">length</span>; i<span style="color:#f92672">++</span>)
</span></span><span style="display:flex;"><span> list1<span style="color:#f92672">[</span>i<span style="color:#f92672">]</span> <span style="color:#f92672">=</span> list2<span style="color:#f92672">[</span>i<span style="color:#f92672">]</span> <span style="color:#f92672">=</span> (<span style="color:#66d9ef">int</span>)(Math.<span style="color:#a6e22e">random</span>() <span style="color:#f92672">*</span> 10000000);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">long</span> startTime <span style="color:#f92672">=</span> System.<span style="color:#a6e22e">currentTimeMillis</span>();
</span></span><span style="display:flex;"><span> parallelMergeSort(list1); <span style="color:#75715e">// Invoke parallel merge sort</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">long</span> endTime <span style="color:#f92672">=</span> System.<span style="color:#a6e22e">currentTimeMillis</span>();
</span></span><span style="display:flex;"><span> System.<span style="color:#a6e22e">out</span>.<span style="color:#a6e22e">println</span>(<span style="color:#e6db74">"\nParallel time with "</span> <span style="color:#f92672">+</span>
</span></span><span style="display:flex;"><span> Runtime.<span style="color:#a6e22e">getRuntime</span>().<span style="color:#a6e22e">availableProcessors</span>() <span style="color:#f92672">+</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">" processors is "</span> <span style="color:#f92672">+</span> (endTime <span style="color:#f92672">-</span> startTime) <span style="color:#f92672">+</span> <span style="color:#e6db74">" milliseconds"</span>);
</span></span><span style="display:flex;"><span> startTime <span style="color:#f92672">=</span> System.<span style="color:#a6e22e">currentTimeMillis</span>();
</span></span><span style="display:flex;"><span> MergeSort.<span style="color:#a6e22e">mergeSort</span>(list2); <span style="color:#75715e">// MergeSort is in Listing 23.5</span>
</span></span><span style="display:flex;"><span> endTime <span style="color:#f92672">=</span> System.<span style="color:#a6e22e">currentTimeMillis</span>();
</span></span><span style="display:flex;"><span> System.<span style="color:#a6e22e">out</span>.<span style="color:#a6e22e">println</span>(<span style="color:#e6db74">"\nSequential time is "</span> <span style="color:#f92672">+</span>
</span></span><span style="display:flex;"><span> (endTime <span style="color:#f92672">-</span> startTime) <span style="color:#f92672">+</span> <span style="color:#e6db74">" milliseconds"</span>);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">static</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">parallelMergeSort</span>(<span style="color:#66d9ef">int</span><span style="color:#f92672">[]</span> list) {
</span></span><span style="display:flex;"><span> RecursiveAction mainTask <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> SortTask(list);
</span></span><span style="display:flex;"><span> ForkJoinPool pool <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> ForkJoinPool();
</span></span><span style="display:flex;"><span> pool.<span style="color:#a6e22e">invoke</span>(mainTask);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> <span style="color:#66d9ef">static</span> <span style="color:#66d9ef">class</span> <span style="color:#a6e22e">SortTask</span> <span style="color:#66d9ef">extends</span> RecursiveAction {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> <span style="color:#66d9ef">final</span> <span style="color:#66d9ef">int</span> THRESHOLD <span style="color:#f92672">=</span> 500;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">private</span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[]</span> list;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> SortTask(<span style="color:#66d9ef">int</span><span style="color:#f92672">[]</span> list) {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">this</span>.<span style="color:#a6e22e">list</span> <span style="color:#f92672">=</span> list;
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">@Override</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">protected</span> <span style="color:#66d9ef">void</span> <span style="color:#a6e22e">compute</span>() {
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> (list.<span style="color:#a6e22e">length</span> <span style="color:#f92672"><</span> THRESHOLD)
</span></span><span style="display:flex;"><span> java.<span style="color:#a6e22e">util</span>.<span style="color:#a6e22e">Arrays</span>.<span style="color:#a6e22e">sort</span>(list);
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Obtain the first half</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[]</span> firstHalf <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[</span>list.<span style="color:#a6e22e">length</span> <span style="color:#f92672">/</span> 2<span style="color:#f92672">]</span>;
</span></span><span style="display:flex;"><span> System.<span style="color:#a6e22e">arraycopy</span>(list, 0, firstHalf, 0, list.<span style="color:#a6e22e">length</span> <span style="color:#f92672">/</span> 2);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Obtain the second half</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span> secondHalfLength <span style="color:#f92672">=</span> list.<span style="color:#a6e22e">length</span> <span style="color:#f92672">-</span> list.<span style="color:#a6e22e">length</span> <span style="color:#f92672">/</span> 2;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[]</span> secondHalf <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> <span style="color:#66d9ef">int</span><span style="color:#f92672">[</span>secondHalfLength<span style="color:#f92672">]</span>;
</span></span><span style="display:flex;"><span> System.<span style="color:#a6e22e">arraycopy</span>(list, list.<span style="color:#a6e22e">length</span> <span style="color:#f92672">/</span> 2, secondHalf, 0, secondHalfLength);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Recursively sort the two halves</span>
</span></span><span style="display:flex;"><span> invokeAll(<span style="color:#66d9ef">new</span> SortTask(firstHalf), <span style="color:#66d9ef">new</span> SortTask(secondHalf));
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">// Merge firstHalf with secondHalf into list</span>
</span></span><span style="display:flex;"><span> MergeSort.<span style="color:#a6e22e">merge</span>(firstHalf, secondHalf, list);
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>
https://noneback.github.io/blog/gsoc/
Mon, 01 Jan 0001 00:00:00 +0000https://noneback.github.io/blog/gsoc/<h1 id="an-alternative-tuple-storage-engine-for-casbin-mesh--casbin--gsoc-2022-proposal">An alternative tuple-storage engine for Casbin Mesh / Casbin — GSOC 2022 Proposal</h1>
<h2 id="about-me">About me</h2>
<h3 id="basic-infomation">Basic Infomation</h3>
<ul>
<li>
<p>First / Last Name: Xie Kai</p>
</li>
<li>
<p>Email: <a href="mailto:[email protected]">[email protected]</a></p>
</li>
<li>
<p>QQ : 1633849228</p>
</li>
<li>
<p>School/University: <a href="https://en.wikipedia.org/wiki/Beijing_University_of_Posts_and_Telecommunications">Beijing University of Posts and Telecommunications</a></p>
</li>
<li>
<p>Graduation Date: July, 2022</p>
</li>
<li>
<p>Major/Focus: Software Engineering</p>
</li>
<li>
<p>Location: Beijing, China</p>
</li>
<li>
<p>Timezone: China Standard Time (CST), UTC +8</p>
</li>
<li>
<p>Github Profile: <a href="https://github.com/noneback">https://github.com/noneback</a></p>
</li>
<li>
<p>Personal Blog: <a href="http://noneback.github.io">http://noneback.github.io</a></p>
</li>
</ul>
<h3 id="open-source-experience">Open Source Experience</h3>
<p>I used to make contribution on those open source projects:</p>
<ul>
<li>
<p><a href="https://github.com/matrixorigin/matrixone">MaxtrixOne</a> : Hyperconverged cloud-edge native database</p>
</li>
<li>
<p><a href="https://github.com/flamego/cache">flame-go: cache</a> : a middleware that provides the cache management for Flamego</p>
</li>
<li>
<p><a href="https://github.com/flamego/session">flame-go: session</a> : a middleware that provides the session management for Flamego</p>
</li>
<li>
<p><a href="https://github.com/casbin/casnode">casnode</a> : An open-source forum (BBS) software developed by Go and React</p>
</li>
<li>
<p><a href="https://github.com/noneback/Toys">Toys</a> : Toys written by myself.</p>
</li>
</ul>
<h3 id="other-information">Other Information</h3>
<ul>
<li>
<p>Currently, I am learning mit6.824 and cmu15-445 courses and have finished MapReduce and Raft Lab. I have basic concepts of page layout, indexing (hash index, B+ tree index), and multi-version concurrency control.</p>
</li>
<li>
<p>Used to work in the Business Department and the Distributed Storage Department in Bytebance as an internship.</p>
</li>
</ul>
<h2 id="problem-description">Problem Description</h2>
<p>Currently, Casbin uses golang built-in map structure to maintain policies in the main memory and persist the policies via adapter abstraction.</p>
<p>If policies data grows, however, the growing cost of main memory resources and bad performance make the memory management strategy not tolerable anymore. We need to find a better way to manage the casbin in-memory data when data grows.</p>
<h2 id="implementation-plan">Implementation Plan</h2>
<h3 id="breif-design">Breif Design</h3>
<p>From my point of view, our main goal is to reduce the cost of memory as well as keep good performance handling policies read and write requests.</p>
<p>In order to achieve those key goals, we can introduce an experimental tuple storage to get charge of storing those policies, turning the policies management strategy from memory-oriented to disk-oriented. We can even make a better abstraction of the storage layer so that we can use different engines (row, column) for the different workloads.</p>
<p>In general, we can take the following parts into consideration:</p>
<ul>
<li>
<p><strong>API</strong> for upper layer</p>
<blockquote>
<p>Design API for the upper level.</p>
<p>The API is the key design if we want to make the storage engine a plugin.</p>
<p>Derived from adapter will be fine.</p>
</blockquote>
</li>
<li>
<p>workload <strong>optimizer</strong></p>
<blockquote>
<p>Try to optimize those workloads to improve performance.</p>
<p>Key design:</p>
<ul>
<li>
<p>Estimation of requests cost</p>
</li>
<li>
<p>Strategy to reconstruct data access path.</p>
</li>
</ul>
</blockquote>
</li>
<li>
<p><strong>Buffer Pool</strong> management</p>
<blockquote>
<p>In-memory data structure management.</p>
<p>Key design: replace strategy</p>
</blockquote>
</li>
<li>
<p><strong>Indexing</strong></p>
<blockquote>
<p>Index to accelerate read and write requests. Key design is what index we should use, and how to build indexes for policies to improve performance.</p>
<p>Considering the casbin’s common workload, we can provide b+tree and hash structure for indexing.</p>
</blockquote>
</li>
<li>
<p><strong>Data Storage Structures</strong></p>
<blockquote>
<p>Key design:</p>
<ul>
<li>
<p>File organization. B+tree sequence file organization.</p>
</li>
<li>
<p>Page and tuple layout. Our policies data is actually varchar so our tuples are variable-length records. And we can use slotted-page structure to organize records.</p>
</li>
<li>
<p>encoding. Row-based or Column-based.</p>
</li>
</ul>
</blockquote>
</li>
<li>
<p><strong>Transaction</strong> if necessary</p>
<blockquote>
<p>we can use mvcc to improve performance.</p>
</blockquote>
</li>
</ul>
<h3 id="reference--resource">Reference & Resource</h3>
<h4 id="codebase">Codebase</h4>
<ul>
<li>
<p>Bustdb codebase</p>
</li>
<li>
<p>MIT-6.830 Simpledb Codebase</p>
</li>
<li>
<p>Risinglight DB</p>
</li>
<li>
<p>Badger DB</p>
</li>
</ul>
<h4 id="paper">Paper</h4>
<ul>
<li>
<p>An Empirical Evaluation of In-Memory Multi-Version Concurrency Control</p>
</li>
<li>
<p>Column-Stores vs. Row-Stores: How Different Are They Really?</p>
</li>
</ul>
<h4 id="other">Other</h4>
<ul>
<li>
<p>Database System Concepts</p>
</li>
<li>
<p>CMU-15445 DB Course</p>
</li>
<li>
<p>CMU-15721 DB Course</p>
</li>
</ul>
<h2 id="timeline">Timeline</h2>
<h3 id="before-the-official-coding-time">Before the official coding time</h3>
<h4 id="may-1---may-23">May 1 - May 23</h4>
<ul>
<li>
<p>Learn more about Casbin source code and Casbin Community and try to solve some basic issues on the codebase.</p>
</li>
<li>
<p>Have a discussion with the mentor to determine what feature we need to add and make a basic design overview of the project.</p>
</li>
<li>
<p>Do research about how to implement our project best and write a detailed design document about it.</p>
</li>
</ul>
<h4 id="may-24---june-14">May 24 - June 14</h4>
<ul>
<li>
<p>Carefully design and write the basic framework of our whole project.</p>
</li>
<li>
<p>Write UT for framework code.</p>
</li>
</ul>
<h3 id="official-coding-period-starts">Official coding period starts</h3>
<h4 id="june-15---june-28">June 15 - June 28</h4>
<ul>
<li>
<p>Write code about data encoding and page tuple layout (disk manager)</p>
</li>
<li>
<p>Write UT for data storage layer</p>
</li>
</ul>
<h4 id="june-29---july-13">June 29 - July 13</h4>
<ul>
<li>
<p>Implement a module about buffer pool and Index management.</p>
</li>
<li>
<p>Write UT for the buffer pool and indexing.</p>
</li>
</ul>
<h4 id="july-14---july-28">July 14 - July 28</h4>
<ul>
<li>
<p>Implement the API and workload optimizer for the upper layer.</p>
</li>
<li>
<p>Optimized for workload from the upper level.</p>
</li>
</ul>
<h4 id="july-29---august-5">July 29 - August 5</h4>
<ul>
<li>
<p>Implement Transaction Module</p>
</li>
<li>
<p>Write UT for transaction part</p>
</li>
</ul>
<h4 id="august-6---august-13">August 6 - August 13</h4>
<ul>
<li>
<p>Polish our project</p>
</li>
<li>
<p>CI integration and document work. Polish the document finished</p>
</li>
</ul>
<h4 id="extra-time">Extra Time</h4>
<ul>
<li>A buffer kept for any unpredictable delay</li>
</ul>
<h2 id="deliverables">Deliverables</h2>
<p>A Casbin built-in embedded disk-oriented tuple storage engine.</p>
<p>The Engine should contain :</p>
<ul>
<li>
<p>Carefully designed API for upper Casbin internal module.</p>
</li>
<li>
<p>Storage management. Include file organization and page layout.</p>
</li>
<li>
<p>Buffer pool management.</p>
</li>
<li>
<p>Index management.</p>
</li>
<li>
<p>An workload optimizer for the upper layer.</p>
</li>
<li>
<p>Transaction part if necessary.</p>
</li>
</ul>