MapDB

MapDB 4 announcement

2023-09-17T00:00:00+00:00

I started work on MapDB 4; this blog post describes some details. The last major version, 3.0, was released a long time ago. I decided to wrap up my work on different prototypes and make a new major release.

My major goal now is to leave out anything that is too complex to implement and maintain . Instead, I aim to neatly arrange realistic simple elements into a cohesive puzzle that functions smoothly. I am somewhat giving up on the “large database idea” while returning to the root that made JDBM3 and MapDB 1.0 great.

Goals for the new release

MapDB is an alternative memory model. It can store data on disk and work as a database engine, but the main goal is to be a simple drop-in replacement for Java collections. It should make processing data that does not fit into memory easy.
Designed around Spliterator and Parallel Streams from Java 8. There are consumer CPUs with 128 cores; let’s make them usable.
Very fast data ingestion (imports). MapDB 1.0 had Data Pumps that could create BTreeMap in a very fast way. I would like to extend this idea. It will be possible to load a data file, perform transformations using parallel streams, and save it into new writable collections. All very, very fast, using append-only operations. The goal is to saturate SSD speeds.
Cheap snapshots. It will finally be possible to take a snapshot of a collection and use it to make backups or for data processing. It will work well with Parallel Streams and fast ingestion.
Fix write amplification issue in BTreeMap and other collections. Group multiple writes together into single batch.
Simplicity and maintainability. I am fixing all design mistakes that did not work in older versions.

What is not there

MapDB V4 will be in a single Java package again. It will also use no dependencies (no Eclipse Collections, no Guava…). I am even dropping Kotlin from production code (it will still be used for unit tests). The new version will be a single JAR file that only depends on Java 8 or newer.

Packaging will change to make it possible to use older versions in parallel. The package name will be org.mapdb.v4, and the Maven archetype name will be org.mapdb:mapdb-v4. I am also going back to Maven. It seems simpler for tricky stuff like OSGI annotations.

I am also dropping all the tricky elements from storage. No dynamic recursive space allocation, no multi-sized records, and no alternative storage access methods (sorry, no RandomAccessFiles, ByteBuffers all the way).

In practice, it means three things:

Memory-mapped files will work everywhere, even on tricky network file systems like CEPH.
No need for Unsafe and unmap tricks; MapDB 4 will only use the vanilla Java 8 API.
There will be a bigger space overhead for writable stores.

Serializers will also be removed. Now, collections will not directly “see” records, such as BTreeNode, but will have to perform “actions” at the Store on top of binary data. This will allow to store all modification in an event log, and use it to replay snapshots.

And finally all concurrency design is greatly simplified. MapDB will be thread safe, and use single ReadWriteLock per collection. No more segmented and structural locking. It will be still insert data using multiple threads, but that will be done at caching layer.

That is all for now, more details in next few days.

MapDB storage and design change

2022-11-09T00:00:00+00:00

This article outlines recent changes in MapDB 4 development. I decided to refactor Serializer interface and merge it into Store.

Write amplification and allocation problem

MapDB has big problem with write amplification. Random inserts into BTreeMap trigger cascade of memory allocations, and disk space fragmentation.

One solution is data structure designed to absorb large amount of writes gracefully. I experimented with Fractal Trees, where each tree node has buffer for storing changes. However, compaction algorithm for Fractal Trees is beyond my skills (I will write another blog post about this adventure).

Another way to solve write amplification are append-only log files. It has many great features: snapshots, versioning, logs are streameable over network. But this requires a lot of disk space, and is useless for memory based systems.

After some failed experiments, I got key criteria for new storage engine:

Fixed page size (such as 1024 bytes)
Very simple free space management, no recursion…
Absorb large amount of writes into single node, at expense of temporary slower reads
Combine append-only log files, with update-on-place for space efficiency (RAM is limited)
Simple compaction, it must operate in tiny chunks, no dependency between pages
Support streaming, versioning
Can not be tied to single data-structure (such as tree). Most work on all data types such as JSON documents, arrays…
Multiple write operations must merge into single write operations. For example +1+1+1 becomes +3

Store, Serializer, Collection

To make it work, I needed one big change in design.

Store interface in MapDB is essentially simple key value map: Map. Store does not know what data type its stores. If BTreeMap wants to perform some operation (such as insert new key into dir node), it has to take value from Store, modify it, and put new value back to store.

That generates too much IO, I tried many hacks to make it faster for special cases (such as sorted arrays), but ultimately that is way too complex.

Another problem with take-modify-put-back approach is streaming. Write operations should produce small deltas. I tried to compare two binary versions, compress them with common dictionary… Obviously that did not work. More hacks.

Solution is simple design change! Take data interpretation away from data structures (collections) and move it into Store. So BTreeMap can insert new key into btree node, but should not know if node is represented as sorted array. This case hides data types (such as BTree dir node) from collection (BTree).

This simple change also greatly simplifies MapDB store design. It eliminates unholy trinity of Serializer-Collection-Store. In BTreeMap I had to do many tricks with generics to hide from BTreeMap that dir nodes may be Long[] (or long[], ByteBuffers, or byte[]…).

Page Log Store

Now I can finally use storage format, that solves write amplification problem.

Most databases use fixed size pages. Unused space on page is left empty, to accommodate future writes. So if new key is inserted into BTree page, it does not trigger new space allocation (like in MapDB 3), but simply uses free space on page.

Another cause of write amplification are array insertion BTree page uses sorted array, insert in middle has to resize and overwrite entire array (page). Constantly resizing and rewriting array causes write amplification.
My solution is to leave original array untouched, and write delta log into free space after main array.

So that is the trick. New store design uses free space on pages to write delta log. This delta log is latter merged with main data on page by background worker.

This has some key benefits:

several updates of same page in short interval, are merged together into single write (if several keys are inserted into BTree node, sorted array is only updated once with all changes)
it reduces write amplification by several magnitudes at expense of temporary slower reads (until background worker kicks in, reader has to replay write log to get most recent version)
there is no complex free space allocator (fixed size page), yet unused space on pages is not wasted
“compaction” is very simple, localized and atomic; only single page IO. No traversal or inter-page dependencies
read and write IO is localized on single page, great for cache locality etc…
it is universal, works on most shapes of data, not just sorted BTree nodes. For example JSON documents have similar write amplification problem, and can use delta logs
opens road to new features (snapshots, isolation…)
write log can be used for different scenarios (streaming over network, write ahead log…)

What is next?

I am working on proof of concept implementation. It will provide PageLogStore and SortedMap implementation. After performance is verified, I will move to MapDB 4 development again.

MapDB in February 2021

2021-02-11T00:00:00+00:00

I resumed work on MapDB a few weeks ago.
Public repo and alpha version should surface at the end of February. Here are some notes about MapDB development for this month

Project organization

Many user like single jar with no dependencies. MapDB could easily have 20 MB jar with 20+ dependencies (like Eclipse Collections). So I decided to split MapDB into several smaller modules. Each module will be hosted in separate github repo under new organization.

For now, I am working on basic mapdb-core module, it contains API
and basic functionality equivalent to MapDB1 (minus TX):

BTreeMap and HashMap
IndexTreeList (simulates flat array, but uses btree like structure)
queues
page based store with snapshots and transactions
redesigned serializers

I am also slowly rebranding MapDB. I picked cute little rocket back in 2012, but that is a bit obsolete now. Today it is time for a brand new less dynamic logo Map, that captures good old boring Java values.

Licence will remain Apache 2 with no dual licensing, no proprietary parts and no other hooks. It is the most effective business model for one man consulting business. However, I will put less effort into documenting internals of my work (storage format specs, integration tests…), coauthor never materialized. In next few months I will probably start an LLC company.

Store and serializers

Current StoreDirect is designed to save space, and it brings several layers and complexity. I would like to support libraries like Protocol Buffers, Netty or GRPC. It should be possible to send serialized data directly into network buffer with ByteBuffer.put().

MapDB uses serializers with DataInput and DataOutput so using ByteBuffer is complicated. After some thinking I decided to eliminate entire serialization layer and (de)serialize into ByteBuffers directly.

There is new page based store inspired by LMDB. It uses fixed size pages and tightly integrates with serializers. Data structure is aware of page size, so BTree Nodes will split in a way to fit onto page. Serializers will be also able to use free space on page, for example for write buffer, or sparse node arrays (faster random btree inserts).

PageStore is simpler to deal with. Compaction can use temporary bitmaps for spatial information, and will not require
full store rewrite (like in MapDB1 and 3). Snapshots, transactions, copy-on-write etc will also be
much easier to implement.

New PageStore requires ByteBuffers. Older version supported other backends such as RandomAccessFile or byte[], but that is gone. 32bit systems will have to use different store, from now on focus is on memory mapped files. Memory mapped files in Java are tricky (JVM crash with sparse files, file handles…). So this version will use very safe (and slower) defaults.

Data structures

I am working on following data structures:

IndexTreeList uses sparse BTree like structure to represent a huge array (or list). In here it is used as hash table for HashMap

HTreeMap we all know and love. First version will be fixed size (specified at creating) and optionally use 64bit hashing. Concurrent segments are gone, did not really scale too well.

BTreeMap - Older version used complicated non-recursive-concurrent map. It was based on paper I found impressive, but it had complicated design. This version comes with a simplified version (recursion, single ReadWriteLock, delete nodes). Old concurrent version will move into separate module. Both version will share the same storage format (serializers) and will be interchangeable.

CircularQueue, LinkedFIFOQueue, LinkedLIFOQueue and LinkedDequeue. Simple queue implementations with single read-write locks. No support for maps yet for expiration yet, that will come in separate module (mapdb-cache).

Concurrency

I tried to reach some sort of concurrent scalability with my own design (segment locking, locking layers), that did not really work and is gone.

Instead, I merged locking into store, and data structure will have an option to lock individual records by their recid. That is sufficient to ensure consistency in concurrent environment.

By default, all data structures will come in two flavors. With no locking or global read-write lock (single writer, multiple readers)

To support concurrent scalability on 16+ cores, all data structures will support Java 8 fork-join framework, parallel streams, producers and consumers.

Older version had data pump to import large BTrees, this will get extended to support other collections as well (HashMap, queues…). It will be possible to analyze large map (10B entries) with 64 CPU cores and dump result into another collection at 2GBps

In future, I will focus on Java8 Streams and similar filtering APIs, to analyze data that do not fit into memory (external merge sort etc..)

Roadmap

With breaks, I worked on storage engines for 12 years (H2, JDBM 23, MapDB 1234). I have many notes, unfinished branches, unit tests, emails, bug reports and ideas.
So big tasks for February is to compile all this into some sort of list. At end, of February I will make an alpha release, to have something to work with.

Once code and ideas are out, I can get some feedback and establish roadmap. If everything goes well, we could have a beta release at an end of March.

Funding

I support this project from my consulting work. It is enough to keep lights on, but things could move faster. If you feel like support me on Github, or throw me some crypto.

That is all for now.

MapDB 4 update

2019-03-14T00:00:00+00:00

I resumed work on MapDB 4 recently. Here is an update about my progress and plans.

What is finished

Not much. I have many notes, some code and design.

Plans for March

Restart work on MapDB 4
- get skeleton with most features
- release milestone 1 (see bellow)
- get back into increment development
Catalog all unit tests for MapDB (and java collections)
- port tests into MapDB 4 code
Create a tool to export/import data from MapDB 3
Read all Issues on Github and create Roadmap
Maintenance release for MapDB 3 and maybe other versions
Fix website and documentation

MapDB 4 milestone 1

This milestone should flash out my design ideas into code
Limited usability
- fixed store size (2GB)
- several heap memory leaks (support data structures will use onheap collections)
use fixed sizes ByteBuffers (in memory, mmap files)
- 2GB or less (limit for single ByteBuffer)
- store size change is responsible for many complications
will be single threaded
- single writer, multiple readers (ReadWriteLock)
will include snapshots
- snapshots play major role in MapDB4 design
- COW, use heap collection t
- old snapshots discarded by GC
very basic durable TX
- store will use file swap
should include most collection types (Map, List, Queues)
- needed for benchmarking, even if initial performance is bad
primitive maps from Eclipse Collections backed by ByteBuffer
- needed for temporary internal structures to track free space etc..

Compatibility with older versions

MapDB Backup Format
- there will be new format for backups, imports and exports
- JSON based
- MapDB 1,2,3,4 will get exporters
  - modify old code bases and making new releases with modifications
- MapDB 4 will have importer from this format
- Latter maybe support for other dbs (LevelDB, Redis, SQL, Cassandra…)
command line tools to convert files into new MapDB 4 format
MapDB 4 will use different package names
- it will be possible to use MapDB 4 with older versions when full class names are used (org.mapdb.DB)
- class names will not conflict with other versions
- org.mapdb will stay empty
- org.mapdb.volume (introduced in MapDB3) will not be used
- org.mapdb.serializer was renamed to org.mapdb.ser

Major changes in MapDB 4

eliminate class inheritance
- always use interfaces
- specialized classes generated by code generator
- inheritance JIT performance issues with multiple inherited classes
- package size will skyrocket
  - 10MB is ok, 20MB maybe, 50MB bad
  - I know Scala and issues it causes on Android
Volumes are gone
- it was an IO layer abstraction (ByteBuffer, RandomAccessFile)
- it is not possible to wrap different types if IO into single interface
- use code generator to inline IO directly into Store
  - many store implementations (StoreDirect_RandomAccessFile, StoreDirect_ByteBuffer)…
Elsa and other POJO serializations are gone from default config
- Default serializer will only support basic JVM data types (Long, String…)
- POJO serialization is too complex to handle with default configuration
  - class rename, performance overhead….
  - POJO serialization must be configured separately

Commends:

Pranas • a year ago

Sounds like real-logic/agrona could love both 2G limitations (buffers) and provide small packages size (primitive collections to replace Eclipse collections) and provide some more concurrency wise clever primitives.

Money Manager • 2 years ago

Will JSON handling be added as well ? XPATH like query etc.

Pranas -> Money Manager • a year ago

https://commons.apache.org/… could be used as wrapper on top of the stores. It’s nice not to have bloated library …

Log Store format and compaction

2018-02-12T00:00:00+00:00

Designing reliable append-only stores is hard. This blog post outlines design for new append-only store for MapDB4. It has following features:

Append-only log based files. Old record are never overwritten
Compaction in background processes, does not block IO
Lock-free snapshot and isolation
Parallel writters
Transactions with rollback

Store is basic low level storage layer in MapDB. It maps recid (record ID) into binary record (byte[]). Here is interface definition

Files

LogStore uses directory with multiple subdirectories:

data - contains data files
blob - stores large records, those are too big to be stored directly in log file
generations - file generation markers, used for replay

Data files

Data file has maximal size
- store wide, configured at store creation (typically 128MB)
- maximal value is 2GB (ByteBuffer limit), file offset is signed int
New records are appended to end of file
When file becomes too big
- It is closed and
- New file is started
Each Data File has 4byte number
- File number starts from 0 and increments for each file
- Newer file has higher file number

Blob files

Some records are too big for log
- Compaction moves data, large records would cause overhead
Large records are stored in separate files, in dedicated ‘blob’ directory
Cut off size is configurable at store creation, typically 64KB

File Generations

File number is N-bit unsigned long
- N depends on maximal Data File size, together they form 8-byte long
Overtime this number overflows and starts again from #0
To deal with overflow ‘file generations’ are used
- It ensures that data file #0 is deleted by the time it overflows
- It finds oldest log file for log replay (data file #0 might not be oldest)
There are 64K file generations, highest two bytes in file number
Generations form cyclic buffer
- Used generations are continously allocated
  - As new file are added, old deleted this allocated window moves in cyclic buffer
- There is a ‘hole’ in cyclic buffer, log starts at upper end of this hole
  - For example if used generations are 0-1 the hole is 2-64K, upper end of hole is 0
Each generation has empty file in ‘generations’ directory
- if file with given name exists, generation is used
New Data File can be only created if its generation is used
New Data File can not be created if there is no hole in generation cyclic buffer
- it would be impossible to find oldest Data File and start of the log

Log Entries

Each data file is composed of Log Entries
It starts at beginning of file, log replay traverses file from start to end
Each Log Entry starts with 1 byte header, that identifies its type

Log Entry Headers (they will most likely change in final version):

0 - invalid header
- reserved to detect data corruption
1 - eof
- end replay and skip to next file
2 - skip single byte
- used for padding
3 - commit
- written by tx commit
- apply changes from last commit
4 - rollback
- written by tx rollback
- discard changes since last commit
5 - record
- followed by 4 byte signed int (record size) and
- 8 byte recid
- byte[] that contains record
6 - blob record
- followed by 8-byte recid
- followed by 8-byte long, blob file number

Compaction

Compaction removes older version of records and reclaims space.
It takes old Data File and moves active record into newer Data File, old file is then deleted

Algorithm:

Select file #N
- big part of file should be unused, some file stats are needed
Replay file, iterate over all records
If record is active (its recid is in Index Table)
- append this record to new file
  - use FileChannel#transferTo()
- if not, ignore this record
at this point file should have no active records
unmap delete

TODO: can transaction spread over multiple files?

Index Table

Most recent version of record is determined by Index Table
Index Table maps Recid (long) to File Position (long)
Index Table is updated when record is inserted, updated or deleted
When store is opened, all Data Files must be replayed to reconstruct index table
- Index Table can be saved every Nth files, that will require only a few Data Files
Snapshots makes immutable copy of Index Table
Index Table can be stored in
- on-heap Map
- memory mapped flat temporary file, offset=recid*8
- some sort of persistent (crash resistant) file, to prevent full log replay

JetBrains Xodus - code overview

2018-02-09T00:00:00+00:00

I started exploring alternative DB engines. I want to find interesting algorithms, new ideas and perhaps some new code for MapDB.

Xodus

First engine I review is JetBrains Xodus. It was developed for JetBrains YouTrack and latter open-sourced under Apache2 license. It powers YouTrack in production, I could not find other users. It started in 2010, and is production ready.

Main characteristics

Log based storage
BTree and Patricia Prefix Tree indexes
Hybrid between memory/disk store,
- needs lot of heap for caches and internal datastructures
Strong lock-free snapshot isolation
Entity store (object with properties) with some query capabilities
Stable, but not much documentation
Written partly in Kotlin (as MapDB :-))

This blog post is just code walkthrought with my initial impressions. Future blog posts will address architecture, log, indexes, performance and transactions.

There are many code references. You can use them by pressing ctrl+n in Idea.

Benchmarks

Simple benchmarks for key-value stores
Benchmarked engines: Xodus, Chronicle, H2 MVStore, LMDB, MapDB, Persistit, Tokyo Cabinet
Uses JMH, that is mostly usable for microbenchmarks, not really suitable for DB benchmarking
No long running stress tests

Utils

Save ByteBuffer cleaner

JVM releases memory-mapped ByteBuffers after GC, that causes disk handle leaks, and could cause JVM crash TODO link

There is ‘cleaner hack’, but on MapDB it only works on OpenJDK 7 and 8, does not work on OpenJDK 9 or Android. This cleaner has branches for Android and OpenJDK 9 and probably works there.

jetbrains.exodus.util.SafeByteBufferCleaner

Packed long

Xodus uses packed longs similar way as MapDB does:

jetbrains.exodus.log.CompressedUnsignedLongByteIterable#getLong(jetbrains.exodus.ByteIterator)

Packaged Long Iterator, get and skip:

jetbrains/exodus/log/CompressedUnsignedLongByteIterable.java:150

Detect JVM version

Code to detect JVM version

jetbrains.exodus.system.JVMConstants

Privileged code execution

Executes piece of code with extra priviliges. MapDB does not handle priviliged code execution, it will be interesting to see how db works on restricted JVMs.

jetbrains.exodus.util.UnsafeHolder#doPrivileged$production_sources_for_module_utils_main

Hexa utils

Convert to/from Hexa string

jetbrains.exodus.util.HexUtil

VCDiff

VCDIFF-like delta encoding algorithm; wiki.

jetbrains.exodus.compress.VcDiff

VLQ

Variable-length quantity universal code; wiki.

jetbrains.exodus.compress.VLQUtil

Spin Allocators

Xodus caches/polls many short lived objects (such as byte[]). Keyword for code search is Spin Allocator. I am not sure how effective object polling is on modern JVMs. General consensus is caching/polling has higher overhead than GC.
Xodus started in 2010, Git history on those goes at least to 2013.

Log

Log is very interesting part (for MapDB): jetbrains.exodus.log.Log

Storage description:

Xodus stores data in rolling log files. Log entries (pages) are identified by one-byte headers.
BTree headers are here: jetbrains.exodus.tree.btree.BTree#loadRootPage.
All files are fixed size. If inserted record does not fit into file size, file is padded with empty entries and new file is started.
Only small records (pages) are stored directly in log. Large records (pages) are placed into separate file in dedicated directory (blob stororage)
Each record (page) is identified by 8byte address, that identifies log file and its offset.

Some code:

Consistency check: jetbrains.exodus.log.Log#checkLogConsistency
Set log end, discard higher files. Used for rollback and tx flush?: jetbrains.exodus.log.Log#setHighAddress
Append record (page), return address: jetbrains.exodus.log.Log#write(byte, int, jetbrains.exodus.ByteIterable)
Example of log replay: jetbrains.exodus.log.Log#getFirstLoggableOfType
Flush and close: jetbrains.exodus.log.Log#flush(boolean)
File remove: jetbrains.exodus.log.Log#removeFile(long, jetbrains.exodus.io.RemoveBlockType)
Log garbage collection job: jetbrains.exodus.gc.CleanWholeLogJob#execute

Indexes

BTree and Patricia Prefix Tree are two indexes in Xodus.
It stores only `byte[], there is no way to use custom comparator.
There is prefix and delta compression, probably to reduce space usage for Entities

Code:

Binary search in BTree: jetbrains.exodus.tree.btree.BasePageImmutable#binarySearch(jetbrains.exodus.ByteIterable, int)
Save BTree Page: jetbrains.exodus.tree.btree.BasePageMutable#save
BTree Page put: jetbrains.exodus.tree.btree.BottomPageMutable#put and jetbrains.exodus.tree.btree.InternalPageMutable#put
BTree Page delete: jetbrains.exodus.tree.btree.BottomPageMutable#delete and jetbrains.exodus.tree.btree.InternalPageMutable#delete

New record types

2017-12-11T00:00:00+00:00

StoreDirect in MapDB 3 is very simple key-value store. It takes an object (record), serializes it into binary form, and stores it in file. (it is bit more complex, but for purpose of this article…)

Update mechanism for record is:

serialize on-heap object into binary byte[]
check if old record has the same size, if yes update on place (no allocations needed)
if binary size differs, allocate new space
write record
release old space

Get operation is also trivial:

it finds record position in file
loads it into byte[]
and deserializes it into onheap object

This approach kind of works. But in many usages you run into performance issues. Good example are BTrees:

to add single key, you need to rewrite entire BTree node and other keys, that are part of that node
in counted BTrees (each dir node contains counter) you need to rewrite several nodes just to increment single int

It was suggested that BTrees are not good for disk storage application for this reason.

And once you get into more exotic data structures (linked list, hash trees, counted btrees, compressed bitmaps, queues…) similar problems become way too common. Data structure that works well on-heap (and on paper), becomes performance nightmare when implemented in database.

Solution?

There are several workarounds for performance problems. Just BTree has dozens reimplementation to fix write and read amplifications. I studied some, but I believe it just introduces more complexity without really fixing performance problems.

Take a Fractal BTree for example. It solves write amplification by caching the keys as part of top BTree nodes.
There is quite complicated algorithm, where changes slowly trickle down from top nodes, to leafs

In MapDB I spend several months trying to solve performance issues, related to write and read amplifications. Partial deserialization, binary acceleration, hash produced from serialized data…. This introduced a lot of complexity into storage. When coupled with caching, transactions, snapshots… it is just nightmare.

But now I think problem is somewhere else. Problem is not with data structure design, but due to limited features of most database engines (or Store interface in MapDB).

In MapDB I control full stack; serializers, caching, commits, binary formats, memory representation… I can afford some creativity while inserting data.

Rather than redesigning each data structure separately, I can identify common problems, and redesign my database engine to handle those cases efficiently.

And if I really have to implement data structure such as Fractal Tree, MapDB should have features to make it easy.

New record types

Currently MapDB has two record types:

tiny records (<5 bytes) are packed into index table instead of offset
single small record with size up to 64K
linked record (linked list) with size up to 1GB (theoretically unlimited)

Store exposes those to outside world (to Serializers) as monolithic byte[]. Small records have zero copy optimizations (serializer reads directly from mmap file). But for huge linked record, I have to allocate and copy huge byte[] during deserialization.

So for start I decided to redesign how Serializer interacts with the Store and binary data. I want to break down monolithic byte[] and single step (de)serialization, into dance of callbacks and incremental changes.

Some of those changes are already there. For example MapDB 3 has ‘binary acceleration’, where BTreeMap performs binary search on its nodes, without full deserialization. But this redesign should address common problems, rather than adhoc optimizations.

Also I want to make record types more abstract, with option to add new record types in future. This should eventually handle stuff like compression or hashing, which are now part of serializers.

Huge records (aka linked records)

Current linked records has following structure:

  R1 - first chunk of data, link to R2
  R2 - second chunk of data, link to R3
  R3 - third chunk of data, end of record

Deserialization is slow and requires two passes:

traverse linked record to find its size
allocate byte[]
traverse linked recod again and copy it into byte[]
pass byte[] to deserializer

In new form the links will be stored on first node:

  R1 - link to R2, size of R2, link to R3, size of R3...
  R2 - first chunk of data
  R3 - second chunk of data
  R4 - second chunk of data

This allows some interesting features:

size can be determined by only reading first part
no need to traverse entire set (traversing 2GB in 64KB chunks can take a long time) to find size
deserialization can be incremental, only smaller chunks can be loaded into memory
we can do on place binary modification (change Nth byte), without reinserting entire set
binary record can expand in middle (just resize R3), so adding an element into middle of array is possible
we can efficiently append to end of record, without reinserting old data

Partial update

Take a Counted BTree (Interval Tree) or Authenticated Skip List.

In both cases there is larger node, with a small counter.
Node is relatively static, but counter is updated frequently
Node and counter should be collocated for performance reasons
- memory locality
- deserialization needs to be fast

Solution is to store both node and counter in the same record, but on update overwrite only small fraction of binary record;

let say that node consumes 2000 bytes
counter is first for 4 bytes of node
update in counter always rewrites first 4 bytes
this is done using special action in serializer, with function passed to store on update
counter update does not require serialization of entire node
deserialization (get operation) always sees most recent binary version

Write Buffer

Fractal Trees have huge nodes (MBs), each node has write buffer which stores changes in keys.

Or in other terms, imagine large compressed array with frequent updates.

Update on place is not practical (decompress, update and compress array again).
Decompression is much faster than compression.
We store changes in write buffer
every N updates we flush the write buffer
- read old version
- apply changes from write buffer
- compress new version and update database
get operation
- read old version
- apply changes from write buffer
- return current version
- performance is ok, decompression is much faster than compression

In binary terms it will be implemented in following way

record has allocated space, lets say 64KB
- this allocation rarely changes, one of the goals is to prevent frequent allocations
first 50KB are reserved for array itself (static part)
last 14KB are for write buffer
write buffer is append only
- there is pointer to end of buffer
- new write operation (such as delete or add key)
  - appends information to write buffer
  - and increases pointer
once buffer is full (or some condition is met), record is rewritten and buffer is flushed

Virtual Records

Some records do not really store any information, but are just used as cache. Their content can be reconstructed from other records. Good example is counter attached to collection, it caches number of elements in collection so Collection#size() is fast, its content is reconstructed by counting all elements in collection.

Content is temporary and frequently updated
Updates do not have to be durable, since content can be reconstructed
Value of Virtual Record can be kept in memory, does not have to be stored in file
Virtual Record can be excluded from Write Ahead Log (is reconstructed in case of crash)
Virtual Record can not be stored on heap, but must be part of Store
- Read only snapshots would conflict with multiple instances
- Rollback needs to have access to Virtual Record to revert to older value
Serializer can provide initial value
- for collection counter Serializer would need access to content of collection (possible cyclic dependency)

External File

StoreDirect does not support huge binary records well, even with new storage format. Frequent updates in large file may trigger allocator and cause fragmentation. Also moving large records through Write Ahead Log will slow down commits. So it makes sense to optionally move Large Record into separate files.

binary Store contains only file name
user can configure directory, where files will be stored
- relative or absolute
- Map (index) can have keys on fast SSD, large values on slow spinning disk
Transactions
- updated records creates new file
- file sync runs on background
- commit can sync multiple files in parallel
- Write Ahead Log only contains new file name
  - is smaller, much faster log replay
  - large values completely bypasses WAl and does not have to be copied multiple times
Deserialization can just use buffered FileInputStream
- does not load all content in memory
- much easier to partially modify, expand or shring record in external file
Option to specify file prefix and suffix
- images (or other files) stored in MapDB can be opened directly by external programs

Compression

Currently compression is handled by serializer. That complicates caching etc. It makes sense to introduce compression as new type of record. This opens some interesting scenarious for backround oerations:

Store#update will store values in decompressed form
Background operation will compress those on backround
- it might try different types of compressions (different dictionaries) to see which works best
Background operation might even train new type of dictionary on background
- and recompress entire store on background (works very well for XML chunks)

Checksum

Right now the checksum is handled by serializer. It should be handled by store as new type of record

Incremental updates

Big priority for me is to support partial updates for Write Ahead Log. If large record is updated, only small part of it should pass through Write Ahead Log.

Also if single part of record is updated frequently, multiple updates should be merged together and only every Nth update hits file. The changes should be cached in memory somehow.

When file is stored in external file, and rollback is possible, there should be some mechanism to update file incrementally. Perhaps provide FileInputStream with write overlay stored in separate file.

Changes in serializers

Serializers in MapDB 3.0 already do more than just convert data from/to binary form:

hash code
comparators
compact in-memory representation for arrays (Long[] versus long[])
delta packing
binary search in binary form

Serializers will be redesigned in MapBD 4.0 to handle even more:

provide diff between two objects for partial updates
allocator size hints
partial updates
do more operations over binary data

Also it will be more flexible with less grouping I will write separate blog post about those changes.

What will be in MapDB 4.0

I want good support for large records from beginning
DB and Makers will have extra option how to store records
- it will be possible to compress values in maps, place them into external files…
current large records (over 64KB) will get new format
External File records will be there as well
Store and Serializer will have some extra hints how to store record (place it in external file)
StoreDirect format will change to support new record types in future
- extra byte, so 255 possible record types
Basic compression and checksums
More functions that operate directly over binary data

Comments

Frolov Aleksey • 2 years ago

Where is secondary key from mapdb v1/v2?

MapDB 4 and near future

2017-08-15T00:00:00+00:00

I decided to start new major MapDB version. Master branch was already refactored and tagged as MapDB 4.

MapDB 3

MapDB3 was announced more than 18 months ago.
Current stable branch is 3.0.
Dev branch 3.1 is cancelled
- I started backporting changes to 3.0.x releases;
- for example 3.0.5 had major performance improvement that reduced lock overhead.
The 3.0 branch will be maintained until 4.0 is released and becomes stable enough (most likely December 2017)
I decided to start new version because
- Some features require format change (external files for large records, extended records)
- API changes; lot of refactoring
- changes in core classes (DBMaker, Serializers)
- Some parts rewritten (Write Ahead Log, Volumes)

Major news in 4.0

Format change in StoreDirect format
- better support for huge records
- transparently put large records into external files
  - large records will bypass write-ahead-log, while preserving durability
- better support for checksums and encryption
- lazily streaming of large records (right now it is loaded into byte[])
full support for zero copy
- deserialization input stream reads directly from mmaped file
- write-ahead-log
redesign Volumes (file IO)
- refactor File IO to use memory-mapped files better way
- support for AsynchronousFileChannel and non-blocking disk IO, with continuations and light thread
format change
- support for values in external files
- unified header
- format evolution
  - old features will be deprecated, but not removed
way more automated tests
- backward compatibility, format spec will be part of tests
MapDB will integrate with several libraries
- it will be able to export/import data to Hadoop file formats, Spark…
- I do not like several tiny maven project, so everything will be in MapDB artifacts (or perhaps mapdb-extra)
  - in separate package, latter might move into separate jar files
- MapDB artifact will depend on several libraries,
  - but those will be optional compile time deps
  - user will be responsible for providing those
integration with libs and extras
mapdb will unify various types of collections
- spark like
- chronicle like
- primitive collections over flat arrays (or memory mapped files)
- flat cols over mmap files
support for Streams and Parallel Streams

Changes in development

I kept too tight grip on MapDB, tried to make it perfect, that made development too slow
- Lessons from mapdb development blogpost
- in future I will move faster, but keep quality where it matters; automated unit and acceptance tests
way more blog posts
- comments on various projects, algorithms, papers
- staging place for documentation,
  - new feature will be first documented in blog post for comments, then moved into separate chapter
youtube channel
- screencast videos to walkthrough code in IDE (very fast to produce, good for quick introduction)
change in a way documentation is made
- bullet point oriented format
  - very fast to make, very readable
  - Antirez from Redis originally used this format
  - contributors are welcomed to reedit and polish the documentation
- more code oriented
  - code examples will be written first, before code
change in release cycle
- MapDB4 is the last major release
- various formats will be introduced, and deprecated, but never removed
- new formats (or collections) will start new file header, and use different implementations
- new minor (4.X) version will be out every month
- integration tests take about week to finish
  - dedicated machine will run integration tests nonstop
  - so every week there will be stable snapshot release or minor (4.0.X) bugfix release
changes in unit tests
- way more unit tests
- test full matrix of all configuration options; CPU is cheap
- concurrency stress tests
- performance regression testing (MapDB 3 release was disaster)
- test storage format compatibility (can read and modify files generated by older 4.0.0 release)

Roadmap for next 3 months

first priority is to finish Elsa Serialization library, but final version will be released together with MapDB 4
MapDB 4.0 should be out at end of October
- with features of MapDB 3, but without open TODOs (missing compaction)
there will be many blog post describing my progress on MapDB
semi-stable release (passes acceptance tests) should be out every week

New features after 4.0 release

I have very long list of ideas. So I will go through my bookmarks and notes; and put everything into series of blog posts.

So far most requested features are:

extra collections to support cryptography and blockchain applications
- authenticated merkle tree (immutable, fast creation with data pump)
- authenticated skip list, already written for iodb
LSM Store based on IODB
- supports snapshots
- supports branching (the same way Git or other CVS)
data pump for everything
- including hashmap
- fast creation is important for Merge algos
spark compatibility
- spark data frames is functional data transformation language
  - it also defines how data should be partioned to fit into memory on single node
- spark uses several nodes
- but single node spark swaps data in-out of memory
- mapdb can do it way more efficiently (10x?)
- so I want to have some compatibility with Spark Data Frames
Query planner support
- support some sort SQLish language with query planner and executor
- take inspiration from Postgres extension API
- SQL engine from SQLite VM?
- use Spark Catalyst??
reactive support
- planned for very long time I played with Kilim in 2008, JDBM3 was originally steered this direction
- based on Kotlin continuations and perhaps similar framework
- based on AsynchronousFileChannel
- non-blocking disk IO
- should include MapDB and most of its collections
- support for Akka, RxJava and similar frameworks
time series database
graph database…

Comments

Eduard Dudar • 3 years ago

No pressure of course but wondering what are the current plans for 4.0 release. Some features like non-blocking IO are very sweet but github shows only 1 issue in closed for 4.0 and about 60 opened.

Scala, boxing and concurrent collections

2017-07-27T00:00:00+00:00

I am working on concurrent Scala code. Native Scala collections do not have that many features, so I have to use native Java collections together with some extra libraries (Eclipse Collections, Guava). It is real pain to do use those from Scala, here is one example why you should just use Kotlin.

File reader count

I need to track number of readers for each file. it is simple semaphore system using: Map, where key is file, value is number of readers.

Ideally this map should support atomic operations (cas, swap, computeIfAbsent…). No locking should be required to maintain this map. So I need some sort of concurrent map.

Scala has scala.collection.mutable.ConcurrentMap interface, but without any actual implementations. Plus it does not have any useful functions.

Java has java.util.concurrent.ConcurrentMap with nice methods such as compute, computeWithPresent etc… So the choice is simple.

The map declaration looks like this:

// scala code
val fileSemaphore = new ConcurrentHashMap[File, Long]()

Lock

Now the locking method. It should increment value (number of readers) in atomic way. If key (File) is not in map (has zero readers), it should insert default value 1

// scala code
def fileLock(file:File){
    fileSemaphore.compute(file, {(file,value)=> 
      if(value==null) 1 
      else value+1
    })
  }

The compute method will update map in atomic way.
If the key-value pair is not present, the value argument will be null
Function returns new value, which is inserted into map
This code is actually broken and just happens to work (see next chapter)

Unlock

Now the file unlock method.

It should decrement number of readers
If the number of readers reaches zero, it should remove key from Map (function returns null)

// scala code
def fileUnlock(file:File): Unit ={
  fileSemaphore.compute(file, {(file,value:Long) => 
    if(value==1L) null 
    else value-1
  })
}

This code does not compile, scalac can not infer function return type from null. It need cast for extra hint:

if(value==1L) null.asInstanceOf[Long] 
else value-1

It compiles and runs, but does not work. When number of readers reaches zero, the file is not removed from Map but its value is set to zero.

Nullable Long in Scala

Both lock and unlock methods are broken. If you convert this code to Java, it works. Problem is that Scala does not allow nullable longs. Any null variable or expression with type Long is silently converted to 0L.

First lock method just happens to work, because the non-existing key (null value) is silently converted to 0L and incremented. But the first line in this expression is never executed:

if(value==null) 1  // always false, never executed, value is never null, but 0L
else value+1

Unlock method never returns null (to remove file from map). First line is executed, but the return value is converted to 0L, and inserted to map:

if(value==1L) null.asInstanceOf[Long]  //converted to 0L
else value-1

This conversion is pretty nasty. It swaps your values at runtime. I would expect something like that from Javascript, but not from strongly typed language.

To be fair the scalac emits warning in lock method, but it is not fatal error. And unlock method passes without warning.

Solution

I found it impossible to write correct solution in Scala. There are two workarounds:

Remove Long definition and use Any to keep compiler away: Map[File, Long] becomes Map[File,Any].
Write lock/unlock methods in Java.

Nullable types in Kotlin

I could not resist to show how elegant this code becomes in Kotlin with nullable types. So I rewrote code above in Kotlin.

// kotlin code
val readers = ConcurrentHashMap<File,Long>()

fun lock(file:File):Unit{
    readers.compute(file,{file, value ->
        if(value==null) 1L 
        else value+1
    })
}

fun unlock(file:File):Unit{
    readers.compute(file,{file, value ->
        if(value==1L) null 
        else value-1
    })
}

Fancy null operators (elvis) would not make code better. Instead Kotlin provided another unexpected benefit; compiler found concurrency issue :-)

Code above does not compile. If we unlock wrong file (not yet locked), value is null and else value-1 would throw NPE.

So we need to handle case when wrong file is unlocked. This version is correct and compiles:

// kotlin code
fun unlock(file:File):Unit{
    readers.compute(file,{file, value ->
        if(value==null) 
            throw IllegalMonitorStateException("file not locked")
        if(value==1L) null
        else value-1
    })
}

PS: Keep on mind that ConcurrentMap is interface defined in Java code. Nullability information was added latter with external annotations :-)

Comments

Avatar
Tse-Wen Wang (Tom) • 3 years ago • edited

Scala supports nullable Long through its support of Java classes. Here's how I would use nullable Long.

import java.lang.{Long => JLong} // JLong will be alias for java.lang.Long

null.asInstanceOf[JLong] // returns null

--

Mateusz Maciaszek • 4 years ago

Wouldn't using AtomicRef with Optional type help to resolve all of these problems?


    Avatar
    Jan Kotek Mateusz Maciaszek • 4 years ago

    I guess you mean in combination with immutable Scala Map. It would not, it generates too much GC garbage.
            −
        Avatar
        Mateusz Maciaszek Jan Kotek • 4 years ago

        Not really, mutable version should be ok as well once dealing with concurrency control mechanism (hence not sure about GC pressure).

Lessons from MapDB development

2017-07-10T00:00:00+00:00

MapDB is great project, but for many reasons it is falling behind other projects which raised around the same time (Hazelcast, Redis…). In this post I will outline mistakes I made over the years, while working on MapDB.

concurrency
- JDMB3 back in 2012 used single ReadWriteLock to handle concurrency.
  - That allows parallel readers, but single writer.
  - SQLite has similar approach
  - Antirez from Redis is a big advocate of simplifying things by avoiding concurrency
- That was redesigned in MapDB 1, to allow parallel writers
- Parallel writers did not brought real performance benefits
  - j.u.c.ConcurrentSkipListMap scales linearly with number of cores
  - MapDB only scales up to 4 cores
- Concurrency greatly complicated things and is responsible for many delays
- Most benefits for concurrency could be achieved under single lock
  - Fail fast iterator in JDBM3 would throw ConcurrentModificationException, that can be fixed under single lock
  - Concurrent scalability is still possible under single lock with sharding and other trivial tricks
  - Single lock would still allow background writer thread, main benefit for a latency
writing database is hard task
- working over raw binary files is thought
  - good luck debuging wrong file offset at 1TB stores
- even now in 2017, there are not many database engines
  - most of them use relatively simple ideas (BTree, LSM)
- papers describe logarithms are often impresise
  - there is B-Link-Tree paper which describes concurrent BTree
    - published in 1980ties, many citations
    - but even today it is not clear howto handle some concurrent cases (root update)
    - initial implementation took one week
    - it took about 3 months of work to nail it and make it thread safe
too many features
- MapDB had too many ways to open files, handle concurrency,
- that created too many combinations to test
- it was hard to document and explain all the features
code duplication and not invented here
- I spend long time written code, which was already written in other libraries
- MapBD was self contained, with no dependencies
MapDB does not integrate with default tools and defacto standards
- TFile and HFile and other formats
  - data exported from MapDB could be used by other databases
  - mapdb could operate directly over data created by other tools
- other file formats
- reimplementation of existing API (LevelDB java binding)
  - this way MapDB could be used as a drop-in replacement for other libraries
did not follow test driven development
- automated testing in MapDB is still fairly good, for example we test process crash recovery with killl -9 PID
- we had the same problem with JDBM 1.5 back in 2008
- long running tests were broken for a long time (fixed now)
- it took too long to run default unit tests (fixed now)
not enough performance testing
- no performance regression testing
- Single Entry locks destroyed performance in 3.x branch (fixed in 3.0.5)
- old code had concurrent scalability problems
  - it used segmented locks
  - too many embedded locks, sometimes semaphores without memory barrier would do
file format and API had changed way too many times
- it was necessary to fix early design mistakes
- should have started with in-memory store
- code change is sometimes necessary
documentation
- spend way too much time deciding on format (Markdown versus Restructured)
- too much time went into generating PDFs, not many people are using it
- mapdb.org was originally generated by Maven Site plugin (haha)
- github wiki would be fine from start
- MapDB needs way more code examples

Comments

xamde ⬣ • 4 years ago Thanks for sharing. I watched the Java embedded database space thoroughly, there are few real candidates, MapDB always looking like the best of them. My main problem was stability. I needed far less features, cared less about ultra-performance, but need a stable, reliable, somewhat scalable (otherwise I could use in-memory) key-value store first. I am happy to see MapDB lives on and I hope there will be stable, maintained, releases with stable on-disk formats.

–

Nick Apperley • 3 years ago

Since Kotlin is being used with MapDB have Kotlin Coroutines ( https://www.youtube.com/wat… ) been considered for non blocking concurrency? Some NoSQL DB systems use Coroutines ( https://en.wikipedia.org/wi… ) to handle concurrency ( https://medium.com/software… ).