My major goal now is to leave out anything that is too complex to implement and maintain . Instead, I aim to neatly arrange realistic simple elements into a cohesive puzzle that functions smoothly. I am somewhat giving up on the “large database idea” while returning to the root that made JDBM3 and MapDB 1.0 great.
MapDB is an alternative memory model. It can store data on disk and work as a database engine, but the main goal is to be a simple drop-in replacement for Java collections. It should make processing data that does not fit into memory easy.
Designed around Spliterator and Parallel Streams from Java 8. There are consumer CPUs with 128 cores; let’s make them usable.
Very fast data ingestion (imports). MapDB 1.0 had Data Pumps that could create BTreeMap in a very fast way. I would like to extend this idea. It will be possible to load a data file, perform transformations using parallel streams, and save it into new writable collections. All very, very fast, using append-only operations. The goal is to saturate SSD speeds.
Cheap snapshots. It will finally be possible to take a snapshot of a collection and use it to make backups or for data processing. It will work well with Parallel Streams and fast ingestion.
Fix write amplification issue in BTreeMap and other collections. Group multiple writes together into single batch.
Simplicity and maintainability. I am fixing all design mistakes that did not work in older versions.
MapDB V4 will be in a single Java package again. It will also use no dependencies (no Eclipse Collections, no Guava…). I am even dropping Kotlin from production code (it will still be used for unit tests). The new version will be a single JAR file that only depends on Java 8 or newer.
Packaging will change to make it possible to use older versions in parallel. The package name will be org.mapdb.v4, and the Maven archetype name will be org.mapdb:mapdb-v4. I am also going back to Maven. It seems simpler for tricky stuff like OSGI annotations.
I am also dropping all the tricky elements from storage. No dynamic recursive space allocation, no multi-sized records, and no alternative storage access methods (sorry, no RandomAccessFiles, ByteBuffers all the way).
In practice, it means three things:
Serializers will also be removed. Now, collections will not directly “see” records, such as BTreeNode, but will have to perform “actions” at the Store on top of binary data. This will allow to store all modification in an event log, and use it to replay snapshots.
And finally all concurrency design is greatly simplified. MapDB will be thread safe, and use single ReadWriteLock per collection. No more segmented and structural locking. It will be still insert data using multiple threads, but that will be done at caching layer.
That is all for now, more details in next few days.
]]>MapDB has big problem with write amplification. Random inserts into BTreeMap trigger cascade of memory allocations, and disk space fragmentation.
One solution is data structure designed to absorb large amount of writes gracefully. I experimented with Fractal Trees, where each tree node has buffer for storing changes. However, compaction algorithm for Fractal Trees is beyond my skills (I will write another blog post about this adventure).
Another way to solve write amplification are append-only log files. It has many great features: snapshots, versioning, logs are streameable over network. But this requires a lot of disk space, and is useless for memory based systems.
After some failed experiments, I got key criteria for new storage engine:
To make it work, I needed one big change in design.
Store interface in MapDB is essentially simple key value map: Map<Pointer,Value>.
Store does not know what data type its stores.
If BTreeMap wants to perform some operation (such as insert new key into dir node),
it has to take value from Store, modify it, and put new value back to store.
That generates too much IO, I tried many hacks to make it faster for special cases (such as sorted arrays), but ultimately that is way too complex.
Another problem with take-modify-put-back approach is streaming. Write operations should produce small deltas. I tried to compare two binary versions, compress them with common dictionary… Obviously that did not work. More hacks.
Solution is simple design change! Take data interpretation away from data structures (collections) and move it into Store. So BTreeMap can insert new key into btree node, but should not know if node is represented as sorted array. This case hides data types (such as BTree dir node) from collection (BTree).
This simple change also greatly simplifies MapDB store design. It eliminates unholy trinity of Serializer-Collection-Store. In BTreeMap I had to do many tricks with generics to hide from BTreeMap that dir nodes may be Long[] (or long[], ByteBuffers, or byte[]…).
Now I can finally use storage format, that solves write amplification problem.
Most databases use fixed size pages. Unused space on page is left empty, to accommodate future writes. So if new key is inserted into BTree page, it does not trigger new space allocation (like in MapDB 3), but simply uses free space on page.
Another cause of write amplification are array insertion
BTree page uses sorted array, insert in middle has to resize and overwrite entire array (page). Constantly resizing and rewriting array causes write amplification.
My solution is to leave original array untouched, and write delta log into free space after main array.
So that is the trick. New store design uses free space on pages to write delta log. This delta log is latter merged with main data on page by background worker.
This has some key benefits:
several updates of same page in short interval, are merged together into single write (if several keys are inserted into BTree node, sorted array is only updated once with all changes)
it reduces write amplification by several magnitudes at expense of temporary slower reads (until background worker kicks in, reader has to replay write log to get most recent version)
there is no complex free space allocator (fixed size page), yet unused space on pages is not wasted
“compaction” is very simple, localized and atomic; only single page IO. No traversal or inter-page dependencies
read and write IO is localized on single page, great for cache locality etc…
it is universal, works on most shapes of data, not just sorted BTree nodes. For example JSON documents have similar write amplification problem, and can use delta logs
opens road to new features (snapshots, isolation…)
write log can be used for different scenarios (streaming over network, write ahead log…)
I am working on proof of concept implementation. It will provide PageLogStore
and SortedMap<Long,Long> implementation.
After performance is verified, I will move to MapDB 4 development again.
Many user like single jar with no dependencies. MapDB could easily have 20 MB jar with 20+ dependencies (like Eclipse Collections). So I decided to split MapDB into several smaller modules. Each module will be hosted in separate github repo under new organization.
For now, I am working on basic mapdb-core module, it contains API
and basic functionality equivalent to MapDB1 (minus TX):
I am also slowly rebranding MapDB. I picked cute little rocket back in 2012, but that is a bit obsolete now. Today it is time for a brand new less dynamic logo Map<D,B>, that captures good old boring Java values.
Licence will remain Apache 2 with no dual licensing, no proprietary parts and no other hooks. It is the most effective business model for one man consulting business. However, I will put less effort into documenting internals of my work (storage format specs, integration tests…), coauthor never materialized. In next few months I will probably start an LLC company.
Current StoreDirect is designed to save space, and it brings several layers and complexity.
I would like to support libraries like Protocol Buffers, Netty or GRPC.
It should be possible to send serialized data directly into network buffer with ByteBuffer.put().
MapDB uses serializers with DataInput and DataOutput so using ByteBuffer is complicated.
After some thinking I decided to eliminate entire serialization layer
and (de)serialize into ByteBuffers directly.
There is new page based store inspired by LMDB. It uses fixed size pages and tightly integrates with serializers. Data structure is aware of page size, so BTree Nodes will split in a way to fit onto page. Serializers will be also able to use free space on page, for example for write buffer, or sparse node arrays (faster random btree inserts).
PageStore is simpler to deal with. Compaction can use temporary bitmaps for spatial information, and will not require
full store rewrite (like in MapDB1 and 3). Snapshots, transactions, copy-on-write etc will also be
much easier to implement.
New PageStore requires ByteBuffers. Older version supported other
backends such as RandomAccessFile or byte[], but that is gone. 32bit systems will have to use different store, from now on focus is on memory mapped files.
Memory mapped files in Java are tricky (JVM crash with sparse files, file handles…).
So this version will use very safe (and slower) defaults.
I am working on following data structures:
IndexTreeList uses sparse BTree like structure to represent a huge array (or list). In here it is used as hash table for HashMap
HTreeMap we all know and love. First version will be fixed size (specified at creating) and optionally use 64bit hashing. Concurrent segments are gone, did not really scale too well.
BTreeMap - Older version used complicated non-recursive-concurrent map. It was based on paper I found impressive,
but it had complicated design. This version comes with a simplified version (recursion, single ReadWriteLock, delete nodes).
Old concurrent version will move into separate module.
Both version will share the same storage format (serializers) and will be interchangeable.
CircularQueue, LinkedFIFOQueue, LinkedLIFOQueue and LinkedDequeue. Simple queue implementations with single read-write locks.
No support for maps yet for expiration yet, that will come in separate module (mapdb-cache).
I tried to reach some sort of concurrent scalability with my own design (segment locking, locking layers), that did not really work and is gone.
Instead, I merged locking into store, and data structure will have an option to lock individual records by their recid. That is sufficient to ensure consistency in concurrent environment.
By default, all data structures will come in two flavors. With no locking or global read-write lock (single writer, multiple readers)
To support concurrent scalability on 16+ cores, all data structures will support Java 8 fork-join framework, parallel streams, producers and consumers.
Older version had data pump to import large BTrees, this will get extended to support other collections as well (HashMap, queues…). It will be possible to analyze large map (10B entries) with 64 CPU cores and dump result into another collection at 2GBps
In future, I will focus on Java8 Streams and similar filtering APIs, to analyze data that do not fit into memory (external merge sort etc..)
With breaks, I worked on storage engines for 12 years (H2, JDBM 23, MapDB 1234).
I have many notes, unfinished branches, unit tests, emails, bug reports and ideas.
So big tasks for February is to compile all this into some sort of list.
At end, of February I will make an alpha release, to have something to work with.
Once code and ideas are out, I can get some feedback and establish roadmap. If everything goes well, we could have a beta release at an end of March.
I support this project from my consulting work. It is enough to keep lights on, but things could move faster. If you feel like support me on Github, or throw me some crypto.
That is all for now.
]]>Not much. I have many notes, some code and design.
ReadWriteLock)command line tools to convert files into new MapDB 4 format
org.mapdb.DB)org.mapdb will stay emptyorg.mapdb.volume (introduced in MapDB3) will not be usedorg.mapdb.serializer was renamed to org.mapdb.serPranas • a year ago
Sounds like real-logic/agrona could love both 2G limitations (buffers) and provide small packages size (primitive collections to replace Eclipse collections) and provide some more concurrency wise clever primitives.
Money Manager • 2 years ago
Will JSON handling be added as well ? XPATH like query etc.
Pranas -> Money Manager • a year ago
https://commons.apache.org/… could be used as wrapper on top of the stores. It’s nice not to have bloated library …
]]>Store is basic low level storage layer in MapDB. It maps recid (record ID) into binary record (byte[]).
Here is interface definition
LogStore uses directory with multiple subdirectories:
data - contains data filesblob - stores large records, those are too big to be stored directly in log filegenerations - file generation markers, used for replayLog Entry Headers (they will most likely change in final version):
0 - invalid header
1 - eof
2 - skip single byte
3 - commit
4 - rollback
5 - record
byte[] that contains record6 - blob record
Algorithm:
FileChannel#transferTo()TODO: can transaction spread over multiple files?
Snapshots makes immutable copy of Index Table
Map<Long,Long>offset=recid*8First engine I review is JetBrains Xodus. It was developed for JetBrains YouTrack and latter open-sourced under Apache2 license. It powers YouTrack in production, I could not find other users. It started in 2010, and is production ready.
Main characteristics
This blog post is just code walkthrought with my initial impressions. Future blog posts will address architecture, log, indexes, performance and transactions.
There are many code references. You can use them by pressing ctrl+n in Idea.
JVM releases memory-mapped ByteBuffers after GC, that causes disk handle leaks, and could cause JVM crash TODO link
There is ‘cleaner hack’, but on MapDB it only works on OpenJDK 7 and 8, does not work on OpenJDK 9 or Android. This cleaner has branches for Android and OpenJDK 9 and probably works there.
jetbrains.exodus.util.SafeByteBufferCleaner
Xodus uses packed longs similar way as MapDB does:
jetbrains.exodus.log.CompressedUnsignedLongByteIterable#getLong(jetbrains.exodus.ByteIterator)
Packaged Long Iterator, get and skip:
jetbrains/exodus/log/CompressedUnsignedLongByteIterable.java:150
Code to detect JVM version
jetbrains.exodus.system.JVMConstants
Executes piece of code with extra priviliges. MapDB does not handle priviliged code execution, it will be interesting to see how db works on restricted JVMs.
jetbrains.exodus.util.UnsafeHolder#doPrivileged$production_sources_for_module_utils_main
Convert to/from Hexa string
jetbrains.exodus.util.HexUtil
VCDIFF-like delta encoding algorithm; wiki.
jetbrains.exodus.compress.VcDiff
Variable-length quantity universal code; wiki.
jetbrains.exodus.compress.VLQUtil
Xodus caches/polls many short lived objects (such as byte[]).
Keyword for code search is Spin Allocator.
I am not sure how effective object polling is on modern JVMs.
General consensus is caching/polling has higher overhead than GC.
Xodus started in 2010, Git history on those goes at least to 2013.
Log is very interesting part (for MapDB):
jetbrains.exodus.log.Log
Storage description:
Xodus stores data in rolling log files. Log entries (pages) are identified by one-byte headers.
BTree headers are here: jetbrains.exodus.tree.btree.BTree#loadRootPage.
All files are fixed size. If inserted record does not fit into file size, file is padded with empty entries and new file is started.
Only small records (pages) are stored directly in log. Large records (pages) are placed into separate file in dedicated directory (blob stororage)
Each record (page) is identified by 8byte address, that identifies log file and its offset.
Some code:
Consistency check: jetbrains.exodus.log.Log#checkLogConsistency
Set log end, discard higher files. Used for rollback and tx flush?: jetbrains.exodus.log.Log#setHighAddress
Append record (page), return address: jetbrains.exodus.log.Log#write(byte, int, jetbrains.exodus.ByteIterable)
Example of log replay: jetbrains.exodus.log.Log#getFirstLoggableOfType
Flush and close: jetbrains.exodus.log.Log#flush(boolean)
File remove: jetbrains.exodus.log.Log#removeFile(long, jetbrains.exodus.io.RemoveBlockType)
Log garbage collection job: jetbrains.exodus.gc.CleanWholeLogJob#execute
Code:
Binary search in BTree: jetbrains.exodus.tree.btree.BasePageImmutable#binarySearch(jetbrains.exodus.ByteIterable, int)
Save BTree Page: jetbrains.exodus.tree.btree.BasePageMutable#save
BTree Page put: jetbrains.exodus.tree.btree.BottomPageMutable#put and jetbrains.exodus.tree.btree.InternalPageMutable#put
BTree Page delete: jetbrains.exodus.tree.btree.BottomPageMutable#delete and jetbrains.exodus.tree.btree.InternalPageMutable#delete
StoreDirect in MapDB 3 is very simple key-value store.
It takes an object (record), serializes it into binary form, and stores it in file.
(it is bit more complex, but for purpose of this article…)
Update mechanism for record is:
byte[]Get operation is also trivial:
byte[]This approach kind of works. But in many usages you run into performance issues. Good example are BTrees:
intIt was suggested that BTrees are not good for disk storage application for this reason.
And once you get into more exotic data structures (linked list, hash trees, counted btrees, compressed bitmaps, queues…) similar problems become way too common. Data structure that works well on-heap (and on paper), becomes performance nightmare when implemented in database.
There are several workarounds for performance problems. Just BTree has dozens reimplementation to fix write and read amplifications. I studied some, but I believe it just introduces more complexity without really fixing performance problems.
Take a Fractal BTree for example.
It solves write amplification by caching the keys as part of top BTree nodes.
There is quite complicated algorithm, where changes slowly trickle down from top nodes, to leafs
In MapDB I spend several months trying to solve performance issues, related to write and read amplifications. Partial deserialization, binary acceleration, hash produced from serialized data…. This introduced a lot of complexity into storage. When coupled with caching, transactions, snapshots… it is just nightmare.
But now I think problem is somewhere else.
Problem is not with data structure design,
but due to limited features of most database engines (or Store interface in MapDB).
In MapDB I control full stack; serializers, caching, commits, binary formats, memory representation… I can afford some creativity while inserting data.
Rather than redesigning each data structure separately, I can identify common problems, and redesign my database engine to handle those cases efficiently.
And if I really have to implement data structure such as Fractal Tree, MapDB should have features to make it easy.
Currently MapDB has two record types:
Store exposes those to outside world (to Serializers) as monolithic byte[].
Small records have zero copy optimizations (serializer reads directly from mmap file).
But for huge linked record, I have to allocate and copy huge byte[] during deserialization.
So for start I decided to redesign how Serializer interacts with the Store and binary data.
I want to break down monolithic byte[] and single step (de)serialization,
into dance of callbacks and incremental changes.
Some of those changes are already there. For example MapDB 3 has ‘binary acceleration’, where BTreeMap performs binary search on its nodes, without full deserialization. But this redesign should address common problems, rather than adhoc optimizations.
Also I want to make record types more abstract, with option to add new record types in future. This should eventually handle stuff like compression or hashing, which are now part of serializers.
Current linked records has following structure:
R1 - first chunk of data, link to R2
R2 - second chunk of data, link to R3
R3 - third chunk of data, end of record
Deserialization is slow and requires two passes:
byte[]byte[]byte[] to deserializerIn new form the links will be stored on first node:
R1 - link to R2, size of R2, link to R3, size of R3...
R2 - first chunk of data
R3 - second chunk of data
R4 - second chunk of data
This allows some interesting features:
Take a Counted BTree (Interval Tree) or Authenticated Skip List.
Solution is to store both node and counter in the same record, but on update overwrite only small fraction of binary record;
Fractal Trees have huge nodes (MBs), each node has write buffer which stores changes in keys.
Or in other terms, imagine large compressed array with frequent updates.
In binary terms it will be implemented in following way
Some records do not really store any information, but are just used as cache.
Their content can be reconstructed from other records.
Good example is counter attached to collection,
it caches number of elements in collection so Collection#size() is fast,
its content is reconstructed by counting all elements in collection.
StoreDirect does not support huge binary records well, even with new storage format. Frequent updates in large file may trigger allocator and cause fragmentation. Also moving large records through Write Ahead Log will slow down commits. So it makes sense to optionally move Large Record into separate files.
FileInputStream
Currently compression is handled by serializer. That complicates caching etc. It makes sense to introduce compression as new type of record. This opens some interesting scenarious for backround oerations:
Store#update will store values in decompressed formRight now the checksum is handled by serializer. It should be handled by store as new type of record
Big priority for me is to support partial updates for Write Ahead Log. If large record is updated, only small part of it should pass through Write Ahead Log.
Also if single part of record is updated frequently, multiple updates should be merged together and only every Nth update hits file. The changes should be cached in memory somehow.
When file is stored in external file, and rollback is possible, there should be some
mechanism to update file incrementally. Perhaps provide FileInputStream with
write overlay stored in separate file.
Serializers in MapDB 3.0 already do more than just convert data from/to binary form:
Long[] versus long[])Serializers will be redesigned in MapBD 4.0 to handle even more:
Also it will be more flexible with less grouping I will write separate blog post about those changes.
Frolov Aleksey • 2 years ago
Where is secondary key from mapdb v1/v2?
]]>The 3.0 branch will be maintained until 4.0 is released and becomes stable enough (most likely December 2017)
byte[])AsynchronousFileChannel and non-blocking disk IO, with continuations and light threadintegration with libs and extras
first priority is to finish Elsa Serialization library, but final version will be released together with MapDB 4
there will be many blog post describing my progress on MapDB
I have very long list of ideas. So I will go through my bookmarks and notes; and put everything into series of blog posts.
So far most requested features are:
AsynchronousFileChanneltime series database
Eduard Dudar • 3 years ago
No pressure of course but wondering what are the current plans for 4.0 release. Some features like non-blocking IO are very sweet but github shows only 1 issue in closed for 4.0 and about 60 opened.
]]>I need to track number of readers for each file. it is simple semaphore system using: Map<File, Long>,
where key is file, value is number of readers.
Ideally this map should support atomic operations (cas, swap, computeIfAbsent…).
No locking should be required to maintain this map.
So I need some sort of concurrent map.
Scala has scala.collection.mutable.ConcurrentMap interface, but without any actual implementations.
Plus it does not have any useful functions.
Java has java.util.concurrent.ConcurrentMap with nice methods such as compute, computeWithPresent etc… So the choice is simple.
The map declaration looks like this:
// scala code
val fileSemaphore = new ConcurrentHashMap[File, Long]()
Now the locking method. It should increment value (number of readers) in atomic way.
If key (File) is not in map (has zero readers), it should insert default value 1
// scala code
def fileLock(file:File){
fileSemaphore.compute(file, {(file,value)=>
if(value==null) 1
else value+1
})
}
compute method will update map in atomic way.value argument will be nullFunction returns new value, which is inserted into map
Now the file unlock method.
null)// scala code
def fileUnlock(file:File): Unit ={
fileSemaphore.compute(file, {(file,value:Long) =>
if(value==1L) null
else value-1
})
}
This code does not compile, scalac can not infer function return type from null. It need cast for extra hint:
if(value==1L) null.asInstanceOf[Long]
else value-1
It compiles and runs, but does not work. When number of readers reaches zero, the file is not removed from Map but its value is set to zero.
Both lock and unlock methods are broken.
If you convert this code to Java, it works.
Problem is that Scala does not allow
nullable longs.
Any null variable or expression with type Long is silently converted to 0L.
First lock method just happens to work, because the non-existing key (null value) is silently converted to 0L and incremented.
But the first line in this expression is never executed:
if(value==null) 1 // always false, never executed, value is never null, but 0L
else value+1
Unlock method never returns null (to remove file from map). First line is executed, but the return value is converted to 0L, and inserted to map:
if(value==1L) null.asInstanceOf[Long] //converted to 0L
else value-1
This conversion is pretty nasty. It swaps your values at runtime. I would expect something like that from Javascript, but not from strongly typed language.
To be fair the scalac emits warning in lock method, but it is not fatal error. And unlock method passes without warning.
I found it impossible to write correct solution in Scala. There are two workarounds:
Remove Long definition and use Any to keep compiler away: Map[File, Long] becomes Map[File,Any].
Write lock/unlock methods in Java.
I could not resist to show how elegant this code becomes in Kotlin with nullable types. So I rewrote code above in Kotlin.
// kotlin code
val readers = ConcurrentHashMap<File,Long>()
fun lock(file:File):Unit{
readers.compute(file,{file, value ->
if(value==null) 1L
else value+1
})
}
fun unlock(file:File):Unit{
readers.compute(file,{file, value ->
if(value==1L) null
else value-1
})
}
Fancy null operators (elvis) would not make code better. Instead Kotlin provided another unexpected benefit; compiler found concurrency issue :-)
Code above does not compile. If we unlock wrong file (not yet locked), value is null and else value-1 would throw NPE.
So we need to handle case when wrong file is unlocked. This version is correct and compiles:
// kotlin code
fun unlock(file:File):Unit{
readers.compute(file,{file, value ->
if(value==null)
throw IllegalMonitorStateException("file not locked")
if(value==1L) null
else value-1
})
}
PS: Keep on mind that ConcurrentMap is interface defined in Java code. Nullability information was added latter with external annotations :-)
Avatar
Tse-Wen Wang (Tom) • 3 years ago • edited
Scala supports nullable Long through its support of Java classes. Here's how I would use nullable Long.
import java.lang.{Long => JLong} // JLong will be alias for java.lang.Long
null.asInstanceOf[JLong] // returns null
--
Mateusz Maciaszek • 4 years ago
Wouldn't using AtomicRef with Optional type help to resolve all of these problems?
Avatar
Jan Kotek Mateusz Maciaszek • 4 years ago
I guess you mean in combination with immutable Scala Map. It would not, it generates too much GC garbage.
−
Avatar
Mateusz Maciaszek Jan Kotek • 4 years ago
Not really, mutable version should be ok as well once dealing with concurrency control mechanism (hence not sure about GC pressure).
ReadWriteLock to handle concurrency.
j.u.c.ConcurrentSkipListMap scales linearly with number of coresConcurrentModificationException, that can be fixed under single lockkilll -9 PIDxamde ⬣ • 4 years ago Thanks for sharing. I watched the Java embedded database space thoroughly, there are few real candidates, MapDB always looking like the best of them. My main problem was stability. I needed far less features, cared less about ultra-performance, but need a stable, reliable, somewhat scalable (otherwise I could use in-memory) key-value store first. I am happy to see MapDB lives on and I hope there will be stable, maintained, releases with stable on-disk formats.
–
Nick Apperley • 3 years ago
Since Kotlin is being used with MapDB have Kotlin Coroutines ( https://www.youtube.com/wat… ) been considered for non blocking concurrency? Some NoSQL DB systems use Coroutines ( https://en.wikipedia.org/wi… ) to handle concurrency ( https://medium.com/software… ).
]]>