The primary key plan supports data deduplication.
First of all, the primary key on a single column is supported. When data is inserted, judge whether the row appears in the table by primary key. If there is already row with the same primary key, the insertion is skipped.
The initial plan is to achieve deduplication by maintaining a deduplication container for each table. When the database is restarted, the primary key column is read from the disk and container in memory is rebuilt.
After investigation, roaring bitmap is a compressed bitmap index with excellent performance and less memory usage.
We can use RoaringBitmap and RoaringTreemap in roaring-rs to store ordinary integer primary keys. For string types that cannot be supported by roaring bitmap, we can use HashSet storage.
Also, where can the deduplication container of each table be placed appropriately, can it be placed in the MetaStore?
The primary key plan supports data deduplication.
First of all, the primary key on a single column is supported. When data is inserted, judge whether the row appears in the table by primary key. If there is already row with the same primary key, the insertion is skipped.
The initial plan is to achieve deduplication by maintaining a deduplication
containerfor each table. When the database is restarted, the primary key column is read from the disk andcontainerin memory is rebuilt.After investigation, roaring bitmap is a compressed bitmap index with excellent performance and less memory usage.
We can use
RoaringBitmapandRoaringTreemapin roaring-rs to store ordinary integer primary keys. For string types that cannot be supported by roaring bitmap, we can useHashSetstorage.Also, where can the deduplication
containerof each table be placed appropriately, can it be placed in theMetaStore?