Skip to content

SSD cache-dictionary#8624

Closed
nikvas0 wants to merge 113 commits intoClickHouse:masterfrom
nikvas0:nikvas0/ssd_dict
Closed

SSD cache-dictionary#8624
nikvas0 wants to merge 113 commits intoClickHouse:masterfrom
nikvas0:nikvas0/ssd_dict

Conversation

@nikvas0
Copy link
Contributor

@nikvas0 nikvas0 commented Jan 12, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Cache-dictionaries layout for storing data on SSD.

TODO:

  • getBlockInputStream
  • Counters
  • Strings support
  • one file fifo compaction
  • lru for keys in memory
  • Configurable write buffer size
  • Checksums for blocks
  • store keys and metadata on ssd?
  • Any keys support
  • Get rid of duplicate objects in update from different threads
  • Perfomance test
  • Change HashMap (now unordered_map) (store hashes instead of keys?)
  • documentation
  • some refactoring

achulkov2 added a commit to achulkov2/ClickHouse that referenced this pull request May 24, 2020
@nikitamikhaylov
Copy link
Member

I think it is necessary to add performance tests to benchmark it in comparison with DirectDictionary and CacheDictionary. Or just CacheDictionary in memory with SSDCacheDictionary.
I don't mean performance tests in CI, I want something like the one given here #10622


namespace
{
inline size_t nearestPowTwo(size_t x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use roundUpToPowerOfTwoOrZero from our codebase


size_t getPosition(const size_t bucket) const
{
const size_t idx = (bucket >> 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it here? bucket it already a number from [0, buckets - 1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To determine a position in the bucket we need to read 4 bits from the array with positions, but we can address only bytes, so we need to divide bucket number by 2.

};

if (!write_buffer)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces are redundant according to our codestyle


const size_t start_block = current_file_block_id % max_size;
const size_t finish_block = start_block + write_buffer_size;
for (const auto& key : keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: for (const auto & key : keys)

{
// add partitions to queue
while (partitions.size() > max_partitions_count)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style

AIOContext aio_context(1);

while (io_submit(aio_context.ctx, 1, &request_ptr) != 1)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style

}

while (io_getevents(aio_context.ctx, 1, 1, &event, nullptr) != 1)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style


TemporalComplexKeysPool tmp_keys_pool;
storage.update(
source_ptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true, that we can use many sources and not only one?

void SSDCachePartition::remove()
{
std::unique_lock lock(rw_lock);
std::filesystem::remove(std::filesystem::path(path + BIN_FILE_EXT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling?

std::vector<Key> required_ids(not_found_ids.size());
std::transform(std::begin(not_found_ids), std::end(not_found_ids), std::begin(required_ids), [](const auto & pair) { return pair.first; });

storage.update(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to wrap it in a separate function, which adds source_ptr and lifetime and call this function?

@filimonov
Copy link
Contributor

BTW - it would be really interesting to see some perf number, and/or comparison with aerospike (#5629)

@alexey-milovidov
Copy link
Member

@filimonov They are published in https://presentations.clickhouse.tech/hse_2020/3rd/SSD_Dictionary_full.pdf

@filimonov
Copy link
Contributor

Just for the reference - aerospike gives up to ~1Mln req/sec

@alexey-milovidov
Copy link
Member

@filimonov This number does not mean anything.
It can be too slow or too fast depending on hardware.
Nikita has tested with laptop SSD.

nikitamikhaylov added a commit that referenced this pull request Jun 27, 2020
@nikitamikhaylov
Copy link
Member

Merged in #11947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants