Add storage RabbitMQ by kssenii · Pull Request #11069 · ClickHouse/ClickHouse

kssenii · 2020-05-20T09:53:25Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add storage RabbitMQ.

Detailed description / Documentation draft:

CRATE TABLE query parameters:

rabbitmq_host_port
rabbitmq_routing_key_list
rabbitmq_exchange_name
rabbitmq_exchange_type
rabbitmq_format
rabbitmq_row_delimiter
rabbitmq_num_consumers
rabbitmq_num_queues
rabbitmq_transactional_channel

Some explanation:

The routing key of the message and exchange name are parameters that are set while publishing messages from any RabbitMQ client. Therefore, they are speicified by the client.

All exchange types, known in RabbitMQ, are supported: direct-exchange, fanout-exchange, topic-exchange, headers-exchange, consistent-hash-exchange. The preferred exchange type is specified by the client in exchange_type parameter.

There can be no more than one exchange per table. Also one exchange can be shared between multiple tables - it enables routing into multiple tables at the same time.

If exchange type is set to 'direct', then messages go to table(s) with a routing_key parameter that exactly matches the routing key of the message.

If exchange type is set to 'fanout', it routes received messages to all tables that are bound to it regardless of routing keys. (To table(s), where the exchange_name is the same.)

If exchange type is set to 'topic', then messages are routed to one or many tables based on a matching between a message routing key and a pattern, which is set in routing_key_list parameter. By RabbitMQ documentation, if messages are sent to topic-exchange, then a routing key must be a list of words, delimited by a dot. The routing patterns may contain “*”. For example, if a pattern (specified in routing_key_list) of some tables is '*.logs', then messages with a routing key with suffix '.logs' will be sent to all those tables. Patterns can be complicated (like 'agreements.*.*.data.'). Patterns can also contain '#' - indicates a match of zero or more words. For more details see RabbitMQ documentation.

If exchange type is set to 'headers', the routing is even more flexible. If using headers-exchange, routing key can be a map (dictionary). For example, if (key = value) headers of published messages are 'format=logs', 'type=report', 'year=2020' with setting 'x-match=all', then these messages will be routed to all tables, where routing_key_list contains all those matches. If a setting is 'x-match=any', then those messages are send to all tables where at least one dictionary value matches its key. The setting is passed in the same list as headers:
rabbitmq_routing_key_list = 'x-match=all,format=logs,type=report,year=2020'.

If exchange type is set to 'consistent_hash', which is a special sharding exchange, - there will be an even distribution of messages between all tables, where the exchange_name is the same. Note that, by RabbitMQ documentation, if using consistent-hash-exchange, by default it will distribute messages based on a hash value of a routing key. Therefore, the key must be a string integer, randomised for every batch.

Multiple routing keys for each table are supported. They are specified in a list of keys, separated by commas: rabbitmq_routing_key_list = 'key1,key2,key3,key4,key5'. In case of topic-exchange it can be a list of patterns, separated by commas: rabbitmq_routing_key_list = '*.logs,data.*,*.work.data.*.logs'. In case of headers-exchange it can be a list of (key = value) headers: rabbitmq_routing_key_list = 'x-match=any,format=logs,type=report,year=2020'.

If parameter num_consumers is set, then in current implementation the exchange, specified by the client, is bound to a consistent-hash exchange (by the specified routing keys) and this exchange will make an even distribution between all (concurrent) consumers within one table, because, without this binding to consistent-hash exchange, it is impossible to ensure that one message is received no more than once, as it is the only way to enable sharding between queues (~ consumers)) with the same routing key.

If parameter num_queues (number of queues per consumer) is set, then distribution of messages is done between all queues of all consumers. It is worth setting num_queues parameter as one queue can handle up to 50K messages and it is in general highly recommended to keep queues as short as possible to increase throughput. By default, there is one unique queue for each consumer. Also there is no limit in size of parameters num_consumers, num_queues.

Also, as a local (unique for each table) consistent-hash-exchange is used for sharding within one table and as there still remains a rule that hash-exchange distributes messages based on a hash value of an integer-key, which in our case can be a string, a complicated pattern or a key=value header, then in current implementation hash-property of the exchange is changed to message_id. (This will make it possible to enable sharding with non-integer routing keys). But remember, that to be able to use a different hash-property, by RabbitMQ documentation, when messages are published from some RabbitMQ client, then the message_id property must be set in publishing parameters (otherwise all messages are routed to one arbitrary chosen queue). And to enable sharding - this message_id should be unique for each message or at least unique for every batch of messages.

(!) If there is no need for a specific exchange type, then there is a default implementation. It should be used, if there is just a need for specific table to receive messages quickly. Because in this case only direct-exchange type is used, which is the quickest type of all, and for sharding consistent-hash exchange is not used - everything is done only by direct-exchange. This makes a certain difference in speed as direct-exchange works quicker than consistent-hash-exchange. Sharding with direct-exchange is possible because of the special internal bindings, which are made by default. Therefore, with default implementation, a much faster sharding between all concurrent consumers (~ their queues) is done within each table. This implementation is also used in INSERT query.

To use this default behaviour, exchange_type parameter must not be used, exchange_name must be different for each table, there is no need to set routing_key_list parameter (as it will be ignored). Also when publishing messages - the routing key message parameter must be a string integer, randomized in range [1, num_consumers] for every batch of messages (the smaller the batch size, the better). And if num_queues is set, then in range [1, num_consumers * num_queues]. (In this case, the exchange type, where messages are first published by the client is fanout-exchange, later it will be routed to the needed local table's exchange.)

If transactional_channel parameter is set to 1 (true), then publishing inside INSERT queries implementation will be wrapped in transactions.

Note:

Consistent-hash exchange must be enabled with rabbitmq plugin.
By RabbitMQ documentation, if there are multiple consumers, then the order of messages is not guaranteed.

…o add-storage-rabbitmq-read-only

alesapin

Almost ready! Now we need to fix style check https://clickhouse-test-reports.s3.yandex.net/11069/c3569882bbcc33c047d9d9b9424bf06b9a50a3bf/style_check.html#fail1 and clang-tidy warnings https://clickhouse-builds.s3.yandex.net/11069/c3569882bbcc33c047d9d9b9424bf06b9a50a3bf/build_log_680227519_1590004367.txt. After that I'll try to help with compatibility.

Do you have any ideas why some kafka tests failed in your PR?

src/Storages/RabbitMQ/ReadBufferFromRabbitMQConsumer.cpp

src/Storages/RabbitMQ/StorageRabbitMQ.cpp

…o add-storage-rabbitmq-read-only

alesapin

We discussed that we have to simplify the code of (Read/Write)Buffer with separation of the logic of reading from RabbitMQ (and moving it into separate background task) and reading into views/select queries.

src/Storages/RabbitMQ/ReadBufferFromRabbitMQConsumer.cpp

src/Storages/RabbitMQ/RabbitMQBlockInputStream.cpp

src/Storages/RabbitMQ/ReadBufferFromRabbitMQConsumer.cpp

src/Storages/RabbitMQ/StorageRabbitMQ.cpp

…lickHouse into add-storage-rabbitmq-read-only

alesapin

We still have several points to improve and clarify. However this code is tested well and quite isolated, so I think we can merge it and perform some improvements in separate PRs.

qza1800 · 2020-08-12T10:17:32Z

@kssenii this feature is really fantastic
Can you please give a sample how to use Protobuf format? I can't find a way to put format_schema
https://clickhouse.tech/docs/en/interfaces/formats/#protobuf
Thank you

kssenii · 2020-08-12T11:27:43Z

@qza1800 thank you. Now format_schema is not added. I will add it here #12761.

kssenii added 5 commits May 20, 2020 09:08

Add libevent & AMQP-CPP libraries

a055e33

Register RabbitMQ storage

3b75f21

Add base for RabbitMQ integration tests

41b99ed

Enable SELECT and CREATE MV queries with engine RabbitMQ

aeffab3

Add tests for RabbitMQ read-only part

1760f01

alesapin added the can be tested label May 20, 2020

blinkov added doc-alert pr-feature Pull request with new product feature labels May 20, 2020

kssenii added 3 commits May 20, 2020 12:52

Fix libevent build

fe8d285

Merge branch 'master' of https://github.com/ClickHouse/ClickHouse int…

4dc659b

…o add-storage-rabbitmq-read-only

Update version of docker_compose_rabbitmq.yml

c356988

alesapin self-assigned this May 21, 2020

alesapin reviewed May 22, 2020

View reviewed changes

Fixes

14c67c6

kssenii force-pushed the add-storage-rabbitmq-read-only branch from 1314807 to 14c67c6 Compare May 26, 2020 17:36

kssenii added 9 commits May 26, 2020 17:45

Merge

ad1c0de

Fix merge & small fix

5e472af

Make connection between concurrent consumers shared - not private

0362bb2

Fix build & fix style

8266715

Code fix & style fix & merge fix

037ed3a

Merge branch 'master' of https://github.com/ClickHouse/ClickHouse int…

e80b405

…o add-storage-rabbitmq-read-only

Add insert part

5757dd1

Add tests for insert part

5939422

Fixes

386dc4d

kssenii changed the title ~~Add storage rabbitmq read-only part~~ Add storage RabbitMQ Jun 1, 2020

kssenii added 3 commits June 1, 2020 18:25

Merge branch 'master' of https://github.com/ClickHouse/ClickHouse int…

d3b069e

…o add-storage-rabbitmq-read-only

Fix build & fix style & fix

806fd27

Better publish & some fixes

786874e

kssenii mentioned this pull request Jun 2, 2020

[WIP] Add RabbitMQ table engine #10716

Closed

Fix producer

5624066

kssenii and others added 12 commits June 12, 2020 18:10

Support transactions for publishing

462e8bc

Make local exchanges unique for each table

b8a4c77

Fix tests

9c49398

Support headers-exchange type

dcd7b73

Better exchanges, fix build, better comments, better tests

9e1b8b2

test empty commit

a3b240b

Merge branch 'master' into kssenii-rabbit-mq

3fc65b3

Correct merge with master

cb30dbf

Trying to do everything on top of libuv, add heartbeats

addee61

Avoid memory leaks

d5847d2

Simplify code around locks

36eb2c3

Experiments

3c22479

alesapin reviewed Jun 25, 2020

View reviewed changes

kssenii and others added 12 commits June 28, 2020 18:20

Move reading from RabbitMQ into background task

649eb8e

Use ConcurentBoundedQueue instead of vector

5fc0b93

Remove libevent library

3d2cc9d

Move writing to RabbitMQ into background task

88ece42

Fix and simplify code

fd9b416

Fix build

f797efb

Merge branch 'add-storage-rabbitmq-read-only' of github.com:kssenii/C…

7ae8a19

…lickHouse into add-storage-rabbitmq-read-only

Merge branch 'master' into add-storage-rabbitmq-read-only

64583ce

Better shutdown and conversion

96df2e6

Reverse arguments

ab54a96

Less race conditions

c6c7ee3

Tiny fixes

c57edd2

alesapin approved these changes Jul 4, 2020

View reviewed changes

alesapin merged commit a2b6d58 into ClickHouse:master Jul 4, 2020

alexey-milovidov mentioned this pull request Dec 31, 2023

Intern Tasks 2023/2024 #58394

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add storage RabbitMQ#11069

Add storage RabbitMQ#11069
alesapin merged 58 commits intoClickHouse:masterfrom
kssenii:add-storage-rabbitmq-read-only

kssenii commented May 20, 2020 •

edited

Loading

Uh oh!

alesapin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin left a comment

Uh oh!

qza1800 commented Aug 12, 2020

Uh oh!

kssenii commented Aug 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kssenii commented May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin left a comment

Choose a reason for hiding this comment

Uh oh!

qza1800 commented Aug 12, 2020

Uh oh!

kssenii commented Aug 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kssenii commented May 20, 2020 •

edited

Loading