Skip to content

Multiple paths (disks/volumes) for storing temporary data support#8750

Merged
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
azat:temporary_data_configuration
Jan 24, 2020
Merged

Multiple paths (disks/volumes) for storing temporary data support#8750
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
azat:temporary_data_configuration

Conversation

@azat
Copy link
Member

@azat azat commented Jan 20, 2020

Changelog category (leave one):

  • New Feature

Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):
Support storage policy (<tmp_policy>) for storing temporary data support.

Follow-up for: #4918

@alexey-milovidov
Copy link
Member

Ok.

BTW, do you consider the option to use existing storage_configuration and just to refer to some policy to store temporary data?

@azat
Copy link
Member Author

azat commented Jan 20, 2020

BTW, do you consider the option to use existing storage_configuration and just to refer to some policy to store temporary data?

At first I went this way (and just has tmp_policy), but now I don't have any ideas why, using existing storage_configuration looks indeed more convenient. Should be rebased...

@azat
Copy link
Member Author

azat commented Jan 20, 2020

but now I don't have any ideas why

Got it, I want to have separate default disk path (since default selector uses <path> and I want to use <tmp_path>)

@azat
Copy link
Member Author

azat commented Jan 20, 2020

@alexey-milovidov So, the question is, what is preferable?

I did the same thing for distributed engine (did not submitted yet), by adding another separate distributed_storage_configuration, to leave storage_configuration for MergeTree engine only, but I don't have any objective reasons,

@alexey-milovidov
Copy link
Member

Got it, I want to have separate default disk path (since default selector uses and I want to use <tmp_path>)

Not sure that I understand this completely, but looks like it's not an issue:

  • if tmp_policy is not specified, tmp_path is used;
  • if tmp_policy is default, then default policy is used that equals to path.

Probably one issue is that tmp subdirectory will not be created under path.
Maybe change the meaning of tmp_path so that it will be equal to path by default but tmp subdirectory is always used inside tmp_path?

@azat
Copy link
Member Author

azat commented Jan 20, 2020

Probably one issue is that tmp subdirectory will not be created under path.
Maybe change the meaning of tmp_path so that it will be equal to path by default but tmp subdirectory is always used inside tmp_path?

The issue is that there is tmp_path and path, so if storage_policy will be used instead and no tmp_policy will be specified then it will use path (but at least it should be changed to path+/tmp)

@alexey-milovidov
Copy link
Member

Always using tmp subdirectory is Ok.

@azat
Copy link
Member Author

azat commented Jan 20, 2020

Ok

@azat
Copy link
Member Author

azat commented Jan 21, 2020

Looks like the following is failing in upstream too:

  • test_ttl_move/test.py::test_double_move_while_select[test_double_move_while_select_positive-1]
  • test_multiple_disks/test.py::test_concurrent_alter_move[concurrently_altering_replicated_mt-ReplicatedMergeTree('/clickhouse/concurrently_altering_replicated_mt', '1')]
  • test_multiple_disks/test.py::test_concurrent_alter_modify[alter_modifying_mt-MergeTree()]
  • test_ttl_move/test.py::test_concurrent_alter_with_ttl_move[concurrently_altering_ttl_replicated_mt-ReplicatedMergeTree('/clickhouse/concurrently_altering_ttl_replicated_mt', '1')]

And test_storage_kafka cannot start docker.

@azat azat force-pushed the temporary_data_configuration branch from e1a5f07 to cf4e9fd Compare January 21, 2020 17:50
@azat
Copy link
Member Author

azat commented Jan 21, 2020

Always using tmp subdirectory is Ok.

Rebased (<tmp_path> behavior left unchanged)

@azat
Copy link
Member Author

azat commented Jan 22, 2020

Looks like test failures are not relevant to this changes, or am I missing something?

@azat
Copy link
Member Author

azat commented Jan 22, 2020

Hm, tmp subdirectory is missing, will fix it

Always using tmp subdirectory is Ok.

@alexey-milovidov so tmp_path directive can go away?

@azat
Copy link
Member Author

azat commented Jan 22, 2020

Hm, tmp subdirectory is missing, will fix it

So after looking into this again, I'm not sure which way this should goes:

  • adding tmp/ in all callers looks error-prone
  • adding some interface for <disk> is not that useful, create the disk with that <path> is simpler (the directory will not be created but it is okay, if this will useful then <create>true</create> can be parsed and call mkdir, and also note that currently tmp_path also will not be created and fail if the directory will not exist)

IOW maybe this should left as-is? @alexey-milovidov

@alexey-milovidov
Copy link
Member

adding tmp/ in all callers looks error-prone

So, the issue is that after we selected disk, we have to append tmp to its path but it's too error prone because we have to add it in multiple places: external aggregation, external sorting, joins, distributed tables... Do I understand correctly?

IOW maybe this should left as-is?

Ok, now it looks reasonable.

@azat
Copy link
Member Author

azat commented Jan 23, 2020

Do I understand correctly?

Yes

@alexey-milovidov
Copy link
Member

Ok. Let's proceed with the first variant of implementation.

azat added 2 commits January 23, 2020 20:31
This patch adds <tmp_policy> config directive, that will define the
policy to use for storing temporary files, if it is not set (default)
the <tmp_path> will be used.

Also tmp_policy has some limitations:
- move_factor              is ignored
- keep_free_space_bytes    is ignored
- max_data_part_size_bytes is ignored
- must have exactly one volume
@azat azat force-pushed the temporary_data_configuration branch from cf4e9fd to c9cc1ef Compare January 23, 2020 17:31
@azat
Copy link
Member Author

azat commented Jan 23, 2020

Rebased to include #8790 (to make the tests pass)

@azat azat force-pushed the temporary_data_configuration branch from 225a819 to 4f86861 Compare January 23, 2020 18:10
@alexey-milovidov alexey-milovidov merged commit 528d231 into ClickHouse:master Jan 24, 2020
@azat azat deleted the temporary_data_configuration branch January 25, 2020 14:41
@CurtizJ CurtizJ added the pr-feature Pull request with new product feature label Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp-multidisk Storages & storage policies pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants