Implement natively uuid type in Clickhouse#945
Implement natively uuid type in Clickhouse#945alexey-milovidov merged 8 commits intoClickHouse:masterfrom YiuRULE:master
Conversation
|
Can one of the admins verify this patch? |
|
Great! |
|
autotests |
I wonder, does ClickHouse work on this machine? There are quite a lot of code, that rely on little endianess. |
dbms/src/Common/UInt128.h
Outdated
| @@ -25,19 +24,43 @@ struct UInt128 | |||
| UInt64 first; | |||
There was a problem hiding this comment.
Renaming this fields to high and low (or, better, low and high?) will be more convenient.
dbms/src/Common/UInt128.h
Outdated
| #endif | ||
|
|
||
| UInt128 & operator= (const UInt64 rhs) { first = rhs; second = 0; return *this; } | ||
| UInt128 & operator= (const UInt64 rhs) { first = 0; second = rhs; return *this; } |
There was a problem hiding this comment.
We need to check, where this method is used, just in case...
There was a problem hiding this comment.
So I just take a look that this method was only used in the file Common/HashTable/HashTable.h in the ZeroTraits::set method who initialize to zero so it should be safe. :)
(most of the code call directly the first and second attributes)
dbms/src/Common/UInt128.h
Outdated
| UInt128() = default; | ||
| explicit UInt128(const UInt64 rhs) : first(0), second(rhs) { } | ||
|
|
||
| bool operator== (const UInt128 & rhs) const { return second == rhs.second && first == rhs.first; } |
There was a problem hiding this comment.
This all could be written with tuple comparison: std::tie or std::forward_as_tuple.
dbms/src/Common/UInt128.h
Outdated
| UInt128() = default; | ||
| explicit UInt128(const UInt64 rhs) : first(0), second(rhs) { } | ||
|
|
||
| bool operator== (const UInt128 & rhs) const { return second == rhs.second && first == rhs.first; } |
There was a problem hiding this comment.
Almost no reason to pass by reference here (and also no harm as compiler will remove reference as one of optimization passes) - it's just about style preferences.
dbms/src/Common/UInt128.h
Outdated
| bool operator<= (const UInt128 & rhs) const { return (first == rhs.first) ? second <= rhs.second : first < rhs.first; } | ||
| bool operator< (const UInt128 & rhs) const { return (first == rhs.first) ? second < rhs.second : first < rhs.first; } | ||
|
|
||
| /** Type stored in the database will contains no more than 64 bits at the moment, don't need |
dbms/src/Common/UInt128.h
Outdated
| template<typename T> bool operator< (const T rhs) const { return first == 0 && static_cast<T>(second) > rhs; } | ||
|
|
||
| template<typename T> explicit operator T() const { return static_cast<T>(second); } | ||
| operator UInt128() const { return *this; } |
dbms/src/Common/UInt128.h
Outdated
| { | ||
| size_t operator()(const DB::UInt128 & u) const | ||
| { | ||
| return std::hash<DB::UInt64>()(u.first) ^ std::hash<DB::UInt64>()(u.second); |
There was a problem hiding this comment.
Because it is broken when first == second.
The right thing is at least something like
hash(hash(first) ^ second),
but it is also not suitable, because std::hash is trivial for integers,
and you should use something more viable like CityHash_v1_0_2::Hash128to64.
| #endif | ||
| } | ||
|
|
||
| /// Overload hash for type casting |
There was a problem hiding this comment.
What type casting do you mean?
There was a problem hiding this comment.
Inside the ToIntMonotonicity on Functions/FunctionsConversion.h, they instantiate the DataTypeEnum class who have a std::unordered_map who have the template type ToIntMonotonicity as a key. (so for example who interest us, if we instantiate ToIntMonotonicity with an UInt128, it will instantiate a std::unordered_map who will have a UInt128 as a key, and std::unordered_map need to have the operator()() available)
So when I talk about type casting, it's most the conversion functions. (toInt8, toUInt32, etc...)
dbms/src/Core/AccurateComparison.h
Outdated
| std::is_integral<TInt>::value && std::is_integral<TUInt>::value && | ||
| std::is_signed<TInt>::value && std::is_unsigned<TUInt>::value>; | ||
| std::is_integral<TInt>::value && (std::is_integral<TUInt>::value || std::is_same<TUInt, DB::UInt128>::value) && | ||
| std::is_signed<TInt>::value && (std::is_unsigned<TUInt>::value || std::is_same<TUInt, DB::UInt128>::value)>; |
There was a problem hiding this comment.
You have already specialized std::is_unsinged.
(What about specializing std::is_integral?)
There was a problem hiding this comment.
Specializing std::is_integral was my first idea. But the problem it's in some part of the code they are using std::is_integral.
For example, in DataTypes/DataTypeNumberBase.cpp, we have the readTextUnsafeIfIntegral function member who call the readIntTextUnsafe function member (IO/ReadHelpers.h).
The problem now it's this function call fews operators who are not overloaded like operator*= or operator+= which is a little bit complicated because the UInt128 only emulate fews functions that a natural type have (principally comparison operator) but the arithmetic operator aren't overloaded because it is not really an unsigned integral but only a simulation of it.
A solution I can suggest it's maybe to overload std::is_integral (with value as true) and std::is_arithmetic(as false) and check in fews functions who need the arithmetic operator not only std::is_integral but also std::is_arithmetic, but maybe it's a little bit overkill for a single type ?
dbms/src/Core/AccurateComparison.h
Outdated
| template <typename TAInt, typename TAFloat> | ||
| using bool_if_double_can_be_used = std::enable_if_t< | ||
| std::is_integral<TAInt>::value && (sizeof(TAInt) <= 4) && std::is_floating_point<TAFloat>::value, | ||
| std::is_integral<TAInt>::value && (sizeof(TAInt) <= 4 || sizeof(DB::Uuid) == sizeof(TAInt)) && std::is_floating_point<TAFloat>::value, |
There was a problem hiding this comment.
Looks like this have no sense.
dbms/src/Core/AccurateComparison.h
Outdated
| template<> | ||
| inline bool greaterOp<DB::Float64, DB::UInt128>(DB::Float64 f, DB::UInt128 u) | ||
| { | ||
| return f > u && greaterOp(f, static_cast<UInt64>(u)); |
There was a problem hiding this comment.
Seems incorrect.
BTW, do we support comparison of Float64 and UUID?
There was a problem hiding this comment.
It isn't support because it doesn't have logical sense to compare Float64 (or every numeric type) with an UUID (as same like a Date/DateTime with a numeric type) so we have an Illegal types of arguments error if we try to compare an UUID and a numerical type.
But even if the code can't be call using the client, we need to implement it because the compiler require it.
dbms/src/Core/Field.cpp
Outdated
| x.push_back(value); | ||
| break; | ||
| } | ||
| case Field::Types::Uuid: |
There was a problem hiding this comment.
Looks like Field is generic enough to have type be named UInt128 instead of UUID.
(The case: what if anyone would like to introduce another type for IPv6?)
|
So, little erratum, just waited to fix all the things for telling this, my machine is in little endian (and would be actually easier to this case if it was in little indian) and not big endian, sorry for the misunderstood. :) Ok, so everything was fix, Because arithmetic operator aren't implemented, and for making |
|
Ok. I need about two days for further review. |
|
Now we have two methods |
Implemented UUID type into clickhouse, the goal of this PR is to have a compromise between being fast and human readable without being obligate to do a little hack with a FixedString(16) and using UuidNumToString() function. This implementation (try) to not using that much of memory, fast enough and human readable.
The type used in memory is the
UInt128fromCommon/UInt128with fews modifications on the type, comparison and cast operator were overloaded to act like a numeric type. And for this reason the POCO::UUID and boost::uuid type weren't used. The _int128 type wasn't choose for ensure compatibility between compilers. The benefits of using two unsigned long long is to have only two maximum comparisons instead of a maximum of 16 for a FixedString.The function toUUID was also implemented.
For the following test, we will use this table:
Insertion works with uuid and string works:
INSERT INTO test(uuid) VALUES('c8b40466-600b-11e7-907b-a6006ad3dba0');INSERT INTO test(uuid) VALUES(toUUID('f9713452-600b-11e7-907b-a6006ad3dba0'));INSERT INTO test(uuid) VALUES(toUUID(1));Will display with a
SELECT * FROM test;Comparison operator were overloaded, so you can order the query
SELECT * FROM test ORDER BY uuid ASC;SELECT * FROM test ORDER BY uuid DESC;You can also use the comparison like this
SELECT * FROM test WHERE uuid = 'c8b40466-600b-11e7-907b-a6006ad3dba0';SELECT * FROM test WHERE 'c8b40466-600b-11e7-907b-a6006ad3dba0' = uuid;And it will return
c8b40466-600b-11e7-907b-a6006ad3dba0Also, multiples casting work, like
SELECT toInt8(toUUID(1))will return1In the serialization and deserialization,
snprintfandstrtoullare used for ensure the quality of comparison and casting. my machine is in big endian so when casting a char and using memcpy would result something like00000000-0000-0000-1000-00000000000and the operationff000000-0000-0000-0000-00000000000 < 00000000-0000-00ff-0000-00000000000would return true if we used directly memcpy for writing the UUID to theUInt128type. (but can be changed when gcc or clang would implementstd::from_charsandstd::to_charsfunction in c++17)Hoping I was clear enough in the PR, quite to be too much into the details in the implementation. :)
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en