Implement natively uuid type in Clickhouse by YiuRULE · Pull Request #945 · ClickHouse/ClickHouse

YiuRULE · 2017-07-03T16:48:24Z

Implemented UUID type into clickhouse, the goal of this PR is to have a compromise between being fast and human readable without being obligate to do a little hack with a FixedString(16) and using UuidNumToString() function. This implementation (try) to not using that much of memory, fast enough and human readable.

The type used in memory is the UInt128 from Common/UInt128 with fews modifications on the type, comparison and cast operator were overloaded to act like a numeric type. And for this reason the POCO::UUID and boost::uuid type weren't used. The _int128 type wasn't choose for ensure compatibility between compilers. The benefits of using two unsigned long long is to have only two maximum comparisons instead of a maximum of 16 for a FixedString.

The function toUUID was also implemented.

For the following test, we will use this table:

create table test (
   uuid UUID
) Engine = Memory;

Insertion works with uuid and string works:
INSERT INTO test(uuid) VALUES('c8b40466-600b-11e7-907b-a6006ad3dba0');
INSERT INTO test(uuid) VALUES(toUUID('f9713452-600b-11e7-907b-a6006ad3dba0'));
INSERT INTO test(uuid) VALUES(toUUID(1));

Will display with a SELECT * FROM test;

c8b40466-600b-11e7-907b-a6006ad3dba0
f9713452-600b-11e7-907b-a6006ad3dba0
00000000-0000-0000-0000-000000000001

Comparison operator were overloaded, so you can order the query

SELECT * FROM test ORDER BY uuid ASC;

00000000-0000-0000-0000-000000000001 
c8b40466-600b-11e7-907b-a6006ad3dba0
f9713452-600b-11e7-907b-a6006ad3dba0

SELECT * FROM test ORDER BY uuid DESC;

f9713452-600b-11e7-907b-a6006ad3dba0
c8b40466-600b-11e7-907b-a6006ad3dba0
00000000-0000-0000-0000-000000000001

You can also use the comparison like this
SELECT * FROM test WHERE uuid = 'c8b40466-600b-11e7-907b-a6006ad3dba0';
SELECT * FROM test WHERE 'c8b40466-600b-11e7-907b-a6006ad3dba0' = uuid;
And it will return c8b40466-600b-11e7-907b-a6006ad3dba0

Also, multiples casting work, like
SELECT toInt8(toUUID(1)) will return 1

In the serialization and deserialization, snprintf and strtoull are used for ensure the quality of comparison and casting. my machine is in big endian so when casting a char and using memcpy would result something like 00000000-0000-0000-1000-00000000000 and the operation ff000000-0000-0000-0000-00000000000 < 00000000-0000-00ff-0000-00000000000 would return true if we used directly memcpy for writing the UUID to the UInt128 type. (but can be changed when gcc or clang would implement std::from_chars and std::to_chars function in c++17)

Hoping I was clear enough in the PR, quite to be too much into the details in the implementation. :)

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

robot-metrika-test · 2017-07-03T16:49:03Z

Can one of the admins verify this patch?

alexey-milovidov · 2017-07-03T17:42:09Z

Great!
What about if data type will be called UUID (not Uuid) for style consistency (and in all other mentions)?

alexey-milovidov · 2017-07-03T17:42:24Z

autotests

alexey-milovidov · 2017-07-03T17:44:32Z

my machine is in big endian

I wonder, does ClickHouse work on this machine? There are quite a lot of code, that rely on little endianess.

alexey-milovidov · 2017-07-03T18:16:08Z

dbms/src/Common/UInt128.h

@@ -25,19 +24,43 @@ struct UInt128
    UInt64 first;


Renaming this fields to high and low (or, better, low and high?) will be more convenient.

alexey-milovidov · 2017-07-03T18:18:58Z

dbms/src/Common/UInt128.h

 #endif

-    UInt128 & operator= (const UInt64 rhs) { first = rhs; second = 0; return *this; }
+    UInt128 & operator= (const UInt64 rhs) { first = 0; second = rhs; return *this; }


We need to check, where this method is used, just in case...

So I just take a look that this method was only used in the file Common/HashTable/HashTable.h in the ZeroTraits::set method who initialize to zero so it should be safe. :)

(most of the code call directly the first and second attributes)

alexey-milovidov · 2017-07-03T21:15:28Z

dbms/src/Common/UInt128.h

+    UInt128() = default;
+    explicit UInt128(const UInt64 rhs) : first(0), second(rhs) { }
+
+    bool operator== (const UInt128 & rhs) const { return second == rhs.second && first == rhs.first; }


This all could be written with tuple comparison: std::tie or std::forward_as_tuple.

alexey-milovidov · 2017-07-03T21:27:18Z

dbms/src/Common/UInt128.h

+    UInt128() = default;
+    explicit UInt128(const UInt64 rhs) : first(0), second(rhs) { }
+
+    bool operator== (const UInt128 & rhs) const { return second == rhs.second && first == rhs.first; }


Almost no reason to pass by reference here (and also no harm as compiler will remove reference as one of optimization passes) - it's just about style preferences.

alexey-milovidov · 2017-07-03T21:28:01Z

dbms/src/Common/UInt128.h

+    bool operator<= (const UInt128 & rhs) const { return (first == rhs.first) ? second <= rhs.second : first < rhs.first; }
+    bool operator<  (const UInt128 & rhs)  const { return (first == rhs.first) ? second < rhs.second : first < rhs.first; }
+
+    /** Type stored in the database will contains no more than 64 bits at the moment, don't need


Little typo.

alexey-milovidov · 2017-07-03T21:28:45Z

dbms/src/Common/UInt128.h

+    template<typename T> bool operator<  (const T rhs)  const { return first == 0 && static_cast<T>(second) >  rhs; }
+
+    template<typename T> explicit operator T() const { return static_cast<T>(second); }
+    operator UInt128() const { return *this; }


Why do we need this?

alexey-milovidov · 2017-07-03T21:29:53Z

dbms/src/Common/UInt128.h

+{
+    size_t operator()(const DB::UInt128 & u) const
+    {
+        return std::hash<DB::UInt64>()(u.first) ^ std::hash<DB::UInt64>()(u.second);


This is extremely bad.

Because it is broken when first == second.

The right thing is at least something like
hash(hash(first) ^ second),

but it is also not suitable, because std::hash is trivial for integers,
and you should use something more viable like CityHash_v1_0_2::Hash128to64.

alexey-milovidov · 2017-07-03T21:33:33Z

dbms/src/Common/UInt128.h

 #endif
+}

+/// Overload hash for type casting


What type casting do you mean?

Inside the ToIntMonotonicity on Functions/FunctionsConversion.h, they instantiate the DataTypeEnum class who have a std::unordered_map who have the template type ToIntMonotonicity as a key. (so for example who interest us, if we instantiate ToIntMonotonicity with an UInt128, it will instantiate a std::unordered_map who will have a UInt128 as a key, and std::unordered_map need to have the operator()() available)

So when I talk about type casting, it's most the conversion functions. (toInt8, toUInt32, etc...)

alexey-milovidov · 2017-07-03T21:34:50Z

dbms/src/Core/AccurateComparison.h

-                            std::is_integral<TInt>::value && std::is_integral<TUInt>::value &&
-                               std::is_signed<TInt>::value && std::is_unsigned<TUInt>::value>;
+                            std::is_integral<TInt>::value && (std::is_integral<TUInt>::value || std::is_same<TUInt, DB::UInt128>::value) &&
+                               std::is_signed<TInt>::value && (std::is_unsigned<TUInt>::value || std::is_same<TUInt, DB::UInt128>::value)>;


You have already specialized std::is_unsinged.
(What about specializing std::is_integral?)

Specializing std::is_integral was my first idea. But the problem it's in some part of the code they are using std::is_integral.

For example, in DataTypes/DataTypeNumberBase.cpp, we have the readTextUnsafeIfIntegral function member who call the readIntTextUnsafe function member (IO/ReadHelpers.h).

The problem now it's this function call fews operators who are not overloaded like operator*= or operator+= which is a little bit complicated because the UInt128 only emulate fews functions that a natural type have (principally comparison operator) but the arithmetic operator aren't overloaded because it is not really an unsigned integral but only a simulation of it.

A solution I can suggest it's maybe to overload std::is_integral (with value as true) and std::is_arithmetic(as false) and check in fews functions who need the arithmetic operator not only std::is_integral but also std::is_arithmetic, but maybe it's a little bit overkill for a single type ?

alexey-milovidov · 2017-07-03T21:35:41Z

dbms/src/Core/AccurateComparison.h

 template <typename TAInt, typename TAFloat>
 using bool_if_double_can_be_used = std::enable_if_t<
-                                        std::is_integral<TAInt>::value && (sizeof(TAInt) <= 4) && std::is_floating_point<TAFloat>::value,
+                                        std::is_integral<TAInt>::value && (sizeof(TAInt) <= 4 || sizeof(DB::Uuid) == sizeof(TAInt)) && std::is_floating_point<TAFloat>::value,


Looks like this have no sense.

alexey-milovidov · 2017-07-03T21:36:53Z

dbms/src/Core/AccurateComparison.h

+template<>
+inline bool greaterOp<DB::Float64, DB::UInt128>(DB::Float64 f, DB::UInt128 u)
+{
+    return f > u && greaterOp(f, static_cast<UInt64>(u));


Seems incorrect.
BTW, do we support comparison of Float64 and UUID?

It isn't support because it doesn't have logical sense to compare Float64 (or every numeric type) with an UUID (as same like a Date/DateTime with a numeric type) so we have an Illegal types of arguments error if we try to compare an UUID and a numerical type.

But even if the code can't be call using the client, we need to implement it because the compiler require it.

alexey-milovidov · 2017-07-03T21:38:07Z

dbms/src/Core/Field.cpp

                    x.push_back(value);
                    break;
                }
+                case Field::Types::Uuid:


Looks like Field is generic enough to have type be named UInt128 instead of UUID.
(The case: what if anyone would like to introduce another type for IPv6?)

YiuRULE · 2017-07-06T15:21:59Z

So, little erratum, just waited to fix all the things for telling this, my machine is in little endian (and would be actually easier to this case if it was in little indian) and not big endian, sorry for the misunderstood. :)

Ok, so everything was fix, UUID is now a type different of UInt128 using the STRONG_TYPEDEF macro so if we want to create other type than UUID we can use the UInt128 type.

Because arithmetic operator aren't implemented, and for making UInt128 generic, I needed to implement a Read/WriteCSV and Read/WriteText. This code will never be used except if we create a function who return an DataTypeUInt128 but just in case, the code throw an exception. Before that I didn't needed to implement it because UInt128 and UUID was the same type for the compiler.

alexey-milovidov · 2017-07-07T06:25:50Z

Ok. I need about two days for further review.

alexey-milovidov · 2017-07-10T04:08:37Z

Now we have two methods formatUUID and two methods parseUUID in ReadHelpers, WriteHelpers.
One of them is using strtoull and snprintf, which are very inefficient, and also an excessive copying.

alexey-milovidov · 2021-04-04T22:40:35Z

6e93d93

YiuRULE added 3 commits June 30, 2017 22:01

Implement UUID type and toUuid function

1231952

sync master

525851e

Remove mention of UUID type for dictonnary

d6ab064

alexey-milovidov requested changes Jul 3, 2017

View reviewed changes

alexey-milovidov reviewed Jul 3, 2017

View reviewed changes

YiuRULE added 5 commits July 4, 2017 12:42

Change Uuid to UUID datatype and make UInt128 a generic type

cba9815

Rename field in UInt128

aeeb81f

Adapt UInt128 operator

b907349

Fix comparison of UInt128 to Float64

c6a5083

Make UUID as a strong type

2ec96fe

alexey-milovidov merged commit b64b243 into ClickHouse:master Jul 10, 2017

YiuRULE mentioned this pull request Jul 14, 2017

Improve performance for deserialization/serialization of UUID type #985

Merged

alexey-milovidov mentioned this pull request Mar 16, 2024

Split CastOverloadResolver translation unit #61114

Merged

Conversation

YiuRULE commented Jul 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robot-metrika-test commented Jul 3, 2017

Uh oh!

alexey-milovidov commented Jul 3, 2017

Uh oh!

alexey-milovidov commented Jul 3, 2017

Uh oh!

alexey-milovidov commented Jul 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YiuRULE commented Jul 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Jul 7, 2017

Uh oh!

alexey-milovidov commented Jul 10, 2017

Uh oh!

alexey-milovidov commented Apr 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YiuRULE commented Jul 3, 2017 •

edited

Loading

YiuRULE commented Jul 6, 2017 •

edited

Loading