Skip to content

Implement natively uuid type in Clickhouse#945

Merged
alexey-milovidov merged 8 commits intoClickHouse:masterfrom
YiuRULE:master
Jul 10, 2017
Merged

Implement natively uuid type in Clickhouse#945
alexey-milovidov merged 8 commits intoClickHouse:masterfrom
YiuRULE:master

Conversation

@YiuRULE
Copy link
Contributor

@YiuRULE YiuRULE commented Jul 3, 2017

Implemented UUID type into clickhouse, the goal of this PR is to have a compromise between being fast and human readable without being obligate to do a little hack with a FixedString(16) and using UuidNumToString() function. This implementation (try) to not using that much of memory, fast enough and human readable.

The type used in memory is the UInt128 from Common/UInt128 with fews modifications on the type, comparison and cast operator were overloaded to act like a numeric type. And for this reason the POCO::UUID and boost::uuid type weren't used. The _int128 type wasn't choose for ensure compatibility between compilers. The benefits of using two unsigned long long is to have only two maximum comparisons instead of a maximum of 16 for a FixedString.

The function toUUID was also implemented.

For the following test, we will use this table:

create table test (
   uuid UUID
) Engine = Memory;

Insertion works with uuid and string works:
INSERT INTO test(uuid) VALUES('c8b40466-600b-11e7-907b-a6006ad3dba0');
INSERT INTO test(uuid) VALUES(toUUID('f9713452-600b-11e7-907b-a6006ad3dba0'));
INSERT INTO test(uuid) VALUES(toUUID(1));

Will display with a SELECT * FROM test;

c8b40466-600b-11e7-907b-a6006ad3dba0
f9713452-600b-11e7-907b-a6006ad3dba0
00000000-0000-0000-0000-000000000001 

Comparison operator were overloaded, so you can order the query

SELECT * FROM test ORDER BY uuid ASC;

00000000-0000-0000-0000-000000000001 
c8b40466-600b-11e7-907b-a6006ad3dba0
f9713452-600b-11e7-907b-a6006ad3dba0

SELECT * FROM test ORDER BY uuid DESC;

f9713452-600b-11e7-907b-a6006ad3dba0
c8b40466-600b-11e7-907b-a6006ad3dba0
00000000-0000-0000-0000-000000000001

You can also use the comparison like this
SELECT * FROM test WHERE uuid = 'c8b40466-600b-11e7-907b-a6006ad3dba0';
SELECT * FROM test WHERE 'c8b40466-600b-11e7-907b-a6006ad3dba0' = uuid;
And it will return c8b40466-600b-11e7-907b-a6006ad3dba0

Also, multiples casting work, like
SELECT toInt8(toUUID(1)) will return 1

In the serialization and deserialization, snprintf and strtoull are used for ensure the quality of comparison and casting. my machine is in big endian so when casting a char and using memcpy would result something like 00000000-0000-0000-1000-00000000000 and the operation ff000000-0000-0000-0000-00000000000 < 00000000-0000-00ff-0000-00000000000 would return true if we used directly memcpy for writing the UUID to the UInt128 type. (but can be changed when gcc or clang would implement std::from_chars and std::to_chars function in c++17)

Hoping I was clear enough in the PR, quite to be too much into the details in the implementation. :)

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

@robot-metrika-test
Copy link

Can one of the admins verify this patch?

@alexey-milovidov
Copy link
Member

Great!
What about if data type will be called UUID (not Uuid) for style consistency (and in all other mentions)?

@alexey-milovidov
Copy link
Member

autotests

@alexey-milovidov
Copy link
Member

my machine is in big endian

I wonder, does ClickHouse work on this machine? There are quite a lot of code, that rely on little endianess.

@@ -25,19 +24,43 @@ struct UInt128
UInt64 first;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming this fields to high and low (or, better, low and high?) will be more convenient.

#endif

UInt128 & operator= (const UInt64 rhs) { first = rhs; second = 0; return *this; }
UInt128 & operator= (const UInt64 rhs) { first = 0; second = rhs; return *this; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check, where this method is used, just in case...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just take a look that this method was only used in the file Common/HashTable/HashTable.h in the ZeroTraits::set method who initialize to zero so it should be safe. :)

(most of the code call directly the first and second attributes)

UInt128() = default;
explicit UInt128(const UInt64 rhs) : first(0), second(rhs) { }

bool operator== (const UInt128 & rhs) const { return second == rhs.second && first == rhs.first; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all could be written with tuple comparison: std::tie or std::forward_as_tuple.

UInt128() = default;
explicit UInt128(const UInt64 rhs) : first(0), second(rhs) { }

bool operator== (const UInt128 & rhs) const { return second == rhs.second && first == rhs.first; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost no reason to pass by reference here (and also no harm as compiler will remove reference as one of optimization passes) - it's just about style preferences.

bool operator<= (const UInt128 & rhs) const { return (first == rhs.first) ? second <= rhs.second : first < rhs.first; }
bool operator< (const UInt128 & rhs) const { return (first == rhs.first) ? second < rhs.second : first < rhs.first; }

/** Type stored in the database will contains no more than 64 bits at the moment, don't need
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little typo.

template<typename T> bool operator< (const T rhs) const { return first == 0 && static_cast<T>(second) > rhs; }

template<typename T> explicit operator T() const { return static_cast<T>(second); }
operator UInt128() const { return *this; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

{
size_t operator()(const DB::UInt128 & u) const
{
return std::hash<DB::UInt64>()(u.first) ^ std::hash<DB::UInt64>()(u.second);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is broken when first == second.

The right thing is at least something like
hash(hash(first) ^ second),

but it is also not suitable, because std::hash is trivial for integers,
and you should use something more viable like CityHash_v1_0_2::Hash128to64.

#endif
}

/// Overload hash for type casting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type casting do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the ToIntMonotonicity on Functions/FunctionsConversion.h, they instantiate the DataTypeEnum class who have a std::unordered_map who have the template type ToIntMonotonicity as a key. (so for example who interest us, if we instantiate ToIntMonotonicity with an UInt128, it will instantiate a std::unordered_map who will have a UInt128 as a key, and std::unordered_map need to have the operator()() available)

So when I talk about type casting, it's most the conversion functions. (toInt8, toUInt32, etc...)

std::is_integral<TInt>::value && std::is_integral<TUInt>::value &&
std::is_signed<TInt>::value && std::is_unsigned<TUInt>::value>;
std::is_integral<TInt>::value && (std::is_integral<TUInt>::value || std::is_same<TUInt, DB::UInt128>::value) &&
std::is_signed<TInt>::value && (std::is_unsigned<TUInt>::value || std::is_same<TUInt, DB::UInt128>::value)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have already specialized std::is_unsinged.
(What about specializing std::is_integral?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specializing std::is_integral was my first idea. But the problem it's in some part of the code they are using std::is_integral.

For example, in DataTypes/DataTypeNumberBase.cpp, we have the readTextUnsafeIfIntegral function member who call the readIntTextUnsafe function member (IO/ReadHelpers.h).

The problem now it's this function call fews operators who are not overloaded like operator*= or operator+= which is a little bit complicated because the UInt128 only emulate fews functions that a natural type have (principally comparison operator) but the arithmetic operator aren't overloaded because it is not really an unsigned integral but only a simulation of it.

A solution I can suggest it's maybe to overload std::is_integral (with value as true) and std::is_arithmetic(as false) and check in fews functions who need the arithmetic operator not only std::is_integral but also std::is_arithmetic, but maybe it's a little bit overkill for a single type ?

template <typename TAInt, typename TAFloat>
using bool_if_double_can_be_used = std::enable_if_t<
std::is_integral<TAInt>::value && (sizeof(TAInt) <= 4) && std::is_floating_point<TAFloat>::value,
std::is_integral<TAInt>::value && (sizeof(TAInt) <= 4 || sizeof(DB::Uuid) == sizeof(TAInt)) && std::is_floating_point<TAFloat>::value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this have no sense.

template<>
inline bool greaterOp<DB::Float64, DB::UInt128>(DB::Float64 f, DB::UInt128 u)
{
return f > u && greaterOp(f, static_cast<UInt64>(u));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems incorrect.
BTW, do we support comparison of Float64 and UUID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't support because it doesn't have logical sense to compare Float64 (or every numeric type) with an UUID (as same like a Date/DateTime with a numeric type) so we have an Illegal types of arguments error if we try to compare an UUID and a numerical type.

But even if the code can't be call using the client, we need to implement it because the compiler require it.

x.push_back(value);
break;
}
case Field::Types::Uuid:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Field is generic enough to have type be named UInt128 instead of UUID.
(The case: what if anyone would like to introduce another type for IPv6?)

@YiuRULE
Copy link
Contributor Author

YiuRULE commented Jul 6, 2017

So, little erratum, just waited to fix all the things for telling this, my machine is in little endian (and would be actually easier to this case if it was in little indian) and not big endian, sorry for the misunderstood. :)

Ok, so everything was fix, UUID is now a type different of UInt128 using the STRONG_TYPEDEF macro so if we want to create other type than UUID we can use the UInt128 type.

Because arithmetic operator aren't implemented, and for making UInt128 generic, I needed to implement a Read/WriteCSV and Read/WriteText. This code will never be used except if we create a function who return an DataTypeUInt128 but just in case, the code throw an exception. Before that I didn't needed to implement it because UInt128 and UUID was the same type for the compiler.

@alexey-milovidov
Copy link
Member

Ok. I need about two days for further review.

@alexey-milovidov alexey-milovidov merged commit b64b243 into ClickHouse:master Jul 10, 2017
@alexey-milovidov
Copy link
Member

Now we have two methods formatUUID and two methods parseUUID in ReadHelpers, WriteHelpers.
One of them is using strtoull and snprintf, which are very inefficient, and also an excessive copying.

@alexey-milovidov
Copy link
Member

6e93d93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants