Create a netloc function for extracting network location#11356
Create a netloc function for extracting network location#11356vitlibar merged 6 commits intoClickHouse:masterfrom
Conversation
|
Related to #10357 |
src/Functions/URL/netloc.h
Outdated
| if (end == pos) | ||
| return; | ||
|
|
||
| /// Strings are zero-terminated. |
There was a problem hiding this comment.
FixedString is not zero-terminated, so pos[1] can possibly be the first character of next url.
src/Functions/URL/netloc.h
Outdated
| pos = find_first_symbols<'/', '?'>(pos + 2, end); | ||
| if (end == pos) | ||
| return; | ||
| res_size = pos - res_data; |
There was a problem hiding this comment.
AFAICS there is no checks against restricted symbols, like in domain.h:
ClickHouse/src/Functions/URL/domain.h
Line 51 in 58786f9
Also maybe it will be better to add function to extract userinfo instead and netloc() will be concat(userinfo(url), domain(url)) (yep, will do extra job) ?
There was a problem hiding this comment.
It's maybe unlikely that someone will use FixedString for keeping URLs and one of those FixedString will end with the first slash at the same time. However it's more correct to check boundaries.
|
@azat Alright, fixed normally every points who has been mentioned, we have now a special character supports, depending of the context (if it's an username, password, etc...) |
| static size_t getReserveLengthForElement() { return 15; } | ||
|
|
||
| static inline StringRef getNetworkLocation(const char * data, size_t size) | ||
| { |
There was a problem hiding this comment.
Looks like copy-pasted getURLHost (if I'm not missing anything).
Maybe getURLHost can accept userinfo parameter and return hostname with userinfo or not instead based on this (and then all getNetworkLocation can be replaced with getURLHost(userinfo=true)?
UPD: and something should be done with port, this can be parsed separatelly after getURLHost, see port() function, it uses getURLHost for seeking pointer up to the port and then parse the port
There was a problem hiding this comment.
We have some difference between the getURLHost, even if indeed the function was used as a base.
On getURLHost, we stop when we have a ?, /, @ or :, but this kind of information can be totally considered as valid, when we have a password for example.
We could possibly use it, but we should have on mind than the domain function will be significantly slower, if we doesn't have any identification on the url, we would need to parse entirely the URL as we cannot detect if we are currently parsing an user/password or the domain
There was a problem hiding this comment.
On getURLHost, we stop when we have a ?, /, @ or :, but this kind of information can be totally considered as valid, when we have a password for example.
Forgot about this, ok
I guess that right now you need exactly that function and using |
On my side, it was was mostly for helping a bit the ClickHouse team for this issue: #10357 Right now, on my current company, we don't really need this kind of function, or at least would only be a simple enhancement for us. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add
netlocfunction for extracting network location, similar tourlparse(url),netlocin python