Skip to content

Create a netloc function for extracting network location#11356

Merged
vitlibar merged 6 commits intoClickHouse:masterfrom
PerformanceVision:netloc_function
Jun 5, 2020
Merged

Create a netloc function for extracting network location#11356
vitlibar merged 6 commits intoClickHouse:masterfrom
PerformanceVision:netloc_function

Conversation

@YiuRULE
Copy link
Contributor

@YiuRULE YiuRULE commented Jun 2, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add netloc function for extracting network location, similar to urlparse(url), netloc in python

@YiuRULE
Copy link
Contributor Author

YiuRULE commented Jun 2, 2020

Related to #10357

@blinkov blinkov added doc-alert pr-feature Pull request with new product feature labels Jun 2, 2020
@vitlibar vitlibar self-assigned this Jun 2, 2020
if (end == pos)
return;

/// Strings are zero-terminated.
Copy link
Member

@vitlibar vitlibar Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FixedString is not zero-terminated, so pos[1] can possibly be the first character of next url.

pos = find_first_symbols<'/', '?'>(pos + 2, end);
if (end == pos)
return;
res_size = pos - res_data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS there is no checks against restricted symbols, like in domain.h:

case ' ': /// restricted symbols

Also maybe it will be better to add function to extract userinfo instead and netloc() will be concat(userinfo(url), domain(url)) (yep, will do extra job) ?

Copy link
Member

@vitlibar vitlibar Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's maybe unlikely that someone will use FixedString for keeping URLs and one of those FixedString will end with the first slash at the same time. However it's more correct to check boundaries.

@YiuRULE
Copy link
Contributor Author

YiuRULE commented Jun 4, 2020

@azat Alright, fixed normally every points who has been mentioned, we have now a special character supports, depending of the context (if it's an username, password, etc...)

static size_t getReserveLengthForElement() { return 15; }

static inline StringRef getNetworkLocation(const char * data, size_t size)
{
Copy link
Member

@azat azat Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like copy-pasted getURLHost (if I'm not missing anything).

Maybe getURLHost can accept userinfo parameter and return hostname with userinfo or not instead based on this (and then all getNetworkLocation can be replaced with getURLHost(userinfo=true)?

UPD: and something should be done with port, this can be parsed separatelly after getURLHost, see port() function, it uses getURLHost for seeking pointer up to the port and then parse the port

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some difference between the getURLHost, even if indeed the function was used as a base.

On getURLHost, we stop when we have a ?, /, @ or :, but this kind of information can be totally considered as valid, when we have a password for example.

We could possibly use it, but we should have on mind than the domain function will be significantly slower, if we doesn't have any identification on the url, we would need to parse entirely the URL as we cannot detect if we are currently parsing an user/password or the domain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On getURLHost, we stop when we have a ?, /, @ or :, but this kind of information can be totally considered as valid, when we have a password for example.

Forgot about this, ok

@azat
Copy link
Member

azat commented Jun 4, 2020

Also maybe it will be better to add function to extract userinfo instead and netloc() will be concat(userinfo(url), domain(url)) (yep, will do extra job) ?

I guess that right now you need exactly that function and using concat will be slower?

@YiuRULE
Copy link
Contributor Author

YiuRULE commented Jun 4, 2020

Also maybe it will be better to add function to extract userinfo instead and netloc() will be concat(userinfo(url), domain(url)) (yep, will do extra job) ?

I guess that right now you need exactly that function and using concat will be slower?

On my side, it was was mostly for helping a bit the ClickHouse team for this issue: #10357

Right now, on my current company, we don't really need this kind of function, or at least would only be a simple enhancement for us.

@vitlibar vitlibar merged commit 92ae3e1 into ClickHouse:master Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants