Bug fix + changes in extractKeyValuePairs parsing logic#80657
Bug fix + changes in extractKeyValuePairs parsing logic#80657GrigoryPervakov merged 12 commits intoClickHouse:masterfrom
extractKeyValuePairs parsing logic#80657Conversation
extractKeyValuePairs parsing logicextractKeyValuePairs parsing logic
|
@GrigoryPervakov Hi. Do you happen to have time to review this PR? |
| { | ||
| const auto unexpected_quoting_character_strategy_lower = Poco::toLower(unexpected_quoting_character_strategy); | ||
|
|
||
| if (unexpected_quoting_character_strategy_lower == "accept") |
There was a problem hiding this comment.
Consider using magic_enum - its fun.
In CH it is 'covered' by EnumReflection.h
magic_enum::enum_castConfiguration::UnexpectedQuotingCharacterStrategy(unexpected_quoting_character_strategy, magic_enum::case_insensitive);
There was a problem hiding this comment.
Oh, I didn't know it was included in ClickHouse. I remember using magic_enum heavily in ~2018/2019. That's cool.
|
|
||
| if (parsed_arguments.unexpected_quoting_character_strategy) | ||
| { | ||
| builder.withUnexpectedQuotingCharacterStrategy(parsed_arguments.unexpected_quoting_character_strategy->getDataAt(0).toString()); |
There was a problem hiding this comment.
I need to assert this is not empty
There was a problem hiding this comment.
Actually, this is not a problem. This code is only executed in case the argument has been specified, and empty string will work just fine.
| └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| unexpected_quoting_character_strategy examples: |
There was a problem hiding this comment.
I think good to add examples with string like
"abc:'5',def":'7'
Can we get single key-value pair with key abc:'5',def?
There was a problem hiding this comment.
It depends on what's your quoting character. Is it " or ' or something else? The default is ".
If you are relying on the defaults, it does not really matter what's the value of unexpected_quoting_character_strategy because anything inside a quoted key is valid.
You'll end up with:
Input: "abc:'5',def":'7'
Output: {'abc:'5',def', '7'}
| - `key_value_delimiter` - Single character delimiting keys and values. Defaults to `:`. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md). | ||
| - `pair_delimiters` - Set of character delimiting pairs. Defaults to ` `, `,` and `;`. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md). | ||
| - `quoting_character` - Single character used as quoting character. Defaults to `"`. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md). | ||
| - `unexpected_quoting_character_strategy` - Strategy to handle quoting characters in unexpected places during `read_key` and `read_value` phase. Possible values: "invalid", "accept" and "promote". Invalid will discard key/value and transition back to `WAITING_KEY` state. Accept will treat it as a normal character. Promote will transition to `READ_QUOTED_{KEY/VALUE}` state and start from next character. |
There was a problem hiding this comment.
Shouldn't we more explicitly say what it is by default?
There was a problem hiding this comment.
Not sure, but I guess that values should be in backticks, like `Invalid` will discard ...
| data_column, | ||
| key_value_delimiter, | ||
| pair_delimiters, | ||
| quoting_character, |
There was a problem hiding this comment.
Not sure about trailing comma
GrigoryPervakov
left a comment
There was a problem hiding this comment.
LGTM, except I'm not sure we need to introduce backward incompatible change if it could be avoided keeping feature without changes
| else if (isQuotingCharacter(*p)) | ||
| { | ||
| return {next_pos, State::READING_QUOTED_KEY}; | ||
| if (configuration.unexpected_quoting_character_strategy == Configuration::UnexpectedQuotingCharacterStrategy::INVALID) |
There was a problem hiding this comment.
nit: Maybe switch(configuration.unexpected_quoting_character_strategy) would look better
| char quoting_character = '"'; | ||
| std::vector<char> item_delimiters = {' ', ',', ';'}; | ||
| uint64_t max_number_of_pairs = std::numeric_limits<uint64_t>::max(); | ||
| extractKV::Configuration::UnexpectedQuotingCharacterStrategy unexpected_quoting_character_strategy = extractKV::Configuration::UnexpectedQuotingCharacterStrategy::INVALID; |
There was a problem hiding this comment.
Am I right that this INVALID default is the only reason for backward incompatibility of the PR?
And if we keep it PROMOTE it could be improvement?
There was a problem hiding this comment.
Yes, you are correct. It's a choice we gotta make. IMO, INVALID is what should be the default, but unfortunately it is not the behavior that was implemented previously
There was a problem hiding this comment.
I see. But it worked well enough, and after this patch, some queries could be broken. Maybe we introduce a new setting like extract_key_value_pairs_default_unexpected_quoting_character_strategy and keep the old behaviour in the changes history.
I don't think we need to break it now. Let's keep the old behaviour by default or make it safe to upgrade.
There was a problem hiding this comment.
Ok, I made PROMOTE the default behavior
875c486 to
6b8b505
Compare
|
Workflow [PR], commit [afafde8] Summary: ❌
|
|
03403_read_in_order_streams_memory_usage - Not related and also failed on #80840 02922_server_exit_code - Not related and also failed on #74193 @GrigoryPervakov Can we merge it? |
0aa82b3
…plus_changes_in_extract_kvp_logic 24.8 Partial Backport of ClickHouse#80657 - Wait for pair delimiter after flusing quoted value on extractKeyValuePairs
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Changes in this PR:
unexpected_quoting_character_strategyto theextractKeyValuePairsfunction that controls what happens when aquoting_characteris unexpectedly found when reading a non quoted key or value. The value can be one of:invalid,acceptorpromote. Invalid will discard the key and go back to waiting key state. Accept will treat it as part of the key. Promote will discard previous character and start parsing as a quoted key. I believe the default behavior (and maybe the only behavior) should beINVALID, but in 2023 on Add transition from reading key to reading quoted key when double quotes are found #56423 it was implemented thePROMOTElogic. So, to make it kind of backwards compatible, I created this setting.Documentation entry for user-facing changes