Skip to content

Bug fix + changes in extractKeyValuePairs parsing logic#80657

Merged
GrigoryPervakov merged 12 commits intoClickHouse:masterfrom
arthurpassos:fixes_extract_kvp
Jun 17, 2025
Merged

Bug fix + changes in extractKeyValuePairs parsing logic#80657
GrigoryPervakov merged 12 commits intoClickHouse:masterfrom
arthurpassos:fixes_extract_kvp

Conversation

@arthurpassos
Copy link
Contributor

@arthurpassos arthurpassos commented May 21, 2025

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Changes in this PR:

  1. Introduce a new argument unexpected_quoting_character_strategy to the extractKeyValuePairs function that controls what happens when a quoting_character is unexpectedly found when reading a non quoted key or value. The value can be one of: invalid, accept or promote. Invalid will discard the key and go back to waiting key state. Accept will treat it as part of the key. Promote will discard previous character and start parsing as a quoted key. I believe the default behavior (and maybe the only behavior) should be INVALID, but in 2023 on Add transition from reading key to reading quoted key when double quotes are found #56423 it was implemented the PROMOTE logic. So, to make it kind of backwards compatible, I created this setting.
  2. After parsing a quoted value, only parse the next key if a pair delimiter is found

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@arthurpassos arthurpassos marked this pull request as ready for review May 28, 2025 11:30
@arthurpassos arthurpassos changed the title Changes in extractKeyValuePairs parsing logic Bug fix + changes in extractKeyValuePairs parsing logic May 28, 2025
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented May 29, 2025

Workflow [PR], commit [6b8b505]

@clickhouse-gh clickhouse-gh bot added the pr-backward-incompatible Pull request with backwards incompatible changes label May 29, 2025
@GrigoryPervakov GrigoryPervakov added the can be tested Allows running workflows for external contributors label May 30, 2025
@arthurpassos
Copy link
Contributor Author

@GrigoryPervakov Hi. Do you happen to have time to review this PR?

{
const auto unexpected_quoting_character_strategy_lower = Poco::toLower(unexpected_quoting_character_strategy);

if (unexpected_quoting_character_strategy_lower == "accept")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using magic_enum - its fun.
In CH it is 'covered' by EnumReflection.h
magic_enum::enum_castConfiguration::UnexpectedQuotingCharacterStrategy(unexpected_quoting_character_strategy, magic_enum::case_insensitive);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know it was included in ClickHouse. I remember using magic_enum heavily in ~2018/2019. That's cool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


if (parsed_arguments.unexpected_quoting_character_strategy)
{
builder.withUnexpectedQuotingCharacterStrategy(parsed_arguments.unexpected_quoting_character_strategy->getDataAt(0).toString());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to assert this is not empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is not a problem. This code is only executed in case the argument has been specified, and empty string will work just fine.

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
```

unexpected_quoting_character_strategy examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think good to add examples with string like
"abc:'5',def":'7'
Can we get single key-value pair with key abc:'5',def?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what's your quoting character. Is it " or ' or something else? The default is ".

If you are relying on the defaults, it does not really matter what's the value of unexpected_quoting_character_strategy because anything inside a quoted key is valid.

You'll end up with:

Input: "abc:'5',def":'7'
Output: {'abc:'5',def', '7'}

- `key_value_delimiter` - Single character delimiting keys and values. Defaults to `:`. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
- `pair_delimiters` - Set of character delimiting pairs. Defaults to ` `, `,` and `;`. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
- `quoting_character` - Single character used as quoting character. Defaults to `"`. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
- `unexpected_quoting_character_strategy` - Strategy to handle quoting characters in unexpected places during `read_key` and `read_value` phase. Possible values: "invalid", "accept" and "promote". Invalid will discard key/value and transition back to `WAITING_KEY` state. Accept will treat it as a normal character. Promote will transition to `READ_QUOTED_{KEY/VALUE}` state and start from next character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we more explicitly say what it is by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I guess that values should be in backticks, like `Invalid` will discard ...

data_column,
key_value_delimiter,
pair_delimiters,
quoting_character,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about trailing comma

Copy link
Member

@GrigoryPervakov GrigoryPervakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except I'm not sure we need to introduce backward incompatible change if it could be avoided keeping feature without changes

else if (isQuotingCharacter(*p))
{
return {next_pos, State::READING_QUOTED_KEY};
if (configuration.unexpected_quoting_character_strategy == Configuration::UnexpectedQuotingCharacterStrategy::INVALID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe switch(configuration.unexpected_quoting_character_strategy) would look better

char quoting_character = '"';
std::vector<char> item_delimiters = {' ', ',', ';'};
uint64_t max_number_of_pairs = std::numeric_limits<uint64_t>::max();
extractKV::Configuration::UnexpectedQuotingCharacterStrategy unexpected_quoting_character_strategy = extractKV::Configuration::UnexpectedQuotingCharacterStrategy::INVALID;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that this INVALID default is the only reason for backward incompatibility of the PR?
And if we keep it PROMOTE it could be improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. It's a choice we gotta make. IMO, INVALID is what should be the default, but unfortunately it is not the behavior that was implemented previously

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But it worked well enough, and after this patch, some queries could be broken. Maybe we introduce a new setting like extract_key_value_pairs_default_unexpected_quoting_character_strategy and keep the old behaviour in the changes history.

I don't think we need to break it now. Let's keep the old behaviour by default or make it safe to upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made PROMOTE the default behavior

@GrigoryPervakov GrigoryPervakov self-assigned this Jun 13, 2025
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Jun 16, 2025

Workflow [PR], commit [afafde8]

Summary:

job_name test_name status info comment
Stateless tests (amd_asan, distributed plan, 1/2) failure
02922_server_exit_code FAIL
Stateless tests (amd_binary, ParallelReplicas, s3 storage) failure
03403_read_in_order_streams_memory_usage FAIL
Stateless tests (amd_coverage,1/6) dropped
Stateless tests (amd_coverage,2/6) dropped
Stateless tests (amd_coverage,3/6) dropped
Stateless tests (amd_coverage,4/6) dropped
Stateless tests (amd_coverage,5/6) dropped
Stateless tests (amd_coverage,6/6) dropped

@arthurpassos
Copy link
Contributor Author

03403_read_in_order_streams_memory_usage - Not related and also failed on #80840

02922_server_exit_code - Not related and also failed on #74193

@GrigoryPervakov Can we merge it?

@GrigoryPervakov GrigoryPervakov added this pull request to the merge queue Jun 17, 2025
Merged via the queue into ClickHouse:master with commit 0aa82b3 Jun 17, 2025
118 of 123 checks passed
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 17, 2025
zvonand added a commit to Altinity/ClickHouse that referenced this pull request Jun 18, 2025
…plus_changes_in_extract_kvp_logic

24.8 Partial Backport of ClickHouse#80657 - Wait for pair delimiter after flusing quoted value on extractKeyValuePairs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-backward-incompatible Pull request with backwards incompatible changes pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants