Fix getting file format from globs in file() function#88947
Fix getting file format from globs in file() function#88947vitlibar merged 4 commits intoClickHouse:masterfrom
Conversation
|
Workflow [PR], commit [8e30ae1] Summary: ❌
|
| # import pyarrow as pa | ||
| # id = pa.array(["1"], type=pa.string()) | ||
| # ts = pa.array(["2020-01-01 14:00:00"], type=pa.string()) | ||
| # ts_plus_tz = pa.array(["2020-01-01 14:00:00+00:00"], type=pa.string()) |
There was a problem hiding this comment.
2020-01-01 14:00:00+00:00 is a wrong format for ClickHouse columns of type DateTime64. So the command
INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.parquet')
fails with error message Cannot parse string '2020-01-01 14:00:00+00:00' as DateTime64(9, 'UTC') as expected.
However before this fix the following command
INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.par?uet')
was successful and inserted 1970-01-01 00:00:00.000000000 to the table.
This PR makes such commands using globs fail with error message Cannot parse as expected.
|
Does it fix #88920 ? If so, please mention it to close it. |
Yes. Thanks, I didn't notice they created an issue. |
ca9347f to
09000da
Compare
87d27c3 to
6cba59f
Compare
6cba59f to
5a67ff5
Compare
| if (!path_to_archive.empty()) | ||
| res.archive_info = getArchiveInfo(path_to_archive, filename, user_files_path, context, res.total_bytes_to_read); | ||
| else | ||
| res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read); | ||
|
|
||
| res.with_globs = res.paths.size() > 1; | ||
|
|
||
| if (res.archive_info) | ||
| { | ||
| res.format_from_filenames = FormatFactory::instance().tryGetFormatFromFileName(res.archive_info->path_in_archive); | ||
| } | ||
| else | ||
| { | ||
| for (const String & path : res.paths) | ||
| { | ||
| auto single_file_format = FormatFactory::instance().tryGetFormatFromFileName(path); | ||
| if (!res.format_from_filenames) | ||
| { | ||
| res.format_from_filenames = single_file_format; | ||
| } | ||
| else if (res.format_from_filenames != single_file_format) | ||
| { | ||
| res.format_from_filenames = {}; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (!res.paths.empty()) | ||
| res.path_for_partitioned_write = res.paths.front(); | ||
| else | ||
| res.path_for_partitioned_write = filename; | ||
|
|
||
| return res; |
There was a problem hiding this comment.
As a suggestion, instead of setting fields manually, pass them to a private FileSource c-tor:
FileSource(String filename_, Strings paths_, std::optional<ArchiveInfo> archive_info_, size_t total_bytes_to_read_)
: filename(std::move(filename_)), paths(std::move(paths_)), archive_info(std::move(archive_info_)), total_bytes_to_read(total_bytes_to_read_)
{
}
There was a problem hiding this comment.
Honestly I don't really see any benefits of adding such a constructor because since these fields are public and we need to calculate these fields anyway, the constructor looks like just an extra step.
| else | ||
| res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read); | ||
|
|
||
| res.with_globs = res.paths.size() > 1; |
There was a problem hiding this comment.
with_globs is derived from another field and it's cheap so that it can be in the form of a getter instead:
bool withGlobs() const { return paths.size() > 1; }
There was a problem hiding this comment.
Yes, this calculation is simple, however it looks clear why with_globs = (paths.size() > 1) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.
| if (!res.paths.empty()) | ||
| res.path_for_partitioned_write = res.paths.front(); | ||
| else | ||
| res.path_for_partitioned_write = filename; |
There was a problem hiding this comment.
path_for_partitioned_write can also be in the form of a method
There was a problem hiding this comment.
Yes, this calculation is simple, however it looks clear why path_for_partitioned_write = res.paths.front() (or filename) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.
|
CI failures are unrelated: |
99a5825
|
This change actually breaks |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix getting file format from globs in file() function. Resolves #88920