FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478
FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478subrata-ms wants to merge 9 commits intomainfrom
Conversation
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 3127-3138 3127 // the streaming LOB path to retrieve the full
3128 // value.
3129 LOG("SQLGetData: Cannot determine data length "
3130 "(SQL_NO_TOTAL) for column %d (SQL_CHAR), "
! 3131 "falling back to LOB streaming",
3132 i);
! 3133 row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
! 3134 charEncoding));
3135 // LCOV_EXCL_STOP
3136 } else if (dataLen < 0) {
3137 LOG("SQLGetData: Unexpected negative data length "
3138 "for column %d - dataType=%d, dataLen=%ld",Lines 3274-3285 3274 // the streaming LOB path to retrieve the full
3275 // value.
3276 LOG("SQLGetData: Cannot determine NVARCHAR data "
3277 "length (SQL_NO_TOTAL) for column %d, "
! 3278 "falling back to LOB streaming",
3279 i);
! 3280 row.append(
! 3281 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
3282 // LCOV_EXCL_STOP
3283 } else if (dataLen < 0) {
3284 LOG("SQLGetData: Unexpected negative data length "
3285 "for column %d (NVARCHAR) - dataLen=%ld",Lines 4008-4029 4008 //
4009 // Coverage note: This branch is defensive — the ODBC driver
4010 // controls indicator values in bound-column mode, so
4011 // SQL_NO_TOTAL cannot be triggered from Python-level tests.
! 4012 const ColumnInfoExt& noTotalColInfo = columnInfosExt[col - 1];
! 4013 LOG("SQLGetData: SQL_NO_TOTAL for column %d (dataType=%d), "
! 4014 "falling back to LOB streaming",
! 4015 col, noTotalColInfo.dataType);
! 4016 switch (noTotalColInfo.dataType) {
! 4017 case SQL_CHAR:
! 4018 case SQL_VARCHAR:
! 4019 case SQL_LONGVARCHAR:
! 4020 #if defined(__APPLE__) || defined(__linux__)
! 4021 PyList_SET_ITEM(row, col - 1,
! 4022 FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false,
! 4023 noTotalColInfo.charEncoding)
! 4024 .release()
! 4025 .ptr());
4026 #else
4027 PyList_SET_ITEM(
4028 row, col - 1,
4029 FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")Lines 4029-4066 4029 FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
4030 .release()
4031 .ptr());
4032 #endif
! 4033 break;
! 4034 case SQL_WCHAR:
! 4035 case SQL_WVARCHAR:
! 4036 case SQL_WLONGVARCHAR:
! 4037 case SQL_SS_XML:
! 4038 PyList_SET_ITEM(
! 4039 row, col - 1,
! 4040 FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 4041 .release()
! 4042 .ptr());
! 4043 break;
! 4044 case SQL_BINARY:
! 4045 case SQL_VARBINARY:
! 4046 case SQL_LONGVARBINARY:
! 4047 case SQL_SS_UDT:
! 4048 PyList_SET_ITEM(row, col - 1,
! 4049 FetchLobColumnData(hStmt, col, SQL_C_BINARY, false, true)
! 4050 .release()
! 4051 .ptr());
! 4052 break;
! 4053 default:
4054 // For fixed-length types SQL_NO_TOTAL should not
4055 // occur; treat as NULL to avoid undefined behaviour.
! 4056 LOG("SQL_NO_TOTAL for unexpected fixed-type column %d "
! 4057 "(dataType=%d), returning NULL",
! 4058 col, noTotalColInfo.dataType);
! 4059 Py_INCREF(Py_None);
! 4060 PyList_SET_ITEM(row, col - 1, Py_None);
! 4061 break;
! 4062 }
4063 // LCOV_EXCL_STOP
4064 continue;
4065 }📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%🔗 Quick Links
|
There was a problem hiding this comment.
Pull request overview
This PR addresses cross-platform inconsistencies when fetching CP1252-collated VARCHAR data by changing the ODBC C-type used for VARCHAR retrieval on Windows, aiming to ensure consistent Unicode (str) results across Windows and Linux/macOS.
Changes:
- On Windows, fetch/bind
VARCHAR(and relatedSQL_CHARtypes) asSQL_C_WCHARand process via the wide-char path to avoid code page guessing and bytes fallback. - On Linux/macOS, clarify/standardize
SQL_C_CHARbuffer sizing assumptions for UTF-8 expansion. - Add a new CP1252 regression test section covering problematic bytes and multiple fetch paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
mssql_python/pybind/ddbc_bindings.cpp |
Updates SQLGetData_wrap and batch fetch binding/processing to use SQL_C_WCHAR for VARCHAR on Windows; adjusts buffer sizing logic and LOB detection formatting. |
tests/test_013_encoding_decoding.py |
Adds a new CP1252 VARCHAR consistency test suite intended to ensure VARCHAR returns str across platforms and fetch methods. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
ffelixg
left a comment
There was a problem hiding this comment.
Hey, love to see this!
The encoding used by the odbc driver is actually not directly dependent on the operating system, instead it checks which locale is being used and encodes according to that. The difference between the operating systems is that windows uses cp1252 by default and linux uses utf-8. However, you can change the locale and make the driver return data in some other encoding. For example the current test passes on Linux with mssql_python 1.4.0, which shows that you can provoke the same issue on Linux:
def test_locale_varchar_decode_iso885915():
assert locale.getlocale() == ('en_US', 'UTF-8')
# important: change locale BEFORE connecting
locale.setlocale(locale.LC_ALL, 'en_US.iso885915')
connection = connect()
cursor = connection.cursor()
(val,) = cursor.execute("SELECT '€'").fetchone()
# without changing locale, we would get val == '€'
assert val == b'\xa4', val
cursor.close()
connection.close()If we really want to continue to use SQL_C_CHAR for fetching varchar data, even if just on Linux, we would need to note down the current locale when connecting and only fetch SQL_C_CHAR if it is UTF-8 and SQL_C_WCHAR otherwise. This is made more complicated by connection pooling though. Apparently some pooling is already active by default, as followup connections still use the same locale, even if I change it beforehand.
On the flip side, if you do this, you can get rid of the platform specific code, because windows also allows changing the locale to UTF-8, which would enable SQL_C_CHAR there as well.
@subrata-ms Let' see if we can detect the locale at connection time and use SQL_C_WCHAR on any platform when the locale is non-UTF-8. This prevents the same class of bug from re-appearing. If the full fix is deferred, create a tracking issue for it in ADO backlog. |
| (void)wcharEncoding; // Suppress unused parameter warning | ||
| #if defined(_WIN32) | ||
| // On Windows, VARCHAR is fetched as SQL_C_WCHAR, so charEncoding is unused. | ||
| (void)charEncoding; |
There was a problem hiding this comment.
I consider this as a big behavioral (or potentially breaking) change, why? Because:
Consider a user with a server using a rare or custom collation (say a legacy Thai or Japanese single-byte encoding). Before this PR, they could call setdecoding(SQL_CHAR, encoding="tis-620") to control how bytes were interpreted. After this PR on Windows, that call is silently ignored, the driver always asks ODBC for UTF-16, which may or may not produce the correct result depending on the driver's conversion capabilities. The user gets no indication that their explicit configuration was bypassed.
What I get from this change is that the charEncoding parameter is explicitly (void) suppressed on Windows and all three fetch paths (SQLGetData_wrap, SQLBindColums, FetchBatchData) now unconditionally use SQL_C_WCHAR for VARCHAR columns. This means the user-facing setdecoding(SQL_CHAR, encoding="cp1252", ctype=SQL_CHAR) API call is accepted without error but has no effect on Windows.
The user's explicit decoding configuration is silently discarded. The test test_cp1252_varchar_explicit_decoding passes only because SQL_C_WCHAR handles CP1252 natively, not because the setdecoding call influenced anything.
There was a problem hiding this comment.
I am still thinking about what could be the possible solution for this to avoid this "potentially" big change.
There was a problem hiding this comment.
There's actually a cleaner fix that avoids the breaking change entirely. _decoding_settings already stores both encoding and ctype, but only encoding is forwarded to C++ today and ctype is never passed down. So instead, as a fix:
-
We can also pass
ctypeintoFetchOne/Many/All_wrap -
In the
SQL_CHAR/SQL_VARCHARC++ branch: useSQL_C_WCHARwhenctype == SQL_WCHARand useSQL_C_CHAR+ decode withcharEncodingwhenctype == SQL_CHAR -
Change the default
ctypeforSQL_CHARfromSQL_CHAR→SQL_WCHAR
This means default users get SQL_C_WCHAR automatically (fixed). Users who explicitly passed ctype=SQL_CHAR keep their existing behavior unchanged. No platform #if guards needed, and the Linux locale issue @ffelixg raised is also covered since the default becomes SQL_C_WCHAR.
This is consider as limitation in python test. The SQL_NO_TOTAL indicator value is written by the ODBC driver into bound column buffers during SQLFetchScroll. Python code has no control over what the driver writes there — it's a runtime decision made deep inside the native ODBC layer.
|
| f"{fetch_method}: expected str but got {type(value).__name__}: " | ||
| f"{value!r} (platform={sys.platform})." | ||
| ) | ||
| assert "\u00ad" in value, ( |
There was a problem hiding this comment.
The fetchone parametrized tests all use strict equality (==). This batch test uses "in", so containment check would pass even if the driver incorrectly prepended/appended characters. Please change to assert value == "\u00ad" to match the stronger assertion pattern used in the rest of the test suite.
| (void)wcharEncoding; // Suppress unused parameter warning | ||
| #if defined(_WIN32) | ||
| // On Windows, VARCHAR is fetched as SQL_C_WCHAR, so charEncoding is unused. | ||
| (void)charEncoding; |
There was a problem hiding this comment.
There's actually a cleaner fix that avoids the breaking change entirely. _decoding_settings already stores both encoding and ctype, but only encoding is forwarded to C++ today and ctype is never passed down. So instead, as a fix:
-
We can also pass
ctypeintoFetchOne/Many/All_wrap -
In the
SQL_CHAR/SQL_VARCHARC++ branch: useSQL_C_WCHARwhenctype == SQL_WCHARand useSQL_C_CHAR+ decode withcharEncodingwhenctype == SQL_CHAR -
Change the default
ctypeforSQL_CHARfromSQL_CHAR→SQL_WCHAR
This means default users get SQL_C_WCHAR automatically (fixed). Users who explicitly passed ctype=SQL_CHAR keep their existing behavior unchanged. No platform #if guards needed, and the Linux locale issue @ffelixg raised is also covered since the default becomes SQL_C_WCHAR.
Work Item / Issue Reference
Summary
This pull request makes significant improvements to how VARCHAR (and related CHAR) columns are fetched and decoded, especially for Windows platforms, to ensure consistent Unicode string handling across all operating systems. The changes unify the decoding behavior for problematic CP1252 byte values, preventing platform-specific inconsistencies and ensuring that VARCHAR columns always return Python
strobjects, notbytes, regardless of the server collation or platform. The update also adds comprehensive tests to verify this behavior.Cross-platform VARCHAR decoding and buffer management:
SQL_C_WCHAR, ensuring the ODBC driver converts from the server's native encoding (such as CP1252) to UTF-16, matching the behavior of NVARCHAR columns and eliminating the need to guess the server code page. This change guarantees that Python always receives Unicode strings (str) for VARCHAR columns, even when they contain bytes that are invalid in UTF-8. [1] [2] [3] [4]SQL_C_CHARand decode as UTF-8, but the code is refactored for clarity and consistency with the Windows path. [1] [2] [3] [4] [5]Testing improvements:
str(notbytes) on all platforms and via all fetch methods (fetchone,fetchall,fetchmany).These changes ensure robust, predictable Unicode handling for VARCHAR columns across all supported platforms, and the expanded test suite helps prevent regressions in encoding/decoding behavior.