Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478

Open
subrata-ms wants to merge 9 commits intomainfrom
subrata-ms/cp1252_encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478
subrata-ms wants to merge 9 commits intomainfrom
subrata-ms/cp1252_encoding

Conversation

@subrata-ms
Copy link
Copy Markdown
Contributor

@subrata-ms subrata-ms commented Mar 19, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request makes significant improvements to how VARCHAR (and related CHAR) columns are fetched and decoded, especially for Windows platforms, to ensure consistent Unicode string handling across all operating systems. The changes unify the decoding behavior for problematic CP1252 byte values, preventing platform-specific inconsistencies and ensuring that VARCHAR columns always return Python str objects, not bytes, regardless of the server collation or platform. The update also adds comprehensive tests to verify this behavior.

Cross-platform VARCHAR decoding and buffer management:

  • On Windows, VARCHAR columns are now always fetched as SQL_C_WCHAR, ensuring the ODBC driver converts from the server's native encoding (such as CP1252) to UTF-16, matching the behavior of NVARCHAR columns and eliminating the need to guess the server code page. This change guarantees that Python always receives Unicode strings (str) for VARCHAR columns, even when they contain bytes that are invalid in UTF-8. [1] [2] [3] [4]
  • On Linux/macOS, the logic for buffer allocation and decoding remains unchanged, continuing to fetch as SQL_C_CHAR and decode as UTF-8, but the code is refactored for clarity and consistency with the Windows path. [1] [2] [3] [4] [5]

Testing improvements:

  • Adds 23 new parameterized tests to verify that all problematic CP1252 byte values in VARCHAR columns are always decoded to str (not bytes) on all platforms and via all fetch methods (fetchone, fetchall, fetchmany).
  • Adds tests to ensure that embedded problematic bytes and explicit CP1252 decoding are handled correctly, further validating the new cross-platform consistency.
  • Updates the test count in the encoding/decoding test module to reflect the increased coverage.

These changes ensure robust, predictable Unicode handling for VARCHAR columns across all supported platforms, and the expanded test suite helps prevent regressions in encoding/decoding behavior.

@github-actions github-actions bot added the pr-size: medium Moderate update size label Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

📊 Code Coverage Report

🔥 Diff Coverage

11%


🎯 Overall Coverage

77%


📈 Total Lines Covered: 5759 out of 7465
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (11.1%): Missing lines 3131,3133-3134,3278,3280-3281,4012-4025,4033-4053,4056-4062

Summary

  • Total: 54 lines
  • Missing: 48 lines
  • Coverage: 11%

mssql_python/pybind/ddbc_bindings.cpp

Lines 3127-3138

  3127                             // the streaming LOB path to retrieve the full
  3128                             // value.
  3129                             LOG("SQLGetData: Cannot determine data length "
  3130                                 "(SQL_NO_TOTAL) for column %d (SQL_CHAR), "
! 3131                                 "falling back to LOB streaming",
  3132                                 i);
! 3133                             row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
! 3134                                                           charEncoding));
  3135                             // LCOV_EXCL_STOP
  3136                         } else if (dataLen < 0) {
  3137                             LOG("SQLGetData: Unexpected negative data length "
  3138                                 "for column %d - dataType=%d, dataLen=%ld",

Lines 3274-3285

  3274                             // the streaming LOB path to retrieve the full
  3275                             // value.
  3276                             LOG("SQLGetData: Cannot determine NVARCHAR data "
  3277                                 "length (SQL_NO_TOTAL) for column %d, "
! 3278                                 "falling back to LOB streaming",
  3279                                 i);
! 3280                             row.append(
! 3281                                 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
  3282                             // LCOV_EXCL_STOP
  3283                         } else if (dataLen < 0) {
  3284                             LOG("SQLGetData: Unexpected negative data length "
  3285                                 "for column %d (NVARCHAR) - dataLen=%ld",

Lines 4008-4029

  4008                 //
  4009                 // Coverage note: This branch is defensive — the ODBC driver
  4010                 // controls indicator values in bound-column mode, so
  4011                 // SQL_NO_TOTAL cannot be triggered from Python-level tests.
! 4012                 const ColumnInfoExt& noTotalColInfo = columnInfosExt[col - 1];
! 4013                 LOG("SQLGetData: SQL_NO_TOTAL for column %d (dataType=%d), "
! 4014                     "falling back to LOB streaming",
! 4015                     col, noTotalColInfo.dataType);
! 4016                 switch (noTotalColInfo.dataType) {
! 4017                     case SQL_CHAR:
! 4018                     case SQL_VARCHAR:
! 4019                     case SQL_LONGVARCHAR:
! 4020 #if defined(__APPLE__) || defined(__linux__)
! 4021                         PyList_SET_ITEM(row, col - 1,
! 4022                                         FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false,
! 4023                                                            noTotalColInfo.charEncoding)
! 4024                                             .release()
! 4025                                             .ptr());
  4026 #else
  4027                         PyList_SET_ITEM(
  4028                             row, col - 1,
  4029                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")

Lines 4029-4066

  4029                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
  4030                                 .release()
  4031                                 .ptr());
  4032 #endif
! 4033                         break;
! 4034                     case SQL_WCHAR:
! 4035                     case SQL_WVARCHAR:
! 4036                     case SQL_WLONGVARCHAR:
! 4037                     case SQL_SS_XML:
! 4038                         PyList_SET_ITEM(
! 4039                             row, col - 1,
! 4040                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 4041                                 .release()
! 4042                                 .ptr());
! 4043                         break;
! 4044                     case SQL_BINARY:
! 4045                     case SQL_VARBINARY:
! 4046                     case SQL_LONGVARBINARY:
! 4047                     case SQL_SS_UDT:
! 4048                         PyList_SET_ITEM(row, col - 1,
! 4049                                         FetchLobColumnData(hStmt, col, SQL_C_BINARY, false, true)
! 4050                                             .release()
! 4051                                             .ptr());
! 4052                         break;
! 4053                     default:
  4054                         // For fixed-length types SQL_NO_TOTAL should not
  4055                         // occur; treat as NULL to avoid undefined behaviour.
! 4056                         LOG("SQL_NO_TOTAL for unexpected fixed-type column %d "
! 4057                             "(dataType=%d), returning NULL",
! 4058                             col, noTotalColInfo.dataType);
! 4059                         Py_INCREF(Py_None);
! 4060                         PyList_SET_ITEM(row, col - 1, Py_None);
! 4061                         break;
! 4062                 }
  4063                 // LCOV_EXCL_STOP
  4064                 continue;
  4065             }


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review March 19, 2026 09:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses cross-platform inconsistencies when fetching CP1252-collated VARCHAR data by changing the ODBC C-type used for VARCHAR retrieval on Windows, aiming to ensure consistent Unicode (str) results across Windows and Linux/macOS.

Changes:

  • On Windows, fetch/bind VARCHAR (and related SQL_CHAR types) as SQL_C_WCHAR and process via the wide-char path to avoid code page guessing and bytes fallback.
  • On Linux/macOS, clarify/standardize SQL_C_CHAR buffer sizing assumptions for UTF-8 expansion.
  • Add a new CP1252 regression test section covering problematic bytes and multiple fetch paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
mssql_python/pybind/ddbc_bindings.cpp Updates SQLGetData_wrap and batch fetch binding/processing to use SQL_C_WCHAR for VARCHAR on Windows; adjusts buffer sizing logic and LOB detection formatting.
tests/test_013_encoding_decoding.py Adds a new CP1252 VARCHAR consistency test suite intended to ensure VARCHAR returns str across platforms and fetch methods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

@ffelixg ffelixg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, love to see this!

The encoding used by the odbc driver is actually not directly dependent on the operating system, instead it checks which locale is being used and encodes according to that. The difference between the operating systems is that windows uses cp1252 by default and linux uses utf-8. However, you can change the locale and make the driver return data in some other encoding. For example the current test passes on Linux with mssql_python 1.4.0, which shows that you can provoke the same issue on Linux:

def test_locale_varchar_decode_iso885915():
    assert locale.getlocale() == ('en_US', 'UTF-8')

    # important: change locale BEFORE connecting
    locale.setlocale(locale.LC_ALL, 'en_US.iso885915')
    connection = connect()
    cursor = connection.cursor()

    (val,) = cursor.execute("SELECT '€'").fetchone()
    # without changing locale, we would get val == '€'
    assert val == b'\xa4', val

    cursor.close()
    connection.close()

If we really want to continue to use SQL_C_CHAR for fetching varchar data, even if just on Linux, we would need to note down the current locale when connecting and only fetch SQL_C_CHAR if it is UTF-8 and SQL_C_WCHAR otherwise. This is made more complicated by connection pooling though. Apparently some pooling is already active by default, as followup connections still use the same locale, even if I change it beforehand.

On the flip side, if you do this, you can get rid of the platform specific code, because windows also allows changing the locale to UTF-8, which would enable SQL_C_CHAR there as well.

@sumitmsft
Copy link
Copy Markdown
Contributor

Hey, love to see this!

The encoding used by the odbc driver is actually not directly dependent on the operating system, instead it checks which locale is being used and encodes according to that. The difference between the operating systems is that windows uses cp1252 by default and linux uses utf-8. However, you can change the locale and make the driver return data in some other encoding. For example the current test passes on Linux with mssql_python 1.4.0, which shows that you can provoke the same issue on Linux:

def test_locale_varchar_decode_iso885915():
    assert locale.getlocale() == ('en_US', 'UTF-8')

    # important: change locale BEFORE connecting
    locale.setlocale(locale.LC_ALL, 'en_US.iso885915')
    connection = connect()
    cursor = connection.cursor()

    (val,) = cursor.execute("SELECT '€'").fetchone()
    # without changing locale, we would get val == '€'
    assert val == b'\xa4', val

    cursor.close()
    connection.close()

If we really want to continue to use SQL_C_CHAR for fetching varchar data, even if just on Linux, we would need to note down the current locale when connecting and only fetch SQL_C_CHAR if it is UTF-8 and SQL_C_WCHAR otherwise. This is made more complicated by connection pooling though. Apparently some pooling is already active by default, as followup connections still use the same locale, even if I change it beforehand.

On the flip side, if you do this, you can get rid of the platform specific code, because windows also allows changing the locale to UTF-8, which would enable SQL_C_CHAR there as well.

@subrata-ms
agreed to comments by @ffelixg here. The encoding used by the ODBC driver depends on the locale, not strictly the operating system. On Linux with a non-UTF-8 locale (e.g., en_US.iso885915), the driver will return non-UTF-8 bytes through SQL_C_CHAR, causing the exact same bytes-vs-str inconsistency this PR is trying to fix on Windows. The Linux/macOS path unconditionally assumes UTF-8 (columnSize * 4, UTF-8 decode).

Let' see if we can detect the locale at connection time and use SQL_C_WCHAR on any platform when the locale is non-UTF-8. This prevents the same class of bug from re-appearing. If the full fix is deferred, create a tracking issue for it in ADO backlog.

(void)wcharEncoding; // Suppress unused parameter warning
#if defined(_WIN32)
// On Windows, VARCHAR is fetched as SQL_C_WCHAR, so charEncoding is unused.
(void)charEncoding;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider this as a big behavioral (or potentially breaking) change, why? Because:

Consider a user with a server using a rare or custom collation (say a legacy Thai or Japanese single-byte encoding). Before this PR, they could call setdecoding(SQL_CHAR, encoding="tis-620") to control how bytes were interpreted. After this PR on Windows, that call is silently ignored, the driver always asks ODBC for UTF-16, which may or may not produce the correct result depending on the driver's conversion capabilities. The user gets no indication that their explicit configuration was bypassed.

What I get from this change is that the charEncoding parameter is explicitly (void) suppressed on Windows and all three fetch paths (SQLGetData_wrap, SQLBindColums, FetchBatchData) now unconditionally use SQL_C_WCHAR for VARCHAR columns. This means the user-facing setdecoding(SQL_CHAR, encoding="cp1252", ctype=SQL_CHAR) API call is accepted without error but has no effect on Windows.

The user's explicit decoding configuration is silently discarded. The test test_cp1252_varchar_explicit_decoding passes only because SQL_C_WCHAR handles CP1252 natively, not because the setdecoding call influenced anything.

Copy link
Copy Markdown
Contributor

@sumitmsft sumitmsft Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still thinking about what could be the possible solution for this to avoid this "potentially" big change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually a cleaner fix that avoids the breaking change entirely. _decoding_settings already stores both encoding and ctype, but only encoding is forwarded to C++ today and ctype is never passed down. So instead, as a fix:

  • We can also pass ctype into FetchOne/Many/All_wrap

  • In the SQL_CHAR/SQL_VARCHAR C++ branch: use SQL_C_WCHAR when ctype == SQL_WCHAR and use SQL_C_CHAR + decode with charEncoding when ctype == SQL_CHAR

  • Change the default ctype for SQL_CHAR from SQL_CHARSQL_WCHAR

This means default users get SQL_C_WCHAR automatically (fixed). Users who explicitly passed ctype=SQL_CHAR keep their existing behavior unchanged. No platform #if guards needed, and the Linux locale issue @ffelixg raised is also covered since the default becomes SQL_C_WCHAR.

@subrata-ms
Copy link
Copy Markdown
Contributor Author

subrata-ms commented Apr 2, 2026

📊 Code Coverage Report

🔥 Diff Coverage

11%

🎯 Overall Coverage

77%

📈 Total Lines Covered: 5759 out of 7465 📁 Project: mssql-python

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (11.1%): Missing lines 3131,3133-3134,3278,3280-3281,4012-4025,4033-4053,4056-4062

Summary

  • Total: 54 lines
  • Missing: 48 lines
  • Coverage: 11%

mssql_python/pybind/ddbc_bindings.cpp

Lines 3127-3138

  3127                             // the streaming LOB path to retrieve the full
  3128                             // value.
  3129                             LOG("SQLGetData: Cannot determine data length "
  3130                                 "(SQL_NO_TOTAL) for column %d (SQL_CHAR), "
! 3131                                 "falling back to LOB streaming",
  3132                                 i);
! 3133                             row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
! 3134                                                           charEncoding));
  3135                             // LCOV_EXCL_STOP
  3136                         } else if (dataLen < 0) {
  3137                             LOG("SQLGetData: Unexpected negative data length "
  3138                                 "for column %d - dataType=%d, dataLen=%ld",

Lines 3274-3285

  3274                             // the streaming LOB path to retrieve the full
  3275                             // value.
  3276                             LOG("SQLGetData: Cannot determine NVARCHAR data "
  3277                                 "length (SQL_NO_TOTAL) for column %d, "
! 3278                                 "falling back to LOB streaming",
  3279                                 i);
! 3280                             row.append(
! 3281                                 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
  3282                             // LCOV_EXCL_STOP
  3283                         } else if (dataLen < 0) {
  3284                             LOG("SQLGetData: Unexpected negative data length "
  3285                                 "for column %d (NVARCHAR) - dataLen=%ld",

Lines 4008-4029

  4008                 //
  4009                 // Coverage note: This branch is defensive — the ODBC driver
  4010                 // controls indicator values in bound-column mode, so
  4011                 // SQL_NO_TOTAL cannot be triggered from Python-level tests.
! 4012                 const ColumnInfoExt& noTotalColInfo = columnInfosExt[col - 1];
! 4013                 LOG("SQLGetData: SQL_NO_TOTAL for column %d (dataType=%d), "
! 4014                     "falling back to LOB streaming",
! 4015                     col, noTotalColInfo.dataType);
! 4016                 switch (noTotalColInfo.dataType) {
! 4017                     case SQL_CHAR:
! 4018                     case SQL_VARCHAR:
! 4019                     case SQL_LONGVARCHAR:
! 4020 #if defined(__APPLE__) || defined(__linux__)
! 4021                         PyList_SET_ITEM(row, col - 1,
! 4022                                         FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false,
! 4023                                                            noTotalColInfo.charEncoding)
! 4024                                             .release()
! 4025                                             .ptr());
  4026 #else
  4027                         PyList_SET_ITEM(
  4028                             row, col - 1,
  4029                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")

Lines 4029-4066

  4029                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
  4030                                 .release()
  4031                                 .ptr());
  4032 #endif
! 4033                         break;
! 4034                     case SQL_WCHAR:
! 4035                     case SQL_WVARCHAR:
! 4036                     case SQL_WLONGVARCHAR:
! 4037                     case SQL_SS_XML:
! 4038                         PyList_SET_ITEM(
! 4039                             row, col - 1,
! 4040                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 4041                                 .release()
! 4042                                 .ptr());
! 4043                         break;
! 4044                     case SQL_BINARY:
! 4045                     case SQL_VARBINARY:
! 4046                     case SQL_LONGVARBINARY:
! 4047                     case SQL_SS_UDT:
! 4048                         PyList_SET_ITEM(row, col - 1,
! 4049                                         FetchLobColumnData(hStmt, col, SQL_C_BINARY, false, true)
! 4050                                             .release()
! 4051                                             .ptr());
! 4052                         break;
! 4053                     default:
  4054                         // For fixed-length types SQL_NO_TOTAL should not
  4055                         // occur; treat as NULL to avoid undefined behaviour.
! 4056                         LOG("SQL_NO_TOTAL for unexpected fixed-type column %d "
! 4057                             "(dataType=%d), returning NULL",
! 4058                             col, noTotalColInfo.dataType);
! 4059                         Py_INCREF(Py_None);
! 4060                         PyList_SET_ITEM(row, col - 1, Py_None);
! 4061                         break;
! 4062                 }
  4063                 // LCOV_EXCL_STOP
  4064                 continue;
  4065             }

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details
View Azure DevOps Build

Browse Full Coverage Report

This is consider as limitation in python test.

The SQL_NO_TOTAL indicator value is written by the ODBC driver into bound column buffers during SQLFetchScroll. Python code has no control over what the driver writes there — it's a runtime decision made deep inside the native ODBC layer.
The existing LCOV_EXCL_START / LCOV_EXCL_STOP markers are correct and intentional for this reason. The options are:

  1. Accept the exclusion — this is defensive code that gracefully handles a rare driver behavior. The LCOV_EXCL markers are the right approach for genuinely untestable driver-controlled paths.
  2. Write a C++ unit test (separate from the pybind module) — using a mock ODBC driver or by directly constructing a ColumnBuffers struct with SQL_NO_TOTAL indicators and calling the relevant logic. This keeps test code out of production code but requires a C++ test framework (e.g., Google Test).
  3. Attempt to trigger naturally — the existing test_sql_no_total_large_data_scenario already tries with 5MB data, but there's no guarantee the driver will return SQL_NO_TOTAL instead of the actual length.

f"{fetch_method}: expected str but got {type(value).__name__}: "
f"{value!r} (platform={sys.platform})."
)
assert "\u00ad" in value, (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetchone parametrized tests all use strict equality (==). This batch test uses "in", so containment check would pass even if the driver incorrectly prepended/appended characters. Please change to assert value == "\u00ad" to match the stronger assertion pattern used in the rest of the test suite.

(void)wcharEncoding; // Suppress unused parameter warning
#if defined(_WIN32)
// On Windows, VARCHAR is fetched as SQL_C_WCHAR, so charEncoding is unused.
(void)charEncoding;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually a cleaner fix that avoids the breaking change entirely. _decoding_settings already stores both encoding and ctype, but only encoding is forwarded to C++ today and ctype is never passed down. So instead, as a fix:

  • We can also pass ctype into FetchOne/Many/All_wrap

  • In the SQL_CHAR/SQL_VARCHAR C++ branch: use SQL_C_WCHAR when ctype == SQL_WCHAR and use SQL_C_CHAR + decode with charEncoding when ctype == SQL_CHAR

  • Change the default ctype for SQL_CHAR from SQL_CHARSQL_WCHAR

This means default users get SQL_C_WCHAR automatically (fixed). Users who explicitly passed ctype=SQL_CHAR keep their existing behavior unchanged. No platform #if guards needed, and the Linux locale issue @ffelixg raised is also covered since the default becomes SQL_C_WCHAR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants