Skip to content

scan_string fails on Windows with non-ASCII characters due to encoding mismatch in temp file #1582

@Nokai90

Description

@Nokai90

Prerequisites

  • Are you running the latest version of this application?
  • Have you checked the Frequently Asked Questions document?
  • Have you simplified the bug report to the essential details?
    • Do you have a distinct command line to report?
    • Can you clearly state the configuration for this bug report?
    • Do you have a minimal document that highlights this bug?
    • Are any required files (configuration or Markdown document) attached to the issue?
  • Did you perform a cursory search of other issues to look for related issues?

Bug Report

Bug Type

  • Assertion Failure
  • Documentation
  • Scan/Rule not working as expected
  • Fix/Rule not working as expected
  • Other
    scan_string API fails on Windows when the input contains non-ASCII characters (e.g. accented letters)

Description

PyMarkdownApi().scan_string() raises a Configuration Error on Windows when the input string contains non-ASCII characters such as à, è, ò, etc.

The root cause is an encoding mismatch in file_scan_helper.py. The method __scan_from_stdin (line 154) writes the string to a temporary file using tempfile.NamedTemporaryFile("wt", delete=False) without specifying an encoding. On Windows, this defaults to the system locale encoding (typically cp1252). The file is then read back by FileSourceProvider.__init__ in general/source_providers.py (line 73) with explicit encoding="utf-8", causing a UnicodeDecodeError.

For example, the character à is encoded as the single byte 0xE0 in cp1252, but 0xE0 in UTF-8 is the start of a 3-byte sequence — so the read fails with invalid continuation byte.

Specifics

What operating system and version are you running into this behavior on?

Windows 11 Pro 10.0.26200

What version are you seeing this behavior in? (Run pip list or pipenv run pip list and look for the entry beside pymarkdownlnt.)

0.9.36

Are there any extra steps that need to be taken before executing the application?

None. The issue is triggered simply by running on a Windows system with a non-UTF-8 default locale (which is the default for most Windows installations).

What is the command line you invoke to get this behavior?

from pymarkdown.api import PyMarkdownApi

result = PyMarkdownApi().scan_string("Testo con caratteri accentati: à è ò ù")

Are you using a configuration file? Either on the command line or one of the implicit configuration files? If so, attach that file to this issue.

None. Default configuration, no custom config file.

What Markdown document causes this behavior to manifest? Attach that file to this issue.

Not applicable (using scan_string API). The equivalent content is:

Testo con caratteri accentati: à è ò ù

Actual Behavior

WARNING:pymarkdown.main:Configuration Error: 'utf-8' codec can't decode byte 0xe0 in position 45: invalid continuation byte

Expected Behavior

scan_string should correctly handle any valid Python string containing non-ASCII/Unicode characters. Since FileSourceProvider reads with encoding="utf-8", the temporary file should also be written with encoding="utf-8" to ensure consistency across all platforms.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions