Fix escaped hash in regex extended modefixed escaped hash in regex extended mode#4814
Fix escaped hash in regex extended modefixed escaped hash in regex extended mode#4814ondrejmirtes merged 1 commit intophpstan:2.1.xfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes handling of escaped hash (\#) when stripping comments in PCRE extended (x) mode so that \# is treated as a literal # rather than the start of a comment.
Changes:
- Update the freespacing comment-stripping logic in
RegexGroupParserto account for escaped hashes. - Add an nsrt test case covering escaped hash behavior in extended mode.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/Type/Regex/RegexGroupParser.php |
Adjusts the regex used to strip #... comments in /x mode. |
tests/PHPStan/Analyser/nsrt/preg_match_shapes.php |
Adds a regression test to validate escaped-hash handling in extended mode parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/Type/Regex/RegexGroupParser.php
Outdated
| // in freespacing mode the # character starts a comment and runs until the end of the line | ||
| $regex = preg_replace('/(?<!\?)#.*/', '', $regex) ?? ''; | ||
| // but \# is an escaped literal hash, not a comment | ||
| $regex = preg_replace('/(?<!\\\)#.*/', '', $regex) ?? ''; |
There was a problem hiding this comment.
The new freespacing comment-stripping pattern will also treat the # in PCRE inline comment groups (?# ... ) as a line comment (because it’s not preceded by \). This is a regression from the previous (?<!\?) guard and can break valid patterns (e.g. tests/PHPStan/Analyser/nsrt/bug-11311.php contains (?# ...) inside an /x regex). Consider updating the stripping logic to exclude # that is part of (?#...) (and ideally avoid using a single lookbehind heuristic altogether).
| $regex = preg_replace('/(?<!\\\)#.*/', '', $regex) ?? ''; | |
| // also, (?# ... ) is an inline comment group and its # must not be treated as a line comment | |
| $regex = preg_replace('/(?<!\\\)(?<!\(\?)#.*/', '', $regex) ?? ''; |
src/Type/Regex/RegexGroupParser.php
Outdated
| // in freespacing mode the # character starts a comment and runs until the end of the line | ||
| $regex = preg_replace('/(?<!\?)#.*/', '', $regex) ?? ''; | ||
| // but \# is an escaped literal hash, not a comment | ||
| $regex = preg_replace('/(?<!\\\)#.*/', '', $regex) ?? ''; |
There was a problem hiding this comment.
/(?<!\\\)#.*/ only checks the single character immediately before #. In extended mode, \\# (two backslashes before #) should still start a comment because the last backslash is escaped, but this regex will incorrectly preserve it. To match PCRE semantics you need to treat # as escaped only when it’s preceded by an odd-length run of backslashes (and also ensure you’re not inside a character class).
| $regex = preg_replace('/(?<!\\\)#.*/', '', $regex) ?? ''; | |
| // and # inside a character class does not start a comment | |
| $length = strlen($regex); | |
| $processedRegex = ''; | |
| $inCharClass = false; | |
| $backslashRun = 0; | |
| for ($i = 0; $i < $length; $i++) { | |
| $char = $regex[$i]; | |
| $escaped = ($backslashRun % 2) === 1; | |
| // update character class state (unescaped [ and ]) | |
| if ($char === '[' && !$escaped) { | |
| $inCharClass = true; | |
| } elseif ($char === ']' && !$escaped) { | |
| $inCharClass = false; | |
| } | |
| // start of a comment: unescaped # outside of character class | |
| if ($char === '#' && !$escaped && !$inCharClass) { | |
| // skip until end of line or end of pattern | |
| while ($i + 1 < $length) { | |
| $nextChar = $regex[$i + 1]; | |
| if ($nextChar === "\n" || $nextChar === "\r") { | |
| break; | |
| } | |
| $i++; | |
| } | |
| $backslashRun = 0; | |
| continue; | |
| } | |
| $processedRegex .= $char; | |
| if ($char === '\\') { | |
| $backslashRun++; | |
| } else { | |
| $backslashRun = 0; | |
| } | |
| } | |
| $regex = $processedRegex; |
src/Type/Regex/RegexGroupParser.php
Outdated
| // in freespacing mode the # character starts a comment and runs until the end of the line | ||
| $regex = preg_replace('/(?<!\?)#.*/', '', $regex) ?? ''; | ||
| // but \# is an escaped literal hash, not a comment | ||
| $regex = preg_replace('/(?<!\\\)#.*/', '', $regex) ?? ''; |
There was a problem hiding this comment.
The literal /(?<!\\\)#.*/ relies on PHP single-quote escaping rules (the trailing \ is not an escape sequence), which is hard to read/maintain. Consider rewriting it in a clearer form (e.g. using \\\\ in the PHP string for a \\ regex escape) or adding a short comment explaining the escaping, to reduce the chance of accidental changes breaking the pattern.
| $regex = preg_replace('/(?<!\\\)#.*/', '', $regex) ?? ''; | |
| // (?<!\\\\) is a negative lookbehind for a backslash; '\\\\' in PHP becomes '\\' in the regex | |
| $regex = preg_replace('/(?<!\\\\)#.*/', '', $regex) ?? ''; |
\# was incorrectly treated as comment start in /x mode
|
/cc @staabm Please review 😊 |
staabm
left a comment
There was a problem hiding this comment.
I was not aware of such a regex feature, but implementation wise it looks good to me.
thank you
|
Thank you! |
When using the
x(extended) modifier,\#was incorrectly treated as the start of a comment. According to PCRE specification,\#is an escaped literal hash character and should not start a comment.Example
Before this fix, PHPStan would strip
\#.]) $/xas a comment, breaking the pattern parsing.Changes
src/Type/Regex/RegexGroupParser.php: Changed the comment-stripping regex from/(?<!\?)#.*/to/(?<!\\)#.*/to respect escaped hashestestExtendedSyntaxEscapedHashintests/PHPStan/Analyser/nsrt/preg_match_shapes.phpNote
This fix does not address
#inside character classes like[#.](which is also not a comment in PCRE). However, users can use the workaround[\#.]which now works correctly with this fix.