fix: properly strip HTML tags and resolve entities in feed article summaries#149
Conversation
…mmaries Fixes bewcloud#146 The parseTextFromHtml function was using document.textContent directly on the parsed HTML document, which could leave raw HTML tags and unresolved entities in feed article summaries. Changes: - Extract text from body element to avoid document wrapper artifacts - Collapse multiple whitespace/newlines into single spaces for cleaner output - Add early return for empty/whitespace-only input - Use optional chaining for safer null handling
BrunoBernardino
left a comment
There was a problem hiding this comment.
Thank you for the suggested fix, @stakeswky ! While your summary appears to be AI-generated, the code fix is too small to make that assessment, and it's simple enough for me to worry too much about it.
I do have a request to either improve or remove a couple of lines of code in here.
Thanks, I hope that makes sense!
| // Collapse multiple whitespace/newlines into single spaces | ||
| .replace(/\s+/g, ' ') |
There was a problem hiding this comment.
I don't think this is necessary and it will break text-only summaries that are properly formatted with line breaks. That being said, it would make sense to remove/trim more than 2 newline or whitespace characters in a row.
There was a problem hiding this comment.
Good point — updated in 7a3bfc5. Now it only collapses runs of 2+ non-newline whitespace into a single space, and 3+ consecutive newlines into a double newline. Single line breaks are preserved.
.replace(/[^\S\n]{2,}/g, ' ')
.replace(/\n{3,}/g, '\n\n')…pace Address review feedback: the previous \s+ regex was too aggressive and broke text-only summaries with legitimate line breaks. Now: - Collapse runs of 2+ non-newline whitespace into a single space - Collapse 3+ consecutive newlines into double newline (paragraph break) - Single line breaks are preserved
|
Hi @BrunoBernardino, thanks for the feedback! I've updated the regex in the second commit:
Single line breaks are now preserved. Let me know if this looks good! |
BrunoBernardino
left a comment
There was a problem hiding this comment.
Thanks for the changes!
Fixes #146
Problem
The feed reader displays raw HTML tags and unresolved entities in article summaries instead of clean plain text.
Root Cause
parseTextFromHtml()inlib/feed.tsuseddocument.textContentdirectly on the parsed HTML document object, which could include artifacts from the document wrapper and didn't handle all edge cases.Fix
<body>element specifically (where the actual content lives after parsing)