Skip to content

fix: properly strip HTML tags and resolve entities in feed article summaries#149

Merged
BrunoBernardino merged 2 commits intobewcloud:mainfrom
stakeswky:fix/146-feed-html-processing
Feb 23, 2026
Merged

fix: properly strip HTML tags and resolve entities in feed article summaries#149
BrunoBernardino merged 2 commits intobewcloud:mainfrom
stakeswky:fix/146-feed-html-processing

Conversation

@stakeswky
Copy link
Copy Markdown
Contributor

Fixes #146

Problem

The feed reader displays raw HTML tags and unresolved entities in article summaries instead of clean plain text.

Root Cause

parseTextFromHtml() in lib/feed.ts used document.textContent directly on the parsed HTML document object, which could include artifacts from the document wrapper and didn't handle all edge cases.

Fix

  • Extract text from the <body> element specifically (where the actual content lives after parsing)
  • Collapse multiple whitespace/newlines into single spaces for cleaner display
  • Add early return for empty/whitespace-only input
  • Use optional chaining for safer null handling

…mmaries

Fixes bewcloud#146

The parseTextFromHtml function was using document.textContent directly on
the parsed HTML document, which could leave raw HTML tags and unresolved
entities in feed article summaries.

Changes:
- Extract text from body element to avoid document wrapper artifacts
- Collapse multiple whitespace/newlines into single spaces for cleaner output
- Add early return for empty/whitespace-only input
- Use optional chaining for safer null handling
Copy link
Copy Markdown
Member

@BrunoBernardino BrunoBernardino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggested fix, @stakeswky ! While your summary appears to be AI-generated, the code fix is too small to make that assessment, and it's simple enough for me to worry too much about it.

I do have a request to either improve or remove a couple of lines of code in here.

Thanks, I hope that makes sense!

Comment thread lib/feed.ts Outdated
Comment on lines +235 to +236
// Collapse multiple whitespace/newlines into single spaces
.replace(/\s+/g, ' ')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary and it will break text-only summaries that are properly formatted with line breaks. That being said, it would make sense to remove/trim more than 2 newline or whitespace characters in a row.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — updated in 7a3bfc5. Now it only collapses runs of 2+ non-newline whitespace into a single space, and 3+ consecutive newlines into a double newline. Single line breaks are preserved.

.replace(/[^\S\n]{2,}/g, ' ')
.replace(/\n{3,}/g, '\n\n')

…pace

Address review feedback: the previous \s+ regex was too aggressive and
broke text-only summaries with legitimate line breaks.

Now:
- Collapse runs of 2+ non-newline whitespace into a single space
- Collapse 3+ consecutive newlines into double newline (paragraph break)
- Single line breaks are preserved
@stakeswky
Copy link
Copy Markdown
Contributor Author

Hi @BrunoBernardino, thanks for the feedback! I've updated the regex in the second commit:

  • [^\S\n]{2,} → collapses runs of 2+ non-newline whitespace into a single space
  • \n{3,} → collapses 3+ consecutive newlines into a double newline (paragraph break)

Single line breaks are now preserved. Let me know if this looks good!

Copy link
Copy Markdown
Member

@BrunoBernardino BrunoBernardino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

@BrunoBernardino BrunoBernardino merged commit 1aca444 into bewcloud:main Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Allow Feed reader to render HTML

2 participants