Skip to content

Stringify sanitized HTML as HTML (not XML)#3575

Merged
tillprochaska merged 1 commit intodevelopfrom
fix/safe-html-self-closing-tags
Jan 25, 2024
Merged

Stringify sanitized HTML as HTML (not XML)#3575
tillprochaska merged 1 commit intodevelopfrom
fix/safe-html-self-closing-tags

Conversation

@tillprochaska
Copy link
Contributor

This fixes a bug where Aleph doesn’t correctly preview HTML pages (see #3334).

The etree.ElementTree.tostring method stringifies as valid XML by default. While XML allows self-closing tags by default, only some tags can be self-closing in HTML.

This causes problems when rendering the sanitized HTML. Consider this example:

<noscript><script src="..."></script></noscript>
<h1>Page title</h1>
<p>Some text</p>

When converting the sanitized element tree to XML, this would result in something like this:

<noscript />
<h1>Page title</h1>
<p>Some text</p>

noscript is not a valid void element which results in browsers treating all following elements as children of the noscript tag, parsing the HTML into an effective DOM structure like this:

<noscript>
  <h1>Page title</h1>
  <p>Some text</p>
</noscript>

As a result the actual page contents are hidden when rendered in a browser (if JavaScript is enabled).

The `etree.ElementTree.tostring` method stringifies as valid XML by default. While XML allows self-closing tags by default, only some tags can be self-closing in HTML.

This causes problems when rendering the sanitized HTML. Consider this example:

```html
<noscript><script src="proxy.php?url=..."></script></noscript>
<h1>Page title</h1>
<p>Some text</p>
```

When converting the sanitized element tree to XML, this would result in something like this:

```html
<noscript />
<h1>Page title</h1>
<p>Some text</p>
```

`noscript` is not a valid void element which results in browsers treating all following elements as children of the `noscript` tag, parsing the HTML into an effective DOM structure like this:

```html
<noscript>
  <h1>Page title</h1>
  <p>Some text</p>
</noscript>
```

As a result the actual page contents are hidden when rendered in a browser (if JavaScript is enabled).
@tillprochaska tillprochaska marked this pull request as ready for review January 25, 2024 09:46
Copy link
Contributor

@stchris stchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant investigation, thanks for documenting this in detail! And for fixing it, of course!

@tillprochaska tillprochaska merged commit 4c64254 into develop Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants