fix(webid-tls): improve HTML profile parsing#72
fix(webid-tls): improve HTML profile parsing#72melvincarvalho wants to merge 2 commits intogh-pagesfrom
Conversation
- Change Accept header to request text/html to avoid conneg issues - Fix JSON-LD regex to handle additional attributes on script tag Fixes #70
There was a problem hiding this comment.
Pull request overview
This PR fixes WebID-TLS authentication failures when HTML profiles contain embedded JSON-LD with additional attributes on the script tag. The fix addresses content negotiation issues and improves HTML parsing robustness.
Key Changes:
- Modified Accept header to request
text/htmlinstead of multiple content types to work around conneg priority issues - Updated JSON-LD extraction regex to handle script tags with additional attributes (e.g.,
id,class)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const response = await fetchWithTimeout(webId, { | ||
| headers: { | ||
| 'Accept': 'application/ld+json, text/turtle, application/json' | ||
| 'Accept': 'text/html' |
There was a problem hiding this comment.
Restricting the Accept header to only 'text/html' could break WebID-TLS authentication for profiles served exclusively as Turtle or JSON-LD. While this may work around conneg issues with HTML profiles, the code at lines 155-169 still handles multiple content types in the response. Consider using 'text/html, application/ld+json;q=0.9, text/turtle;q=0.8' to indicate HTML preference while maintaining backward compatibility with non-HTML profiles. Alternatively, if HTML-only is intentional, remove the dead code paths for handling Turtle and JSON-LD content types.
| 'Accept': 'text/html' | |
| 'Accept': 'text/html, application/ld+json;q=0.9, text/turtle;q=0.8' |
| } else if (contentType.includes('text/html')) { | ||
| // Try to extract JSON-LD from HTML data island | ||
| const jsonLdMatch = text.match(/<script\s+type=["']application\/ld\+json["']\s*>([\s\S]*?)<\/script>/i); | ||
| const jsonLdMatch = text.match(/<script[^>]*type=["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/i); |
There was a problem hiding this comment.
The updated regex pattern for extracting JSON-LD from HTML profiles lacks test coverage. Since this change fixes authentication failures, consider adding tests that verify the regex correctly matches script tags with additional attributes (e.g., <script type="application/ld+json" id="me">) as well as the original format. This would prevent future regressions and validate the fix.
Replies to Copilot ReviewRe: Accept header (line 142)Good point about backward compatibility. The quality value approach ( For now,
Re: Test coverage (line 160)Agreed. I'll add a test case for the regex with additional attributes. The existing
|
Adds test cases for the HTML JSON-LD extraction regex to ensure it handles: - Basic script tags - Script tags with additional attributes (e.g., id="me") - Script tags with attributes before type - Single quotes - Multiline content - Non-matching script types
Summary
Fixes WebID-TLS authentication failures with HTML profiles containing embedded JSON-LD.
Changes
Accept header - Request
text/htmlinstead of mixed types to work around conneg priority issues (Conneg ignores Accept header priority #71)JSON-LD regex - Use more flexible pattern to handle additional attributes on script tag (e.g.,
<script type="application/ld+json" id="me">)Testing
Tested with WebID-TLS client certificate authentication against HTML profile with embedded JSON-LD.
Fixes #70