rviscomi.dev https://rviscomi.dev/ Wed, 25 Jun 2025 19:35:05 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://rviscomi.dev/wp-content/uploads/2023/05/cropped-favicon-32x32.png rviscomi.dev https://rviscomi.dev/ 32 32 Breaking Up with Long Tasks or: how I learned to group loops and wield the yield https://rviscomi.dev/2025/01/breaking-up-with-long-tasks/ https://rviscomi.dev/2025/01/breaking-up-with-long-tasks/#respond Thu, 02 Jan 2025 13:32:01 +0000 https://rviscomi.dev/?p=677 Arrays are in every web developer’s toolbox, and there are a dozen ways to iterate over them. Choose wrong, though, and all of that processing time will happen synchronously in one long, blocking task. The thing is, the most natural ways are the wrong ways. A simple for..of loop that processes each array item is synchronous by default, while Array methods like forEach and map can ONLY run synchronously. You almost certainly have a loop like this waiting to be optimized right now.

The post Breaking Up with Long Tasks or: how I learned to group loops and wield the yield appeared first on rviscomi.dev.

]]>
This post originally appeared in the Web Performance Calendar on December 31, 2024. Try the demo.

Everything, On the Main Thread, All at Once

Arrays are in every web developer’s toolbox, and there are a dozen ways to iterate over them. Choose wrong, though, and all of that processing time will happen synchronously in one long, blocking task. The thing is, the most natural ways are the wrong ways. A simple for..of loop that processes each array item is synchronous by default, while Array methods like forEach and map can ONLY run synchronously. You almost certainly have a loop like this waiting to be optimized right now.

What’s the problem with long tasks, anyway? Every long task is a liability for an unresponsive user experience. If the user interacts with the page at just the right (or wrong) time, the browser won’t be able to handle that interaction until the task completes, which contributes to its input delay and slow Interaction to Next Paint (INP) performance. You can think of them like potholes on a road, forcing drivers to dodge them or risk damaging their cars—an unpleasant experience either way. Likewise, long tasks create unresponsive UIs, which can frustrate users and impact business metrics. They’re especially problematic when they’re not just coinciding with a user interaction, but in response to one. It’s no longer a matter of poor timing, because every click necessarily becomes a slow click.

Synchronously processing large arrays is one of the easiest ways to introduce long tasks. Even if the unit of work performed on each item in the array is reasonably fast, that time scales up linearly with the number of items. For example, if a CPU can complete one unit of work in 0.25 ms, and there are 1,000 units, the total processing time will be 250 ms, creating a long task and exceeding the threshold for a fast and responsive interaction. The key to breaking up the long task is to use the repetition to your advantage: each iteration of the loop is an opportunity to interrupt the processing and update the UI as needed.

Optimizing interaction responsiveness

Interrupting a task to allow the event loop to continue turning is known as yielding. There are a few ways to yield, with the classic approach being setTimeout with a delay of 0 ms, or the more modern alternative: scheduler.yield. It’s not currently supported in all browsers, so production-ready use cases will need a polyfill or fall back to setTimeout. In both cases, the trick to making the loop asynchronous is to use async/await. But there’s a catch.

If you’re using an Array method like forEach or map, you’ll quickly realize that this doesn’t work:

function handleClick() {
  items.forEach(async (item) => {
    await scheduler.yield();
    process(item);
  });
}
A 917ms long task blocking a pointer interaction

forEach doesn’t care if your callback function is asynchronous, it will plow through every item in the array without awaiting the yield. And it doesn’t matter which approach you use scheduler.yield or setTimeout. Apparently, this trips up a lot of developers, with this StackOverflow question having been viewed 2.4 million times since it was asked in 2016. The solution is in the top answer: switch to using a for..of loop instead.

async function handleClick() {
  for (const item of items) {
    await scheduler.yield();
    process(item);
  }
}
Short, broken-up tasks all together taking 1.2 seconds to complete, without blocking the interaction

Instead of a monolithic long task blocking the click handler, now we’ve spread the work out into smaller tasks, responding to the interaction instantly. Problem solved, right?

Before we get into the major problem with this approach, you might have noticed the third most upvoted answer on that StackOverflow question, which recommends using the reduce method. In case you were tempted to cling to your functional programming tendencies and use reduce to break up the long task, think again.

function handleClick() {
  items.reduce(async (promise, item) => {
    await promise;
    await scheduler.yield();
    process(item);
  }, Promise.resolve());
}
A 267ms long task blocking the pointer interaction followed by short tasks

This approach passes a promise along from one iteration to the next, which we can await before processing the next item. However, the issue with this is that reduce still plows through the entire array, synchronously queuing up each microtask. It’s not until the promises are fulfilled that it starts processing the items. In other words, even though the actual processing happens asynchronously, the amount of overhead is still enough to make the click handler slow.

Yielding within a for..of loop seems like the best way to achieve responsive interactions, but the problem is that we’re yielding on EVERY iteration of the loop. Let’s see what happens in browsers that don’t support scheduler.yield:

async function handleClick() {
  for (const item of items) {
    await Promise(resolve => setTimeout(resolve, 0));
    process(item);
  }
}
Short tasks that don't block the interaction that cumulatively take 2.3 minutes to complete

With setTimeout, the job takes over 2 minutes to complete! Compare that with scheduler.yield, which completes in about 1 second. The huge disparity comes down to the fact that these are nested timeouts. Unlike tasks deferred with scheduler.yield, browsers introduce a 4 ms gap between nested timeouts. But that’s not to say that using scheduler.yield on every iteration comes without a cost. Both approaches introduce some overhead, which can be mitigated with batching.

Optimizing total processing time

Batching is processing multiple iterations of the loop before yielding. The interesting problem is knowing when to yield. Let’s say you yield after processing every 100 items in the array. Did you solve the long task problem? Well, that depends on the CPU speed and how much time the average item takes to process, and both of those factors will vary depending on the client’s machine.

Rather than batching by number of items, a much better approach would be to batch items by the time it takes to process them. That way you can set a reasonable batch duration, say 50 ms, and yield only when it’s been at least that long since the last yield.

const BATCH_DURATION = 50;
let timeOfLastYield = performance.now();

function shouldYield() {
  const now = performance.now();
  if (now - timeOfLastYield > BATCH_DURATION) {
    timeOfLastYield = now;
    return true;
  }
  return false;
}

async function handleClick() {
  for (const item of items) {
    if (shouldYield()) {
      await scheduler.yield();
    }
    process(item);
  }
}
Short tasks that don't block the interaction that cumulatively take 872ms to complete, using scheduler.yield

And here are the results with setTimeout:

Short tasks that don't block the interaction that cumulatively take 1.3s to complete, using setTimeout

The choice of batch duration is a tradeoff between minimizing the amount of time a user would spend waiting if they interacted with the page during the batch processing and the total time to process everything in the array. If you chunk up the work into 100 ms batches, that’s fewer interruptions and faster throughput, but at worst that’s also 100 ms of possible input delay, which is already half the budget for a fast interaction. On the other hand, with 10 ms batches, the worst case input delay is almost negligible, but more interruptions and slower throughput.

Your primary goal should be to unblock the interaction so that it feels responsive. That could just mean yielding so that you can update the UI with the first few items, or kicking off a loading animation. How often you yield during the rest of the processing time will depend on what your second priority is. Maybe nothing can be shown to the user until the entire array is processed, so your secondary goal should be to finish as quickly as possible. In that case you’ll want to go with a higher batch duration. Or maybe it’s ok to do the work in the background, but the UI should remain as smooth and responsive as possible. That lends itself to a smaller batch duration. When in doubt, 50 ms can be a good compromise, but it’s always a good idea to profile different approaches and pick what works best for your app.

We could stop there, but there’s one more thing that you might want to consider: frame rate. If you look closely at the screenshots above, you’ll notice thin green markers roughly corresponding to the paint cycle. These are custom timings using performance.mark to show when a requestAnimationFrame callback runs. There’s a curious difference in the frame rates of scheduler.yield and setTimeout.

Optimizing smoothness

To reiterate, if the work needs to be completed as quickly as possible, you should minimize the number of yields. But there are plenty of instances where it’s more important to provide visual feedback to the user that something is happening, like a progress indicator. Even if you’re not showing any progress to the user, you might still want to keep the frame rate reasonably fast to avoid janky animations or scrolling behavior. That’s where the preferential priority of scheduler.yield starts getting in the way.

A line chart showing batch duration on the x-axis and frames per second on the y-axis, with two series: scheduler.yield and setTimeout. The line for scheduler.yield appears relatively flat around 10 FPS as the batch duration increases from 0 to 10, 50, and 100. However for setTimeout, the line is flat at 60 FPS for batch durations of 0 and 10, then falls to 20 FPS at 50ms, and 15 FPS at 100ms.

Surprisingly, for batch durations under 100 ms, the frame rate is relatively flat around 10 FPS. However, setTimeout follows the expected curve, where more frames are painted as the batch duration decreases, approaching 60 FPS. Tasks scheduled with scheduler.yield are given preferential treatment, so even if you don’t do any batching at all, the browser will prioritize it over the next paint—but only up to a point.

Highlighting the 120ms time span between requestAnimationFrame calls

With no batching, the average time between frames is 120 ms, far from the 16 ms you get with tasks scheduled with setTimeout. This means your frame rate will be a lame 8 FPS. If you’re cool with that, you can skip the rest of this section. But I know there are some people who can’t stand the thought of a laggy UI, so here are some tips.

const BATCH_DURATION = 1000 / 30; // 30 FPS
let timeOfLastYield = performance.now();

function shouldYield() {
  const now = performance.now();
  if (now - timeOfLastYield > BATCH_DURATION) {
    timeOfLastYield = now;
    return true;
  }
  return false;
}

async function handleClick() {
  for (const item of items) {
    if (shouldYield()) {
      await new Promise(requestAnimationFrame);
      await scheduler.yield();
    }
    process(item);
  }
}
Highlighting the 37ms time span between requestAnimationFrame calls

First, change the batch duration to align with your desired frame rate. When it’s time to yield, before calling scheduler.yield, await a promise that resolves in a requestAnimationFrame callback. This effectively prevents any more work from happening until a frame is painted, ensuring a much smoother UI.

One gotcha is that the rAF callback won’t be fired as long as the tab is in the background. We can make a few adjustments to handle this edge case.

const BATCH_DURATION = 1000 / 30; // 30 FPS
let timeOfLastYield = performance.now();

function shouldYield() {
  const now = performance.now();
  if (now - timeOfLastYield > (document.hidden ? 500 : BATCH_DURATION)) {
    timeOfLastYield = now;
    return true;
  }
  return false;
}

async function handleClick() {
  for (const item of items) {
    if (shouldYield()) {
      if (document.hidden) {
        await new Promise(resolve => setTimeout(resolve, 1));
        timeOfLastYield = performance.now();
      } else {
        await Promise.race([
          new Promise(resolve => setTimeout(resolve, 100)),
          new Promise(requestAnimationFrame)
        ]);
        timeOfLastYield = performance.now();
        await scheduler.yield();
      }
    }
    process(item);
  }
}
Short, non-blocking tasks with a large chunk of time in the middle while the tab was in the background during which tasks alternate between 500ms of processing and 500ms of rest.

The first change is to the shouldYield function, which now checks the page visibility. If the document is hidden, we can afford to yield in larger batches of 500 ms. Even though there is no user to experience a slow interaction, this still introduces a long task that could block the page from becoming visible if the user returns before the work is completed. document.hidden will continue to be true until the visibilitychange event can be handled, so we still need to yield periodically.

The second change is to the way we yield when the document is visible. We need to make sure that we’re not dependent on the rAF callback, so we can race it against a 100 ms timeout, borrowing from Vercel’s await-interaction-response approach. The 100 ms timeout will be throttled to 1000 ms while the tab is backgrounded, but after that, the timeout will fire and work can resume. Resetting the timeOfLastYield is good so that the first backgrounded batch can run for the full 500 ms.

The final change is to the way we yield when the document is hidden. We want the visibilitychange event to fire, but scheduler.yield will always preempt it, delaying the page from becoming visible until the work is completed. That might be worth more investigation because it feels like a bug, but we can work around it by switching to a timeout-based approach. As long as the document is hidden, work will be done in 500 ms batches with an additional 500 ms delay between each batch, adding up to the 1000 ms delay for throttled timeouts. That way, if the user returns before the work is completed, the visibility state will be updated and the regular batching logic will kick back in.

If all of this feels overly complicated, that’s probably because it is. If your application can withstand pausing array iteration while the tab is in the background, then you should skip this last part for the sake of simplicity. In any case, this was a fun exercise in pushing the limits of yielding.

Try it out

If you’d like to try out the different yielding strategies, you can use this demo. That’s also what I used to make the screenshots in this post.

Hopefully this was a useful overview of the “yield in a loop” problem and how I’d go about solving it. Feel free to let me know if I got something wrong, or if you know of a better way I’d love to hear about it. Good luck out there!

The post Breaking Up with Long Tasks or: how I learned to group loops and wield the yield appeared first on rviscomi.dev.

]]>
https://rviscomi.dev/2025/01/breaking-up-with-long-tasks/feed/ 0
A faster web in 2024 https://rviscomi.dev/2023/11/a-faster-web-in-2024/ https://rviscomi.dev/2023/11/a-faster-web-in-2024/#comments Fri, 10 Nov 2023 21:05:59 +0000 https://rviscomi.dev/?p=570 The web is getting faster. In fact, according to HTTP Archive, more websites than ever before are passing the Core Web Vitals assessment, which looks at three metrics that represent different aspects of page performance: loading speed, interaction responsiveness, and layout stability. Earlier this week, the Chrome team published a retrospective on the Web Vitals […]

The post A faster web in 2024 appeared first on rviscomi.dev.

]]>
YouTube player
Note: This blog post is a companion to a presentation I gave at DevFest NYC on November 11, 2023.

The web is getting faster. In fact, according to HTTP Archive, more websites than ever before are passing the Core Web Vitals assessment, which looks at three metrics that represent different aspects of page performance: loading speed, interaction responsiveness, and layout stability.

Earlier this week, the Chrome team published a retrospective on the Web Vitals program that details some of the browser-level and ecosystem improvements that got us to this point. In the post, the Chrome team reported a savings of 10,000 years worth of waiting thanks to these improvements to Core Web Vitals.

So with 2024 around the corner, I wanted to take a closer look at what it’s going to take to carry this momentum forward and continue making the web even faster.

But there’s a catch. The metric we use to measure interaction responsiveness is changing in 2024. And this new metric is finding a lot of responsiveness issues that have been flying under the radar.

Will we be able to meet this new challenge? Will we be able to do so while keeping pace with the performance improvements of 2023? I think so, but we’re going to need to learn some new tricks.

Why care about web performance

This is a question I often take for granted. I’ve spent the last 11 years working on and advocating for web performance, and sometimes I naively assume that everyone—in my bubble, at least—gets it too.

If we’re going to continue making the web faster, we’re going to need more developers and business leaders to buy in to the idea that performance is a virtue worth doing something about.

So let’s talk about the “why” of web performance.

Last week, I had the chance to go to the performance.now() conference in Amsterdam. It’s become an annual pilgrimage for many of us in the web performance industry to convene and talk about pushing the web faster. One of the co-chairs and presenters at the conference was Tammy Everts, who perfectly summed up the answer to this question in the slide pictured above.

In 2016, Tammy published a book called Time is Money in which she lists a few reasons why a site owner might want to care about optimizing web performance:

  • Bounce rate
  • Cart size
  • Conversions
  • Revenue
  • Time on site
  • Page views
  • User satisfaction
  • User retention
  • Organic search traffic
  • Brand perception
  • Productivity
  • Bandwidth/CDN savings
  • Competitive advantage

Drawing from decades of experience and volumes of case studies and neuroscience research, Tammy makes the case that all of these things can be positively influenced by improving a site’s performance.

Tammy also worked with Tim Kadlec to create WPO stats, a site that catalogs years of web performance case studies directly linking web performance improvements to better business outcomes.

For example, in one case study, a Shopify site improved loading performance and layout stability by 25% and 61%, and saw a 4% decrease in bounce rate and 6% increase in conversions. In another case study, the Obama for America site improved performance by 60% and saw a corresponding increase in conversions of 14%. There are dozens of examples just like these.

Happy users make more money. If you think about the typical conversion funnel, fewer and fewer users make it deeper into the funnel. Optimizing performance effectively “greases the funnel” to drive conversions by giving users a more frictionless experience.

That’s the business impact, but even more fundamentally, performance is about the user experience.

How we’re doing

The modern web is the fastest it’s ever been, using Google’s measure of performance: Core Web Vitals. To put that in perspective, it’s helpful to look at how we got here.

At the start of 2023, 40.1% of websites passed the Core Web Vitals assessment for mobile user experiences. Since then, we’ve seen steady growth. As of September 2023, we’re at 42.5% of websites passing the Core Web Vitals assessment, an improvement of 2.4 percentage points, or 6.0%. This is a new high, representing an incredible amount of work by the entire web ecosystem.

This might seem like a glass half full / half empty situation. You could celebrate the positive story of nearly half of all websites having measurably good performance. Another equally valid way to look at it is that more than half of websites are not meeting the bar for performance.

We can have it both ways! It’s amazing that the web has improved so much, and at the same time we can push ourselves to continue this momentum into 2024.

Keeping pace

So, can we keep up the current pace of improvement and convert another 6% of websites to pass the assessment? I’d like to think that we can, but everything is about to change with the metric we use to assess page responsiveness.

Earlier this year, I wrote a blog post announcing that Interaction to Next Paint (INP) will become Google’s new responsiveness metric in the Core Web Vitals assessment, replacing First Input Delay (FID) in March 2024.

This is a very good change, as INP is much more effective at catching instances of poor responsiveness. As a result though, many fewer websites have good INP scores compared to FID, especially among mobile experiences.

In the Performance chapter of the 2022 Web Almanac, I wrote about what the Core Web Vitals pass rates would look like in a world with INP instead of FID.

For mobile experiences, only 31.2% of sites would pass the assessment, a drop of 8.4 percentage points (21.2%) from the FID standard. That was based on data from June 2022. How are we looking now?

Things are actually looking much better! The gap is all but closed on desktop, and mobile experiences are only trailing by 6 points (14.2%).

But the fact remains: pass rates will drop substantially once INP takes effect.

While it might seem like a step backwards at first, keep in mind that INP is giving us a much more accurate look at how real users are experiencing interaction responsiveness. Nothing about how the web is actually experienced is changing with INP—only our ability to measure it. In this case, a drop in the pass rates does not actually mean that the web is getting slower.

So I’m still optimistic that we’re going to see continued improvements in performance throughout 2024. We’re just going to have to recalibrate our expectations against the new baseline when INP hits the scene.

Regaining lost ground

FID is the oldest metric in the Core Web Vitals. It first appeared in the Chrome UX Report dataset in June 2018. As of today, only 5.8% of websites have any FID issues whatsoever on either desktop or mobile. So I think it’s fair to say that for the most part we haven’t really had to worry about interaction responsiveness.

INP challenges us to overcome five years of inertial complacency. To do it, we’re going to have to flex some web performance muscles we may not have used in a while, if ever. We’re going to have to break out some new tools.

A long task shown in the Chrome DevTools Performance panel
Source: Optimize Long Tasks on web.dev

We’re going to have to get very comfortable with this.

This is what a long task looks like in the Performance panel of Chrome DevTools. The red striping shows the amount of the task that exceeds the 50ms budget, making it a “long” task. If a user were to attempt to interact with the page at this time, the long task would block the page from responding, creating what the user (and the INP metric) would perceive to be a slow interaction.

The long task is broken up, as shown in the Chrome DevTools Performance panel
Source: Optimize Long Tasks on web.dev

The solution to this problem might require a web performance technique you’ve never tried before: breaking up long tasks. The same amount of work will get done eventually, but by adding yield points between major chunks of work, the page will be able to more quickly respond to any user interactions that happen during the task.

Chrome is experimenting with a couple of APIs in origin trials to help address problematic long tasks. The first is the scheduler.yield() API, which is designed to give developers more control over breaking up long tasks. It ensures that the work happens continuously, without other tasks cutting in.

Knowing which long tasks to break up is its own science. To help with this, Chrome is also experimenting with the Long Animation Frames API. Similar to the Long Tasks API, which reports when long tasks happen and for how long, the Long Animation Frames API reports on long rendering updates, which can be comprised of multiple tasks. Crucially, it also exposes much more actionable attribution info about the tasks, including the script source down to the character position in code.

Similar to tracking INP performance in analytics tools, developers could use the Long Animation Frames API to track why the INP was slow. In aggregate, this data can narrow down the root causes of common performance issues, saving developers from optimizing by trial and error.

These APIs aren’t stable yet, but they offer powerful new functionality to complement the existing suite of tools to optimize responsiveness. Even though it might feel like we’re playing catch-up just getting the pass rates back to where they were in a FID-centric assessment, the web is actually getting faster in the process!

It might seem like responsiveness will be the new bottleneck when INP takes over, but that’s actually not the case. Loading performance, as measured by the Largest Contentful Paint (LCP) metric, is and will still be the weakest link in the Core Web Vitals assessment.

Passing the Core Web Vitals assessment requires a site to be fast in all three metrics. So in order to continue the pace of improvement, we need to be looking at the metrics that need the most help.

54.2%

This is the percentage of websites with good LCP on mobile, compared to 64.1% and 76.0% for INP and CLS, according to HTTP Archive as of the September 2023 dataset.

As long as web performance has been a thing, developers have been talking about loading performance. Since the days of simple HTML applications, we’ve built up a lot of institutional knowledge around traditional techniques like backend performance and image optimization. But web pages have evolved a lot since then. They’ve become ever more complex with an increasing number of third party dependencies, richer media, and sophisticated techniques to render content on the client. Modern problems require modern solutions.

In 2022, Philip Walton introduced a new way of breaking down the time spent in LCP: the time to start receiving content on the client (TTFB), the time to start loading the LCP image (resource load delay), the time to finish loading the LCP image (resource load time), and the time until the LCP element is rendered (element render delay). By measuring which of these diagnostic metrics are slowest, we could focus our efforts on the optimizations that would most effectively improve LCP performance.

Conventional wisdom says that if you want your LCP image to appear sooner, you should optimize the image itself. This includes things like using a more efficient image format, caching it longer, resizing it smaller, and so on. In terms of the LCP diagnostic metrics, these things would only improve the resource load time. What about the rest?

Earlier I mentioned that last week I was at the performance.now() conference. Another one of the presenters was Estela Franco, who I collaborated with to share some brand new data sourced from real Chrome users on where this LCP time is typically spent.

Estela Franco presenting Chrome data on where LCP time is spent (November 2023)
Photo credit: Rick Viscomi

The photo above shows Estela’s slide with the LCP diagnostics as a percentage of the mean LCP time. Here’s the same data presented in milliseconds:

LCP scoreMean TTFBMean load delayMean load durationMean render delay
Good41040080230
Needs improvement1,0201,350260490
Poor2,3303,670580990
Breakdown of mean LCP diagnostic performance, grouped by LCP score (October 2023)
Source: Chrome 119 beta internal data

What’s perhaps most surprising about this data is that the resource load time (load duration) is already the fastest LCP diagnostic. The slowest part is actually the resource load delay. Therefore, the biggest opportunity to speed up slow LCP images is to load them sooner. To reiterate, the problem is less about how long the image takes to load, it’s that we’re not loading it soon enough.

Browsers are usually pretty good about discovering images in the markup and loading them reasonably quickly. So why is this an issue? Developers aren’t making LCP images discoverable.

I also wrote about the LCP discoverability problem in the 2022 Web Almanac. In it, I reported that 38.7% of mobile pages that have an image LCP are not making it statically discoverable. Even if we look at the latest data from HTTP Archive, this figure is still at 36.0%.

A big part of the problem continues to be lazy loading. I first wrote about the negative performance effects of LCP lazy loading in 2021. Lazy loading is more than just the native loading=lazy attribute; developers can also use JavaScript to dynamically set the image source. Last year I reported that 17.8% of pages with LCP images lazy load them in some way. According to the latest data from HTTP Archive, we’ve improved slightly to 16.8% of pages. It’s not impossible to have a fast LCP if you lazy load the image, but it definitely doesn’t help. LCP images should never be lazy loaded.

To be clear: lazy loading is good for performance, but only for non-critical content. Everything else, including LCP images, must be loaded as eagerly as possible.

A totally different problem is client-side rendering. If the only markup you’re sending to the client is a <div id="root"></div> container that gets rendered by JavaScript, the browser can’t load the LCP image until it’s eventually discovered in the DOM. A better (if controversial) solution is to switch to a server-side rendering model.

We also need to contend with LCP images declared in CSS background styles. For example, background-image: url("cat.gif). These images will not be picked up by the browser’s preload scanner, and so they won’t get the benefit of loading as early as possible. Using a plain old <img src="proxy.php?url=cat.gif"> element will get the job done.

In each of these cases, it’s also possible to use declarative preloading to make the images explicitly discoverable. In its simplest form, the code looks like this:

<link rel="preload" as="image" href="cat.gif">

Browsers will start loading the image sooner, but as long as its rendering is dependent on JavaScript or CSS, you may just be shifting from a load delay problem to a render delay problem. Eliminating these dependencies by putting the <img> directly in HTML is the most straightforward way to avoid this delay.

New tricks

So far all of these LCP recommendations are basically to dismantle some of the complexities we’ve introduced into our applications: LCP lazy loading, client-side rendering, and LCP background images. There are also some relatively new, additive techniques we could use to improve performance or even to avoid these delays altogether.

In last year’s Web Almanac, I reported that 0.03% of pages use fetchpriority=high on their LCP images. This attribute hints to the browser that it should be loaded higher than its default priority. Images in Chrome are typically low priority by default, so this can give them a meaningful boost.

A lot has changed since last year! In the most recent HTTP Archive dataset, 9.25% of pages are now using fetchpriority=high on their LCP images. This is a massive leap, primarily due to WordPress adopting fetchpriority in version 6.3.

There are also a couple of techniques you can use to effectively get instant navigations: leveraging the back/forward cache and speculative loading.

When a user hits the back or forward buttons, a previously visited page is resumed. If the page was stored in the browser’s in-memory back/forward cache (also referred to as the bfcache) then it would appear to be loaded instantly. That LCP image will already be loaded and any of the JavaScript needed to render it will have already run. But not all pages are eligible for the cache. Things like unload listeners or Cache-Control: no-store directives currently* make pages ineligible for Chrome’s cache, even if those event listeners are set by third parties.

In the year since I last reported on bfcache eligibility for the Web Almanac, unload usage dropped from 17% to 12% of pages, and no-store usage dropped less significantly from 22% to 21%. So more pages are becoming eligible for this instant loading cache, which benefits all Core Web Vitals metrics.

The other instant navigation technique is called speculative loading. Using the experimental Speculation Rules API, developers can hint to the browser that an entire page should be prerendered if there’s a high likelihood that the user will navigate there next. The API also supports prefetching, which is a less aggressive way to improve loading performance. The drawback is that it only loads the document itself and none of its subresources, so it’s less likely to deliver on the “instant navigation” promise than prerender mode.

Here’s an example of speculative loading in action, from the MDN docs:

<script type="speculationrules">
  {
    "prerender": [
      {
        "source": "list",
        "urls": ["next3.html", "next4.html"]
      }
    ]
  }
</script>

Both of these optimizations leverage different kinds of prerendering. With the bfcache, previously visited pages are preserved in memory so that revisiting them from the history stack can happen instantly. With speculative loading, the user doesn’t need to have ever visited the page for it to be prerendered. The net effect is the same: instant navigations.

The way forward

As more and more developers become aware of the challenges and opportunities to improve performance, I’m hopeful that we can see the continued growth of sites passing the Core Web Vitals assessment that we saw to date in 2023.

The first hurdle to clear is to even know that your site has a performance problem. PageSpeed Insights is the easiest way to run an assessment of your site, using public Core Web Vitals data from the Chrome UX Report. Even if you’re currently passing the assessment, pay close attention to your Interaction to Next Paint (INP) performance, as that will become the new standard for responsiveness in March 2024. You can also monitor your site’s performance using the Core Web Vitals report in Google Search Console. An even better way to understand your site’s performance is to measure it yourself, which enables you to get more granular diagnostic information about why it may be slow.

The next hurdle is to be able to invest time, effort, and maybe some money in improving performance. To do this, your organization first needs to care about web performance.

If your site has poor INP performance, there’s probably going to be a learning curve to start making use of all of the unfamiliar documentation, techniques, and tools to optimize long tasks. First Input Delay (FID) has given us something of a false sense of security when it comes to interaction responsiveness, but now we have an opportunity to find and fix the issues that would have otherwise been frustrating our users.

And even though INP is new and shiny, we can’t forget that Largest Contentful Paint (LCP) is the weakest link in the Core Web Vitals assessment. More sites struggle with LCP than any other metric. The way we’ve been building web apps over the years has changed, and so we need to adapt our optimization techniques accordingly by focusing beyond making images faster.

In lieu of the 2023 edition of the Web Almanac, I hope this post helps to demonstrate some of the progress we’ve seen this year and the room for improvement. The web is 6% faster, and that’s certainly worth celebrating. But most sites are still not fast—yet.

If we maintain the current rate of change of 6% per year, in 2026 more than half of sites will have good Core Web Vitals on mobile. So here’s my challenge. Let’s continue pushing our sites, our CMSs, our JavaScript frameworks, and our third party dependencies faster. Let’s continue to be advocates for better (if not instant) performance best practices in the web community. Here’s to the next 6% in 2024!

This post draws from the work of many people in the web performance community to whom I owe my thanks, including: Tammy Everts, Tim Kadlec, Estela Franco, Philip Walton, Mateusz Krzeszowiak, Annie Sullivan, Addy Osmani, Patrick Meenan, Jeremy Wagner, Barry Pollard, Brendan Kenny, and Felix Arntz.

The post A faster web in 2024 appeared first on rviscomi.dev.

]]>
https://rviscomi.dev/2023/11/a-faster-web-in-2024/feed/ 1
You probably don’t need http-equiv meta tags https://rviscomi.dev/2023/07/you-probably-dont-need-http-equiv-meta-tags/ https://rviscomi.dev/2023/07/you-probably-dont-need-http-equiv-meta-tags/#comments Thu, 27 Jul 2023 04:03:41 +0000 https://rviscomi.dev/?p=239 Until recently, I just assumed you could put anything equivalent to an HTTP header in an http-equiv meta tag, and browsers would treat it like the header itself. Maybe you thought the same thing—why wouldn’t you, with a name like that. But as it turns out, there are actually very few standard values that you […]

The post You probably don’t need http-equiv meta tags appeared first on rviscomi.dev.

]]>
Until recently, I just assumed you could put anything equivalent to an HTTP header in an http-equiv meta tag, and browsers would treat it like the header itself. Maybe you thought the same thing—why wouldn’t you, with a name like that.

But as it turns out, there are actually very few standard values that you can set here. And some values don’t even behave the same way as their header equivalents! What’s going on here and how are we supposed to use this thing?

Let’s take this as an example:

<meta
    http-equiv="X-UA-Compatible"
    content="IE=edge">

SERIOUSLY, WHAT DOES THIS DO? Why is it that if you load up any three random websites, one of them is bound to have this? And what does Internet Explorer have to do with anything anymore?

I could go on:

<meta
    http-equiv="content-type"
    content="text/html; charset=UTF-8">

Is this even necessary? It sure looks important—I wouldn’t want my web page to not be parsed as text/html.

Look, I know http-equiv meta tags of all things are not what most people get too worried about. It’s easy to copy-paste boilerplate markup from one project to the next because of some unquestioned folklore about what meta tags all HTML documents need. And if it works, it works, right?

Sure, but I’d argue that having a deeper understanding of what our code does and how to use it properly and effectively makes us all better developers. We can save ourselves the trouble of reaching for the wrong tool at first, only to find out later after burning time on debugging that maybe the http-equiv meta tag doesn’t do what we thought it does after all.

After a lot of researching and testing, I think I’m finally starting to get it. In this post I’ll share what the HTML spec says about http-equiv, how sites are actually using it in the wild, and argue why you probably* don’t need http-equiv meta tags.

👉 If you’d like to skip right to the takeaways, I’ve put together a cheatsheet with all of my http-equiv keyword recommendations.

*Unless…

I’ll start by giving my best arguments for needing http-equiv. I can break it down into two use cases: the response headers are hard or impossible to configure, and there might be tags added at runtime.

The first argument is about simplicity. If you’re deploying a static site somewhere like GitHub Pages, you don’t have control over the server or its response headers. If you need to set a header, your only choice is to use http-equiv or to migrate your site somewhere else.

The other argument is more about flexibility. You might not know what you need until the page is already running on the client. Maybe a third party needs to add the http-equiv meta tag for some feature to work.

These reasons don’t apply equally to all http-equiv use cases, though. For example, some use cases unlock features that require server-side logic to work anyway, while others are only applicable when parsed directly from the static HTML.

You really need to understand what each value does in order to be sure that you’re using http-equiv correctly. So let’s go back and see where it all started and how it’s supposed to be used today.

A brief history of http-equiv

In 1994, Roy Fielding proposed a new HTML element:

HTTP-EQUIV

This attribute binds the element to an HTTP response header. It means that if you know the semantics of the HTTP response header named by this attribute, then you can process the contents based on a well-defined syntactic mapping, whether or not your DTD tells you anything about it. HTTP header names are not case sensitive. If not present, the attribute NAME should be used to identify this metainformation and it should not be used within an HTTP response header.

HTTP servers can read the content of the document HEAD to generate response headers corresponding to any elements defining a value for the attribute HTTP-EQUIV. This provides document authors a mechanism (not necessarily the preferred one) for identifying information which should be included in the response headers for an HTTP request.

One example of an inappropriate usage for the META element is to use it to define information that should be associated with an already existing HTML element, e.g.

<meta
  name="Title"
  content="The Etymology of Dunsel">

A second example of inappropriate usage is to name an HTTP-EQUIV equal to a response header that should normally only be generated by the HTTP server. Example names that are inappropriate include Server, Date, and Last-Modified—the exact list of inappropriate names is dependent on the particular server implementation. It is recommended that servers ignore any META elements which specify http-equivalents which are equal (case-insensitively) to their own reserved response headers.

https://www.w3.org/MarkUp/html-spec/Elements/META.html

This is useful context to understand what the original intent of http-equiv was and was not. It wasn’t meant to replace more semantic HTML elements like title. It also wasn’t for HTTP headers that would have otherwise been more appropriately set by the server.

Unlike their actual usage today, http-equiv meta tags were initially intended to be read by the server so that it can set the corresponding response headers. Nowadays though, they’re read by the user agent to parse and handle the document accordingly. The HTML spec calls these pragma directives.

Today, rather than permissively supporting any and all HTTP headers, the only standard keywords (specced values of the http-equiv attribute) are, in their entirety:

KeywordStandardConforming
content-language✅❌
content-type✅✅
default-style✅✅
refresh✅✅
set-cookie✅❌
x-ua-compatible✅✅
content-security-policy✅✅
Standard http-equiv keywords, according to the HTML spec. (Source)

That’s a pretty short list! Not only that, but two of them are actually non-conforming, meaning that using them is actively discouraged or even completely ignored by the browser. That leaves us with only five conforming http-equiv keywords.

So, given that we’re all disciplined web developers, you wouldn’t expect to find anything improper in what people actually use this for, right?

Let’s look at the data.

For the rest of this post, I’ll be sharing stats from the June 2023 crawl of the public HTTP Archive dataset. Jump to the Methodology section for the queries and more info on the results.

Fun fact: the title used in Fielding’s example is “The Etymology of Dunsel”. Dunsel is a fictional word from the Star Trek universe meaning useless, superfluous, or unnecessary. It’s an ominously fitting description for a lot of today’s http-equiv usage, as you’ll see in the results below.

http-equiv adoption

Of the 17,389,897 websites in HTTP Archive’s June 2023 crawl, 11,722,086—67%— of them contain an http-equiv meta tag. That’s a huge proportion of the web, on par with a behemoth third party resource like Google Analytics.

Visualization of all observable websites (gray) and all websites that contain an http-equiv meta tag (red). Each pixel represents 100 websites.

Let’s dig deeper and see what the most popular http-equiv keywords are:

The emoji in the last two columns indicate whether the value is either standard or conforming. It’s easy to see at a glance that there are a lot of non-standard values that are in use on thousands of websites.

The two most popular values do happen to be standard and conforming. But are they being used correctly? And are they actually necessary?

Let’s explore a few of the most interesting results.

Obsolete keywords

RankKeywordSitesStandard
1x-ua-compatible6,469,282✅
13content-style-type237,535❌
15content-script-type197,005❌
16imagetoolbar137,625❌
18cleartype99,083❌
22page-enter13,937❌
25msthemecompatible12,797❌
30x-frame-options7,173❌
A few of the obsolete keywords used in http-equiv.
(HTTP Archive, June 2023)

Many of the top http-equiv keywords are bygone features of the Internet Explorer era. Official support for Internet Explorer, whose latest major version (IE 11) was released in October 2013, ended in June 2022.

Internet Explorer adoption (Statcounter)

As of July 2023, IE adoption is at an all-time low. According to Statcounter, 0.2% of web traffic comes from users on IE.

So the question is, if Microsoft won’t even support IE users, why should you?

x-ua-compatible

The most popular keyword is x-ua-compatible. It’s standard, it conforms, but the spec is quite clear that it should have no effect in modern browsers:

In practice, this pragma encourages Internet Explorer to more closely follow the specifications.

For meta elements with an http-equiv attribute in the X-UA-Compatible state, the content attribute must have a value that is an ASCII case-insensitive match for the string “IE=edge“.

User agents are required to ignore this pragma.

WHATWG HTML spec

The spec also requires that the value be exactly IE=edge, so let’s see if the sites abide:

ValueSitesStandard
ie=edge5,047,117✅
ie=edge,chrome=11,258,286❌
ie=emulateie727,752❌
ie=7;ie=9;ie=10;ie=1123,192❌
ie=1022,676❌
ie=921,842❌
chrome=118,033❌
ie=9,chrome=112,401❌
ie=9;ie=8;ie=7;ie=edge12,045❌
ie=1110,671❌
ie=89,953❌
ie=edge;chrome=17,666❌
Top content values for x-ua-compatible.
(HTTP Archive, June 2023. View query.)

The 11 content values listed above make up 99% of the x-ua-compatible usage. The most popular one is ie=edge, which is the only standard value.

What’s the point, though? Are you testing your website in IE? Are you gracefully degrading down to decades-old HTML, CSS, and JavaScript? Is your website even remotely presentable to the 1 out of every 500 users on IE? No, your site is almost certainly not IE-compatible, and this meta tag isn’t a magic wand to make it so.

Modern web pages should not need x-ua-compatible.

For a more detailed history of x-ua-compatible, check out Almost (Standards) Doesn’t Count by Jay Hoffmann.

content-style-type, content-script-type

Everyone knows <script> means JavaScript and <style> means CSS. But that wasn’t always the case.

Back in the day when the W3C’s HTML spec ruled the land, they wrote of the necessity to “specify the style sheet language of style information associated with an HTML document.” And similarly, everyone should “specify the default scripting language for all scripts in a document.”

Today though, none of this is necessary. You can write <script> and all modern browsers know you mean JavaScript.

You don’t need content-style-type or content-script-type in 2023.

x-frame-options

The thirtieth most used http-equiv keyword is x-frame-options, used by 7k sites. A site might set this keyword if they wanted to prevent their page from being surreptitiously embedded in a malicious page, for example for clickjacking purposes.

This one is obsoleted by the frame-ancestors directive of the Content Security Policy API (CSP), which provides much more flexibility for security controls. The CSP directive is supported by all modern browsers, so there’s really no reason to continue using x-frame-options.

Use CSP instead of x-frame-options.

content-type

The next most popular keyword is content-type, which, not to be confused with the script- and style-specific keywords above, is an alias for the charset meta tag.

content-type is used to declare the document’s character encoding and it’s used by about 4.6 million websites.

Note that the spec requires that pages must not contain both a http-equiv=content-type meta tag and a charset meta tag.

<meta
    http-equiv="content-type"
    content="text/html; charset=utf-8">

<meta
    charset="utf-8">

In other words, these two elements do exactly the same thing, but it’s invalid to have both of them.

Content-Type: text/html; charset=utf-8

Don’t forget about the Content-Type HTTP header, which is yet another way to declare the character encoding of the document (among other things). The spec is unclear if it’s also invalid to have both the header and the http-equiv versions, but I’m guessing it’s discouraged.

And there’s one other idiosyncrasy of the charset and content-type meta tags, which is that they must be included in the first 1024 bytes of the document. For example, this is one of the reasons why capo.js assigns the highest possible weight to charset meta tags, otherwise a late-discovered character encoding could screw up the parsing, causing the browser to have to start over.

Content-Typecharsethttp-equivSitesValid
✔✔✔801,404❌
✔✔11,382,946❓
✔✔3,115,266❓
✔727,032✅
✔✔98,656❌
✔1,228,396✅
✔731,908✅
223,932❌
16,026,64813,511,4024,747,23418,309,540
All of the combinations of ways to declare a character encoding.
(HTTP Archive, June 2023. View query.)

A few things surprised me about these results:

  • Despite so many pages using the http-equiv=content-type keyword, it’s a fraction of the usage that its alternatives get: Content-Type is 3.4x more popular, and charset is 2.8x more popular.
  • Most pages (79%) are in this ambiguous validity zone denoted by the ❓, having both a Content-Type header and either of the meta character encoding declarations.
  • 1 in 20 pages are clearly invalid and redundantly declare both the charset and http-equiv, or nothing at all.

Given the popularity of the alternatives, and the validation risks associated with the http-equiv approach—again, I have to ask—what’s the point? For the relatively few sites (732k or 4%) that wouldn’t have otherwise declared a content encoding, they’d be better off going with the alternatives.

The Content-Type header avoids the requirement for meta tags to be in the first 1024 bytes, and the charset declaration is much more concise than http-equiv. There’s no real advantage to using the content-type keyword.

All HTML pages should set a character encoding. Prefer the Content-Type HTTP header first, otherwise use the charset meta tag in the first 1024 bytes.

origin-trial

The third most popular value is used by 4.1 million pages and it’s technically non-standard according to the spec: origin-trial. I recently wrote a post called Origin trials and tribulations, which explores how they’re—often incorrectly—used to enable experimental web platform features. I recommend checking that out for a closer look at the stats behind individual origin trials.

The way they usually work is for a developer to sign up to use a particular experimental feature directly with a browser like Chrome, Edge, or Firefox. The browser gives the developer a token, which they serve from their website as a way to instruct the browser to enable that feature. The token can be served in one of two ways:

  1. As an HTTP header:
    Origin-Trial: [token]
  1. As a meta tag:
    <meta
      http-equiv="origin-trial"
      content="[token]">

Firefox began supporting origin trials in early 2022. Until then, only Chromium browsers had supported it. And for that reason, the WHATWG hadn’t considered adding it to the HTML spec. However, while researching this post and learning about the status of the origin-trial value, I left a comment on the spec issue recommending that they reconsider it. Now, having the support of at least two implementers, it meets the criteria and it sounds like they’re open to adding it.

Standardizing origin-trial is a good idea. Unlike the previous http-equiv keywords that we’ve looked at so far, in some cases it’s actually necessary to declare the origin trial token in the markup as opposed to the HTTP header equivalent. Third party origin trials can only be declared by dynamically injecting the meta tag into the main document. And, as discussed in my last blog post, third parties are responsible for 99% of origin trial usage.

Use either the Origin-Trial HTTP header or the origin-trial meta tag, whichever is more convenient. You must use the meta tag if you’re injecting a third party token into a page.

Cache headers

The http-equiv keywords ranked 5, 6, 10, 11, and 21 are in a category of cache control headers.

RankKeywordSitesStandard
5cache-control441,570❌
6etag438,906❌
10pragma390,770❌
11expires387,003❌
21last-modified19,515❌
Cache control headers used in http-equiv.
(HTTP Archive, June 2023)

I’m going to skip over describing what each header does. See Prevent unnecessary network requests with the HTTP Cache for a broader overview of caching headers and strategies.

Recall that Fielding specifically called out last-modified as an example of what not to do. Servers are better able to tell when a file was last modified or what the current time is, and developers should definitely not be hard-coding those things in their HTML.

We’ve already established that the original intent of http-equiv was for servers to read the HTML and respond with the corresponding headers, so what’s wrong with declaring something like a cache-control policy in the HTML? Other than being non-standard, I’m not even sure if servers actually behave like that or if they ever did. It wouldn’t be a great idea to have to parse the HTML on the server, due to the complexity and performance. Also, if there are proxies between the origin server and the client, they too would need to support that behavior.

We’ve just seen an example of a keyword that is non-standard, yet browsers support it anyway: origin-trial. So maybe browsers support caching headers too? Spoiler: they don’t.

Amusingly, even the all-knowing AI gets this wrong:

None of these are standard http-equiv keywords and as far as I know all modern browsers ignore them.

Use HTTP headers for cache directives, not http-equiv.

Wix metadata

The keywords ranked 7–9 are all prefixed with x-wix. And given that all three keywords are found on about 439k sites within a margin of ±1 site, I think it’s a safe bet that these are all set by the Wix CMS.

RankKeywordSites
7x-wix-published-version438,666
8x-wix-application-instance-id438,665
9x-wix-meta-site-id438,665
Wix headers used in http-equiv.
(HTTP Archive, June 2023)

Just to be sure, the number of Wix websites in the dataset is very close, at 445,164 sites (view query), and the HTML on the Wix blog confirms this theory:

<meta
  http-equiv="X-Wix-Meta-Site-Id"
  content="058b1f4f-09cf-426f-bb00-cec48b9da4b0">
<meta
  http-equiv="X-Wix-Application-Instance-Id"
  content="7dbdad6e-27ef-44ac-8270-48f414db3dc8">
<meta
  http-equiv="X-Wix-Published-Version"
  content="3834"/>
<meta
  http-equiv="etag"
  content="bug"/>
<meta
  http-equiv="X-UA-Compatible"
  content="IE=edge">

As an added bonus, it even looks like Wix is responsible for 99.9% of the high etag usage and some (6.8%) of the x-ua-compatible usage.

So, is this “valid” HTML?

No, definitely not.

Is there any harm to it? Well, the default behavior for a browser that doesn’t recognize an http-equiv value is to ignore it, so no these are harmless.

But isn’t there a more semantic way to set metadata like these? Yes! It’s the <meta name=generator> tag. Wix pages already have one of these, and it looks like this:

<meta
  name="generator"
  content="Wix.com Website Builder">

There’s nothing wrong with having multiple generator tags so it’d be more appropriate for Wix to use those instead of http-equiv. That said, it’s needless work and there’s no real benefit to making this switch other than technical correctness, but hey that counts for something!

Use generator meta tags for page metadata, not http-equiv.

content-language

The eighth most popular http-equiv keyword is content-language, found on 487k sites. The HTML spec considers this keyword to be non-conforming and it recommends using the lang attribute instead.

The important thing to know about the lang attribute is that it can influence the UI of the page: fonts, pronunciation, dictionaries, date pickers, etc. For accessibility, the WCAG explicitly requires that all pages declare the document language, primarily using the lang attribute. Setting the lang attribute is one of those core web dev best practices, and you’ll find it baked into projects like HTML5 Boilerplate and create-react-app.

Curiously, the spec also includes this note:

This pragma is almost, but not quite, entirely unlike the HTTP Content-Language header of the same name.

As best I can tell, the semantic distinction between these two values is that the Content-Language header indicates what language the reader is expected to speak, while the content-language pragma indicates what language the page is written in.

Clients can also use the Accept-Language header to politely ask for the content in their preferred language(s). The server will ideally take the client’s preferences into consideration when serving the response, as indicated by the Content-Language header.

So to boil it down:

  • If your web page is meant to be read by everyone, you should omit the Content-Language header.
  • If you serve multiple translations of the same resource, you should serve it with the appropriate Content-Language header based on the client’s Accept-Language preference.
  • You should always set <html lang> to whatever language the document is written in. For example, this page is set to <html lang="en-US">.

The specification for the lang attribute describes the order of precedence for all three of these directives on a given HTML node:

  1. The lang attribute on the nearest ancestor
  2. The value set by the http-equiv meta tag
  3. The value set by HTTP headers

The spec doesn’t explicitly say that the language falls back to the Content-Language header, just a “higher-level protocol” like HTTP. I’m going to assume that means the Content-Language header as that’s really the entire point of the http-equiv attribute.

We know how frequently the http-equiv keyword is used, so let’s compare that with all of the other ways to set the content language:

Content-Languagelanghttp-equivSitesValid
✔✔✔20,730❓
✔✔1,573,439✅
✔✔7,009❌
✔107,572❌
✔✔300,890❓
✔12,913,992✅
✔172,239❌
2,910,087❌
1,708,75014,809,051500,86818,005,958
All of the combinations of ways to declare the content language.
(HTTP Archive, June 2023. View query.)

Most sites (14.8 million) opt to declare the content language by setting the lang attribute. And the most popular combination of the three is the lang attribute alone (12.9 million).

The second most popular combination of directives is to set no directives at all, as seen on 2.9 million sites. While it’s fine to omit the Content-Language header, all documents should set a lang attribute at a minimum. So this case is clearly invalid, but unlike the content-type keyword, the spec isn’t as clear if it’s invalid for both lang and content-language to be set on the same document.

Don’t use http-equiv=content-language. The spec recommends using the lang attribute instead. If you need to support a resource in multiple languages, negotiate using the Accept-Language request header and the Content-Language response header to serve it in the client’s preferred language.

Client hints

The keywords ranked 12 and 72 are accept-ch and delegate-ch. They’re part of the Client Hints API available in Chromium-based browsers. You can learn more about the performance and privacy benefits of client hints.

This API allows browsers to provide servers with specific, opt-in information about the client like device capabilities or network conditions, which servers can use to adapt their content delivery, with headers like accept-ch and delegate-ch controlling the negotiation and delegation of these hints, respectively.

RankKeywordSites
12accept-ch242,017
72delegate-ch492
Client hints headers used in http-equiv.
(HTTP Archive, June 2023)

While these http-equiv keywords are not standard as far as the HTML spec is concerned, the Chromium engine alone does support them.

As I was browsing through the list of sites that use accept-ch, I noticed that a lot of them are hosted on the Squarespace domain. Given that this keyword depends entirely on server support, it makes sense that CMS hosts would be among its biggest power users. It turns out that sites hosted by Squarespace account for 97% of accept-ch usage, specifically its http-equiv adoption.

So how are these (mostly Squarespace) sites actually using it? What policy directives are they promoting to clients?

DirectiveTypeSitesValid
sec-ch-ua-platform-versionuser agent235,454✅
sec-ch-ua-modeluser agent235,442✅
dprdevice6,073✅
widthdevice5,959✅
viewport-widthdevice5,724✅
device-memorydevice511✅
downlinknetwork417✅
save-datanetwork295✅
ectnetwork289✅
rttnetwork228✅
sec-ch-ua-platformuser agent139✅
sec-ch-uauser agent120✅
sec-ch-ua-mobileuser agent111✅
sec-ch-ua-full-version-listuser agent107✅
sec-ch-ua-archuser agent15✅
sec-ch-ua-bitnessuser agent10✅
sec-ch-ua-full-versionuser agent9✅
sec-ch-prefers-color-schememedia4✅
sec-ch-prefers-contrastmedia1✅
sec-ch-prefers-reduced-motionmedia1✅
sec-ch-prefers-reduced-transparencymedia1✅
sec-ch-forced-colorsmedia1✅
sec-ch-prefers-reduced-datamedia0✅
Popularity of all valid accept-ch directives.
(HTTP Archive, June 2023. View query.)

Note that “valid” is used loosely here to mean that they’re accepted by browsers that support client hints. They’re not necessarily supported from the HTML spec’s point of view. Also, keep in mind that these stats don’t take HTTP header adoption into account—these are only the sites that set it in http-equiv specifically. Feel free to remix the query if you’re interested in HTTP header adoption.

It seems like the 97% of Squarespace accept-ch usage is for two things: sec-ch-ua-platform-version and sec-ch-ua-model. These are part of the User-Agent Client Hints expansion pack that provide more secure access to the UA info. The rest of the UA directives are used much less frequently.

The second most popular pack of directives is the Device Client Hints, including: dpr, width, viewport-width, and device-memory. The first three are used on about 6k sites while device-memory is used almost as much as the third most popular pack, Network Client Hints. These include downlink, save-data, ect, and rtt.

The usage patterns of device and network hints kind of makes sense if you think about the way these would be used. The three most popular device hints are all the visual ones that would be useful for responsive design. The network hints plus device-memory are useful for performance, either as a diagnostic for analytics or as a predicate for serving lighter-weight content.

The least popular category are the user preference media features. The most popular directive in this category is sec-ch-prefers-color-scheme, which gives websites the ability to serve pages in dark mode if that’s what the user prefers. I think it’s cool that this doesn’t have to be “figured out” by CSS on the client side, but I’m interested to see some real-world examples of it to understand how much more performance or simplicity it provides.

So now that we’ve seen how the http-equiv keywords are used, are they even needed at all? The purpose of client hints is for the client to give the server information that it can use to deliver a better experience. For that to work, the server would need to process the client’s request header detailing its hints and serve the alternate version of the page. So it seems to me that any server capable of handling client hints ought to be capable of setting these as response headers as opposed to http-equiv meta tags.

Maybe there are some valid reasons to dynamically inject these meta tags into the page? Well, as a security restriction, the WICG draft specifically calls out delegate-ch as being ineffective when injected by JavaScript. It’s possible there are use cases for injecting accept-ch but none come to mind.

If your server can handle client hints, it can declare its support using HTTP headers rather than relying on http-equiv.

x-dns-prefetch-control

The fourteenth most popular http-equiv keyword is x-dns-prefetch-control, which is not yet standardized. It’s used by about 200k sites and it serves one and only one purpose: turn off the browser’s default behavior to speculatively prefetch the DNS record of URLs that the client is likely to need soon.

This behavior is very good for performance, especially for mobile users. It’s similar to the developer-controlled way of doing DNS prefetching, using another header tag:

<link
  rel="dns-prefetch"
  href="https://www.example.com">

The key difference is that the browser will perform additional prefetches automatically to further improve the user’s loading performance.

There might be privacy concerns with prefetching domains that the user hasn’t actually indicated an intention to visit yet. And there might also be some performance concerns with resolving domain names that never get visited, for example maybe for a site that includes lots of links that users rarely follow. If a site owner chooses, they can disable this behavior by setting x-dns-prefetch-control to off.

It doesn’t make sense to set it to anything else, because the default behavior is for it to be on. Can you guess what the data says?

ValueSitesValid
on199,095✅
off1,688✅
[null]736❌
TRUE2❌
[empty string]1❌
yes1❌
text/html;charset=utf-81❌
no1❌
ie=edge,chrome=11❌
Usage of x-dns-prefetch-control.
(HTTP Archive, June 2023. View query.)

Note that “valid” is used loosely here to mean that the values are accepted by browsers that support x-dns-prefetch-control.

99% of sites that set this keyword are doing so unnecessarily by setting the value to on. Again, that’s the default!

1% (1,688) sites are actually disabling DNS prefetching.

The rest of the sites are passing garbage values.

Do sites need this keyword? The vast majority of them certainly don’t. Even the few that do intentionally disable the feature could assumedly set the corresponding HTTP response header.

Only use x-dns-prefetch-control if you have security or performance concerns with built-in DNS prefetching, in which case be sure to set the value to off.

content-security-policy

Skipping to number 17, the content-security-policy (CSP) meta tag is found on 101k sites. CSP helps to lock down all of the ways external content can be added to a page, which can make it vulnerable to attacks like cross-site scripting.

CSP is great and everyone should use a policy that works for them. The CSP meta tag is standard, it’s conforming, but I would strongly discourage anyone from using it.

Meme showing two boxes on a tall shelf next to each other: "needles" and "poison tipped needles".

On a good day, placing a script tag before a meta CSP would disable Chrome’s preload scanner. Alas, today is not a good day. While researching how all of this works, it turns out that I discovered a bug in Chrome that disables the preload scanner for all meta CSPs everywhere.

Even after the bug gets fixed, why risk it? The convenience of a meta tag is not worth the liability of losing one of the best performance optimizations that you can get for free. The Content-Security-Policy HTTP header is a much safer option that behaves exactly the same way, without the performance risks.

Don’t use content-security-policy. Use the HTTP header instead.

content-security-policy-report-only

I know this is a top 100 list, but I had to tack number 137 on the end just to show how often this related value is used. Not only is the content-security-policy-report-only value non-standard, but Chromium browsers will actually warn you if you try to use it. I guess that’s been helpful to drive down adoption; it’s set on only 114 websites.

This is a nice reminder that http-equiv is not for arbitrary HTTP headers, despite what you might think. Even though the CSP header is a perfectly standard value for http-equiv, its friend the CSP Report-Only header is unsupported. Go figure!

The CSP report-to directive should be used with this header, otherwise this header will be an expensive no-op machine.

MDN

After taking a closer look at how sites are trying (and failing) to use this header, it actually looks like in every single case they mixed it up for the CSP header. There’s one thing that distinguishes this from the CSP header, and that’s the report-to directive. When it’s set properly as an HTTP header, the reporting API will check resources against the enclosed policy and report violations to a given URL. “Otherwise,” as MDN brilliantly puts it, “this header will be an expensive no-op machine.” All 114 of the sites that use the –report-only meta tag omit the necessary report-to directive.

Don’t use content-security-policy-report-only. Use the HTTP header instead.

refresh

The next value on the list is refresh, found on only 28k sites. Get this—it refreshes the page 🤯

An example of this that you might be familiar with is the WebPageTest loading screen as a page is being tested:

<noscript>
  <meta
    http-equiv="refresh"
    content="30">
</noscript>

The noscript wrapper means that users with JavaScript enabled would get a less disruptive progress update UX. For all others with JavaScript disabled, this ancient directive is the only other way I can think of to get the page to automatically refresh on an interval.

And that’s the big downside to refresh: it’s disruptive. So much so that its redirection powers (refreshing to a different URL, technically) are discouraged by accessibility groups.

Here are some fun facts about how refresh is used in the wild:

  • 31% of sites that use refresh are using it for redirection.
  • The most popular redirect timeout is 5 seconds, used by 18% of sites that redirect.
  • The most popular refresh timeout is 1800 seconds (30 minutes), used by 13% of sites that refresh.
  • 4% of sites that use refresh don’t set a valid timeout value at all.
  • The largest timeout is 30,000,000,000,000,000,000,000,000,000,000 seconds.

    To put that in perspective: you could visit this website and buy a Powerball ticket for every second you wait. And every time you hit the jackpot, you fill out a March Madness bracket. By the time the page refreshes, you will have correctly predicted 11,131 brackets.

There are much better alternatives for redirection than using the refresh directive, like the HTTP 3xx status codes, as recommended by the Google Search docs and WCAG. And unless you really need to fall back to the primitive behavior like in WebPageTest’s case, an asynchronous JavaScript solution would be much less disruptive.

Only use refresh when you really need to reload the current page and there are no other, less disruptive options available.

default-style

Skipping ahead again, we have default-style at number 50, found on just 1k sites.

According to the CSSOM spec:

preferred CSS style sheet set name is a concept to determine which CSS style sheets need to have their disabled flag unset. Initially its value is the empty string.

How does a stylesheet get disabled? That’s a side-effect of alternative stylesheets. If you set the rel attribute of your link tag to "alternate stylesheet", it’s disabled by default.

So how does a stylesheet get re-enabled? default-style, for one! You can kind of think of it like the way label and input elements relate to each other by way of the for/id attributes:

<label for="name">
  What's your name?
</label>
<input id="name">

The way to indicate a preferred stylesheet is to set the meta tag’s content attribute to the value of the stylesheet’s title attribute:

<meta
  http-equiv="default-style"
  content="green">
<link
  rel="alternate stylesheet"
  title="green"
  href="green.css">
<link
  rel="alternate stylesheet"
  title="red"
  href="red.css">

If you’d like to try it out for yourself, I built a little demo.

As you can see, this keyword only works so long as the content attribute refers to a valid title attribute value. How often do you reckon that happens?

ValuePagesValid
text/css176❌
au normal contrast109✅
styles_portal54✅
ie=edge53❌
default36✅
text/javascript26❌
main_style23✅
text/html;charset=utf-814❌
style.css13❌
toppage8✅
Top 10 preferred stylesheets.
(HTTP Archive, June 2023. View query.)

By my count, 31% of the values set in the default-style tag are invalid. For example, take the most popular value: text/css. That’s a perfectly valid CSS content type, but I highly doubt someone set the title attribute of their stylesheet to it.

The next one that jumps out at me is ie=edge. Look familiar? That’s the top value of the x-ua-compatible pragma. Not valid here.

<meta
    http-equiv="default-style"
    content="the document's preferred stylesheet">

Two sites took the spec a bit too literally 🤣

Update: this code seems to have been lifted directly from W3Schools!

There are a couple of other content types on the list: text/javascript and text/html;charset=utf-8. Why would anyone set the title attribute of a stylesheet to the MIME types for JavaScript and HTML? Nope, not valid either.

The last one that caught my eye is style.css. It seems they mistakenly set the value to the href of the stylesheet, not the title. So close. The intent was there, but not valid.

This might be unique to Chrome’s implementation, but when I test this out, I see a flash of unstyled content. The page renders with the default styles (black text), then the preferred styles (red text) kick in shortly after. It’s not a great user experience.

This is a tiny demo, so I can only imagine the flash being even more jarring on sites that use this for real.

So, is it worth it? I don’t think so. I just don’t see enough value added by this feature that you couldn’t get with modern CSS anyway. For example, MDN recommends using @media features instead.

Use modern CSS instead of default-style for alternative stylesheets.

The last value of interest in this list is set-cookie. It’s standard—it’s in the spec—but it’s non-conforming. A mere 317 sites use it.

The spec doesn’t have too much to say about it:

This pragma is non-conforming and has no effect.

User agents are required to ignore this pragma.

That’s it.

If you need to set cookies, only use the HTTP response header.

Chromium behavior

Šime Vidas responded to a related tweet of mine about standard http-equiv keywords, asking what non-standard keywords browsers do support.

To answer to Šime’s question, I had to trawl through the Chromium source. Here are the implemented keywords that I found:

KeywordSupportedStandard
default-style✅✅
refresh✅✅
set-cookie❌✅
content-language✅✅
x-dns-prefetch-control✅❌
x-frame-options❌❌
accept-ch✅❌
delegate-ch✅❌
content-security-policy✅✅
content-security-policy-report-only❌❌
origin-trial✅❌
content-type (source)✅✅
http-equiv keywords implemented by Chromium browsers and their levels of support and standardization. (Source)

So the supported, non-standard keywords are: x-dns-prefetch-control, accept-ch, delegate-ch, and origin-trial.

It’s interesting to see that some keywords are implemented, but only to warn developers when found:

  • set-cookie triggers an error
  • x-frame-options triggers an error
  • content-security-policy-report-only logs a friendlier message

Chromium is not the only engine, and other browsers may handle http-equiv keywords differently. If you’d like to contribute keyword support for other browsers, please reach out in the comments, and I’d be happy to include it here.

Cheatsheet

If you take away one thing from this post, have it be this cheatsheet with my condensed recommendations for each keyword. You can refer to this list if you’re ever unsure whether you need a given http-equiv meta tag.

KeywordRecommendation
accept-ch❌ Use the Accept-CH HTTP header instead
cache-control❌ Use the Cache-Control HTTP header instead
cleartype❌ You don’t need it
content-language❌ Use the lang attribute instead
content-security-policy❌ Use the Content-Security-Policy HTTP header instead
content-security-policy-report-only❌ Use the Content-Security-Policy-Report-Only HTTP header instead
content-script-type❌ You don’t need it
content-style-type❌ You don’t need it
content-type❌ Use the Content-Type HTTP header instead, or the charset meta tag in the first 1024 bytes
default-style❌ Use modern CSS instead
delegate-ch❌ Use the Delegate-CH HTTP header instead
etag❌ Use the ETag HTTP header instead
expires❌ Use the Expires HTTP header instead
imagetoolbar❌ You don’t need it
last-modified❌ Use the Last-Modified HTTP header instead
msthemecompatible❌ You don’t need it
origin-trial✅ Prefer the HTTP header if you can, otherwise the meta tag is fine
page-enter❌ You don’t need it
pragma❌ Use the Cache-Control HTTP header instead
refresh❌ Use HTTP 3xx for redirects
✅ Use it for reloads as a noscript fallback
set-cookie❌ Use the Set-Cookie HTTP header instead
x-dns-prefetch-control✅ Use it if you have legitimate security or performance concerns
x-frame-options❌ Use the Content-Security-Policy HTTP header instead
x-ua-compatible❌ You don’t need it
x-wix-application-instance-id❌ Use generator meta tags instead
x-wix-meta-site-id❌ Use generator meta tags instead
x-wix-published-version❌ Use generator meta tags instead
Cheatsheet of all the http-equiv keywords explored in this post and my recommended actions.

If you don’t see the keyword you’re looking for in this list, chances are you’re not gonna need it. You’re almost always better off setting the HTTP header directly where possible. But just to be sure, test it out in a modern browser. You can also check the HTML spec—it’s a rapidly evolving living standard—or your favorite web developer documentation site for more info.

Based on all of my reading of the spec, analysis of the data, and interpretation of the Chromium source code, it’s clear to me that there’s a lot of unnecessary usage of the http-equiv meta tag. I hope you’re convinced that you probably don’t need most of these tags anymore, and you can use this new knowledge to write cleaner, more modern HTML.

Please reach out to me in the comments if there’s anything in this post that I can improve. I’m eager to continue building my understanding of how this all works and I’d be happy to update this post accordingly.

Appendix: Methodology

For all queries, I used the June 2023 crawl of the public HTTP Archive dataset. The queries do not distinguish between client type or root/secondary pages. For example, if http-equiv is used only on a site’s mobile secondary page, I count that site as using http-equiv. If a site uses it on all four combinations of desktop/mobile and root/secondary pages, the site is counted once towards the overall stats.

Popular website builders like the WordPress CMS make up about a third of the dataset and have a disproportionate effect on the stats. This is ok, as I’m trying to measure adoption across the whole web, regardless of whether the site owner added the tags themselves or their CMS did it.

Warning: these queries process between 6 and 14 TB each. Run at your own expense.

Querying http-equiv adoption

Show query
WITH meta AS (
  SELECT
    root_page,
    LOWER(JSON_VALUE(meta, '$.http-equiv')) AS http_equiv
  FROM
    `httparchive.all.pages`
  LEFT JOIN
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01'
)
SELECT
  COUNT(DISTINCT IF(http_equiv IS NOT NULL, root_page, NULL)) AS http_equiv,
  COUNT(DISTINCT root_page) AS total,
  COUNT(DISTINCT IF(http_equiv IS NOT NULL, root_page, NULL)) / COUNT(DISTINCT root_page) AS pct
FROM
  meta

Querying the top 100 http-equiv values

Show query
CREATE TEMP FUNCTION IS_VALID(value STRING) RETURNS BOOL AS (
  value IN (
    'content-language',
    'content-type',
    'default-style',
    'refresh',
    'set-cookie',
    'x-ua-compatible',
    'content-security-policy'
  )
);
CREATE TEMP FUNCTION IS_CONFORMING(value STRING) RETURNS BOOL AS (
  value IN (
    'content-type',
    'default-style',
    'refresh',
    'x-ua-compatible',
    'content-security-policy'
  )
);
WITH meta AS (
  SELECT
    root_page,
    LOWER(JSON_VALUE(meta, '$.http-equiv')) AS http_equiv
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01'
)
SELECT
  ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT root_page) DESC) AS rank,
  http_equiv AS value,
  COUNT(DISTINCT root_page) AS sites,
  IF(IS_VALID(http_equiv), '✅', '❌') AS valid,
  IF(IS_CONFORMING(http_equiv), '✅', '❌') AS conforming
FROM
  meta
WHERE
  http_equiv IS NOT NULL
GROUP BY
  http_equiv
ORDER BY
  sites DESC
LIMIT
  100

Querying x-ua-compatible usage

Show query
WITH meta AS (
  SELECT
    root_page,
    LOWER(JSON_VALUE(meta, '$.http-equiv')) AS http_equiv,
    REGEXP_REPLACE(LOWER(JSON_VALUE(meta, '$.content')), r'\s', '') AS content
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01'
)
SELECT
  content,
  COUNT(DISTINCT root_page) AS sites
FROM
  meta
WHERE
  http_equiv = 'x-ua-compatible'
GROUP BY
  content
ORDER BY
  sites DESC

Querying content-type usage

Show query
CREATE TEMP FUNCTION IS_VALID(header STRING, charset STRING, http_equiv STRING) RETURNS STRING AS (
  CASE
    WHEN charset IS NOT NULL AND http_equiv IS NOT NULL THEN '❌'
    WHEN header IS NULL AND charset IS NULL AND http_equiv IS NULL THEN '❌'
    WHEN header IS NOT NULL AND charset IS NULL and http_equiv IS NULL THEN '✅'
    WHEN header IS NULL AND charset IS NOT NULL and http_equiv IS NULL THEN '✅'
    WHEN header IS NULL AND charset IS NULL and http_equiv IS NOT NULL THEN '✅'
    ELSE '❓'
  END
);
WITH all_sites AS (
  SELECT
    rank,
    root_page,
    page
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-06-01'
),
meta_charset AS (
  SELECT
    page,
    LOWER(JSON_VALUE(meta, '$.charset')) AS charset
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01' AND
    LOWER(JSON_VALUE(meta, '$.charset')) IS NOT NULL
),
meta_content_type AS (
  SELECT
    page,
    LOWER(JSON_VALUE(meta, '$.content')) AS content_type
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01' AND
    LOWER(JSON_VALUE(meta, '$.http-equiv')) = 'content-type' AND
    LOWER(JSON_VALUE(meta, '$.content')) IS NOT NULL
),
header AS (
  SELECT
    page,
    LOWER(REGEXP_EXTRACT(header.value, r'(?i)charset=([^;\s]*)')) AS http_content_type
  FROM
    `httparchive.all.requests`,
    UNNEST(response_headers) AS header
  WHERE
    date = '2023-06-01' AND
    is_main_document AND
    LOWER(header.name) = 'content-type' AND
    REGEXP_CONTAINS(header.value, r'(?i)charset=([^;\s]*)')
)
SELECT
  IF(http_content_type IS NOT NULL, '✔', '') AS has_http_header,
  IF(charset IS NOT NULL, '✔', '') AS has_meta_charset,
  IF(content_type IS NOT NULL, '✔', '') AS has_http_equiv,
  COUNT(DISTINCT root_page) AS sites,
  IS_VALID(http_content_type, charset, content_type) AS valid
FROM
  all_sites
LEFT JOIN
  meta_charset
USING
  (page)
FULL OUTER JOIN
  meta_content_type
USING
  (page)
FULL OUTER JOIN
  header
USING
  (page)
GROUP BY
  has_http_header,
  has_meta_charset,
  has_http_equiv,
  valid
ORDER BY
  has_http_header DESC,
  has_meta_charset DESC,
  has_http_equiv DESC

Querying Wix adoption

Show query
SELECT
  COUNT(DISTINCT root_page) AS wix_sites
FROM
  `httparchive.all.pages`,
  UNNEST(technologies) AS t
WHERE
  date = '2023-06-01' AND
  t.technology = 'Wix'

Querying content-language usage

Show query
CREATE TEMP FUNCTION IS_VALID(header STRING, lang STRING, http_equiv STRING) RETURNS STRING AS (
  CASE
    WHEN lang IS NOT NULL AND http_equiv IS NULL THEN '✅'
    WHEN lang IS NULL THEN '❌'
    ELSE '❓'
  END
);
WITH all_sites AS (
  SELECT
    root_page,
    page
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-06-01'
),
html_lang AS (
  SELECT
    page,
    LOWER(REGEXP_EXTRACT(response_body, r'(?i)<html[^>]*lang=[\'"]?([^\s\'"])')) AS lang
  FROM
    `httparchive.all.requests`
  WHERE
    date = '2023-06-01' AND
    is_main_document AND
    REGEXP_CONTAINS(response_body, r'(?i)<html[^>]*lang=[\'"]?([^\s\'"])')
),
meta_content_language AS (
  SELECT
    page,
    LOWER(JSON_VALUE(meta, '$.content')) AS content_language
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01' AND
    LOWER(JSON_VALUE(meta, '$.http-equiv')) = 'content-language' AND
    LOWER(JSON_VALUE(meta, '$.content')) IS NOT NULL
),
header AS (
  SELECT
    page,
    LOWER(header.value) AS http_content_language
  FROM
    `httparchive.all.requests`,
    UNNEST(response_headers) AS header
  WHERE
    date = '2023-06-01' AND
    is_main_document AND
    LOWER(header.name) = 'content-language'
)
SELECT
  IF(http_content_language IS NOT NULL, '✔', '') AS has_http_header,
  IF(lang IS NOT NULL, '✔', '') AS has_html_lang,
  IF(content_language IS NOT NULL, '✔', '') AS has_http_equiv,
  COUNT(DISTINCT root_page) AS sites,
  IS_VALID(http_content_language, lang, content_language) AS valid
FROM
  all_sites
LEFT JOIN
  html_lang
USING
  (page)
FULL OUTER JOIN
  meta_content_language
USING
  (page)
FULL OUTER JOIN
  header
USING
  (page)
GROUP BY
  has_http_header,
  has_html_lang,
  has_http_equiv,
  valid
ORDER BY
  has_http_header DESC,
  has_html_lang DESC,
  has_http_equiv DESC

Querying content-security-policy-report-only usage

Show query
WITH meta AS (
  SELECT
    rank,
    page,
    LOWER(JSON_VALUE(meta, '$.http-equiv')) AS http_equiv,
    REGEXP_REPLACE(LOWER(JSON_VALUE(meta, '$.content')), r'\s', '') AS content
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01'
)
SELECT
  rank,
  page,
  content
FROM
  meta
WHERE
  http_equiv = 'content-security-policy-report-only'
ORDER BY
  rank

Querying refresh usage

Show query
WITH meta AS (
  SELECT
    root_page,
    LOWER(JSON_VALUE(meta, '$.http-equiv')) AS http_equiv,
    REGEXP_REPLACE(LOWER(JSON_VALUE(meta, '$.content')), r'\s', '') AS content
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01'
)
SELECT
  content,
  COUNT(DISTINCT root_page) AS sites
FROM
  meta
WHERE
  http_equiv = 'refresh'
GROUP BY
  content
ORDER BY
  sites DESC

Querying default-style usage

Show query
WITH meta AS (
  SELECT
    rank,
    root_page,
    LOWER(JSON_VALUE(meta, '$.http-equiv')) AS http_equiv,
    REGEXP_REPLACE(LOWER(JSON_VALUE(meta, '$.content')), r';\s+', ';') AS content
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01'
)
SELECT DISTINCT
  rank,
  root_page,
  content
FROM
  meta
WHERE
  http_equiv = 'default-style'
ORDER BY
  rank

Querying Squarespace accept-ch adoption

Show query
WITH accept_ch AS (
  SELECT
    root_page
  FROM
    `httparchive.scratchspace.http_equiv`
  WHERE
    http_equiv = 'accept-ch'
),
ss AS (
  SELECT
    root_page
  FROM
    `httparchive.all.pages`,
    UNNEST(technologies) AS t
  WHERE
    date = '2023-06-01' AND
    t.technology = 'Squarespace'
)
SELECT
  COUNT(DISTINCT IF(ss.root_page IS NOT NULL, root_page, NULL)) / COUNT(DISTINCT root_page) AS pct_ss
FROM
  accept_ch
LEFT JOIN
  ss
USING
  (root_page)

Querying accept-ch usage

Show query
SELECT
  TRIM(directive) AS directive,
  COUNT(DISTINCT root_page) AS sites
FROM
  `httparchive.scratchspace.http_equiv`,
  UNNEST(SPLIT(content)) AS directive
WHERE
  http_equiv = 'accept-ch'
GROUP BY
  directive
ORDER BY
  sites DESC

Querying x-dns-prefetch-control usage

Show query
SELECT
  content,
  COUNT(DISTINCT root_page) AS sites
FROM
  `httparchive.scratchspace.http_equiv`
WHERE
  http_equiv = 'x-dns-prefetch-control'
GROUP BY
  content
ORDER BY
  sites DESC

The post You probably don’t need http-equiv meta tags appeared first on rviscomi.dev.

]]>
https://rviscomi.dev/2023/07/you-probably-dont-need-http-equiv-meta-tags/feed/ 4
Origin trials and tribulations https://rviscomi.dev/2023/07/origin-trials-and-tribulations/ https://rviscomi.dev/2023/07/origin-trials-and-tribulations/#respond Wed, 05 Jul 2023 04:18:13 +0000 https://rviscomi.dev/?p=188 Origin trials are a way for developers to get early access to experimental web platform features. They’re carefully controlled “beta tests” run by browsers to ensure that the feature works and is worth more time on implementation and standardization. Check out Getting started with origin trials to learn more. What’s interesting to me is seeing […]

The post Origin trials and tribulations appeared first on rviscomi.dev.

]]>
Origin trials are a way for developers to get early access to experimental web platform features. They’re carefully controlled “beta tests” run by browsers to ensure that the feature works and is worth more time on implementation and standardization. Check out Getting started with origin trials to learn more.

What’s interesting to me is seeing how sites are using origin trials across the web, with the help of the public HTTP Archive dataset. By detecting and extracting these origin trial tokens, we can decode them to understand more about: which experimental features the sites are enrolling in, when the trial expires, whether the functionality can be added by a third party, who that third party is, and whether the origin trial also extends to all subdomains of a site. There’s a lot of info packed into a token, and lots we can learn about how these origin trials are (mis)used on the web.

Which origin trials are used the most?

The first question worth answering is to find out what the most popular origin trials are. The results will be extremely volatile over time because origin trials are ephemeral by nature and heavily influenced by third party adoption. With that in mind, it’s still useful to get a baseline understanding of what origin trials are out in the wild.

At any given time you can browse the complete list of active origin trials for Chrome, Edge, and Firefox. Safari doesn’t have a program for origin trials.

So let’s see what’s being used the most as of today.

FeatureWebsites
PrivacySandboxAdsAPIs3,697,720
WebViewXRequestedWithDeprecation1,254,718
AttributionReportingCrossAppWeb1,191,638
InterestCohortAPI36,990
CoepCredentialless8,591
PendingBeaconAPI1,006
SendFullUserAgentAfterReduction577
FedCmAutoReauthn410
InstalledApp371
PermissionsPolicyUnload329
BackForwardCacheNotRestoredReasons324
Top 11 features used by mobile pages as of June 2023. (Source: HTTP Archive)

So, what do these features do? Let’s look at each of the top 11 one by one. And let’s count it down in reverse order—why not.

The eleventh most used origin trial is BackForwardCacheNotRestoredReasons, found on 324 sites. This feature lists the reasons why a user didn’t get a page served from the back/forward cache (bfcache). I’m particularly excited about this one, because the bfcache is very effective at giving users the feeling of instant navigations. But eligibility can vary by user, and it’s otherwise impossible for a site owner to understand why they’re not seeing the bfcache restores that they’d expect. Unfortunately, tokens on 302 sites are configured incorrectly (wrong origin). That plus the 6 tokens that have expired means that only 16 sites are actually able to successfully collect data from the API—yikes! It’s expiring next month, so time is running out.

The tenth most used origin trial is PermissionsPolicyUnload, found on 329 sites. This one lets site owners disallow all scripts from running unload event handlers. It’s related to the previous origin trial, because a page with an unload handler is ineligible for the bfcache. It recently expired in June, so it’s not working anymore anyway, but it had a similar configuration issue in which 304 sites had an invalid origin. So any performance A/B tests they were hoping to run should not have worked and, given that the trial is expired, it’s too late to rerun them.

Aside: Looking more closely at the previous two origin trials, it seems what happened was that many TLD variations of google.com (.ca, .cl, .co.in, .co.jp…) incorrectly reused the origin trial token that was explicitly activated for the google.com origin. Let this be a lesson: always validate your origin trial tokens!

The ninth most used origin trial feature is the InstalledApp feature, which is found on 371 sites. It allows sites to determine whether the user has installed their corresponding app, using getInstalledRelatedApps(). The trial ended in January 2020, so all of these sites can save a few bytes by removing the expired tokens from the markup.

The eighth most used origin trial is FedCmAutoReauthn, on 410 sites. This is part of the Federated Credentials Management API (think “Sign in with…”) and the experimental feature is a streamlined re-authentication UX. In all but 14 cases, sites are inheriting this origin trial via a third party accounts.google.com script.

Number seven is SendFullUserAgentAfterReduction coming in at 577 sites. This feature helps sites migrate any of their dependencies off of the full User Agent (UA) string by delaying the newer, reduced UA string format. The UA string is being discouraged for browser/feature detection in favor of the User-Agent Client Hints API.

At number six we have PendingBeaconAPI, found on 1,006 sites. Per the origin trial page, it “allows website authors to specify one or more beacons (HTTP requests) that should be sent reliably when the page is being unloaded.” Oddly, even though the trial is active until September 19 (Chrome 115), all but one site are inheriting an expired origin trial token from a third party ad.doubleclick.net script. The only other site? Also expired. Maybe that’s intentional though, as the PendingBeacon API explainer doc warns that the API is being replaced with fetchLater() after feedback.

The number five most used origin trial is CoepCredentialless, used on 8,591 sites. COEP, which stands for Cross-Origin-Embedder-Policy, enables cross-origin isolation. All but two instances are inherited from a third party itch.io script, and for the first time, 100% of the tokens pass validation checks!

Despite expiring nearly two years ago, the fourth most used origin trial is InterestCohortAPI, used on 36,990 sites. The major driver for its continued popularity seems to be a third party script from adroll.com, used by 36,607 sites, and a script by criteo.net on 321 sites. A wild airhorner.com also appears as one of the holdouts.

Ok now we’re getting into some serious levels of adoption. At number three we have AttributionReportingCrossAppWeb, on 1,191,638 sites. This is an extension of the Attribution Reporting API, which enables measurements like ad conversions in a privacy-preserving way without third-party cookies. This experiment allows attribution events on mobile web to be joinable with events in Android’s Privacy Sandbox. Only two third-party origins are responsible for its popularity: googletagmanager.com (1,184,189 sites), and googleadservices.com (9,380 sites), with some overlapping sites having both.

The second most used origin trial is WebViewXRequestedWithDeprecation, on 1,254,718 sites, which permits WebView requests to continue using the legacy X-Requested-With header while it’s being deprecated. Again, with lots of overlap, the top third-party origins to be using it are doubleclick.net (1,254,708 sites) and googlesyndication.com (1,253,735 sites).

And finally, the most used origin trial is PrivacySandboxAdsAPIs, used on 3,697,720 sites. It’s a collection of APIs to facilitate advertising: FLEDGE, Topics, Fenced Frames, and Attribution Reporting. A whopping 399,172 sites contain an expired token for this feature with the biggest offenders being criteo.com (225,711 sites), criteo.net (225,707 sites—hmm), and s.pinimg.com (171,817).

For reference, here’s the query used to generate all of these stats:

SELECT
  feature,
  COUNT(DISTINCT page) AS pages,
  COUNT(DISTINCT IF(is_expired_token, page, NULL)) AS expired,
  COUNT(DISTINCT IF(is_invalid_subdomain, page, NULL)) AS invalid_subdomain,
  COUNT(DISTINCT IF(is_invalid_third_party, page, NULL)) AS invalid_3p,
  COUNT(DISTINCT IF(source = 'meta', page, NULL)) AS meta_source,
  CAST(MAX(expiry) AS DATETIME) AS most_recent_expiry,
  APPROX_TOP_COUNT(origin, 1) AS top_origin
FROM
  `httparchive.scratchspace.origin_trials`
GROUP BY
  feature
ORDER BY
  pages DESC

Jump to the setup section below to see how the origin_trials scratch table was created.

How many pages directly or indirectly include an origin trial?

SELECT
  COUNT(DISTINCT page) AS pages
FROM
  `httparchive.scratchspace.origin_trials`

This is a quick and easy one to answer. 3,720,272 mobile sites include an origin trial as of June 2023. That’s about 22% of the 16,563,413 sites in the dataset.

SELECT
  COUNT(DISTINCT page) AS pages
FROM
  `httparchive.scratchspace.origin_trials`
WHERE
  NET.REG_DOMAIN(page) = NET.REG_DOMAIN(origin)

If we only look at sites that explicitly self-register for an origin trial, we find there are 10,427 of them. Since we’re only looking at sites under the same domain as the registrant on the origin trial, this will include lots of subdomain variants and “tenant” subdomains. For example, itch.io shows up 8,589 times in this list, facebook.com 155 times, and pinterest.com 68 times.

So what features are individual sites enabling for themselves?

SELECT
  feature,
  COUNT(DISTINCT page) AS pages,
  COUNT(DISTINCT IF(is_expired_token, page, NULL)) AS expired
FROM
  `httparchive.scratchspace.origin_trials`
WHERE
  NET.REG_DOMAIN(page) = NET.REG_DOMAIN(origin)
GROUP BY
  feature
ORDER BY
  pages DESC
FeatureWebsites
CoepCredentialless8,591
PrivacySandboxAdsAPIs368
SendFullUserAgentAfterReduction280
PrivateNetworkAccessNonSecureContextsAllowed210
UnrestrictedSharedArrayBuffer146
InstalledApp161
DisableDifferentOriginSubframeDialogSuppression116
SmsReceiver103
AllowSyncXHRInPageDismissal43
Launch Handler42
PriorityHintsAPI39
Top 11 features directly used by mobile pages as of June 2023. (Source: HTTP Archive)

As mentioned earlier, itch.io is responsible for much of the CoepCredentialless usage, so that’s really an outlier. For the rest, no more than a few hundred sites are enrolling in any given first-party origin trial.

I won’t go through each feature again, but I do want to call out that a few of them look quite old. That raises a related question: what percent of first-party sites include an expired token?

SELECT
  COUNT(DISTINCT IF(is_expired_token, page, NULL)) / COUNT(DISTINCT page) AS pct_expired
FROM
  `httparchive.scratchspace.origin_trials`
WHERE
  NET.REG_DOMAIN(page) = NET.REG_DOMAIN(origin)

17%, or about one in five sites, sign up for an origin trial token and keep it around past its expiration.

Which third parties are injecting the most invalid tokens?

For a third party to make use of an origin trial, it needs to dynamically inject the token in a meta[http-equiv="Origin-Trial] tag. Two main things can go wrong with this:

  • The token is expired
  • The token doesn’t have the thirdParty flag set

Tokens are intentionally short-lived. When they expire, they should be removed along with any experimental functionality.

SELECT
  origin,
  COUNT(DISTINCT page) AS pages,
  CAST(APPROX_QUANTILES(expiry, 1000)[OFFSET(500)] AS DATETIME) AS median_expiry
FROM
  `httparchive.scratchspace.origin_trials`
WHERE
  is_expired_token
GROUP BY
  origin
ORDER BY
  pages DESC
OriginWebsites
https://criteo.net:443226,027
https://criteo.com:443225,711
https://s.pinimg.com:443171,825
https://adroll.com:44336,607
https://teads.tv:4437,935
https://ladsp.com:4434,079
https://ad.doubleclick.net:4431,005
https://www.googletagmanager.com:443848
https://doubleclick.net:443681
https://googletagservices.com:443671
https://googlesyndication.com:443664
Top 11 origins injecting expired tokens as of June 2023. (Source: HTTP Archive)

We saw earlier that Criteo, an adtech company, was responsible for the expired PrivacySandboxAdsAPIs tokens. So it’s no surprise to see it topping the list here. But it is interesting to note that half of their tokens have been expired since November 2022.

s.pinimg.com is an image sharing hostname from Pinterest. Again, almost all of its expired tokens are for the PrivacySandboxAdsAPIs feature, with a median expiration date of April 2023.

We also saw adroll.com earlier as the main driver of the expired InterestCohortAPI feature.

Injecting an expired token isn’t the worst thing. Presumably, it lived long enough in production to be useful. Invalid third party tokens are something else, though.

SELECT
  origin,
  COUNT(DISTINCT page) AS pages
FROM
  `httparchive.scratchspace.origin_trials`
WHERE
  is_invalid_third_party
GROUP BY
  origin
ORDER BY
  pages DESC
OriginWebsites
https://doubleclick.net:4431,254,708
https://googlesyndication.com:4431,253,735
https://themoneytizer.com:4435,501
https://www.google.com:443303
https://facebook.com:443203
https://airbnb.com:44387
https://m.youtube.com:44352
https://pinterest.com:44320
https://brands-id.shortlyst.com:44316
https://m.redbus.in:44315
Top 10 origins injecting invalid third party tokens as of June 2023. (Source: HTTP Archive)

These are the most prevalent third parties injecting origin trial tokens that will never work on a given site. Actually, let’s clarify one major assumption: the origin encoded in the token is assumed to be the same one injecting the token into its host page. It’s also possible that someone else (a fourth party?) is mishandling the origin’s token. For example, if I clone example.com onto my own site, all of their meta tag tokens will be invalidly served from rviscomi.dev.

Setting that aside, doubleclick.net (Google) and googlesyndication.com (also Google) are the two biggest origins that omit the thirdParty flag. In both cases, they’re missing the flag on their third party WebViewXRequestedWithDeprecation token.

Is that a big deal? I hope not. It means the X-Requested-With header would be unexpectedly stripped from WebView requests. Maybe in some cases that’s a load bearing header, but it doesn’t seem like a big deal.

I do worry about the origin trials more in my neck of the woods, like PermissionsPolicyUnload and BackForwardCacheNotRestoredReasons, which I highlighted earlier as being served by the wrong Google TLDs. At worst, someone might give up on them because they don’t seem to work or have any positive effect, all due to a misconfiguration.

Setting up the data table

Since I know I’ll be repeatedly querying the HTTP Archive dataset for the same origin trial info, the first thing I’ll do is preprocess the data and save it to a temporary table.

CREATE TEMP FUNCTION DECODE_ORIGIN_TRIAL(token STRING) RETURNS STRING DETERMINISTIC AS (
  REGEXP_EXTRACT(SAFE_CONVERT_BYTES_TO_STRING(SAFE.FROM_BASE64(token)), r'({".*)')
);

CREATE TEMP FUNCTION PARSE_ORIGIN_TRIAL(token STRING)
RETURNS STRUCT<
  token STRING,
  feature STRING,
  origin STRING,
  expiry TIMESTAMP,
  is_subdomain BOOL,
  is_third_party BOOL
> AS (
  STRUCT(
    token,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.feature') AS feature,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.origin') AS origin,
    TIMESTAMP_SECONDS(CAST(JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.expiry') AS INT64)) AS expiry,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.isSubdomain') = 'true' AS is_subdomain,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.isThirdParty') = 'true' AS is_third_party
  )
);


CREATE OR REPLACE TABLE `httparchive.scratchspace.origin_trials` AS

WITH valid_pages AS (
  SELECT
    page
  FROM
    `httparchive.all.requests`
  WHERE
    date = '2023-06-01' AND
    client = 'mobile' AND
    is_main_document AND
    NET.REG_DOMAIN(page) = NET.REG_DOMAIN(url)
),

ranks AS (
  SELECT
    rank,
    page
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-06-01' AND
    client = 'mobile'
),

origin_trials AS (
  SELECT
    rank,
    page,
    'meta' AS source,
    PARSE_ORIGIN_TRIAL(TRIM(token)).*
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta,
    UNNEST(SPLIT(JSON_VALUE(meta, '$.content'), ',')) AS token
  JOIN
    valid_pages
  USING
    (page)
  WHERE
    date = '2023-06-01' AND
    client = 'mobile' AND
    is_root_page AND
    LOWER(JSON_VALUE(meta, '$.http-equiv')) = 'origin-trial'
UNION ALL
  SELECT
    rank,
    page,
    'header' AS source,
    PARSE_ORIGIN_TRIAL(header.value).*
  FROM
    `httparchive.all.requests`,
    UNNEST(response_headers) AS header
  JOIN
    valid_pages
  USING
    (page)
  JOIN
    ranks
  USING
    (page)
  WHERE
    date = '2023-06-01' AND
    client = 'mobile' AND
    is_root_page AND
    is_main_document AND
    LOWER(header.name) = 'origin-trial'
)

SELECT
  *,
  feature IS NULL AS is_invalid_token,
  expiry < CURRENT_TIMESTAMP() AS is_expired_token,
  NET.REG_DOMAIN(page) != NET.REG_DOMAIN(origin) AND is_third_party IS NOT TRUE AS is_invalid_third_party,
  NET.REG_DOMAIN(page) = NET.REG_DOMAIN(origin) AND NET.HOST(page) != NET.HOST(origin) AND is_subdomain IS NOT TRUE AS is_invalid_subdomain
FROM
  origin_trials

Warning: this query processes 1.97 TB, which costs about $10.

I’d love for more people to get comfortable writing their own queries over the HTTP Archive dataset, so let’s walk through what this query does.

Custom functions

CREATE TEMP FUNCTION DECODE_ORIGIN_TRIAL(token STRING) RETURNS STRING DETERMINISTIC AS (
  REGEXP_EXTRACT(SAFE_CONVERT_BYTES_TO_STRING(SAFE.FROM_BASE64(token)), r'({".*)')
);

CREATE TEMP FUNCTION PARSE_ORIGIN_TRIAL(token STRING)
RETURNS STRUCT<
  token STRING,
  feature STRING,
  origin STRING,
  expiry TIMESTAMP,
  is_subdomain BOOL,
  is_third_party BOOL
> AS (
  STRUCT(
    token,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.feature') AS feature,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.origin') AS origin,
    TIMESTAMP_SECONDS(CAST(JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.expiry') AS INT64)) AS expiry,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.isSubdomain') = 'true' AS is_subdomain,
    JSON_VALUE(DECODE_ORIGIN_TRIAL(token), '$.isThirdParty') = 'true' AS is_third_party
  )
);

I’m creating two BigQuery functions: DECODE_ORIGIN_TRIAL and PARSE_ORIGIN_TRIAL. These two functions were adapted from the ot-decode project by fellow Chromie Sam Dutton, which itself takes inspiration from Jack‘s origin-trials-viewer project. Why didn’t I just use a JavaScript function in BigQuery? Some of the APIs weren’t available, and since it was relatively straightforward to port over to SQL, I opted to take advantage of the performance benefits.

Destination table

CREATE OR REPLACE TABLE httparchive.scratchspace.origin_trials AS

This takes the output of the query and saves it to a table in the httparchive project. This will only work if you have write access to the project, so feel free to comment it out or replace it with a table in your own BigQuery project if you’d like. The httparchive table is still publicly queryable, so feel free to explore it.

Subqueries

WITH valid_pages AS (
  SELECT
    page
  FROM
    `httparchive.all.requests`
  WHERE
    date = '2023-06-01' AND
    client = 'mobile' AND
    is_main_document AND
    NET.REG_DOMAIN(page) = NET.REG_DOMAIN(url)
),

The WITH clause aliases the output of the following subqueries so that I can reference them later. It’s not a temp table necessarily, but it makes the query a lot more readable.

The valid_pages subquery creates a verified subset of pages that don’t have a cross-domain redirect. If foo.com redirects to bar.com, we don’t want bar’s origin trials mistakenly attributed to foo. We’ll join the following queries with this one to ensure that we’re only looking at valid pages.

ranks AS (
  SELECT
    rank,
    page
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-06-01' AND
    client = 'mobile'
),

This next ranks subquery simply gets the rank for each page in the mobile dataset, which will be used later.

origin_trials AS (
  SELECT
    rank,
    page,
    'meta' AS source,
    PARSE_ORIGIN_TRIAL(TRIM(token)).*
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta,
    UNNEST(SPLIT(JSON_VALUE(meta, '$.content'), ',')) AS token
  JOIN
    valid_pages
  USING
    (page)
  WHERE
    date = '2023-06-01' AND
    client = 'mobile' AND
    is_root_page AND
    LOWER(JSON_VALUE(meta, '$.http-equiv')) = 'origin-trial'

The next subquery origin_trials extracts the origin trial metadata from all of the <meta> tags on the page.

If you’re wondering why I didn’t parse the HTML response body directly in BigQuery, that approach would have only yielded the static <meta> tags. Crucially, we need to also include dynamically injected tags from JavaScript. It’s possible (necessary, even) for third party scripts to inject an origin trial token into the <head> of the main page after it’s already loaded.

This query takes advantage of the almanac.meta-nodes custom metric, which runs document.querySelectorAll('head meta') on the page and returns the attributes of each tag. So we’re able to filter the results down to only the ones with http-equiv=origin-trial (case insensitive) and extract their content attributes, which contain the origin trial token.

Also note that I’m only querying the June 2023 dataset (latest at this time of writing) with is_root_page set, which limits the dataset to only the home pages. We could remove this and include secondary pages, for example the article page on a news website, but I don’t suspect the overall results would be much different. Maybe that’s something you can check, if you’re up for it 🤠

The output of this query is the rank of the page, the URL of the page itself, the source of the token (which is meta in this case), and the parsed origin trial data from the custom function above.

UNION ALL
  SELECT
    rank,
    page,
    'header' AS source,
    PARSE_ORIGIN_TRIAL(header.value).*
  FROM
    `httparchive.all.requests`,
    UNNEST(response_headers) AS header
  JOIN
    valid_pages
  USING
    (page)
  JOIN
    ranks
  USING
    (page)
  WHERE
    date = '2023-06-01' AND
    client = 'mobile' AND
    is_root_page AND
    is_main_document AND
    LOWER(header.name) = 'origin-trial'
)

The origin_trials subquery continues with a UNION ALL clause to effectively mash together the results of the previous SELECT statement with this one.

The key difference here is that I’m looking at the requests table so that I can extract any Origin-Trial HTTP headers from the document response.

Amazingly, even though this table is 2.4 PB (yes, petabytes) with over 60 billion rows, this part of the query only processes about 12 GB of data. That’s thanks to the built-in partitioning and clustering, and the fact that the headers are directly accessible under the response_headers field rather than having to be parsed out of a massive JSON blob with other request metadata.

Also note that this is where the ranks data comes into play, because the requests table doesn’t annotate each page with its rank. Maybe it should!

There’s one thing missing from this query worth mentioning, and that is origin trial tokens set in HTTP headers outside of the main page. For example, ServiceWorkerBypassFetchHandlerForMainResource requires the token to be set on the response headers of the service worker itself. A quick check of the dataset found no instances of this particular origin trial, but it’s definitely possible that I’m overlooking some other valid tokens. For the sake of simplicity, this post only looks at origin trials set on the main page via first party headers or meta tags.

Validation

SELECT
  *,
  feature IS NULL AS is_invalid_token,
  expiry < CURRENT_TIMESTAMP() AS is_expired_token,
  NET.REG_DOMAIN(page) != NET.REG_DOMAIN(origin) AND is_third_party IS NOT TRUE AS is_invalid_third_party,
  NET.REG_DOMAIN(page) = NET.REG_DOMAIN(origin) AND NET.HOST(page) != NET.HOST(origin) AND is_subdomain IS NOT TRUE AS is_invalid_subdomain
FROM
  origin_trials

The rest of the query is where it all comes together. This is what determines what actually gets written to the output table.

I’m piping everything out of the origin_trials pseudo-table and adding a few validation flags:

  • If the feature field is null, there must have been a decoding/parsing error, so mark the token as invalid.
  • If the token’s expiry field is in the past, mark it as expired.
  • If the origin of the token doesn’t match up with the origin of the page, and there is no thirdParty flag set, mark it as invalid.
  • If the domain of the token does match up with the domain of the page, but the hosts don’t match (eg foo.example.com and bar.example.com), and there is no subdomain flag set, mark it as invalid.

The output table contains 20,000,950 rows and is 7.22 GB. So it’s definitely affordable for anyone to query and stay well under the 1 TB/month free tier on GCP. (Even still, always set up cost controls!)

The table is publicly accessible at httparchive.scratchspace.origin_trials. Feel free to run your own queries to explore the data, and share what you find with me on Twitter or down in the comments.

Be aware that this table—and every other table in the scratchspace dataset—is provided with no guarantees of uptime/correctness/maintenance. So it won’t be updated regularly with the latest month’s data, it may contain inaccurate or missing data, and it might even be deleted without notice.

The post Origin trials and tribulations appeared first on rviscomi.dev.

]]>
https://rviscomi.dev/2023/07/origin-trials-and-tribulations/feed/ 0
Querying parsed HTML in BigQuery https://rviscomi.dev/2023/05/querying-parsed-html-in-bigquery/ https://rviscomi.dev/2023/05/querying-parsed-html-in-bigquery/#respond Wed, 24 May 2023 20:26:30 +0000 https://rviscomi.dev/?p=86 A longstanding problem in the HTTP Archive dataset has been extracting insights from blobs of HTML in BigQuery. For example, take the source code of example.com: If you wanted to extract the link text in the last paragraph, you could do something relatively straightforward like this: But in BigQuery, we don’t have the luxury of […]

The post Querying parsed HTML in BigQuery appeared first on rviscomi.dev.

]]>
A longstanding problem in the HTTP Archive dataset has been extracting insights from blobs of HTML in BigQuery. For example, take the source code of example.com:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">...</style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

If you wanted to extract the link text in the last paragraph, you could do something relatively straightforward like this:

// 'More information...'
document.querySelector('p:last-child a').textContent;

But in BigQuery, we don’t have the luxury of the document object, querySelector, or textContent.

Instead, we’ve had to resort to unwieldy regular expressions like this:

# 'More information...'
SELECT
  REGEXP_EXTRACT(html, r'<p><a[^>]*>([^<]*)</a></p>') AS link_text
FROM
  body

It looks like it works, but it’s brittle.

  • What if there’s text or whitespace between the elements?
  • What if there are attributes on the paragraph?
  • What if there’s another p>a element pair earlier in the page?
  • What if the page uses uppercase tag names?

It goes on and on.

Using regular expressions for parsing HTML seems like a good idea at first, but it becomes a nightmare as you need to ramp it up to increasingly unpredictable inputs.

To avoid this headache in HTTP Archive analyses, we’ve resorted to custom metrics. These are executed on each page at runtime, and it’s been really effective. It enables us to analyze both the fully rendered page as well as the static HTML. But one big limitation with custom metrics is that they only work at runtime. So if we want to change the code or analyze an older dataset, we’re out of luck.

Cheerio

While looking for a way to implement capo.js in BigQuery to understand how pages in HTTP Archive are ordered, I came across the Cheerio library, which is a jQuery-like interface over an HTML parser.

It works beautifully.

Screenshot of a BigQuery query and result showing example.com being analyzed with the CAPO custom function.

To be able to use Cheerio in BigQuery, I first needed to build a JavaScript binary that I could load into a UDF. The post How To Use NPM Library in Google BigQuery UDF was a big help. I installed the Cheerio library locally and built it into a script with an exposed cheerio global variable using Webpack.

I uploaded the script to HTTP Archive’s Google Cloud Storage bucket. Then in BigQuery, I was able to side-load the script into the UDF with OPTIONS:

OPTIONS (library = 'gs://httparchive/lib/cheerio.js')

From there, the UDF was able to reference the cheerio object to parse the HTML input and generate the results. You can see it in action at capo.sql.

Querying HTML in BigQuery

Here’s a full demo of the example.com link text solution in action:

DECLARE example_html STRING;
SET example_html = '''
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">...</style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="proxy.php?url=https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
''';

CREATE TEMP FUNCTION getLinkText(html STRING)
RETURNS STRING LANGUAGE js
OPTIONS (library = 'gs://httparchive/lib/cheerio.js') AS '''
try {
  const $ = cheerio.load(html);
  return $('p:last-child a').text();
} catch (e) {
  return null;
}
''';

SELECT getLinkText(example_html) AS link_text

🔗 Try it on BigQuery

The results show it working as expected:

Limitations

Cheerio is marketed as fast and efficient.

If you try to parse every HTML response body in HTTP Archive, the query will fail.

Fully built, the library is 331 KB. And due to the need for storing the HTML in memory to parse it, it consumes a lot of memory for large blobs.

To minimize the chances of OOM errors and speed up the query, one thing you can do is pare down the HTML to the area of interest using only the most basic regular expressions. Since the capo script is only concerned with the <head> element, I grabbed everything up to the closing </head> tag:

httparchive.fn.CAPO(
  REGEXP_EXTRACT(
    response_body,
    r'(?i)(.*</head>)'
  )
)

If there are no natural “breakpoints” in the document for your use case, you could also consider restricting the input to a certain character length like WHERE LENGTH(response_body) < 1000. The query will work and it’ll run more quickly, but the results will be biased towards smaller pages.

Also, some documents may not be able to be parsed at all, resulting in exceptions. I added try/catch blocks to the UDF to intercept any exceptions and return null instead.

That also means that your query needs to be able to handle null values instead. For example, to get the first <head> element from the results, I needed to use SAFE_OFFSET instead of plain old OFFSET to avoid breaking the query on null values: elements[SAFE_OFFSET(0)].

Wrapping up

Cheerio is a really powerful new tool in the HTTP Archive toolbox. It unlocks new types of analysis that used to be prohibitively complex. In the capo.sql use case, I was able to extract insights about pages’ <head> elements that would have only been possible with custom metrics on future datasets.

I’m really interested to see what new insights are possible with this approach. Let me know your thoughts in the comments and how you plan to use it.

The post Querying parsed HTML in BigQuery appeared first on rviscomi.dev.

]]>
https://rviscomi.dev/2023/05/querying-parsed-html-in-bigquery/feed/ 0