The post Your Accept-Language Redirects Could Be Blocking Search Engines and AI Crawlers appeared first on Merj.
]]>Search engines generally don’t use Accept-Language in crawl requests but some AI crawlers do, often with default English US values. If your site redirects HTML requests based on Accept-Language, you can accidentally funnel bots into the wrong locale, reduce coverage of non-English pages, and make debugging harder (especially when rendering is involved).
This post explains what we observed in testing.
Accept-Language, so these redirects often didn’t fire historically.en-US,en;q=0.9), which is usually not user intent.Accept-Language can skew discovery and indexing toward the wrong locale, and gets messier during rendering. </aside>For years, technical SEO has relied on a stable rule: don’t do locale-adaptive redirects for search engines.
In practice, that meant avoiding redirects based on IP geolocation, cookie state, or “smart” locale detection. Instead, create clean, crawlable URLs for each locale and add hreflang annotations when needed.
Accept-Language wasn’t part of this conversation for a long time – because major search engine bots simply don’t send it in crawl requests. Google states this explicitly in their locale-adaptive pages documentation.
But search engines aren’t the only crawlers anymore.
AI crawlers are increasingly common, and many behave differently. They’re often less mature than major search engines, more experimental, and more likely to trigger edge cases we never had to consider.
The edge case we’ll consider in this post: AI bots may send Accept-Language headers.
If your platform’s redirects rules are based on Accept-Language, you risk creating redirect loops or blocking certain bots from accessing specific language content.
Accept-Language is an HTTP request header used for content negotiation: the client tells the server which natural languages it prefers. Browsers typically set it based on UI language and user preferences.
The HTTP semantics are standardised in the RFC 9110.
It’s a preference signal, not a command. Treat it as “this is what the client might prefer,” not “always redirect me.”
Google’s documentation on locale-adaptive pages is clear and consistent:
Other major search engines follow similar rules.
In the old SEO world, Accept-Language redirects were rarely a problem since there was no header to trigger them. IP-based detection was the more obvious real risk.
We tested major search engines, AI platforms, and web-access LLMs to see how they fetch HTML documents in the wild and specifically which request headers they send and how they handle redirect chains.
| What we tested | Details |
|---|---|
| Content type | HTML requests only |
| Signals observed | Accept-Language and redirect chains |
HTML is where localisation decisions are usually enforced (often through redirects). It’s also the primary content search engines index and the main input LLM retrieval systems use for grounding.
Across tested platforms, requests were typically sent either:
Accept-Language header, orAccept-Language: en-US,en;q=0.9Crucially, AI crawlers and retrieval systems didn’t actively adapt Accept-Language based on:
In other words: when the header exists, it’s rarely user intent — it’s an implementation default. And in some cases, even Googlebot can end up presenting the same default Accept-Language when it follows redirects during rendering.
Practical takeaway: redirecting HTML based on Accept-Language can reduce indexing quality and create “wrong language” retrieval for both search engines and LLM-driven systems.
Notable exceptions: Applebot and PetalBot behaved differently in our tests, based on internal custom logic.
Below is the raw header behaviour we observed across platforms.
| Platform / Agent | Accept-Language header |
|---|---|
| Googlebot ****(also used by Gemini) | Absent. en-US is present when a JS redirected is followed during rendering. |
| Adsbot-Google | Absent. en-US is present when a JS redirected is followed during rendering. |
| Mediapartners-Google (Adsense) | Absent. en-US is present when a JS redirected is followed during rendering. |
| Bingbot (also used by Microsoft / Copilot) | Absent. |
| Adidxbot (Microsoft Ads) | Absent. |
| YandexBot | Absent or en, *;q=0.01. |
| Yeti (Naver) | Absent. |
| Baiduspider | Can be absent or zh-cn,zh-tw. zh-CN,en;q=0.9,en-GB;q=0.8,en-US;q=0.7,fr;q=0.6 is present when a JS redirected is followed during rendering. |
| Sogou | Can be absent, en-US,en;q=0.9, or zh-CN. |
| DuckDuckBot | Absent or en-US,en;q=0.8,zh;q=0.6,es;q=0.4 |
| DuckDuckBot-Https | Absent or en,* |
| Applebot (also used by Apple Intelligence / Spotlight) | Applebot appears to use Accept-Language values that match the domain’s country code top-level domain (ccTLD). For .com domains, the Accept-Language header is absent. For localised domains like .de, it sends de-DE, and for .co.jp, it sends ja-JP. |
| OpenAI GPTBot | Absent. |
| OpenAI OAI-SearchBot | Absent. |
| OpenAI ChatGPT-User | Can be en-US,en;q=0.9 or absent |
| Anthropic ClaudeBot | Absent. |
| Anthropic Claude-User | Absent. |
| Anthropic Claude-SearchBot | Absent. |
| Perplexity PerplexityBot | Absent. |
| Perplexity Perplexity‑User | Absent. |
| MistralAI | Absent or en-US,en;q=0.9. |
| Bytespider | Can be absent, en-US,en;q=0.5 or zh,zh-CN;q=0.9. |
| DuckAssistBot | Absent. |
| meta-externalagent | Absent. |
| PetalBot | Absent most of the time. Use Accept-Language values that match the domain’s country code top-level domain (ccTLD) in a non consistent way. For .com domains, the Accept-Language header is absent. For localised domains like .fr sends fr,en;q=0.8, but .at sends en. |
| Amazonbot | Absent. |
| TikTokSpider | Absent or en-US,en;q=0.5. |
| Pinterestbot | Absent. |
| CCbot | Absent or en-US,en;q=0.5 |
Note: This table reflects data from February 2026 and may change over time. After we identified redirects linked to the Accept-Language header a few months ago, a subset of AI crawlers stopped including that header in their requests.
Accept-Language redirects are typically implemented to help users reach the right content immediately, bypassing language selector screens and additional interactions. For users, it’s possible to measure the success of having such redirects through conversion rates and interaction metrics. For bots, the effects are more complex and subtle.
Here’s the typical pattern:
/product)Accept-Language/en/product (or worse: a generic homepage)This creates downstream problems:
In our tests, Googlebot’s initial crawl fetch did not include Accept-Language (as expected). However, when a redirect was triggered during rendering, the follow-up requests inherited some request headers from the browser instance, including the default language preference (Accept-Language: en-US).
This creates a tricky edge case: when a platform redirects based on Accept-Language, the redirect doesn’t trigger during HTML fetches but may trigger during rendering fetches.
In the official Google documentation, one of the suggested solution to avoid soft 404 redirect for SPAs is to use a JavaScript redirect to a URL for which the server responds with a 404 HTTP status code (for example /not-found).
Depending on how the Accept-Language redirect is implemented, we may have inconsistent soft 404 handling and indexing signals get muddy:
/product)/not-foundAccept-Language: en_US and redirects again based on internal rules/en/not_found, or on a page that not returns a 404, or worse on a generic homepageAccept-Language isn’t a “bad” header, it’s a standard that platforms can use in multiple ways. What breaks is the assumption that crawlers behave like users.
The web now includes a wide range of crawlers, many send browser-like headers and aggressively explore edge cases.
Our internal testing across major Bots and LLM platforms supports these statements:
Accept-Language as a localisation signalAccept-Language is typically default en-US,en;q=0.9Accept-Language based redirects for bots:
LLMs naturally tend to prefer English over other languages – an emergent behaviour driven by English-heavy training data, tokeniser efficiency differences, and alignment signals. However, redirecting based on Accept-Language, reinforces this bias by forcing “English-only” content. This can unnecessarily exclude relevant non-English sources and create a partial, skewed view of available information.
We recommend avoiding redirects for bots based on the Accept-Language header. Build following internet standards, make URLs explicit, and keep redirects predictable.
The post Your Accept-Language Redirects Could Be Blocking Search Engines and AI Crawlers appeared first on Merj.
]]>The post When Search Stops Playing By Old Rules appeared first on Merj.
]]>SEO died again. 
It’s fine though. My profession just does that sometimes. This time, it’s supposed killer is AI, and the emergence of the chat-based LLM. This had led in turn to the emergence of the field of AEO / GEO / many other acronyms.
However, while much of the discussion around that field gets into the nitty gritty of the technical detail – and as a professional dork, I’m as eager to charge down those rabbit holes as anyone – I wanted to zoom out and look at the wider picture of AEO. In this article, I’m going to outline how value is traditionally created and captured on the web, and why you can’t just transplant a business model built around a web of links and pages to a model of answers.
The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.
Tim Burners-Lee, The WWW Project, 1993
The web, from it’s inception, is designed to be a network of documents, interlinked via Hypertext Reference (href). In the early days, this was more obviously apparent – getting content online meant writing a series of webpages, interlinking them as you saw fit, and uploading them to a server.

Honestly, we lost something as a society when we stopped using the marquee tag.
If you really want to throw a UX designer for a loop (and make them dislike you for being unnecessary pedantic), point out that there’s really no such thing as a website in a technical sense. Just individual hypertext documents, all interlinked via unique URLS – a web.
Today, this framing is somewhat abstracted away. There is, of course, such a thing as a website – pages are rarely written individually as documents in raw HTML, instead being added via a content management system, stored in a database and rendered out via a templating engine to create a unified, coherent experience. And the end user generally think about those experience in terms of domains and/or a website name, rather than browsing around a web of interlinked individual URLS.
But the underlying core components of the web are still the URL, the Page View and the Hyperlink. Indeed, some of the largest challenges facing Technical SEOs over the last decade plus has been downstream of development teams creating platforms that don’t assume that Hypertext-Link-URL document paradigm, often in the name of unifying code bases between the web and other surfaces such as mobile app stores. Move away from that, and you moved away from the all the benefits the web as a paradigm provided, including that critical arbiter of audiences for your content – the aggregator.
Welcome to the internet
Have a look around
Anything that brain of yours can think of can be found
We’ve got mountains of content
Some better, some worse
If none of it’s of interest to you, you’d be the first
Bo Burnham, ‘Welcome to the Internet’, 2021
For my money, the best articulation of the economics of the web is Ben Thompsons 2015 piece, ‘Aggregation Theory’.
For example, printed newspapers were the primary means of delivering content to consumers in a given geographic region, so newspapers integrated backwards into content creation (i.e. supplier) and earned outsized profits through the delivery of advertising. A similar dynamic existed in all kinds of industries, such as book publishers (distribution capabilities integrated with control of authors), video (broadcast availability integrated with purchasing content), taxis (dispatch capabilities integrated with medallions and car ownership), hotels (brand trust integrated with vacant rooms), and more. Note how the distributors in all of these industries integrated backwards into supply: there have always been far more users/consumers than suppliers, which means that in a world where transactions are costly owning the supplier relationship provides significantly more leverage.
The fundamental disruption of the Internet has been to turn this dynamic on its head. First, the Internet has made distribution (of digital goods) free, neutralizing the advantage that pre-Internet distributors leveraged to integrate with suppliers. Secondly, the Internet has made transaction costs zero, making it viable for a distributor to integrate forward with end users/consumers at scale.
Ben Thompson / Stratechery, Aggregation Theory 2015.
Because aggregators deal with digital goods, there is an abundance of supply; that means users reap value through discovery and curation, and most aggregators get started by delivering superior discovery.
Then, once an aggregator has gained some number of end users, suppliers will come onto the aggregator’s platform on the aggregator’s terms, effectively commoditizing and modularizing themselves. Those additional suppliers then make the aggregator more attractive to more users, which in turn draws more suppliers, in a virtuous cycle.
Ben Thompson / Stratechery, Defining Aggregators, 2017
It’s no coincidence that the largest companies in the world for the internet age are all aggregators (Google, Meta, Amazon, with Apple and Microsoft acting as portals to those aggregators).
The core argument of many of the antitrust cases that came to a head in 2025 was based around that the centralisation of this power constitutes a monopoly – a position that US courts largely upheld, albeit with meagre penalties that did nothing to meaningfully dissuade that consolidation. (One of the important but less discussed aspects of AI, and one I think will become ever-more-meaningful in 2026 is how the emergence of ChatGPT gives Google cloud cover for more aggressive vertical integration – but that’s a whole other article)
Google, or at least Google Search, is unique in that it is the aggregator of the open web. The other major aggregators all came later, and chose to run closed ecosystems. Amazon will happily optimise their experience to earn clicks from search, but they’ll never send that traffic back out of their ecosystem to a Shopify-run cart in order to complete a purchase.
If discoverability is, as Thompson argues, the valuable commodity of a digital economy, and on the web the mechanism of being discovered is the link leading to the page view, and the owner of those links is in turn in a position to extract significant commercial value.
It’s no accident that selling links via Adwords remains by far the bulk of Google’s revenue, or that every other major aggregator runs an effectively closed platform. I’m sure Google, with the benefit of hindsight, would have much rather rolled out Google Shopping and Payments earlier and more aggressively, creating a vertically integrated shopping experience rather than sending the traffic out freely that in turn allowed Amazon to reach their critical mass. Closed networks allow the aggregator to extract a greater share of the value created within those networks. But not every truth was apparent in 1998, and it’s not as if selling links has not been a massively successful business for Google!
It is, however, a lesson the LLMs have learned.
On the surface the mechanics of LLM feels familiar to any SEO. There’s a crawler, following links around the web, visiting URLS, indexing and processing the content and then serving it in response to a users query.
But the users experience is fundamentally different. There is no link (or, at least, no link that anyone clicks – links from chatGPT are interacted with a fraction of the time versus the traditional search result), no pageview, and no subsequent exchange of value. It is a closed network, only unlike a TikTok or an Instagram who provide alternative incentives to drive creation (whether monetarily such as ads, or something more intangible such as social status and a means of expression) there’s no recompense for the creator. The link-based value chain is redundant in an AI-first world.
Now, from the end users experience, this centralisation is often a positive. The web may have been built around documents, but what users often really wanted are answers. And often the answer to a question might be spread across several documents, requiring the user to go hunting across several searches and links to get to the answer they want. ChatGPT has had a meteoric rise for a reason!
But one of the contradictions at the heart of AI is that it undermines the very structures that gave rise to the web – the link, the click, the PageView – while also relying on the web as it’s primary source of training data.
For AEO to truly develop as a channel and for AI to become a meaningful path to reaching customers, different models have to start to emerge.
One model is to replicate the mobile app store model, encouraging organisations to integrate directly with LLMs just as they integrated with mobile operating system APIs. And I’m sure this is the model OpenAI, Google would love – they get to effectively enforce app-store style rents on the entire economy. I’m of the opinion that AI is very much in a bubble as I write this that’s due to pop, but if there is a case for the level of investment in infrastructure for AI, it’s the amount of commercial value that represents if it can be pulled off.
But there’s a contradiction at the heart of that model, at least for the supply side of that relation ship. It runs into the risk of being stuck in a commodified relationship and fighting in a race to the bottom.
‘The Doordash’ problem, coined by Nilay Patel of The Verge, is an articulation of this issue. Patel argues that to embrace AI is to give up ownership of the customer experience
So what, exactly, is the DoorDash problem? Briefly, it’s what happens when an AI interface gets between a service provider, like DoorDash, and you, who might send an AI to go order a sandwich from the internet instead of using apps and websites yourself.
That would mean things like user reviews, ads, loyalty programs, upsells, and partnerships would all go away — AI agents don’t care about those things, after all, and DoorDash would just become a commodity provider of sandwiches and lose out on all additional kinds of money you can make when real people open your app or visit your website.
Nilay Patel, ‘The DoorDash Problem’, The Verge Decoder Podcast
Patel’s argument is not that AI adoption is optional, but that AI intermediaries collapse differentiation by absorbing the customer experience.
In this model, DoorDash ceases to be a destination and becomes an API. And once a business is reduced to an API, margins become the only remaining moat. Sooner or later, a competitor offers the same interface at a slightly lower cost, passing the savings on to the customer. The result is a race to the bottom.
This is the decision organisations face whether they articulate it or not. The mistake isn’t adopting AI interfaces. The mistake is adopting them without owning the economic surface they create.
The counterargument many companies made to him is that it’s their brand that acts as a moat. People won’t order ‘a taxi’, they’ll order ‘an Uber’ because they have a level of pre-existing brand trust with Uber that’s no there if AI is simply ordering from an unknown provider. That brand will protect them.
Frankly, I think that is at best naive. The consumer brand is, again, an artefact of 20th century economics, and the true currency of the digital age is discoverability. Uber won not because of their incredible emotional mass storytelling on TV (do they even do TV adverts, and would they even get the reach from them to accomplish that in 2026?), but because they consolidated taxi firms into one place, creating a new and superior user experience in the process versus the traditional ringing around various random providers in the hope that, maybe, a taxi might turn up, possibly, who can tell. (Not to mention a review system that disincentivised some drivers from the awkward thinly vailed racist rant about ‘that there London’ that made the average trip home such a joy!).
It’s the UX, not the brand, that differentiates every successful large scale business in the 21st century.

To just control of the product experience that over to OpenAI would be insane! They would just be giving control of their moat away. But it’s also not an option to simply opt out of AI – ChatGPT is the fastest growing tech product in history, and any organisation ultimately has to meet the customer where the customer is.
What’s needed is a method of to established an alternative to the pageview as the point of value exchange. And into that void, enter MCP.
While still a fairly early-stage technology, Model Context Protocol or MCP is fast emerging as a standard for LLMs to communicate directly with other systems. It’s being positioned firstly as a solution for agentic systems to interact – the protocol that an LLM can use to tell a robot vacuum cleaner to start, or to request information from a SaaS data provider.
It is to AI what HTTP was to the web browser. A way of standardising communication between various systems in a way that promotes interoperation.
That in-turn creates a new ‘chokepoint’ for value extraction, recreating a point of leverage for brands.
The MCP prompt and response creates a moment where organisations can both create and exchange value, requiring something in return from their customers that aligns with their strategic product and marketing goals. It provides an opportunity to interact in a differentiated way, offering something that’s beyond not the proverbial Sandwich delivery API.
MCP is just one part of the emerging stack, of course. Vector Databases also change how data is stored, moving from the rigid exact match of SQL to a database of concepts that can be mixed and remixed in response to the customers prompt.
This change requires organisations to think holistically, from a product level on down, about how their services can fit into this paradigm. Simply sticking an MCP endpoint onto a web server is not going to be enough, just as simply sticking a print brochure on the web wasn’t enough in 1998. The companies that will win will be those that have products built around these new economic models, not those built around web-based economics.
AEO is, ultimately, not simply a technology problem. It’s a product and business model one too. The web rewarded those who understood links and views. AI will reward those who understand interfaces, protocols, and the new points of economic leverage.
The post When Search Stops Playing By Old Rules appeared first on Merj.
]]>The post Rendering, Style, and Layout: When Things Go Wrong appeared first on Merj.
]]>At Tech SEO Connect 2025, our R&D Director, Giacomo Zecchini, showed how crawler behaviour differs from user rendering.
His findings show that the gap between what users see and what crawlers interpret is widening, especially as AI crawlers grow in influence but lag behind search engines in rendering sophistication.
A major focus of Giacomo’s talk was how rendering quirks, viewport expansion, and layout techniques like 100vh hero sections can cause crawlers to completely misunderstand page structure, hide key content, or misjudge its priority.

Watch the full talk here: https://www.youtube.com/watch?v=kZw6BsIytJU
Giacomo emphasised that many AI crawlers still do not execute JavaScript or complete user interactions, meaning:
Even when JavaScript is executed, crawlers process the page differently from users resulting in mismatches in what gets indexed.
One of the standout insights from Giacomo’s talk is Google’s viewport expansion behaviour:
This can change how the content is shown, exposing or hiding information in ways teams never expect. Such rendering behaviour is unique to crawlers, not humans, so issues often go unnoticed.
Giacomo highlighted a common layout pitfall: using 100vh for hero banners or sections.
When viewport expansion occurs:
100vh based on the expanded viewport.If Google or AI crawlers incorporate layout position or ‘above-the-fold’ relevance into ranking signals, this can unintentionally downgrade the importance of key headings and content.
Even worse:
Giacomo’s recommendation: Use a max-height cap on fullscreen elements to preserve visual design without creating crawler distortions.
Lazy loading improves performance, but Giacomo reinforced a critical point:
Most crawlers don’t scroll.
Most crawlers don’t trigger user events.
Therefore:
Safer patterns include:
These ensure both humans and crawlers get the same content.
A consistent message in Giacomo’s talk: Content quality matters only if crawlers can see it.
SEO teams should consistently evaluate:
The “technical” in technical SEO is becoming more literal as we now optimise for rendering engines as much as for search algorithms.
Giacomo’s insights at Tech SEO Connect revealed that many SEO issues today are related to rendering.
From viewport expansion to the 100vh trap, from JS limitations to lazy-loading pitfalls, the modern crawler landscape requires far more precision in how pages are built and served.
As AI-driven discovery accelerates, ensuring that your layouts, behaviours, and rendering patterns align with crawler capabilities is a competitive advantage.
The post Rendering, Style, and Layout: When Things Go Wrong appeared first on Merj.
]]>The post URL Encoding Done Right: Best Practices for Web Developers and SEOs appeared first on Merj.
]]>Modern web developers and SEO teams share the complex challenge of creating URLs that function seamlessly across languages, platforms, and systems.
Whether you’re adapting links for international users or handling dynamic parameters in your APIs, getting URL encoding right matters. Small mistakes can lead to broken links, duplicate pages, and messy analytics.
This guide breaks down URL encoding best practices, focusing on UTF-8, common pitfalls, and solutions. By the end, you’ll know how to:
café or カフェ<meta charset> declaration and your server settings<meta charset> early: The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document%C3%A9 instead of %c3%a9) to avoid inconsistencies% characters before re-encoding so you don’t turn % into %25encodeURIComponent (JS), quote (Python), URI.escape (Ruby), and url.QueryEscape (Go) to handle edge cases correctlyURL standards limit characters to an alphanumeric set (A-Z, a-z, 0-9) and a few special characters (-, ., _, ~). Characters outside this cluster, such as spaces, symbols, or non-ASCII characters, must be encoded to avoid misinterpretation by web systems.
URL encoding (percent-encoding) converts problematic characters into %-prefixed hexadecimal values. For example:
%20é becomes %C3%A9This encoding ensures that URLs remain consistent across browsers, servers, and applications to ensure seamless navigation.
There are two primary reasons to use URL encoding:
UTF-8 (Unicode Transformation Format – 8-bit) is the most widely used encoding for URLs. It represents any Unicode character while remaining backwards-compatible with ASCII.
When non-ASCII characters appear in URLs, they are first encoded using UTF-8 and then percent-encoded.
Example: Encoding Non-ASCII Characters for the Word “Cat”
| Language | Word | UTF-8 Bytes | Encoded URL Path |
|---|---|---|---|
| Greek | Γάτα | CE 93 CE AC CF 84 CE B1 | https://example.com/%CE%93%CE%AC%CF%84%CE%B1 |
| Japanese | 猫 | E7 8C AB | https://example.com/%E7%8C%AB |
Avoid legacy encodings such as Shift-JIS (e.g., %94%4C for 猫), as they can lead to interoperability issues. RFC 3986 recommends using UTF-8 to maintain consistency and compatibility across systems.
While RFC 3986 states %C3%9C and %c3%9c are equivalent, many systems treat them as distinct values.
Real-World Impact:
https://example.com/caf%C3%A9 shared on social media might appear as https://example.com/caf%c3%a9 due to platform re-encodingRe-encoding URLs repeatedly creates infinite variations.
Real-World Scenario 1: E-commerce Faceted Navigation
A user visits an e-commerce store with a faceted navigation menu. They select “Black” as a filter, represented in the query as colour:Black.
An initial URL is created with : and " encoded:
https://www.example.com/products?facet=color%3A%22Black%22
After adding a price filter, the existing % becomes %25
https://example.com/products?facet=color%253A%2522Black%2522&price%3A100-200
Subsequent clicks add a length filter, which further compounds the encoding and converts the existing %25 into %2525:
https://www.example.com/products?facet=color%25253A%252522Black%252522&price%25%3A100-200&length%3A30
Real-World Scenario 2: Login Redirect Return
https://example.com/products?facet=color:Blackhttps://example.com/login?return_to=https://example.com/products?facet=color:Blackhttps://example.com/login?return_to=/login?return_to=https://example.com/products?facet=color:BlackThese loops create cluttered, error-prone URLs and can break navigation workflows.
The robots.txt file guides crawler behaviour but has nuances when dealing with encoded URLs, such as:
In both scenarios, this can result in URLs that are presumed blocked remaining accessible, which can be hard to detect without regular log file analysis.
Example: Disallowing URLs With a Umlaut
Disallowing a decoded page with the character ü or Ü:
User-agent: *
# Upcase (titlecase)
Disallow: /Über-uns
# Downcase
Disallow: /über-uns
Disallowing an encoded page with the character ü:
# Example URL
# https://example.com/über-uns
User-agent: *
# Upcase UTF-8 characters
Disallow: /%C3%BCber-uns
# Downcase UTF-8 characters
Disallow: /%c3%bcber-uns
Disallowing an encoded page with the character Ü:
# Example URL
# https://example.com/Über-uns
User-agent: *
# Upcase UTF-8 characters
Disallow: /%C3%9Cber-uns
# Downcase UTF-8 characters
Disallow: /%c3%9cber-uns
The following rules cover all encoding variants:
User-agent: *
# uppercase Ü URLs
/Über-uns
/%c3%9cber-uns
/%C3%9Cber-uns
# lowercase ü URLs
/über-uns
/%c3%bcber-uns
/%C3%BCber-uns
If the HTTP header Content-Type or the HTML code uses a different encoding (e.g., EUC-JP), the byte sequence in the URL may be interpreted differently. Challenges include:
<meta charset="..."> or the headers, such as Content-Type: text/html; charset=utf-8), complicating URL matching and displayURL encoding inconsistencies can fragment analytics data, causing the metrics of what should be a single page to appear as multiple entries.
By the time data reaches an analytics platform or server log, variations in percent-encoding can distort traffic, session, and performance reports—making it harder to draw accurate insights.
Variations in URL encoding can lead to:
To address these issues, configure web analytics tools to recognise and merge URLs that differ only in their encoding. Many tools include features for URL normalisation:
Google Analytics
Adobe Analytics
Server access logs can also be affected by URL encoding variations:
To address these issues:
Programming languages handle URL parsing differently, and most do not automatically normalise hex case in percent-encoded sequences.
Python’s urllib.parse module provides tools for URL encoding and decoding. However, it does not automatically normalise the case of hexadecimal values in percent-encoded sequences.
from urllib.parse import urlparse, quote
# Encoding
original_url = "https://example.com/café"
encoded_url = quote(original_url, safe='/:')
print(encoded_url) # Output: https://example.com/caf%C3%A9
# Parsing
url1 = urlparse("https://example.com/%C3%A9")
url2 = urlparse("https://example.com/%c3%a9")
print(url1.path == url2.path) # Output: False (case sensitivity issue)
Ruby’s URI module requires explicit encoding for non-ASCII characters and does not normalise hexadecimal casing.
require 'uri'
# Encoding
original_url = "https://example.com/café"
encoded_url = URI::DEFAULT_PARSER.escape(original_url)
puts encoded_url # Output: https://example.com/caf%C3%A9
# Parsing
url1 = URI.parse("https://example.com/%C3%A9")
url2 = URI.parse("https://example.com/%c3%a9")
puts url1 == url2 # Output: false (case sensitivity issue)
Go’s net/url package automatically encodes non-ASCII characters but does not normalise hexadecimal casing.
package main
import (
"fmt"
"net/url"
)
func main() {
// Encoding
originalUrl := "https://example.com/café"
encodedUrl := url.QueryEscape(originalUrl)
fmt.Println(encodedUrl) // Output: https%3A%2F%2Fexample.com%2Fcaf%C3%A9
// Parsing
url1, _ := url.Parse("https://example.com/%C3%A9")
url2, _ := url.Parse("https://example.com/%c3%a9")
fmt.Println(url1.String() == url2.String()) // Output: false (case sensitivity issue)
}
JavaScript’s encodeURIComponent function encodes URLs, but it does not normalise hexadecimal casing. The URL constructor can parse URLs but treats different casings as distinct.
// Encoding
const originalUrl = "https://example.com/café";
const encodedUrl = encodeURIComponent(originalUrl);
console.log(encodedUrl); // Output: https%3A%2F%2Fexample.com%2Fcaf%C3%A9
// Parsing
const url1 = new URL("https://example.com/%C3%A9");
const url2 = new URL("https://example.com/%c3%a9");
console.log(url1.pathname === url2.pathname); // Output: false (case sensitivity issue)
PHP’s urlencode function encodes URLs, but like other languages, it does not normalise hexadecimal casing.
// Encoding
$originalUrl = "https://example.com/café";
$encodedUrl = urlencode($originalUrl);
echo $encodedUrl; // Output: https%3A%2F%2Fexample.com%2Fcaf%C3%A9
// Parsing
$url1 = parse_url("https://example.com/%C3%A9");
$url2 = parse_url("https://example.com/%c3%a9");
echo $url1['path'] === $url2['path'] ? 'true' : 'false'; // Output: false (case sensitivity issue)
While there are many types of web servers, Apache and Nginx are two of the most common, and both offer some built-in handling for URL encoding.
Apache:
AllowEncodedSlashes , NoDecode, and other directives for precise controlmod_rewrite for advanced URL rewriting and redirectionNginx:
mod_rewriteCDNs play a crucial role in managing and delivering content:
Cloudflare’s URL normalisation follows RFC 3986 and includes additional steps such as:
Caveat: Sometimes a URL that should return an error is normalised by the CDN and treated as valid. When a CDN does this, search engines may unexpectedly index the URL or return a 200 status response for URLs such as:
https://example.com///////%C3%9Cber-uns
Web browsers ensure that user-entered URLs comply with Internet standards[1][2] before sending them to servers:
A-Z, a-z, 0-9, and reserved symbols are percent-encoded (e.g., spaces become %20)C++ & Python #1 in a form input becomes C%2B%2B%20%26%20Python%20%231Forms using method="GET" append data to the URL. Encoding preserves spaces and symbols:
Example:
<form method="GET" action="proxy.php?url=/search">
<input type="text" name="query" value="C++ & Python #1">
<input type="submit" value="Search">
</form>
Encoded Result:
/search?query=C%2B%2B%20%26%20Python%20%231
APIs often pass dynamic data in URLs.
Example:
GET /users?filter=role="admin"&country="US/Canada"
Encoded:
GET /users?filter=role%3D%22admin%22%26country%3D%22US%2FCanada%22
For marketing and analytics, UTM parameters commonly include spaces or special characters:
Example:
https://example.com/page?utm_source=newsletter&utm_campaign=Spring Sale 2025https://example.com/page?utm_source=newsletter&utm_campaign=Spring%20Sale%202025Mobile apps also use deep links with URL-like structures to route users directly to in-app content. Those parameters must also be encoded:
Example:
myapp://product/12345?referrer=John Doemyapp://product/12345?referrer=John%20DoeProper URL encoding is crucial for web development, internationalisation, analytics and SEO. By understanding the nuances of URL encoding and implementing best practices, development, analytics, and SEO teams can ensure that their websites are accessible, efficiently crawled by search engines, and accurately tracked in analytics tools. If you’d like to learn more or need assistance implementing these strategies, get in touch.
The post URL Encoding Done Right: Best Practices for Web Developers and SEOs appeared first on Merj.
]]>The post Don’t Block What You Want: DuckDuckGo and Common Crawl to Provide IP Address API Endpoints appeared first on Merj.
]]>While your team focuses on stopping malicious bots, good crawlers get caught in the crossfire. DuckDuckGo processes 3 billion searches monthly. Common Crawl powers the training data behind major AI models. Block them, and your content becomes invisible to privacy-conscious users and AI-powered search.
DuckDuckGo (13 June 2025) and Common Crawl (22 June 2025) shipped a quiet but important upgrade: their crawler IP ranges are now available as structured data. No more brittle HTML parsing. No more manual updates. Just clean, fast, automatable bot management.
1. DuckDuckGo now expose their crawler IP ranges as structured JSON
2. Common Crawl now expose their IP ranges as structured JSON
3. The new approach replaces fragile HTML pages and makes change‑detection trivial (curl, jq, checksum)
4. Safelist the ranges in your WAF or let a bot‑management service (Akamai, Vercel, etc) handle it
5. Blocking these “good bots” is blocking 3 billion DuckDuckGo searches / month + the dataset that fuels many LLMs
6. We built a free search engine IP tracker that monitors every change hourly with years of history → search‑engine‑ip‑tracker.merj.com/status
Historically, most crawler IPs were published in HTML pages or buried in documentation. That worked…until it didn’t.
Google provided the initial IP address structured endpoint schema and others are now adopting the same pattern.
The schema itself is defined as follows (you can also skip to the example below):
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "IP Prefix List",
"type": "object",
"properties": {
"creationTime": {
"type": "string",
"format": "date-time"
},
"prefixes": {
"type": "array",
"items": {
"type": "object",
"properties": {
"ipv4Prefix": {
"type": "string",
"format": "ipv4-cidr"
},
"ipv6Prefix": {
"type": "string",
"format": "ipv6-cidr"
}
},
"additionalProperties": false,
"oneOf": [
{
"required": [
"ipv4Prefix"
]
},
{
"required": [
"ipv6Prefix"
]
}
]
}
}
},
"required": [
"creationTime",
"prefixes"
],
"additionalProperties": false
}
A real world example using IPv4 and IPv6 looks like this:
{
"creationTime": "2025-06-19T12:00:00.000000",
"prefixes": [
{
"ipv6Prefix": "2600:1f28:365:80b0::/60"
},
{
"ipv4Prefix": "44.220.181.167/32"
},
{
"ipv4Prefix": "18.97.14.80/29"
},
{
"ipv4Prefix": "18.97.14.88/30"
},
{
"ipv4Prefix": "98.85.178.216/32"
}
]
}
You can track, verify, and act on them with a single curl command.
| Bot | Documentation | JSON Endpoint |
|---|---|---|
| DuckDuckBot | Help Page | https://duckduckgo.com/duckduckbot.json |
| DuckAssistBot | Help Page | https://duckduckgo.com/duckassistbot.json |
| CCBot | Common Crawl FAQ | https://index.commoncrawl.org/ccbot.json |
sha256sum, or parse it into your bot allowlist pipelineI’m passionate about helping companies standardise formats that can benefit millions of businesses. When I reached out to the CTO of Common Crawl and the CEO of DuckDuckGo about adopting machine-readable standards, they are moving quickly to implement these changes. I want to thank them for their responsiveness and leadership. This is exactly the kind of industry collaboration we need more of.
This is part of a broader shift. As AI-native search grows, so does the importance of structured access to crawlers:
If a bot can’t reach your site, it can’t index it, it can’t train on it, and it can’t surface it in the increasingly fragmented AI powered search results of the future.
DuckDuckGo and Common Crawl are making it easier to tackle this problem head on and take advantage of the opportunities.
Make sure the bots you want are getting through.
The post Don’t Block What You Want: DuckDuckGo and Common Crawl to Provide IP Address API Endpoints appeared first on Merj.
]]>The post Introducing our Bing Webmaster Tools API Python Client appeared first on Merj.
]]>
Ready to start? Head straight to Github…
With the rise of new ways to search, our focus has been on expanding data pipelines outside the Google Search ecosystem. Bing’s significance has grown substantially, particularly with its integration with Copilot.
Google provides an official Python SDK for their APIs which includes Search Console (Github link). Bing has a Python SDK, but only for Bing Ads (Github link).
What began as an internal tool for our data pipeline needs has evolved into something we believe can benefit the broader technical SEO community. By open-sourcing our client, we aim to provide a robust solution that matches the quality and reliability of official SDKs.
Key Features
The BWT API serves three main purposes: collecting marketing data for warehousing, monitoring search engine behaviour, and automating website maintenance tasks:
To begin using the wrapper:
pip install bing-webmaster-tools
export BING_WEBMASTER_API_KEY=your_api_key_here
3. Look under the examples subdirectory to see boilerplate command line interface scripts which can help you get started. These assume you are using the environment variable BING_WEBMASTER_API_KEY as described in the Basic Setup.
# Listing all your Bing website profiles
python examples/get_all_sites.py
# Do you have any blocked webpages that you don't know of?
python examples/get_blocked_webpages.py -s https://merj.dev
# Set the highest crawling available on a scale of 1-10 (default 5).
# Check if "crawl boost" is available, which is valuable for very large websites.
python examples/submit_crawl_settings.py -s https://merj.dev -r "10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10" --enable-boost
# Get all your XML sitemaps, RSS Feeds, etc.
python examples/get_feeds.py -s https://merj.dev -v
# Submit all your latest content
python examples/submit_batch_urls.py --site-url https://merj.dev --input file.txt
# Get all related words that contian a word.
python examples/get_keyword.py -q "seo"
# Get all
python examples/get_keyword_stats.py -q "seo"
# Get all URLs that have been fetched through the web UI
python examples/get_fetched_urls.py -s https://merj.com
# Get a list of URLs with issues and a summary table
python examples/get_crawl_issues.py -s https://merj.com
# get
python examples/get_query_page_stats.py -s https://merj.dev -q "seo"
# Get the stats for a page with a particular query
python examples/get_query_page_detail_stats.py -s https://merj.dev -q "seo" -p "https://merj.dev/seo"
# Get a summary of all external links to a webpage
python examples/get_url_links.py -s https://merj.dev -l https://merj.dev/seo
# Get 3 summary tables: Crawl Stats, Status Codes and Issues.
python examples/get_crawl_stats.py -s https://merj.dev
# Manage_parameters shows how you can use CLI args to perform different actions.
# List all parameters
python examples/manage_parameters.py -s https://merj.dev --list -v
# Add a parameter
python examples/manage_parameters.py -s https://merj.dev --add sort
# Remove a parameter
python examples/manage_parameters.py -s https://merj.dev --remove sort
# Enable a parameter
python examples/manage_parameters.py -s https://merj.dev --enable sort
# Disable a parameter
python examples/manage_parameters.py -s https://merj.dev --disable sort
The client is built with modern Python practices using asynchronous functions. The following example creates the authentication from an environment variable, gets all the known sites, then loops through them to output the traffic stats.
from bing_webmaster_client import Settings, BingWebmasterClient
async def main():
# Initialize client with settings from environment
async with BingWebmasterClient(Settings.from_env()) as client:
# Get all sites
sites = await client.sites.get_sites()
if not sites:
print("No sites available")
return
test_site = sites[0].url
print(f"Using site: {test_site}")
# Get traffic stats
traffic_stats = await client.traffic.get_rank_and_traffic_stats(test_site)
print("Traffic Statistics:")
for stat in traffic_stats:
print(f"Date: {stat.date}")
print(f"Clicks: {stat.clicks}")
print(f"Impressions: {stat.impressions}")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
The client handles all the complexity of API authentication, rate limiting, and data pagination, allowing you to focus on utilising the data rather than managing API interactions.
As our public v1.X.X release, it’s important to note current limitations:
We’re excited to see how marketing technology teams and the SEO community build upon this library. Please feel free to submit:
The post Introducing our Bing Webmaster Tools API Python Client appeared first on Merj.
]]>The post How Extra HTML Attributes in Canonical Tags Impact Search Engines appeared first on Merj.
]]>Our study, sparked by a content management system (CMS) migration anomaly discovered that certain attributes caused Google to ignore canonical tags. Read on to learn our methodology, data, and practical recommendations to ensure your canonical link tags work as intended.
A correct implementation of canonical link tags is essential for effective indexing of a website, as they help manage content duplication and direct search engines to the preferred version of a webpage.
This study focuses on how search engines, particularly Google, interpret canonical link tags with additional attributes other than rel="*" and href="proxy.php?url=*"
Our research was prompted by an anomaly observed during a CMS migration, where Google Search Console failed to recognise canonical link tags.
This study aims to address the following questions:
During a CMS migration project, we encountered an anomaly where the Google Search Console inspector tool failed to detect canonical link tags that were visibly present in both the raw HTML source and the rendered Document Object Model (DOM). Using the Google Search Console inspector showed that the canonical was not present, even though it was present:
<link rel="canonical" crossorigin="anonymous" media="all" href="proxy.php?url=<https://example.com/category/123-slug-here>" />
This observation led to an investigation of Google’s documentation, which revealed a recent update (February 15, 2024) clarifying the extraction of rel="canonical" annotations:
The
rel="canonical"annotations help Google determine which URL of a set of duplicates is canonical. Adding certain attributes to the link element changes the meaning of the annotation to denote a different device or language version.
This is a documentation change only; Google has always ignored these rel="canonical" annotations for canonicalisation purposes. The documentation explicitly mentioned four attributes that, when present, cause Google to disregard the canonical link tag: hreflang, lang, media, and type.
Google supports explicit
rel="canonical"link annotations as described in [RFC 6596].rel="canonical"annotations that suggest alternate versions of a page are ignored; specifically,rel="canonical"annotations withhreflang="lang_code",lang,media, andtypeattributes are not used for canonicalization. Instead, use the appropriate link annotations to specify alternate versions of a page; for example<link rel="alternate" hreflang="lang_code" href="proxy.php?url=url_of_page" />for language and country annotations.
Great! We could confirm this was indeed an expected behaviour. But it also raised several questions, like: Are there other/undocumented attributes that may cause Google to entirely ignore the canonical link tag? How widespread is the usage of these problematic attributes? And, are there tools that already flag these attributes as being potentially problematic?
Our research methodology comprised the following steps:
Our analysis identified the top 10 most common attributes found in canonical link tags that Google does not flag as problematic:

Within our gathered dataset, the four attributes in canonical link tags known to cause issues ranked as follows:

Our testing yielded the following results:
hreflang, lang, media, or type attributes were not identified as self-canonical in the GSC inspector and the entire canonical tag is ignored.A significant finding from our study is that many popular SEO crawling tools do not flag issues related to problematic attributes in canonical link tags. This oversight can lead to undetected canonicalisation problems, impacting the effectiveness of canonical implementations. It’s crucial to be aware of this limitation and take proactive measures to manually verify your canonical link tags using tools like Google Search Console’s URL inspection tool, or building custom systems that specifically monitor the canonical link tags and look for these specific attributes.
We captured the URL inspection testing using Google and Bing Webmaster tools. Example of Google Search Console Live Testing for a URL with a canonical containing problematic attributes (note how “None” is displayed for the User-declared canonical):

Google Search Console Live Testing for a URL with a canonical not containing problematic attributes:

While Google reports “None” for the User-declared canonical, hinting it didn’t recognise a canonical link tag, in Bing Webmaster Tools, the live URL inspection didn’t report the presence nor flag issues with the canonical for any URLs in our testing set.
Bing Webmaster Tools live URL inspection test for a URL with a canonical containing problematic attributes (no data is shown regarding canonical link tags):

Bing Webmaster Tools live URL inspection test for a URL with a canonical not containing problematic attributes:

While our study focused primarily on Google and Bing’s interpretation of canonical link tags, further research around actual crawling and indexing may be required to determine if other search engines like DuckDuckGo, Naver, handle canonical link tags with attributes in a similar manner.
Based on our findings, we propose the following recommendations:
rel and href.data-* attributes.id, name, or content in canonical link tags, as they may become reserved words in the future.Our research highlights critical findings regarding the interpretation of canonical link tags by Google, especially when additional attributes are present. Key takeaways include:
hreflang, lang, media, and type, cause Google to disregard them.These are nuances important to understand, in order to stay updated with search engine guidelines and a healthy website in search results, regardless of the insights provided by SEO tools.
The post How Extra HTML Attributes in Canonical Tags Impact Search Engines appeared first on Merj.
]]>The post Google’s JavaScript Rendering Capabilities appeared first on Merj.
]]>
This research puts to bed a question that still hangs over modern web development: can search engines fully and reliably render a JavaScript website?
Part 1 focuses exclusively on Google, and analysed data from over 100,000 Googlebot fetches, shedding light on critical questions regarding whether and how Google renders JavaScript, the effect of JS on the rendering queue and page discover, and the impact of all this on SEO. Through rigorous analysis, we dispel common myths and provide practical guidance for anyone striving to optimise their web presence.
For a detailed exploration of the findings from this research read the full article on Vercel’s blog. There, you will discover the nuances of our methodology, the myths we addressed, and the practical implications of our findings for web developers and businesses alike.
Read the full article on Vercel’s blog here.
Vercel’s Frontend Cloud provides the developer experience and infrastructure to build, scale, and secure a faster, more personalized web. Customers like Under Armour, Chico’s, The Washington Post, Johnson & Johnson, and Zapier use Vercel to build dynamic user experiences that rank highly in search.
The journey does not end here. Part 2 of this research, which as I write this is already nearing completion, broadens our scope to examine the JavaScript rendering capabilities of other search engines, including non-Western search engines and large language model companies such as OpenAI (ChatGPT). Part 2 will complete your understanding of the global rendering landscape and offer further strategic insights for optimising web content across diverse search platforms.
For further information, or to talk to someone about how your technical infrastructure might be impacting search engine rendering and discovery, please contact us.
The post Google’s JavaScript Rendering Capabilities appeared first on Merj.
]]>The post Investigating Reddit’s robots.txt Cloaking Strategy appeared first on Merj.
]]>We recently noticed a post on X by @pandraus regarding Reddit’s robots.txt file. On 25 June 2024, u/traceroo announced that any automated agents accessing Reddit must comply with their terms and policies.
“In the next few weeks, we’ll be updating our robots.txt instructions to be as clear as possible: if you are using an automated agent to access Reddit, you need to abide by our terms and policies, and you need to talk to us. We believe in the open internet, but we do not believe in the misuse of public content.”
u/traceroo
The robots.txt file is crucial for managing web crawler interactions. An instruction to disallow all access, as seen in Reddit’s latest update, can lead to deindexing the entire site, posing significant risks to a site’s search engine presence and overall accessibility. Let’s explore the details and implications of this change.
The new robots.txt file is quite…blunt:
# Welcome to Reddit's robots.txt # Reddit believes in an open internet, but not the misuse of public content. # See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content. # See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
# policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
User-agent: *
Disallow: /
This directive essentially blocks all crawlers from accessing Reddit. Normally, this would be a critical issue as it could lead to deindexing the entire domain. This situation has precedents, such as when OCaml inadvertently blocked their entire website.
Reddit’s move raises several questions. For example, Reddit still wants search engines and archivers to index their content, especially given recent agreements with entities like Google.
“There are folks like the Internet Archive, who we’ve talked to already, who will continue to be allowed to crawl Reddit.”
u/traceroo
This leads us to question whether search engines like Google might make exceptions for Reddit. However, this seems unlikely. It’s more plausible that Reddit might be serving different robots.txt files to different user-agents.
To investigate, we conducted a test using Google’s tools. Normally, a user-agent switcher could be used, but Reddit blocks agents pretending to be Google, thanks to search engines providing their IP address ranges.
We used Google’s rich snippet testing tool to retrieve the raw HTML. Our findings confirmed that Reddit is serving an entirely different robots.txt file to Google, which is commonly known as cloaking.

Reddit’s updated robots.txt file appears to block all crawlers, but our tests show this is more of a public stance than a practical restriction. Robots.txt files are generally not meant for consumption by your average user (i.e. not developers, SEO practioners and such), therefore it is not perceived as deceptive for the purpose of “good bot” search engine content discovery.
For those interested, the full robots.txt rules can be found here.
The post Investigating Reddit’s robots.txt Cloaking Strategy appeared first on Merj.
]]>The post Discovering and Diagnosing a Google AdSense Rendering Bug appeared first on Merj.
]]>AdSense is Google’s advertising content platform where publishers can be paid to place advertisements on their webpages. While performing a search engine optimisation test that involved crawling and rendering tracking for a React application, we discovered an anomaly in the client’s server log data, which required further investigation.
We found that a part of the AdSense technology stack is not working as expected and the programmatic matching between ads and website content is impacted, thus creating an incomplete understanding of webpage content.
We uphold a strong ethical framework when it comes to disclosing discovered anomalies or bugs, especially those associated with pivotal platforms like Google AdSense. We strictly follow responsible disclosure guidelines, informing relevant parties well in advance of any public announcements.
This issue does not fall under the Google Bug Hunter Program, as it pertains primarily to a product operational anomaly rather than a security vulnerability. Despite not reporting this through the Google Bug Bounty Program, we reported this bug to Google representatives who liaise with the internal Google teams and observed a standard 90-day grace period before considering public disclosure.
We have extended these standard disclosure timelines to facilitate a resolution, although the issue remains unresolved.
| Date | Subject | Action |
| June 1st, 2023 | Merj | Discovered the bug |
| June 8th, 2023 | Merj | We sent an email to Gary Illyes, a member of the Google Search Team, describing the bug and its impact. |
| June 27th, 2023 | Gary Illyes responded, stating that he had consulted with the rendering team and would notify the administrators responsible for the Mediapartners-Google crawlers. As the owners of the Mediapartners-Google crawlers are not part of the Search team, Search team members have no influence over them. | |
| September 15th, 2023 | Merj | We sent a follow-up email asking for any updates regarding the bug. |
| October 24th, 2023 | Merj & Google | We had an in-person conversation with Gary Illyes about the bug at the Google Search Central Live Zurich event. |
| April 23rd, 2024 | Merj | Public disclosure of the issue |
Google AdSense is an advertising program run by Google. It allows website owners (publishers) to monetise their content by displaying targeted advertisements. These ads are generated by Google and can be customised to match the website’s content. Publishers earn revenue when visitors click on or view these advertisements.

Source: https://adsense.google.com/start/resources/best-format-your-site-for-adsense/
Google AdSense offers publishers a variety of ad units to display on their websites. Here are some of the common types of ad units:
Google Ads is a platform that enables businesses (advertisers) to create and manage online advertisements, targeting specific audiences based on keywords, demographics, and interests. These advertisements can appear on various Google services, such as search results, YouTube, and partner websites.
Advertisers that use Google Ads can place their ads on websites that participate in the AdSense program. This symbiotic relationship enables businesses to reach a wider audience through targeted advertising, while website owners can generate revenue by hosting relevant advertisements on their platforms.
Google Adsense works by matching ads to your site based on your content and visitors; having webpages with incorrect, partially rendered, or blank content impacts the matching of ads and webpages. To analyse webpage content, Google AdSense employs a specific User-Agent known as ‘Mediapartners-Google’.
Google AdSense employs various methods for delivering targeted ads. Contextual Targeting uses factors such as keyword analysis, word frequency, font size, and the overall link structure of the web to ascertain a webpage’s content accurately and match ads accordingly.
However, without access to a page’s full content, any targeting based on page content cannot be accurate.
Websites that block AdSense infrastructure from accessing their content can considerably affect the precision of ad targeting, potentially resulting in diminished clicks and, consequently, reduced revenue. This has an impact on both publishers and advertisers.
Misunderstanding the content on the page could result in more severe consequences. If the publisher sends irrelevant traffic to advertisers, the Adsense Platform may limit or disable ad serving.
The misattribution of User-Agents in server access logs can lead to incorrect assumptions about search engines’ crawling and rendering of webpages.
Additionally, it can result in inaccurate conclusions about the sources of crawling traffic and the effectiveness of strategies or updates made on the website, potentially leading to misguided decision-making.
Every time a web browser requests a website, it sends an HTTP Header called the “User-Agent”. The User-Agent value contains information about the web browser name, operating system, and device type. The User-Agent is present in both webpages and page resource requests.
Search Engine crawlers use their own custom User-Agent, when fetching webpages and page resources. Before starting to download a specific URL, Search Engines check if they are allowed to fetch a specific URL by parsing the robots.txt.
Without debating on “if” and “how” the use of the robots.txt to block crawlers is suitable, here below is a simplified step-by-step pipeline of robots.txt effect on a search engine’s crawling and rendering process:

If Step 1 fails:
If Step 1 is completed but one of the other steps fails:
With ongoing efforts to bring our Search Engine Web Rendering Monitoring solution into production, we have been closely monitoring the number of webpages being crawled and the time delta within which those webpages are rendered. Working with server logs that contain terabytes of data, we utilise a custom in-house enrichment and query engine (similar to Splunk) that enables us to drill into the data with complex logic.
The server access logs have started showing anomalies over a 6-week period, with fetches of page resources where the referrer points to webpages that are normally blocked for Googlebot. First, we needed to check the data pipelines and data integrity. This involved reviewing any code changes and container failures that may have created some unexpected edge cases both at our source and further upstream. We are often second consumers of server logs because of Personal Identifiable Information (PII) and Payment Card Industry Data Security Standard (PCI-DSS) requirements. Examples of transformations include:
The Traffic Engineering and Edge teams managing the upstream ingress point (for instance, a CDN like Cloudflare, Akamai, or Fastly) confirmed that no changes had been made. We reprocessed our data, which yielded the same anomalies.
Once the data source has been validated, the next step is to reproduce and isolate the anomaly to confirm its existence and understand its behaviour. Here’s how to replicate the issue:
# Googlebot
user-agent: Googlebot
disallow: /reviews
# Mediapartners-Google
user-agent: Mediapartners-Google
allow: /
APACHE
66.249.64.4 - - [28/Jul/2023:04:17:10 +0000] 808840 "POST /graphql-enpoint HTTP/1.1" 200 56 "https://domain.com/reviews/139593" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"
NGINX
66.249.64.4 - - [28/Jul/2023:04:17:10 +0000] "POST /graphql-enpoint HTTP/1.1" 200 56 "https://domain.com/reviews/139593" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"
Note: For webpages accessible by both “Mediapartners-Google” and “Googlebot,” the above Server Access Logs approach to detect incorrect User-Agent attribution may not be effective. In such specific cases, more advanced solutions, such as our Search Engine Rendering Monitoring tool, are required.
To assess the potential implications of the issue on actual websites, we acquired the list of US and UK websites utilising Google AdSense from BuiltWith.com and developed a tool to identify the possible impact of the issue on these websites.
The robots.txt files of most websites we analysed are small and contain rules only for the global User-Agent (*), which the AdSense crawler ignores. As the AdSense crawler only respects rules set specifically for Mediapartners-Google, this significantly impacts the number of websites that may be affected.
We did this by using the following simplified logic that approximates the potential magnitude of the websites that may be impacted:

Upon executing the tool on the BuiltWith list, which covers around 7 Million US websites and 2 Million UK websites, we determined that around 5.5 Million websites may potentially be impacted by this issue.
UK websites
| Status | Number |
| Websites from the BuiltWith list | 1,946,633 |
| Testable websites | 974,536 |
| Potentially impacted websites | 938,413 |
| Non-impacted websites | 36,123 |
US websites
| Status | Number |
| Websites from the BuiltWith list | 6,827,954 |
| Testable websites | 4,540,894 |
| Potentially impacted websites | 4,363,028 |
| Non-impacted websites | 177,866 |
On analysis of the robots.txt files, we can see that most of the body bytes are relatively small.
UK websites robots.txt bytes (compressed)
| Percentile | Bytes |
| 25th | 137 |
| 50th (Median) | 137 |
| 75th | 137 |
| 95th | 213 |
| 99th | 748 |
US websites robots.txt bytes (compressed)
| Percentile | Bytes |
| 25th | 137 |
| 50th (Median) | 137 |
| 75th | 137 |
| 95th | 575 |
| 99th | 1,145 |
It is difficult, within the scope of this article, to provide an exact prediction of the number of websites currently impacted. While 5.5 million websites may be affected by the issue, they would only experience a negative impact if they exhibit certain specific characteristics, such as serving primary content via JavaScript and blocking a portion of requests using specific robots.txt rules.
Our analysis provides a broad overview of potential impacts without hands-on verification. To identify if a site is affected, a more complex assessment would be necessary, involving the comparison of a site’s initial and rendered HTML. This requires a level of testing that goes beyond our current scope, emulating search engine behaviours to extract and analyse a page’s primary content.
The web is inherently broken, and simple methods, like checking the <main> HTML tag, fall short due to the web’s inconsistency and the varying adherence to best practices among servers and websites. Other approaches, such as comparing initial and rendered HTML sizes or word count differences, are imprecise and unreliable, potentially leading to the publication of incorrect data.
Given the complexity of automating the test, we have opted to describe a straightforward method for self-diagnosing the issue in the FAQ section. This approach allows users to assess their websites independently.
The ideal test to assess the impact on Google AdSense in this scenario would be to quantify the number of websites affected by the issues that display inappropriate ads, yet this is unfeasible.
Google AdSense utilises a variety of ad-matching techniques that go well beyond contextual targeting. This comprehensive approach offers a broad spectrum of ad targeting possibilities, ranging from matches based on content to ads chosen by advertisers for specific placements and those tailored to user interests.
While publishers can customise the types of ad categories permitted on their site, they have limited influence over the exact ads that are shown. Moreover, the presence of ads that seem to not align with the site content could be attributed to advertisers who have set overly broad or generic targeting criteria rather than an issue with the ad targeting system itself.
Due to this complexity, it’s not possible to determine whether a website is displaying incorrect ads based solely on the issue we discovered.
As an alternative method to estimate whether websites affected by the issue might see an impact on revenue, publishers can use the revenue calculator to get an idea of how much they should earn with AdSense.

In the calculator, you can select region, category, and monthly page views to get an estimate. The calculator itself emphasises that the estimate should only be used as a reference and that numbers may vary, but it could be useful to have an idea of the missing revenue if the numbers differ significantly from what publishers can see in the AdSense dashboard.
Google Ads is not directly affected by the issue. We have examined the Google Ads crawler’s requests, and for the tested websites, it is sending the correct User Agent for all fetches. Nonetheless, Advertisers may observe an impact of this issue on the quality of traffic, click-through rate (CTR), and indirectly on revenue.
Access logs are not commonly utilised by publishers or advertisers, yet these logs might be utilised by others for analysis or to establish a business case for technical modifications.
Using the methodology described in the ‘Reproduction and Isolation of the Issue’ section, we examined the access logs of multiple websites for different clients. Our findings revealed that, depending on the scale of the website, the percentage of misattributed ‘Mediapartners-Google’ fetches using the ‘Googlebot’ User-Agent can range from 20% to 70% of the total ‘Googlebot’ requests.
This substantial discrepancy in the access logs analysis can significantly distort any analysis.
While Google has confirmed it is a bug, they have not yet fixed it. Businesses can work around the issue by ensuring essential assets that are used to render a webpage, such as API endpoints, scripts, stylesheets and images, are not blocked by robots.txt for both “Mediapartners-Google” and “Googlebot” User-Agents.
To effectively understand the impact of issues within server access logs, it is crucial to employ a systematic approach to log analysis. The method outlined in the “Reproduction and Isolation of the Issue” section provides a simple way to filter the access logs removing the pages that Googlebot can’t crawl. It’s worth remembering that this approach would offer only a partial view of the problem, excluding only those pages blocked by Googlebot and not for Mediapartners-Google.
It is recommended that you use more advanced filtering techniques to fully understand the issue’s impact. For a detailed and comprehensive analysis of your server access logs, we encourage you to get in touch with us.
The Google AdSense rendering bug is a technical issue in which ads served by Google AdSense might not display correctly on publishers’ websites.
This problem presents itself due to discrepancies in how pages are rendered when different rules are applied to Googlebot and the AdSense bot (“Mediapartners-Google”). If these bots are treated differently by your site’s robots.txt, it can lead to improper ad display.
To diagnose the issue, review your robots.txt checking for any directives that might block “Googlebot” from accessing certain URL paths on your site that are not similarly restricted for the AdSense bot (“Mediapartners-Google”).
If your website is using Client Side Rendering and/or the main content of the webpages is generated dynamically at rendering time using additional JavaScript requests, it’s crucial to ensure that both “Googlebot” and “Mediapartners-Google” have equal access to these JavaScript resources and the resultant content paths.
Discrepancies in access permissions between these bots can lead to issues and prevent proper rendering.
A quick fix to address the rendering bug involves aligning the access rules for both “Googlebot” and the Google AdSense bot (“Mediapartners-Google”) in your robots.txt file.
Ensuring both bots have the same level of access to your site’s content can mitigate rendering issues. This approach helps ensure that even if requests are misattributed in server access logs, page rendering works as expected.
Server Access Logs play a crucial role in diagnosing and understanding how web crawlers and bots interact with your website. These logs contain detailed records of every request made to your server, including those by Googlebot and the AdSense bot (“Mediapartners-Google”).
Even if your website is not affected by the rendering bug, the logs may contain misattributed requests. The consequence of this misattribution would be an inaccurate count of Googlebot requests, you would see more requests than there actually are. In your analysis, the number of Googlebot requests would be the sum of actual Googlebot requests plus the misattributed Google AdSense requests that use Googlebot as the User-Agent.
Google’s documentation details the IP ranges for verifying Googlebot and other Google crawlers, organising these ranges into multiple files.
This categorisation seemingly simplifies filtering processes for our use case: Googlebot IPs are classified as “Common Crawlers”, while Google AdSense IPs are deemed “Special Case Crawlers”. Initially, one might expect to filter Googlebot requests using the googlebot.json IP ranges and exclude those listed in special-crawler.json.
However, the situation is more complex. The misattributed requests actually originate from genuine Googlebot IP addresses. It appears that the Google AdSense bot uses Googlebot’s infrastructure to crawl resources rather than just misattributing the User-Agent string.
The most straightforward approach to verifying and filtering Server Access Logs is examining the request referrer URLs. Specifically, for requests identified with a Googlebot User-Agent, the presence of a referrer page that is blocked to Googlebot but accessible to the Google AdSense bot (‘Mediapartners-Google’) could indicate incorrect attribution.
This technique, however, is limited in its applicability. It does not yield reliable insights for paths that are accessible to both Googlebot and the Google AdSense crawlers, as these scenarios do not facilitate clear differentiation based on robots.txt rules. To have a comprehensive filtering method, more advanced solutions, such as our Search Engine Rendering Monitoring tool, are required.
We would like to thank Aleyda Solis (LinkedIn, X/Twitter), Barry Adams (LinkedIn, X/Twitter), and Jes Scholz (LinkedIn, X/Twitter) for their thorough peer review of this article. Their experience and insightful suggestions have enhanced the depth and clarity of our analysis, allowing us to highlight key aspects and decisions made during the writing process for a more coherent and impactful delivery.
The post Discovering and Diagnosing a Google AdSense Rendering Bug appeared first on Merj.
]]>