Merj https://merj.com/ Tue, 03 Mar 2026 10:55:14 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.1 https://merj.com/app/uploads/[email protected] Merj https://merj.com/ 32 32 Your Accept-Language Redirects Could Be Blocking Search Engines and AI Crawlers https://merj.com/blog/your-accept-language-redirects-could-be-blocking-search-engines-and-ai-crawlers https://merj.com/blog/your-accept-language-redirects-could-be-blocking-search-engines-and-ai-crawlers#respond Tue, 03 Mar 2026 10:55:13 +0000 https://merj.com/?p=2180 Locale-adaptive redirects are one of those things that “worked fine” until the crawler ecosystem changed. Search engines generally don’t use...

The post Your Accept-Language Redirects Could Be Blocking Search Engines and AI Crawlers appeared first on Merj.

]]>
Locale-adaptive redirects are one of those things that “worked fine” until the crawler ecosystem changed.

Search engines generally don’t use Accept-Language in crawl requests but some AI crawlers do, often with default English US values. If your site redirects HTML requests based on Accept-Language, you can accidentally funnel bots into the wrong locale, reduce coverage of non-English pages, and make debugging harder (especially when rendering is involved).

This post explains what we observed in testing.

TL;DR

  • Most search engine bots don’t send Accept-Language, so these redirects often didn’t fire historically.
  • Many AI crawlers send browser-like defaults (often en-US,en;q=0.9), which is usually not user intent.
  • Redirecting HTML based on Accept-Language can skew discovery and indexing toward the wrong locale, and gets messier during rendering. </aside>

Accept-Language redirects used to be “fine”

For years, technical SEO has relied on a stable rule: don’t do locale-adaptive redirects for search engines.

In practice, that meant avoiding redirects based on IP geolocation, cookie state, or “smart” locale detection. Instead, create clean, crawlable URLs for each locale and add hreflang annotations when needed.

Accept-Language wasn’t part of this conversation for a long time – because major search engine bots simply don’t send it in crawl requests. Google states this explicitly in their locale-adaptive pages documentation.

But search engines aren’t the only crawlers anymore.

AI crawlers are increasingly common, and many behave differently. They’re often less mature than major search engines, more experimental, and more likely to trigger edge cases we never had to consider.

The edge case we’ll consider in this post: AI bots may send Accept-Language headers.

If your platform’s redirects rules are based on Accept-Language, you risk creating redirect loops or blocking certain bots from accessing specific language content.

Quick refresher on the Accept-Language header

Accept-Language is an HTTP request header used for content negotiation: the client tells the server which natural languages it prefers. Browsers typically set it based on UI language and user preferences.

The HTTP semantics are standardised in the RFC 9110.

It’s a preference signal, not a command. Treat it as “this is what the client might prefer,” not “always redirect me.”

The “old” best practices

Google’s documentation on locale-adaptive pages is clear and consistent:

  • If you return different content based on perceived country or language preference, Google might not crawl, index, or rank all variants.
  • Googlebot’s default crawling IPs appear US-based.
  • Googlebot sends HTTP requests without Accept-Language.
  • Google recommends separate locale URLs and hreflang annotations.

Other major search engines follow similar rules.

In the old SEO world, Accept-Language redirects were rarely a problem since there was no header to trigger them. IP-based detection was the more obvious real risk.

What changed (and what we tested)

We tested major search engines, AI platforms, and web-access LLMs to see how they fetch HTML documents in the wild and specifically which request headers they send and how they handle redirect chains.

Scope

What we testedDetails
Content typeHTML requests only
Signals observedAccept-Language and redirect chains

Why focus on HTML?

HTML is where localisation decisions are usually enforced (often through redirects). It’s also the primary content search engines index and the main input LLM retrieval systems use for grounding.

What we observed

Across tested platforms, requests were typically sent either:

  1. without an Accept-Language header, or
  2. with a default “automated Chrome” header, usually: Accept-Language: en-US,en;q=0.9

Crucially, AI crawlers and retrieval systems didn’t actively adapt Accept-Language based on:

  • prompt language
  • user locale
  • browser language settings
  • conversation context

In other words: when the header exists, it’s rarely user intent — it’s an implementation default. And in some cases, even Googlebot can end up presenting the same default Accept-Language when it follows redirects during rendering.

Practical takeaway: redirecting HTML based on Accept-Language can reduce indexing quality and create “wrong language” retrieval for both search engines and LLM-driven systems.

Notable exceptions: Applebot and PetalBot behaved differently in our tests, based on internal custom logic.

Data Export

Below is the raw header behaviour we observed across platforms.

Platform / AgentAccept-Language header
Googlebot ****(also used by Gemini)Absent. en-US is present when a JS redirected is followed during rendering.
Adsbot-GoogleAbsent. en-US is present when a JS redirected is followed during rendering.
Mediapartners-Google (Adsense)Absent. en-US is present when a JS redirected is followed during rendering.
Bingbot (also used by Microsoft / Copilot)Absent.
Adidxbot (Microsoft Ads)Absent.
YandexBotAbsent or en, *;q=0.01.
Yeti (Naver)Absent.
BaiduspiderCan be absent or zh-cn,zh-tw. zh-CN,en;q=0.9,en-GB;q=0.8,en-US;q=0.7,fr;q=0.6 is present when a JS redirected is followed during rendering.
SogouCan be absent, en-US,en;q=0.9, or zh-CN.
DuckDuckBotAbsent or en-US,en;q=0.8,zh;q=0.6,es;q=0.4
DuckDuckBot-HttpsAbsent or en,*
Applebot (also used by Apple Intelligence / Spotlight)Applebot appears to use Accept-Language values that match the domain’s country code top-level domain (ccTLD). For .com domains, the Accept-Language header is absent. For localised domains like .de, it sends de-DE, and for .co.jp, it sends ja-JP.
OpenAI GPTBotAbsent.
OpenAI OAI-SearchBotAbsent.
OpenAI ChatGPT-UserCan be en-US,en;q=0.9 or absent
Anthropic ClaudeBotAbsent.
Anthropic Claude-UserAbsent.
Anthropic Claude-SearchBotAbsent.
Perplexity PerplexityBotAbsent.
Perplexity Perplexity‑UserAbsent.
MistralAIAbsent or en-US,en;q=0.9.
BytespiderCan be absent, en-US,en;q=0.5 or zh,zh-CN;q=0.9.
DuckAssistBotAbsent.
meta-externalagentAbsent.
PetalBotAbsent most of the time. Use Accept-Language values that match the domain’s country code top-level domain (ccTLD) in a non consistent way. For .com domains, the Accept-Language header is absent. For localised domains like .fr sends fr,en;q=0.8, but .at sends en.
AmazonbotAbsent.
TikTokSpiderAbsent or en-US,en;q=0.5.
PinterestbotAbsent.
CCbotAbsent or en-US,en;q=0.5

Note: This table reflects data from February 2026 and may change over time. After we identified redirects linked to the Accept-Language header a few months ago, a subset of AI crawlers stopped including that header in their requests.

“Helpful” redirects that harm discovery

Accept-Language redirects are typically implemented to help users reach the right content immediately, bypassing language selector screens and additional interactions. For users, it’s possible to measure the success of having such redirects through conversion rates and interaction metrics. For bots, the effects are more complex and subtle.

Here’s the typical pattern:

  1. Bot requests canonical URL (e.g., /product)
  2. Server redirects based on Accept-Language
  3. Bot lands on /en/product (or worse: a generic homepage)
  4. Indexing and retrieval now skew toward English, even if better alternatives exist

This creates downstream problems:

  • Partial indexing: If English is the default redirect target, you’re training both search engines and LLM retrieval systems to prefer English – regardless of user intent. This can also influence answer content and citations.
  • Crawl inefficiency: Every redirect adds an extra hop, consuming time and resources.
  • Complex debugging: Not all teams have access to request headers in logs. This adds an extra layer of complexity and uncertainty.

Accept-Language might matter for Googlebot during rendering

In our tests, Googlebot’s initial crawl fetch did not include Accept-Language (as expected). However, when a redirect was triggered during rendering, the follow-up requests inherited some request headers from the browser instance, including the default language preference (Accept-Language: en-US).

This creates a tricky edge case: when a platform redirects based on Accept-Language, the redirect doesn’t trigger during HTML fetches but may trigger during rendering fetches.

In the official Google documentation, one of the suggested solution to avoid soft 404 redirect for SPAs is to use a JavaScript redirect to a URL for which the server responds with a 404 HTTP status code (for example /not-found).

Depending on how the Accept-Language redirect is implemented, we may have inconsistent soft 404 handling and indexing signals get muddy:

  1. Googlebot requests Soft 404 URL (e.g., /product)
  2. The page has a Javascript redirect to /not-found
  3. The server intercepts the fetch with Accept-Language: en_US and redirects again based on internal rules
  4. Googlebot lands on /en/not_found, or on a page that not returns a 404, or worse on a generic homepage

Conclusion

Accept-Language isn’t a “bad” header, it’s a standard that platforms can use in multiple ways. What breaks is the assumption that crawlers behave like users.

The web now includes a wide range of crawlers, many send browser-like headers and aggressively explore edge cases.

Our internal testing across major Bots and LLM platforms supports these statements:

  • Bots and LLM crawlers do not use Accept-Language as a localisation signal
  • When present, Accept-Language is typically default en-US,en;q=0.9
  • Therefore, Accept-Language based redirects for bots:
    • do not reliably improve user experience,
    • introduce content accessibility risks,
    • and can reduce indexing quality for both search engines and LLM retrieval systems.

LLMs naturally tend to prefer English over other languages – an emergent behaviour driven by English-heavy training data, tokeniser efficiency differences, and alignment signals. However, redirecting based on Accept-Language, reinforces this bias by forcing “English-only” content. This can unnecessarily exclude relevant non-English sources and create a partial, skewed view of available information.

We recommend avoiding redirects for bots based on the Accept-Language header. Build following internet standards, make URLs explicit, and keep redirects predictable.

The post Your Accept-Language Redirects Could Be Blocking Search Engines and AI Crawlers appeared first on Merj.

]]>
https://merj.com/blog/your-accept-language-redirects-could-be-blocking-search-engines-and-ai-crawlers/feed 0
When Search Stops Playing By Old Rules https://merj.com/blog/when-search-stops-playing-by-old-rules https://merj.com/blog/when-search-stops-playing-by-old-rules#respond Mon, 19 Jan 2026 10:20:51 +0000 https://merj.com/?p=2155 introduction SEO died again. 😟 It’s fine though. My profession just does that sometimes. This time, it’s supposed killer is...

The post When Search Stops Playing By Old Rules appeared first on Merj.

]]>
introduction

SEO died again. 😟

It’s fine though. My profession just does that sometimes. This time, it’s supposed killer is AI, and the emergence of the chat-based LLM. This had led in turn to the emergence of the field of AEO / GEO / many other acronyms.

However, while much of the discussion around that field gets into the nitty gritty of the technical detail – and as a professional dork, I’m as eager to charge down those rabbit holes as anyone – I wanted to zoom out and look at the wider picture of AEO. In this article, I’m going to outline how value is traditionally created and captured on the web, and why you can’t just transplant a business model built around a web of links and pages to a model of answers.

The economics of the web

The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.

Tim Burners-Lee, The WWW Project, 1993

The web, from it’s inception, is designed to be a network of documents, interlinked via Hypertext Reference (href). In the early days, this was more obviously apparent – getting content online meant writing a series of webpages, interlinking them as you saw fit, and uploading them to a server.

Honestly, we lost something as a society when we stopped using the marquee tag.

If you really want to throw a UX designer for a loop (and make them dislike you for being unnecessary pedantic), point out that there’s really no such thing as a website in a technical sense. Just individual hypertext documents, all interlinked via unique URLS – a web.

Today, this framing is somewhat abstracted away. There is, of course, such a thing as a website – pages are rarely written individually as documents in raw HTML, instead being added via a content management system, stored in a database and rendered out via a templating engine to create a unified, coherent experience. And the end user generally think about those experience in terms of domains and/or a website name, rather than browsing around a web of interlinked individual URLS.

But the underlying core components of the web are still the URL, the Page View and the Hyperlink. Indeed, some of the largest challenges facing Technical SEOs over the last decade plus has been downstream of development teams creating platforms that don’t assume that Hypertext-Link-URL document paradigm, often in the name of unifying code bases between the web and other surfaces such as mobile app stores. Move away from that, and you moved away from the all the benefits the web as a paradigm provided, including that critical arbiter of audiences for your content – the aggregator.

Abundance and Aggregation Theory

Welcome to the internet

Have a look around

Anything that brain of yours can think of can be found

We’ve got mountains of content

Some better, some worse

If none of it’s of interest to you, you’d be the first

Bo Burnham, ‘Welcome to the Internet’, 2021

For my money, the best articulation of the economics of the web is Ben Thompsons 2015 piece, ‘Aggregation Theory’.

For example, printed newspapers were the primary means of delivering content to consumers in a given geographic region, so newspapers integrated backwards into content creation (i.e. supplier) and earned outsized profits through the delivery of advertising. A similar dynamic existed in all kinds of industries, such as book publishers (distribution capabilities integrated with control of authors), video (broadcast availability integrated with purchasing content), taxis (dispatch capabilities integrated with medallions and car ownership), hotels (brand trust integrated with vacant rooms), and more. Note how the distributors in all of these industries integrated backwards into supply: there have always been far more users/consumers than suppliers, which means that in a world where transactions are costly owning the supplier relationship provides significantly more leverage.

The fundamental disruption of the Internet has been to turn this dynamic on its head. First, the Internet has made distribution (of digital goods) free, neutralizing the advantage that pre-Internet distributors leveraged to integrate with suppliers. Secondly, the Internet has made transaction costs zero, making it viable for a distributor to integrate forward with end users/consumers at scale.

Ben Thompson / Stratechery, Aggregation Theory 2015.

Because aggregators deal with digital goods, there is an abundance of supply; that means users reap value through discovery and curation, and most aggregators get started by delivering superior discovery.

Then, once an aggregator has gained some number of end users, suppliers will come onto the aggregator’s platform on the aggregator’s terms, effectively commoditizing and modularizing themselves. Those additional suppliers then make the aggregator more attractive to more users, which in turn draws more suppliers, in a virtuous cycle.

Ben Thompson / Stratechery, Defining Aggregators, 2017

It’s no coincidence that the largest companies in the world for the internet age are all aggregators (Google, Meta, Amazon, with Apple and Microsoft acting as portals to those aggregators).

The core argument of many of the antitrust cases that came to a head in 2025 was based around that the centralisation of this power constitutes a monopoly – a position that US courts largely upheld, albeit with meagre penalties that did nothing to meaningfully dissuade that consolidation. (One of the important but less discussed aspects of AI, and one I think will become ever-more-meaningful in 2026 is how the emergence of ChatGPT gives Google cloud cover for more aggressive vertical integration – but that’s a whole other article)

Google Search, and URL Economics

Google, or at least Google Search, is unique in that it is the aggregator of the open web. The other major aggregators all came later, and chose to run closed ecosystems. Amazon will happily optimise their experience to earn clicks from search, but they’ll never send that traffic back out of their ecosystem to a Shopify-run cart in order to complete a purchase.

If discoverability is, as Thompson argues, the valuable commodity of a digital economy, and on the web the mechanism of being discovered is the link leading to the page view, and the owner of those links is in turn in a position to extract significant commercial value.

It’s no accident that selling links via Adwords remains by far the bulk of Google’s revenue, or that every other major aggregator runs an effectively closed platform. I’m sure Google, with the benefit of hindsight, would have much rather rolled out Google Shopping and Payments earlier and more aggressively, creating a vertically integrated shopping experience rather than sending the traffic out freely that in turn allowed Amazon to reach their critical mass. Closed networks allow the aggregator to extract a greater share of the value created within those networks. But not every truth was apparent in 1998, and it’s not as if selling links has not been a massively successful business for Google!

It is, however, a lesson the LLMs have learned.

The difference between web and AEO economics

On the surface the mechanics of LLM feels familiar to any SEO. There’s a crawler, following links around the web, visiting URLS, indexing and processing the content and then serving it in response to a users query.

But the users experience is fundamentally different. There is no link (or, at least, no link that anyone clicks – links from chatGPT are interacted with a fraction of the time versus the traditional search result), no pageview, and no subsequent exchange of value. It is a closed network, only unlike a TikTok or an Instagram who provide alternative incentives to drive creation (whether monetarily such as ads, or something more intangible such as social status and a means of expression) there’s no recompense for the creator. The link-based value chain is redundant in an AI-first world.

Now, from the end users experience, this centralisation is often a positive. The web may have been built around documents, but what users often really wanted are answers. And often the answer to a question might be spread across several documents, requiring the user to go hunting across several searches and links to get to the answer they want. ChatGPT has had a meteoric rise for a reason!

But one of the contradictions at the heart of AI is that it undermines the very structures that gave rise to the web – the link, the click, the PageView – while also relying on the web as it’s primary source of training data.

For AEO to truly develop as a channel and for AI to become a meaningful path to reaching customers, different models have to start to emerge.

Commodification, brands, and ‘The Doordash problem’

One model is to replicate the mobile app store model, encouraging organisations to integrate directly with LLMs just as they integrated with mobile operating system APIs. And I’m sure this is the model OpenAI, Google would love – they get to effectively enforce app-store style rents on the entire economy. I’m of the opinion that AI is very much in a bubble as I write this that’s due to pop, but if there is a case for the level of investment in infrastructure for AI, it’s the amount of commercial value that represents if it can be pulled off.

But there’s a contradiction at the heart of that model, at least for the supply side of that relation ship. It runs into the risk of being stuck in a commodified relationship and fighting in a race to the bottom.

‘The Doordash’ problem, coined by Nilay Patel of The Verge, is an articulation of this issue. Patel argues that to embrace AI is to give up ownership of the customer experience

So what, exactly, is the DoorDash problem? Briefly, it’s what happens when an AI interface gets between a service provider, like DoorDash, and you, who might send an AI to go order a sandwich from the internet instead of using apps and websites yourself.

That would mean things like user reviews, ads, loyalty programs, upsells, and partnerships would all go away — AI agents don’t care about those things, after all, and DoorDash would just become a commodity provider of sandwiches and lose out on all additional kinds of money you can make when real people open your app or visit your website.

Nilay Patel, ‘The DoorDash Problem’, The Verge Decoder Podcast

Patel’s argument is not that AI adoption is optional, but that AI intermediaries collapse differentiation by absorbing the customer experience.

In this model, DoorDash ceases to be a destination and becomes an API. And once a business is reduced to an API, margins become the only remaining moat. Sooner or later, a competitor offers the same interface at a slightly lower cost, passing the savings on to the customer. The result is a race to the bottom.

This is the decision organisations face whether they articulate it or not. The mistake isn’t adopting AI interfaces. The mistake is adopting them without owning the economic surface they create.

The counterargument many companies made to him is that it’s their brand that acts as a moat. People won’t order ‘a taxi’, they’ll order ‘an Uber’ because they have a level of pre-existing brand trust with Uber that’s no there if AI is simply ordering from an unknown provider. That brand will protect them.

Frankly, I think that is at best naive. The consumer brand is, again, an artefact of 20th century economics, and the true currency of the digital age is discoverability. Uber won not because of their incredible emotional mass storytelling on TV (do they even do TV adverts, and would they even get the reach from them to accomplish that in 2026?), but because they consolidated taxi firms into one place, creating a new and superior user experience in the process versus the traditional ringing around various random providers in the hope that, maybe, a taxi might turn up, possibly, who can tell. (Not to mention a review system that disincentivised some drivers from the awkward thinly vailed racist rant about ‘that there London’ that made the average trip home such a joy!).

It’s the UX, not the brand, that differentiates every successful large scale business in the 21st century.

To just control of the product experience that over to OpenAI would be insane! They would just be giving control of their moat away. But it’s also not an option to simply opt out of AI – ChatGPT is the fastest growing tech product in history, and any organisation ultimately has to meet the customer where the customer is.

What’s needed is a method of to established an alternative to the pageview as the point of value exchange. And into that void, enter MCP.

MCP and the ownership of the experience

While still a fairly early-stage technology, Model Context Protocol or MCP is fast emerging as a standard for LLMs to communicate directly with other systems. It’s being positioned firstly as a solution for agentic systems to interact – the protocol that an LLM can use to tell a robot vacuum cleaner to start, or to request information from a SaaS data provider.

It is to AI what HTTP was to the web browser. A way of standardising communication between various systems in a way that promotes interoperation.

That in-turn creates a new ‘chokepoint’ for value extraction, recreating a point of leverage for brands.

The MCP prompt and response creates a moment where organisations can both create and exchange value, requiring something in return from their customers that aligns with their strategic product and marketing goals. It provides an opportunity to interact in a differentiated way, offering something that’s beyond not the proverbial Sandwich delivery API.

MCP is just one part of the emerging stack, of course. Vector Databases also change how data is stored, moving from the rigid exact match of SQL to a database of concepts that can be mixed and remixed in response to the customers prompt.

This change requires organisations to think holistically, from a product level on down, about how their services can fit into this paradigm. Simply sticking an MCP endpoint onto a web server is not going to be enough, just as simply sticking a print brochure on the web wasn’t enough in 1998. The companies that will win will be those that have products built around these new economic models, not those built around web-based economics.

AEO is, ultimately, not simply a technology problem. It’s a product and business model one too. The web rewarded those who understood links and views. AI will reward those who understand interfaces, protocols, and the new points of economic leverage.

The post When Search Stops Playing By Old Rules appeared first on Merj.

]]>
https://merj.com/blog/when-search-stops-playing-by-old-rules/feed 0
Rendering, Style, and Layout: When Things Go Wrong  https://merj.com/blog/rendering-style-and-layout-when-things-go-wrong https://merj.com/blog/rendering-style-and-layout-when-things-go-wrong#respond Mon, 15 Dec 2025 13:00:00 +0000 https://merj.com/?p=2145 Crawler Rendering ≠ User Rendering At Tech SEO Connect 2025, our R&D Director, Giacomo Zecchini, showed how crawler behaviour differs...

The post Rendering, Style, and Layout: When Things Go Wrong  appeared first on Merj.

]]>
Crawler Rendering ≠ User Rendering

At Tech SEO Connect 2025, our R&D Director, Giacomo Zecchini, showed how crawler behaviour differs from user rendering.

His findings show that the gap between what users see and what crawlers interpret is widening, especially as AI crawlers grow in influence but lag behind search engines in rendering sophistication.

A major focus of Giacomo’s talk was how rendering quirks, viewport expansion, and layout techniques like 100vh hero sections can cause crawlers to completely misunderstand page structure, hide key content, or misjudge its priority.

Watch the full talk here: https://www.youtube.com/watch?v=kZw6BsIytJU

Main Takeaways

1. Crawlers don’t behave like users

Giacomo emphasised that many AI crawlers still do not execute JavaScript or complete user interactions, meaning:

  • Lazy-loaded content may never be added to the page.
  • Interactive content never triggers.
  • Pseudo-elements and display properties can hide essential information or devaluate the perceived value of it.

Even when JavaScript is executed, crawlers process the page differently from users resulting in mismatches in what gets indexed.

2. Viewport expansion changes layout

One of the standout insights from Giacomo’s talk is Google’s viewport expansion behaviour:

  • Google initially renders a page using a fixed viewport (e.g. 1024×1024 or 412×732).
  • Then Google expands the viewport height to match the full page.
  • This triggers lazy-loaded elements that rely on viewport boundaries.

This can change how the content is shown, exposing or hiding information in ways teams never expect. Such rendering behaviour is unique to crawlers, not humans, so issues often go unnoticed.

3. The 100vh trap

Giacomo highlighted a common layout pitfall: using 100vh for hero banners or sections.

When viewport expansion occurs:

  • The crawler recalculates 100vh based on the expanded viewport.
  • A hero section intended to fill one screen suddenly becomes much taller, sometimes thousands of pixels.
  • Primary content gets pushed dramatically down in the page.

If Google or AI crawlers incorporate layout position or ‘above-the-fold’ relevance into ranking signals, this can unintentionally downgrade the importance of key headings and content.

Even worse:

  • Lazy-loaded content beneath a 100vh hero element may never trigger, because the crawler never “reaches” the threshold height for activation.

Giacomo’s recommendation: Use a max-height cap on fullscreen elements to preserve visual design without creating crawler distortions.

4. Lazy loading that works for crawlers

Lazy loading improves performance, but Giacomo reinforced a critical point:

Most crawlers don’t scroll.

Most crawlers don’t trigger user events.

Therefore:

  • Scroll-based lazy loading fails silently.
  • Touchstart/wheel-based loading doesn’t activate.
  • Content that should be indexed simply never appears.

Safer patterns include:

  • Intersection Observer API
  • Static HTML fallbacks
  • Avoiding scroll-triggered content loading entirely

These ensure both humans and crawlers get the same content.

5. Optimise for what crawlers can see

A consistent message in Giacomo’s talk: Content quality matters only if crawlers can see it.

SEO teams should consistently evaluate:

  • Rendered DOM, not just raw HTML
  • Layout tree and computed styles
  • How viewport manipulation affects element position
  • Whether key content is pushed too far down
  • How AI crawlers differ from search crawlers in visibility

The “technical” in technical SEO is becoming more literal as we now optimise for rendering engines as much as for search algorithms.

Closing

Giacomo’s insights at Tech SEO Connect revealed that many SEO issues today are related to rendering.

From viewport expansion to the 100vh trap, from JS limitations to lazy-loading pitfalls, the modern crawler landscape requires far more precision in how pages are built and served.

As AI-driven discovery accelerates, ensuring that your layouts, behaviours, and rendering patterns align with crawler capabilities is a competitive advantage.

The post Rendering, Style, and Layout: When Things Go Wrong  appeared first on Merj.

]]>
https://merj.com/blog/rendering-style-and-layout-when-things-go-wrong/feed 0
URL Encoding Done Right: Best Practices for Web Developers and SEOs https://merj.com/blog/url-encoding-done-right-best-practices-for-web-developers-and-seos https://merj.com/blog/url-encoding-done-right-best-practices-for-web-developers-and-seos#respond Tue, 24 Jun 2025 13:00:17 +0000 https://merj.com/?p=2014 Introduction Modern web developers and SEO teams share the complex challenge of creating URLs that function seamlessly across languages, platforms,...

The post URL Encoding Done Right: Best Practices for Web Developers and SEOs appeared first on Merj.

]]>
Introduction

Modern web developers and SEO teams share the complex challenge of creating URLs that function seamlessly across languages, platforms, and systems.

Whether you’re adapting links for international users or handling dynamic parameters in your APIs, getting URL encoding right matters. Small mistakes can lead to broken links, duplicate pages, and messy analytics.

This guide breaks down URL encoding best practices, focusing on UTF-8, common pitfalls, and solutions. By the end, you’ll know how to:

  • Safely encode non-ASCII characters such as café or カフェ
  • Avoid infinite URL loops in faceted navigation
  • Configure servers and CDNs to handle edge cases

Key Takeaways

  • Enforce UTF-8 throughout: Always use UTF-8 for URLs, your <meta charset> declaration and your server settings
  • Set the <meta charset> early: The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document
  • Normalise casing early: Use uppercase hexadecimal values (e.g., %C3%A9 instead of %c3%a9) to avoid inconsistencies
  • Prevent double encoding: Check for existing % characters before re-encoding so you don’t turn % into %25
  • Implement Redirect Rules: Use 301 redirects to redirect all lowercase percent-encoded URLs to their uppercase equivalents
  • Use native encoding functions: Rely on built-in methods such as encodeURIComponent (JS), quote (Python), URI.escape (Ruby), and url.QueryEscape (Go) to handle edge cases correctly
  • Configure analytics tools and logs: Configure your analytics filters and server/CDN logging to treat differently encoded URLs as the same resource
  • Test and verify: Use developer software or online encoding and decoding tools to confirm every URL behaves as expected

Core Concepts

What is URL Encoding?

URL standards limit characters to an alphanumeric set (A-Z, a-z, 0-9) and a few special characters (-, ., _, ~). Characters outside this cluster, such as spaces, symbols, or non-ASCII characters, must be encoded to avoid misinterpretation by web systems.

URL encoding (percent-encoding) converts problematic characters into %-prefixed hexadecimal values. For example:

  • A space becomes %20
  • The letter é becomes %C3%A9

This encoding ensures that URLs remain consistent across browsers, servers, and applications to ensure seamless navigation.

Why Use URL Encoding?

There are two primary reasons to use URL encoding:

  1. Functionality: Nested data, such as query parameters, often includes characters such as spaces, commas, quotes, or brackets. Encoding ensures these characters don’t break the URL structure
  2. Localisation: URLs that include non-ASCII characters (e.g., Greek, Japanese, or Cyrillic scripts) must be encoded to work globally

UTF-8 Encoding: The Gold Standard

UTF-8 (Unicode Transformation Format – 8-bit) is the most widely used encoding for URLs. It represents any Unicode character while remaining backwards-compatible with ASCII.

When non-ASCII characters appear in URLs, they are first encoded using UTF-8 and then percent-encoded.

Example: Encoding Non-ASCII Characters for the Word “Cat”

LanguageWordUTF-8 BytesEncoded URL Path
GreekΓάταCE 93 CE AC CF 84 CE B1https://example.com/%CE%93%CE%AC%CF%84%CE%B1
JapaneseE7 8C ABhttps://example.com/%E7%8C%AB

Avoid legacy encodings such as Shift-JIS (e.g., %94%4C for 猫), as they can lead to interoperability issues. RFC 3986 recommends using UTF-8 to maintain consistency and compatibility across systems.

Common Pitfalls & Real-World Failures

1. Duplicate Content

While RFC 3986 states %C3%9C and %c3%9c are equivalent, many systems treat them as distinct values.

Real-World Impact:

  • A link to https://example.com/caf%C3%A9 shared on social media might appear as https://example.com/caf%c3%a9 due to platform re-encoding
  • Search engines may crawl and index both URLs as separate pages, which wastes crawl budget, creates duplication, and dilutes SEO value. They then have to decide which version carries the strongest signals, which may not be the preferred variant
  • Analytics may treat these as separate pages, leading to skewed traffic metrics and inaccurate reporting

2. Multi-pass Encoding Loops

Re-encoding URLs repeatedly creates infinite variations.

Real-World Scenario 1: E-commerce Faceted Navigation

A user visits an e-commerce store with a faceted navigation menu. They select “Black” as a filter, represented in the query as colour:Black.

An initial URL is created with : and " encoded:

https://www.example.com/products?facet=color%3A%22Black%22

After adding a price filter, the existing % becomes %25

https://example.com/products?facet=color%253A%2522Black%2522&price%3A100-200

Subsequent clicks add a length filter, which further compounds the encoding and converts the existing %25 into %2525:

https://www.example.com/products?facet=color%25253A%252522Black%252522&price%25%3A100-200&length%3A30

Real-World Scenario 2: Login Redirect Return

  1. A customer starts at: https://example.com/products?facet=color:Black
  2. They then visit the login: https://example.com/login?return_to=https://example.com/products?facet=color:Black
  3. Multiple redirects end up as: https://example.com/login?return_to=/login?return_to=https://example.com/products?facet=color:Black

These loops create cluttered, error-prone URLs and can break navigation workflows.

3. More Robots.txt Directives

The robots.txt file guides crawler behaviour but has nuances when dealing with encoded URLs, such as:

  • Case sensitivity: Path components can be case-sensitive, leading to unexpected results when uppercase and lowercase encodings differ
  • Disallow rules: Encoded characters in disallow rules may not match decoded URL requests

In both scenarios, this can result in URLs that are presumed blocked remaining accessible, which can be hard to detect without regular log file analysis.

Example: Disallowing URLs With a Umlaut

Disallowing a decoded page with the character ü or Ü:

User-agent: *
# Upcase (titlecase)
Disallow: /Über-uns
# Downcase
Disallow: /über-uns

Disallowing an encoded page with the character ü:

# Example URL
# https://example.com/über-uns

User-agent: *
# Upcase UTF-8 characters
Disallow: /%C3%BCber-uns
# Downcase UTF-8 characters
Disallow: /%c3%bcber-uns

Disallowing an encoded page with the character Ü:

# Example URL
# https://example.com/Über-uns

User-agent: *
# Upcase UTF-8 characters
Disallow: /%C3%9Cber-uns
# Downcase UTF-8 characters
Disallow: /%c3%9cber-uns

The following rules cover all encoding variants:

User-agent: *
# uppercase Ü URLs
/Über-uns
/%c3%9cber-uns
/%C3%9Cber-uns

# lowercase ü URLs
/über-uns
/%c3%bcber-uns
/%C3%BCber-uns

4. HTML Metadata Conflicts

If the HTTP header Content-Type or the HTML code uses a different encoding (e.g., EUC-JP), the byte sequence in the URL may be interpreted differently. Challenges include:

  1. Lack of Metadata: URLs do not include metadata about their character encoding, making it difficult to interpret the percent-encoded bytes correctly
  2. Multiple Mappings: Servers would need to support numerous mappings to convert incoming strings into the appropriate encoding
  3. Webpage Discrepancies: The file system might use a character encoding that differs from the one declared in the HTML (via <meta charset="..."> or the headers, such as Content-Type: text/html; charset=utf-8), complicating URL matching and display

Impact on Analytics Reporting

URL encoding inconsistencies can fragment analytics data, causing the metrics of what should be a single page to appear as multiple entries.

By the time data reaches an analytics platform or server log, variations in percent-encoding can distort traffic, session, and performance reports—making it harder to draw accurate insights.

Web Analytics

Variations in URL encoding can lead to:

  1. Inconsistent URL tracking: Analytics tools may treat differently encoded versions of the same URL as separate resources, resulting in split data for what is essentially the same page
  2. Inaccurate metrics: Pageviews, bounce rates, and session data can all become skewed, leading to unreliable insights

To address these issues, configure web analytics tools to recognise and merge URLs that differ only in their encoding. Many tools include features for URL normalisation:

Google Analytics

  • Lowercase filter: Converts all incoming URLs to lowercase automatically
  • Search and replace filter: Standardises URL structures by replacing specific characters or patterns

Adobe Analytics

  • Processing rules: Allows manipulation and standardisation of URLs before final reporting
  • VISTA rules: Performs advanced server-side manipulations, including URL normalisation

Server Access Logs

Server access logs can also be affected by URL encoding variations:

  1. Inconsistent logging: Requests with different encodings might be logged as separate entries, even if they refer to the same resource
  2. Data aggregation difficulties: Variations in encoding make it harder to analyse logs and correlate user or search engine behaviour accurately

To address these issues:

  • Implement URL normalisation: Configure servers to normalise URLs before logging
  • Set up URL rewriting rules: Standardise URL encoding before logging
  • Use CDN logs: CDNs may provide normalised URLs and exclude cached requests, offering cleaner data than origin server logs

How Major Languages Handle URL Encoding

Programming languages handle URL parsing differently, and most do not automatically normalise hex case in percent-encoded sequences.

Python

Python’s urllib.parse module provides tools for URL encoding and decoding. However, it does not automatically normalise the case of hexadecimal values in percent-encoded sequences.

from urllib.parse import urlparse, quote

# Encoding
original_url = "https://example.com/café"
encoded_url = quote(original_url, safe='/:')
print(encoded_url)  # Output: https://example.com/caf%C3%A9

# Parsing
url1 = urlparse("https://example.com/%C3%A9")
url2 = urlparse("https://example.com/%c3%a9")
print(url1.path == url2.path)  # Output: False (case sensitivity issue)

Ruby

Ruby’s URI module requires explicit encoding for non-ASCII characters and does not normalise hexadecimal casing.

require 'uri'

# Encoding
original_url = "https://example.com/café"
encoded_url = URI::DEFAULT_PARSER.escape(original_url)
puts encoded_url  # Output: https://example.com/caf%C3%A9

# Parsing
url1 = URI.parse("https://example.com/%C3%A9")
url2 = URI.parse("https://example.com/%c3%a9")
puts url1 == url2  # Output: false (case sensitivity issue)

Go

Go’s net/url package automatically encodes non-ASCII characters but does not normalise hexadecimal casing.

package main

import (
	"fmt"
	"net/url"
)

func main() {
	// Encoding
	originalUrl := "https://example.com/café"
	encodedUrl := url.QueryEscape(originalUrl)
	fmt.Println(encodedUrl)  // Output: https%3A%2F%2Fexample.com%2Fcaf%C3%A9

	// Parsing
	url1, _ := url.Parse("https://example.com/%C3%A9")
	url2, _ := url.Parse("https://example.com/%c3%a9")
	fmt.Println(url1.String() == url2.String())  // Output: false (case sensitivity issue)
}

JavaScript

JavaScript’s encodeURIComponent function encodes URLs, but it does not normalise hexadecimal casing. The URL constructor can parse URLs but treats different casings as distinct.

// Encoding
const originalUrl = "https://example.com/café";
const encodedUrl = encodeURIComponent(originalUrl);
console.log(encodedUrl);  // Output: https%3A%2F%2Fexample.com%2Fcaf%C3%A9

// Parsing
const url1 = new URL("https://example.com/%C3%A9");
const url2 = new URL("https://example.com/%c3%a9");
console.log(url1.pathname === url2.pathname);  // Output: false (case sensitivity issue)

PHP

PHP’s urlencode function encodes URLs, but like other languages, it does not normalise hexadecimal casing.

// Encoding
$originalUrl = "https://example.com/café";
$encodedUrl = urlencode($originalUrl);
echo $encodedUrl;  // Output: https%3A%2F%2Fexample.com%2Fcaf%C3%A9

// Parsing
$url1 = parse_url("https://example.com/%C3%A9");
$url2 = parse_url("https://example.com/%c3%a9");
echo $url1['path'] === $url2['path'] ? 'true' : 'false';  // Output: false (case sensitivity issue)

Handling URL Encoding and Log Management in Apache & Nginx

While there are many types of web servers, Apache and Nginx are two of the most common, and both offer some built-in handling for URL encoding.

Apache:

  • URL Normalisation: Typically normalises URLs before processing
  • Configuration Options: Offers AllowEncodedSlashes , NoDecode, and other directives for precise control
  • URL Manipulation: Provides mod_rewrite for advanced URL rewriting and redirection

Nginx:

  • URL Decoding: Normalises URLs by decoding percent-encoded characters
  • Security Measures: Avoids decoding percent-encoded slashes by default for security
  • Rewrite Module: Allows URL manipulation but is generally less configurable than Apache’s mod_rewrite

How CDNs Handle URL Encoding and Cache Management

CDNs play a crucial role in managing and delivering content:

  • Caching: Often cache content based on normalised URL versions
  • Security: Filter malicious URL requests using firewall rules
  • URL Normalisation: Offer functions to ensure consistent formatting

Cloudflare’s URL normalisation follows RFC 3986 and includes additional steps such as:

  • Converting backslashes to forward slashes
  • Merging consecutive forward slashe
  • Performing RFC 3986 normalisation on the resulting URL.

Caveat: Sometimes a URL that should return an error is normalised by the CDN and treated as valid. When a CDN does this, search engines may unexpectedly index the URL or return a 200 status response for URLs such as:

https://example.com///////%C3%9Cber-uns

URL Encoding in Real-world Contexts

Browsers and URL Encoding

Web browsers ensure that user-entered URLs comply with Internet standards[1][2] before sending them to servers:

  • User Input Handling: Characters outside A-Z, a-z, 0-9, and reserved symbols are percent-encoded (e.g., spaces become %20)
  • Encoding Special Characters in Query Strings: Before sending a GET form or following a link, browsers encode special symbols—for example, C++ & Python #1 in a form input becomes C%2B%2B%20%26%20Python%20%231
  • Automatic Decoding: Once the page loads, browsers decode percent-encoded characters for display, so users typically see a more “friendly” version

Form Submissions (GET Method)

Forms using method="GET" append data to the URL. Encoding preserves spaces and symbols:

Example:

<form method="GET" action="proxy.php?url=/search">
  <input type="text" name="query" value="C++ & Python #1">
  <input type="submit" value="Search">
</form>

Encoded Result:

/search?query=C%2B%2B%20%26%20Python%20%231

APIs, Deeplinking & Analytics Tracking

APIs often pass dynamic data in URLs.

Example:

GET /users?filter=role="admin"&country="US/Canada"

Encoded:

GET /users?filter=role%3D%22admin%22%26country%3D%22US%2FCanada%22

For marketing and analytics, UTM parameters commonly include spaces or special characters:

Example:

  • Original URL: https://example.com/page?utm_source=newsletter&utm_campaign=Spring Sale 2025
  • Encoded URL: https://example.com/page?utm_source=newsletter&utm_campaign=Spring%20Sale%202025

Mobile apps also use deep links with URL-like structures to route users directly to in-app content. Those parameters must also be encoded:

Example:

  • Deep link: myapp://product/12345?referrer=John Doe
  • Encoded: myapp://product/12345?referrer=John%20Doe

Final Thoughts

Proper URL encoding is crucial for web development, internationalisation, analytics and SEO. By understanding the nuances of URL encoding and implementing best practices, development, analytics, and SEO teams can ensure that their websites are accessible, efficiently crawled by search engines, and accurately tracked in analytics tools. If you’d like to learn more or need assistance implementing these strategies, get in touch.

Additional Resources

The post URL Encoding Done Right: Best Practices for Web Developers and SEOs appeared first on Merj.

]]>
https://merj.com/blog/url-encoding-done-right-best-practices-for-web-developers-and-seos/feed 0
Don’t Block What You Want: DuckDuckGo and Common Crawl to Provide IP Address API Endpoints https://merj.com/blog/dont-block-what-you-want-duckduckgo-and-common-crawl-to-provide-ip-address-api-endpoints https://merj.com/blog/dont-block-what-you-want-duckduckgo-and-common-crawl-to-provide-ip-address-api-endpoints#respond Thu, 19 Jun 2025 09:34:36 +0000 https://merj.com/?p=2005 Your security rules are blocking the traffic you actually want. If search engines and LLM crawlers can’t reach your content,...

The post Don’t Block What You Want: DuckDuckGo and Common Crawl to Provide IP Address API Endpoints appeared first on Merj.

]]>
Your security rules are blocking the traffic you actually want. If search engines and LLM crawlers can’t reach your content, they can’t index it, train on it, or show it in traditional or AI search interfaces.

While your team focuses on stopping malicious bots, good crawlers get caught in the crossfire. DuckDuckGo processes 3 billion searches monthly. Common Crawl powers the training data behind major AI models. Block them, and your content becomes invisible to privacy-conscious users and AI-powered search.

DuckDuckGo (13 June 2025) and Common Crawl (22 June 2025) shipped a quiet but important upgrade: their crawler IP ranges are now available as structured data. No more brittle HTML parsing. No more manual updates. Just clean, fast, automatable bot management.

Key Takeaways

1. DuckDuckGo now expose their crawler IP ranges as structured JSON
2. Common Crawl now expose their IP ranges as structured JSON
3. The new approach replaces fragile HTML pages and makes change‑detection trivial (curl, jq, checksum)
4. Safelist the ranges in your WAF or let a bot‑management service (Akamai, Vercel, etc) handle it
5. Blocking these “good bots” is blocking 3 billion DuckDuckGo searches / month + the dataset that fuels many LLMs
6. We built a free search engine IP tracker that monitors every change hourly with years of history → search‑engine‑ip‑tracker.merj.com/status

What Changed

Historically, most crawler IPs were published in HTML pages or buried in documentation. That worked…until it didn’t.

  • Fragile parsing: DOM structures change without warning
  • Reverse DNS validation: is too slow to be usable at scale (some still use this method)
  • Security gaps: Delayed updates mean legitimate traffic gets blocked or relying on stale lists

Google provided the initial IP address structured endpoint schema and others are now adopting the same pattern.

The schema itself is defined as follows (you can also skip to the example below):

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "title": "IP Prefix List",
    "type": "object",
    "properties": {
        "creationTime": {
            "type": "string",
            "format": "date-time"
        },
        "prefixes": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "ipv4Prefix": {
                        "type": "string",
                        "format": "ipv4-cidr"
                    },
                    "ipv6Prefix": {
                        "type": "string",
                        "format": "ipv6-cidr"
                    }
                },
                "additionalProperties": false,
                "oneOf": [
                    {
                        "required": [
                            "ipv4Prefix"
                        ]
                    },
                    {
                        "required": [
                            "ipv6Prefix"
                        ]
                    }
                ]
            }
        }
    },
    "required": [
        "creationTime",
        "prefixes"
    ],
    "additionalProperties": false
}

A real world example using IPv4 and IPv6 looks like this:

{
    "creationTime": "2025-06-19T12:00:00.000000",
    "prefixes": [
        {
            "ipv6Prefix": "2600:1f28:365:80b0::/60"
        },
        {
            "ipv4Prefix": "44.220.181.167/32"
        },
        {
            "ipv4Prefix": "18.97.14.80/29"
        },
        {
            "ipv4Prefix": "18.97.14.88/30"
        },
        {
            "ipv4Prefix": "98.85.178.216/32"
        }
    ]
}

You can track, verify, and act on them with a single curl command.

BotDocumentationJSON Endpoint
DuckDuckBotHelp Pagehttps://duckduckgo.com/duckduckbot.json
DuckAssistBotHelp Pagehttps://duckduckgo.com/duckassistbot.json
CCBotCommon Crawl FAQhttps://index.commoncrawl.org/ccbot.json

What You Should Do

  1. Safelist these bots. Audit your WAF or firewall blocks to make sure DuckDuckGo and Common Crawl are explicitly allowed
  2. Automate the check. Monitor for changes using sha256sum, or parse it into your bot allowlist pipeline
  3. Check your traffic. Are you getting referrals from DuckDuckGo? If not, it may be blocked upstream
  4. Use bot management where possible. Platforms like Vercel and Akamai can manage this for you automatically

Why I Care

I’m passionate about helping companies standardise formats that can benefit millions of businesses. When I reached out to the CTO of Common Crawl and the CEO of DuckDuckGo about adopting machine-readable standards, they are moving quickly to implement these changes. I want to thank them for their responsiveness and leadership. This is exactly the kind of industry collaboration we need more of.

The Bigger Picture

This is part of a broader shift. As AI-native search grows, so does the importance of structured access to crawlers:

  • Expect more LLM search engines to follow (xAI, Anthropic, etc)
  • A shared schema could allow edge platforms to ingest new IPs automatically
  • Until then, standard JSON endpoints are the baseline we should expect

Conclusion

If a bot can’t reach your site, it can’t index it, it can’t train on it, and it can’t surface it in the increasingly fragmented AI powered search results of the future.

DuckDuckGo and Common Crawl are making it easier to tackle this problem head on and take advantage of the opportunities.

Make sure the bots you want are getting through.

The post Don’t Block What You Want: DuckDuckGo and Common Crawl to Provide IP Address API Endpoints appeared first on Merj.

]]>
https://merj.com/blog/dont-block-what-you-want-duckduckgo-and-common-crawl-to-provide-ip-address-api-endpoints/feed 0
Introducing our Bing Webmaster Tools API Python Client https://merj.com/blog/introducing-our-bing-webmaster-tools-api-python-client https://merj.com/blog/introducing-our-bing-webmaster-tools-api-python-client#respond Mon, 11 Nov 2024 13:54:17 +0000 https://merj.com/?p=1981 For organisations managing large-scale SEO operations, accessing and analysing Bing Webmaster Tools (“BWT”) data programmatically has become increasingly crucial. Today,...

The post Introducing our Bing Webmaster Tools API Python Client appeared first on Merj.

]]>
For organisations managing large-scale SEO operations, accessing and analysing Bing Webmaster Tools (“BWT”) data programmatically has become increasingly crucial. Today, we’re excited to introduce our open source Bing Webmaster Tools Client written in Python for the Bing Webmaster Tools API – a modern, type-safe, and production-ready solution for accessing Bing’s webmaster data at scale.

➡ Ready to start? Head straight to Github…

Why Another Bing Webmaster Tools Wrapper?

With the rise of new ways to search, our focus has been on expanding data pipelines outside the Google Search ecosystem. Bing’s significance has grown substantially, particularly with its integration with Copilot.

Google provides an official Python SDK for their APIs which includes Search Console (Github link). Bing has a Python SDK, but only for Bing Ads (Github link).

What began as an internal tool for our data pipeline needs has evolved into something we believe can benefit the broader technical SEO community. By open-sourcing our client, we aim to provide a robust solution that matches the quality and reliability of official SDKs.

Key Features

  • Async/await support – Built on aiohttp for efficient async operations
  • Type-safe – Full typing support with runtime validation using Pydantic
  • Domain-driven design – Operations organised into logical service groups
  • Comprehensive – Complete coverage of Bing Webmaster Tools API
  • Developer-friendly – Intuitive interface with detailed documentation
  • Reliable – Built-in retry logic, rate limiting, and error handling
  • Flexible – Support for both Pydantic models and dictionaries as input
  • Production-ready – Extensive logging, testing, and error handling

What Are The Use Cases For the BWT API?

The BWT API serves three main purposes: collecting marketing data for warehousing, monitoring search engine behaviour, and automating website maintenance tasks:

  1. Bingbot Crawl Stats: This data is not accessible via the web interface. Obtaining server access logs for bot traffic can be time-consuming, so these stats provide some insight into potential issues.
  2. Extracting Search Query Data: Up to 16 months of historical data can be backfilled, with daily data collection possible moving forward.
  3. XML Sitemap Management: Some organisations have thousands of XML sitemaps. By running scripts alongside Google Search Console, both Google Search Console and Bing Webmaster Tools can stay synchronised. A single source of truth JSON file is typically built for this purpose.
  4. User Management: Teams often handle hundreds or thousands of website domains. Compliance is maintained with a risk register (aligned with ISO 27001 and SOC 2 Type II) and a review process for adding and removing users.
  5. Content Submission: While IndexNow is the preferred method for Bing and other search engines, it requires a static file on the site. Content Submission allows for pushing up to 10,000 new URLs per day to Bing, useful when development teams cannot prioritise the static file during a Sprint.
  6. Blocking Content: Handles legal and compliance requests effectively by blocking web pages or previews.
  7. Parameter Removal: Unlike Google, which no longer supports parameter management, Bing still allows it.
  8. Regional Settings: Subdirectory-level regional settings can be configured when hreflang implementation poses difficulties. This option is not available via the web interface.

Technical Overview

Getting Started

To begin using the wrapper:

  1. Install via pip:
pip install bing-webmaster-tools
  1. Add your BWT API Key to the environment variable using our authentication example.
export BING_WEBMASTER_API_KEY=your_api_key_here

3. Look under the examples subdirectory to see boilerplate command line interface scripts which can help you get started. These assume you are using the environment variable BING_WEBMASTER_API_KEY as described in the Basic Setup.

# Listing all your Bing website profiles
python examples/get_all_sites.py

# Do you have any blocked webpages that you don't know of?
python examples/get_blocked_webpages.py -s https://merj.dev

# Set the highest crawling available on a scale of 1-10 (default 5).
# Check if "crawl boost" is available, which is valuable for very large websites.
python examples/submit_crawl_settings.py -s https://merj.dev -r "10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10" --enable-boost

# Get all your XML sitemaps, RSS Feeds, etc. 
python examples/get_feeds.py -s https://merj.dev -v

# Submit all your latest content
python examples/submit_batch_urls.py --site-url https://merj.dev --input file.txt

# Get all related words that contian a word.
python examples/get_keyword.py -q "seo"

# Get all 
python examples/get_keyword_stats.py -q "seo"

# Get all URLs that have been fetched through the web UI
python examples/get_fetched_urls.py -s https://merj.com

# Get a list of URLs with issues and a summary table
python examples/get_crawl_issues.py -s https://merj.com

# get 
python examples/get_query_page_stats.py -s https://merj.dev -q "seo"

# Get the stats for a page with a particular query
python examples/get_query_page_detail_stats.py -s https://merj.dev -q "seo" -p "https://merj.dev/seo"

# Get a summary of all external links to a webpage
python examples/get_url_links.py -s https://merj.dev -l https://merj.dev/seo

# Get 3 summary tables: Crawl Stats, Status Codes and Issues.
python examples/get_crawl_stats.py -s https://merj.dev

# Manage_parameters shows how you can use CLI args to perform different actions.
# List all parameters
python examples/manage_parameters.py -s https://merj.dev --list -v
# Add a parameter
python examples/manage_parameters.py -s https://merj.dev --add sort
# Remove a parameter
python examples/manage_parameters.py -s https://merj.dev --remove sort
# Enable a parameter
python examples/manage_parameters.py -s https://merj.dev --enable sort
# Disable a parameter
python examples/manage_parameters.py -s https://merj.dev --disable sort

Architecture and Approach

The client is built with modern Python practices using asynchronous functions. The following example creates the authentication from an environment variable, gets all the known sites, then loops through them to output the traffic stats.

from bing_webmaster_client import Settings, BingWebmasterClient

async def main():
    # Initialize client with settings from environment
    async with BingWebmasterClient(Settings.from_env()) as client:
        # Get all sites
        sites = await client.sites.get_sites()
        if not sites:
            print("No sites available")
            return
            
        test_site = sites[0].url
        print(f"Using site: {test_site}")
        
        # Get traffic stats
        traffic_stats = await client.traffic.get_rank_and_traffic_stats(test_site)
        print("Traffic Statistics:")
        for stat in traffic_stats:
            print(f"Date: {stat.date}")
            print(f"Clicks: {stat.clicks}")
            print(f"Impressions: {stat.impressions}")

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

The client handles all the complexity of API authentication, rate limiting, and data pagination, allowing you to focus on utilising the data rather than managing API interactions.

Current Limitations and Roadmap

As our public v1.X.X release, it’s important to note current limitations:

  • We are currently only supporting the BWT API key.
  • Our pursuit of creating the Python client also uncovered several BWT API bugs which are currently being addressed by the BWT team. Some BWT endpoints may not work as expected. We shall temporarily patch and work around these endpoints where possible.

Development roadmap for 2024/2025:

  • Expand mock testing for all endpoints
  • Add OAuth 2.0 integration
  • Add command line interface for endpoints
  • Improve documentation and examples
  • Remove deprecated endpoints in line with BWT deprecation
  • Work with the Bing Webmaster Tools team to improve API endpoints and documentation

Community and Contribution

We’re excited to see how marketing technology teams and the SEO community build upon this library. Please feel free to submit:

Next Steps

  1. Visit our GitHub repository to download the code and get started.
  2. Join in the conversation on:
    1. X using the hashtag #bingwebmastertools
    2. Linked using #bingwebmastertools

The post Introducing our Bing Webmaster Tools API Python Client appeared first on Merj.

]]>
https://merj.com/blog/introducing-our-bing-webmaster-tools-api-python-client/feed 0
How Extra HTML Attributes in Canonical Tags Impact Search Engines https://merj.com/blog/how-extra-html-attributes-in-canonical-tags-impact-search-engines https://merj.com/blog/how-extra-html-attributes-in-canonical-tags-impact-search-engines#respond Tue, 13 Aug 2024 13:15:41 +0000 https://merj.com/?p=1959 TL;DR Our study, sparked by a content management system (CMS) migration anomaly discovered that certain attributes caused Google to ignore...

The post How Extra HTML Attributes in Canonical Tags Impact Search Engines appeared first on Merj.

]]>
TL;DR

Our study, sparked by a content management system (CMS) migration anomaly discovered that certain attributes caused Google to ignore canonical tags. Read on to learn our methodology, data, and practical recommendations to ensure your canonical link tags work as intended.

Introduction

A correct implementation of canonical link tags is essential for effective indexing of a website, as they help manage content duplication and direct search engines to the preferred version of a webpage.

This study focuses on how search engines, particularly Google, interpret canonical link tags with additional attributes other than rel="*" and href="proxy.php?url=*"

Our research was prompted by an anomaly observed during a CMS migration, where Google Search Console failed to recognise canonical link tags.

Research Objectives

This study aims to address the following questions:

  • What is Google’s policy on processing canonical link tags with additional attributes?
  • Which specific attributes cause Google to disregard canonical link tags?
  • Are there any other/undocumented attributes that affect canonical link tag interpretation?
  • How do these findings impact current SEO best practices?
  • Do other search engines treat canonical link tags with extra attributes similarly to Google?
  • What measures can website owners and developers implement to ensure their canonical link tags are respected by search engines?

Background

During a CMS migration project, we encountered an anomaly where the Google Search Console inspector tool failed to detect canonical link tags that were visibly present in both the raw HTML source and the rendered Document Object Model (DOM). Using the Google Search Console inspector showed that the canonical was not present, even though it was present:

<link rel="canonical" crossorigin="anonymous" media="all" href="proxy.php?url=<https://example.com/category/123-slug-here>" />

This observation led to an investigation of Google’s documentation, which revealed a recent update (February 15, 2024) clarifying the extraction of rel="canonical" annotations:

The rel="canonical" annotations help Google determine which URL of a set of duplicates is canonical. Adding certain attributes to the link element changes the meaning of the annotation to denote a different device or language version.

This is a documentation change only; Google has always ignored these rel="canonical" annotations for canonicalisation purposes. The documentation explicitly mentioned four attributes that, when present, cause Google to disregard the canonical link tag: hreflang, lang, media, and type.

Google supports explicit rel="canonical" link annotations as described in [RFC 6596]. rel="canonical" annotations that suggest alternate versions of a page are ignored; specifically, rel="canonical" annotations with hreflang="lang_code", lang, media, and type attributes are not used for canonicalization. Instead, use the appropriate link annotations to specify alternate versions of a page; for example <link rel="alternate" hreflang="lang_code" href="proxy.php?url=url_of_page" /> for language and country annotations.

Great! We could confirm this was indeed an expected behaviour. But it also raised several questions, like: Are there other/undocumented attributes that may cause Google to entirely ignore the canonical link tag? How widespread is the usage of these problematic attributes? And, are there tools that already flag these attributes as being potentially problematic?

Methodology

Our research methodology comprised the following steps:

  1. Web Scraping: We extracted canonical link tags from the top 1 million websites.
  2. Data Cleaning: After filtering for valid responses, we analysed 595,517 domains.
  3. Attribute Analysis: We identified and cataloged 210 unique attribute names used in canonical link tags.
  4. Experimental Setup: We developed a Vercel application to generate 568 URLs with various attribute permutations (available at https://canonical-research.vercel.app)
  5. Testing: We utilised Google Search Console’s live URL inspection tool to verify if Google recognises these as self-declared canonicals—we wanted to assess simply if Google recognises the presence of a canonical link tag, and not if it chose to respect it or not after evaluating the contents on a page.
  6. Data Collection: Due to GSC’s daily inspection limit, we conducted tests over a one-week period.

Results

Attribute Frequency

Our analysis identified the top 10 most common attributes found in canonical link tags that Google does not flag as problematic:

  1. data-react-helmet (3,348 occurrences)
  2. itemprop (2,902 occurrences)
  3. data-n-head (1,423 occurrences)
  4. data-rh (967 occurrences)
  5. id (837 occurrences)
  6. data-senna-track (610 occurrences)
  7. data-hid (576 occurrences)
  8. data-baseprotocol (393 occurrences)
  9. data-basehost (392 occurrences)
  10. class (335 occurrences)

Within our gathered dataset, the four attributes in canonical link tags known to cause issues ranked as follows:

  1. hreflang (3196 occurrences)
  2. type (889 occurrences)
  3. media (421 occurrences)
  4. lang (8 occurrences)

Google’s Interpretation

Our testing yielded the following results:

  • Google ignores attributes not explicitly mentioned in their documentation and these don’t affect the search engine’s ability to correctly identify canonical link tags.
  • Canonical link tags containing hreflang, lang, media, or type attributes were not identified as self-canonical in the GSC inspector and the entire canonical tag is ignored.
  • The presence of multiple attributes or various combinations did not alter this behaviour, as long as none of the attributes is one of the listed above.

SEO Tool Limitations and Implications

A significant finding from our study is that many popular SEO crawling tools do not flag issues related to problematic attributes in canonical link tags. This oversight can lead to undetected canonicalisation problems, impacting the effectiveness of canonical implementations. It’s crucial to be aware of this limitation and take proactive measures to manually verify your canonical link tags using tools like Google Search Console’s URL inspection tool, or building custom systems that specifically monitor the canonical link tags and look for these specific attributes.

Verifying

Cross-Search Engine Comparison

We captured the URL inspection testing using Google and Bing Webmaster tools. Example of Google Search Console Live Testing for a URL with a canonical containing problematic attributes (note how “None” is displayed for the User-declared canonical):

Google Search Console Live Testing for a URL with a canonical not containing problematic attributes:

While Google reports “None” for the User-declared canonical, hinting it didn’t recognise a canonical link tag, in Bing Webmaster Tools, the live URL inspection didn’t report the presence nor flag issues with the canonical for any URLs in our testing set.

Bing Webmaster Tools live URL inspection test for a URL with a canonical containing problematic attributes (no data is shown regarding canonical link tags):

Bing Webmaster Tools live URL inspection test for a URL with a canonical not containing problematic attributes:

While our study focused primarily on Google and Bing’s interpretation of canonical link tags, further research around actual crawling and indexing may be required to determine if other search engines like DuckDuckGo, Naver, handle canonical link tags with attributes in a similar manner.

Recommendations

Based on our findings, we propose the following recommendations:

  1. Simplify canonical link tag HTML as much as possible, avoiding unnecessary attributes. The only attributes used should be rel and href.
  2. If additional data must be included on the canonical tag, use data-* attributes.
  3. We highly recommend avoid using common attribute names such as id, name, or content in canonical link tags, as they may become reserved words in the future.
  4. Setup CICD checks to identify if additional attributes are added.
  5. Verify the effectiveness of your canonical link tags using Google Search Console’s URL inspection tool, or custom-built systems.

Conclusion

Our research highlights critical findings regarding the interpretation of canonical link tags by Google, especially when additional attributes are present. Key takeaways include:

  1. Certain attributes in canonical link tags, such as hreflang, lang, media, and type, cause Google to disregard them.
  2. Many common SEO tools do not flag these problematic attributes, leading to potential oversight in canonicalisation.
  3. To ensure an effective indexing process, it is crucial to simplify canonical link tags by avoiding unnecessary attributes and regularly verifying them using tools like Google Search Console.

These are nuances important to understand, in order to stay updated with search engine guidelines and a healthy website in search results, regardless of the insights provided by SEO tools.

The post How Extra HTML Attributes in Canonical Tags Impact Search Engines appeared first on Merj.

]]>
https://merj.com/blog/how-extra-html-attributes-in-canonical-tags-impact-search-engines/feed 0
Google’s JavaScript Rendering Capabilities https://merj.com/blog/googles-javascript-rendering-capabilities https://merj.com/blog/googles-javascript-rendering-capabilities#respond Wed, 31 Jul 2024 15:13:09 +0000 https://merj.com/?p=1913 In a significant step towards debunking some persistent SEO myths and increasing industry understanding of search engine crawling and rendering,...

The post Google’s JavaScript Rendering Capabilities appeared first on Merj.

]]>
In a significant step towards debunking some persistent SEO myths and increasing industry understanding of search engine crawling and rendering, Merj has partnered with Vercel to explore how effectively search engines render JavaScript pages, providing indispensable insight for SEOs, developers, Chief Marketing Officers, and Chief Technology Officers responsible for building, maintaining, and optimising websites.

📢 This research puts to bed a question that still hangs over modern web development: can search engines fully and reliably render a JavaScript website?

Part 1 focuses exclusively on Google, and analysed data from over 100,000 Googlebot fetches, shedding light on critical questions regarding whether and how Google renders JavaScript, the effect of JS on the rendering queue and page discover, and the impact of all this on SEO. Through rigorous analysis, we dispel common myths and provide practical guidance for anyone striving to optimise their web presence.

Full research findings

For a detailed exploration of the findings from this research read the full article on Vercel’s blog. There, you will discover the nuances of our methodology, the myths we addressed, and the practical implications of our findings for web developers and businesses alike.

🔗 Read the full article on Vercel’s blog here.

About Vercel and our partnership

Vercel’s Frontend Cloud provides the developer experience and infrastructure to build, scale, and secure a faster, more personalized web. Customers like Under Armour, Chico’s, The Washington Post, Johnson & Johnson, and Zapier use Vercel to build dynamic user experiences that rank highly in search.

Looking ahead: Part 2

The journey does not end here. Part 2 of this research, which as I write this is already nearing completion, broadens our scope to examine the JavaScript rendering capabilities of other search engines, including non-Western search engines and large language model companies such as OpenAI (ChatGPT). Part 2 will complete your understanding of the global rendering landscape and offer further strategic insights for optimising web content across diverse search platforms.

For further information, or to talk to someone about how your technical infrastructure might be impacting search engine rendering and discovery, please contact us.

The post Google’s JavaScript Rendering Capabilities appeared first on Merj.

]]>
https://merj.com/blog/googles-javascript-rendering-capabilities/feed 0
Investigating Reddit’s robots.txt Cloaking Strategy https://merj.com/blog/investigating-reddits-robots-txt-cloaking-strategy https://merj.com/blog/investigating-reddits-robots-txt-cloaking-strategy#respond Thu, 04 Jul 2024 10:42:45 +0000 https://merj.com/?p=1846 Introduction We recently noticed a post on X by @pandraus regarding Reddit’s robots.txt file. On 25 June 2024, u/traceroo announced...

The post Investigating Reddit’s robots.txt Cloaking Strategy appeared first on Merj.

]]>
Introduction

We recently noticed a post on X by @pandraus regarding Reddit’s robots.txt file. On 25 June 2024, u/traceroo announced that any automated agents accessing Reddit must comply with their terms and policies.

“In the next few weeks, we’ll be updating our robots.txt instructions to be as clear as possible: if you are using an automated agent to access Reddit, you need to abide by our terms and policies, and you need to talk to us. We believe in the open internet, but we do not believe in the misuse of public content.”

u/traceroo

The robots.txt file is crucial for managing web crawler interactions. An instruction to disallow all access, as seen in Reddit’s latest update, can lead to deindexing the entire site, posing significant risks to a site’s search engine presence and overall accessibility. Let’s explore the details and implications of this change.

Analysis of the New Robots.txt

The new robots.txt file is quite…blunt:

# Welcome to Reddit's robots.txt # Reddit believes in an open internet, but not the misuse of public content. # See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content. # See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
# policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy

User-agent: *
Disallow: /

This directive essentially blocks all crawlers from accessing Reddit. Normally, this would be a critical issue as it could lead to deindexing the entire domain. This situation has precedents, such as when OCaml inadvertently blocked their entire website.

Real-World Implications

Reddit’s move raises several questions. For example, Reddit still wants search engines and archivers to index their content, especially given recent agreements with entities like Google.

“There are folks like the Internet Archive, who we’ve talked to already, who will continue to be allowed to crawl Reddit.”

u/traceroo

This leads us to question whether search engines like Google might make exceptions for Reddit. However, this seems unlikely. It’s more plausible that Reddit might be serving different robots.txt files to different user-agents.

Testing and Findings

To investigate, we conducted a test using Google’s tools. Normally, a user-agent switcher could be used, but Reddit blocks agents pretending to be Google, thanks to search engines providing their IP address ranges.

We used Google’s rich snippet testing tool to retrieve the raw HTML. Our findings confirmed that Reddit is serving an entirely different robots.txt file to Google, which is commonly known as cloaking.

Conclusion

Reddit’s updated robots.txt file appears to block all crawlers, but our tests show this is more of a public stance than a practical restriction. Robots.txt files are generally not meant for consumption by your average user (i.e. not developers, SEO practioners and such), therefore it is not perceived as deceptive for the purpose of “good bot” search engine content discovery.

For those interested, the full robots.txt rules can be found here.

The post Investigating Reddit’s robots.txt Cloaking Strategy appeared first on Merj.

]]>
https://merj.com/blog/investigating-reddits-robots-txt-cloaking-strategy/feed 0
Discovering and Diagnosing a Google AdSense Rendering Bug https://merj.com/blog/discovering-and-diagnosing-a-google-adsense-rendering-bug https://merj.com/blog/discovering-and-diagnosing-a-google-adsense-rendering-bug#respond Tue, 23 Apr 2024 10:15:28 +0000 https://merj.com/?p=1811 Introduction AdSense is Google’s advertising content platform where publishers can be paid to place advertisements on their webpages. While performing...

The post Discovering and Diagnosing a Google AdSense Rendering Bug appeared first on Merj.

]]>
Introduction

AdSense is Google’s advertising content platform where publishers can be paid to place advertisements on their webpages. While performing a search engine optimisation test that involved crawling and rendering tracking for a React application, we discovered an anomaly in the client’s server log data,  which required further investigation.

We found that a part of the AdSense technology stack is not working as expected and the programmatic matching between ads and website content is impacted, thus creating an incomplete understanding of webpage content.

Disclosure Ethics and Communication

We uphold a strong ethical framework when it comes to disclosing discovered anomalies or bugs, especially those associated with pivotal platforms like Google AdSense. We strictly follow responsible disclosure guidelines, informing relevant parties well in advance of any public announcements.

This issue does not fall under the Google Bug Hunter Program, as it pertains primarily to a product operational anomaly rather than a security vulnerability. Despite not reporting this through the Google Bug Bounty Program, we reported this bug to Google representatives who liaise with the internal Google teams and observed a standard 90-day grace period before considering public disclosure.

We have extended these standard disclosure timelines to facilitate a resolution, although the issue remains unresolved.

Bug Disclosure Timeline

DateSubjectAction
June 1st, 2023MerjDiscovered the bug
June 8th, 2023MerjWe sent an email to Gary Illyes, a member of the Google Search Team, describing the bug and its impact.
June 27th, 2023GoogleGary Illyes responded, stating that he had consulted with the rendering team and would notify the administrators responsible for the Mediapartners-Google crawlers. As the owners of the Mediapartners-Google crawlers are not part of the Search team, Search team members have no influence over them.
September 15th, 2023MerjWe sent a follow-up email asking for any updates regarding the bug.
October 24th, 2023Merj & GoogleWe had an in-person conversation with Gary Illyes about the bug at the Google Search Central Live Zurich event.
April 23rd, 2024MerjPublic disclosure of the issue

Google Adsense and Google Ads

Google AdSense is an advertising program run by Google. It allows website owners (publishers) to monetise their content by displaying targeted advertisements. These ads are generated by Google and can be customised to match the website’s content. Publishers earn revenue when visitors click on or view these advertisements.

Google AdSense offers multiple formats of ads

Source: https://adsense.google.com/start/resources/best-format-your-site-for-adsense/

Google AdSense offers publishers a variety of ad units to display on their websites. Here are some of the common types of ad units:

Google Ads is a platform that enables businesses (advertisers) to create and manage online advertisements, targeting specific audiences based on keywords, demographics, and interests. These advertisements can appear on various Google services, such as search results, YouTube, and partner websites.

Advertisers that use Google Ads can place their ads on websites that participate in the AdSense program. This symbiotic relationship enables businesses to reach a wider audience through targeted advertising, while website owners can generate revenue by hosting relevant advertisements on their platforms.

Google Adsense targeting

Google Adsense works by matching ads to your site based on your content and visitors; having webpages with incorrect, partially rendered, or blank content impacts the matching of ads and webpages. To analyse webpage content, Google AdSense employs a specific User-Agent known as ‘Mediapartners-Google’.

Google AdSense employs various methods for delivering targeted ads. Contextual Targeting uses factors such as keyword analysis, word frequency, font size, and the overall link structure of the web to ascertain a webpage’s content accurately and match ads accordingly.

However, without access to a page’s full content, any targeting based on page content cannot be accurate.

Impact of the Google AdSense Rendering Bug

Impact on Google Adsense

Websites that block AdSense infrastructure from accessing their content can considerably affect the precision of ad targeting, potentially resulting in diminished clicks and, consequently, reduced revenue. This has an impact on both publishers and advertisers.
Misunderstanding the content on the page could result in more severe consequences. If the publisher sends irrelevant traffic to advertisers, the Adsense Platform may limit or disable ad serving.

Impact on Server Access Logs Analysis

The misattribution of User-Agents in server access logs can lead to incorrect assumptions about search engines’ crawling and rendering of webpages.

Additionally, it can result in inaccurate conclusions about the sources of crawling traffic and the effectiveness of strategies or updates made on the website, potentially leading to misguided decision-making.

Technical Analysis of the Bug

The TL;DR

  • The use of Mediapartners-Google and Googlebot for different parts of the AdSense algorithm process creates a conflict of rules which are not immediately obvious.
  • The initial request to download the webpage’s HTML source code uses the “Mediapartners-Google” User-Agent.
  • The Google Web Rendering Service (WRS) then processes and renders the page to generate the final HTML. During this phase, supplementary rendering resources are requested using the “Googlebot” User-Agent. If a necessary resource cannot be downloaded because a robots.txt rule is blocking the “Googlebot” User-Agent, the webpage may be partially rendered or completely blank.
  • Not being able to get the correct content of the webpage can affect AdSense content understanding and ad targeting, consequently affecting publisher revenues.

The Details

Robots.txt effect on crawling & Rendering

Every time a web browser requests a website, it sends an HTTP Header called the “User-Agent”. The User-Agent value contains information about the web browser name, operating system, and device type. The User-Agent is present in both webpages and page resource requests.

Search Engine crawlers use their own custom User-Agent, when fetching webpages and page resources. Before starting to download a specific URL, Search Engines check if they are allowed to fetch a specific URL by parsing the robots.txt.

Without debating on “if” and “how” the use of the robots.txt to block crawlers is suitable, here below is a simplified step-by-step pipeline of robots.txt effect on a search engine’s crawling and rendering process:

  • Step 1: Checking robots.txt before fetching the webpage
  • Step 2: Fetching the webpage
  • Step 3: Parsing the HTML to get the webpage resource
  • Step 4: Checking robots.txt for each webpage resource
  • Step 5: Downloading webpage resources
  • Step 6: Start rendering the webpage
  • Step 7: Checking robots.txt for additional page resources needed to complete the rendering
  • Step 8: Downloading additional webpage resources
  • Step 9: Complete the webpage rendering
Before each fetch the crawler has to check if the resource can be downloaded or not, respecting robots.txt rules

If Step 1 fails:

  • the crawler is not allowed to download the webpage HTML source code.
  • subsequent steps are ignored.

If Step 1 is completed but one of the other steps fails:

  • the rendering of the webpage may not be correct due to missing resources.
  • the final webpage’s rendered HTML may be missing some information or be completely blank.

Our Investigative Process

With ongoing efforts to bring our Search Engine Web Rendering Monitoring solution into production, we have been closely monitoring the number of webpages being crawled and the time delta within which those webpages are rendered. Working with server logs that contain terabytes of data, we utilise a custom in-house enrichment and query engine (similar to Splunk) that enables us to drill into the data with complex logic.

Validating the Data Source

The server access logs have started showing anomalies over a 6-week period, with fetches of page resources where the referrer points to webpages that are normally blocked for Googlebot. First, we needed to check the data pipelines and data integrity. This involved reviewing any code changes and container failures that may have created some unexpected edge cases both at our source and further upstream. We are often second consumers of server logs because of Personal Identifiable Information (PII) and Payment Card Industry Data Security Standard (PCI-DSS) requirements. Examples of transformations include:

  • Redacting sensitive URLs such as logged-in areas.
  • IP address restriction. Often the IP address is redacted, so Google crawler verification needs to be done further upstream by IP range checks.
  • Scrubbing emails, names, and addresses.

The Traffic Engineering and Edge teams managing the upstream ingress point (for instance, a CDN like Cloudflare, Akamai, or Fastly) confirmed that no changes had been made. We reprocessed our data, which yielded the same anomalies.

Reproduction and Isolation of the Issue

Once the data source has been validated, the next step is to reproduce and isolate the anomaly to confirm its existence and understand its behaviour. Here’s how to replicate the issue:

  1. Identify Target Webpages: Start by identifying webpages that are accessible to the “Mediapartners-Google” User-Agent, but blocked for the “Googlebot” User-Agent. This can be determined by looking for “Disallow” directives in the website’s robots.txt file.
# Googlebot
user-agent: Googlebot
disallow: /reviews

# Mediapartners-Google
user-agent: Mediapartners-Google
allow: /
  1. Utilise the Referer HTTP Request Header: Tracing the webpage resources through the Referer HTTP header reveals the webpage from which a particular resource has been requested.
APACHE
66.249.64.4 - - [28/Jul/2023:04:17:10 +0000] 808840 "POST /graphql-enpoint HTTP/1.1" 200 56 "https://domain.com/reviews/139593" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

NGINX
66.249.64.4 - - [28/Jul/2023:04:17:10 +0000] "POST /graphql-enpoint HTTP/1.1" 200 56 "https://domain.com/reviews/139593" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"
  1. Use a Robots.txt Parser: Use a reliable robots.txt parser to verify the accessibility of the webpage origin address for different User-Agents. We recommend using the official Google open-source C++ version available on GitHub. If using another parser, refer to the Google official documentation and the specification to check for accurate parsing.
  2. Verify User-Agent Attribution: By combining the Referer HTTP request header and the robots.txt parser, check whether the resource requests during rendering are correctly attributed to the “Googlebot” User-Agent or if they originate from a different User-Agent, specifically “Mediapartners-Google”.

Note: For webpages accessible by both “Mediapartners-Google” and “Googlebot,” the above Server Access Logs approach to detect incorrect User-Agent attribution may not be effective. In such specific cases, more advanced solutions, such as our Search Engine Rendering Monitoring tool, are required.

Impact Analysis

Number of Websites using Google AdSense potentially impacted 

To assess the potential implications of the issue on actual websites, we acquired the list of US and UK websites utilising Google AdSense from BuiltWith.com and developed a tool to identify the possible impact of the issue on these websites.

The robots.txt files of most websites we analysed are small and contain rules only for the global User-Agent (*), which the AdSense crawler ignores. As the AdSense crawler only respects rules set specifically for Mediapartners-Google, this significantly impacts the number of websites that may be affected.

We did this by using the following simplified logic that approximates the potential magnitude of the websites that may be impacted:

Flowchart to define if a domain is impacted or not.

Upon executing the tool on the BuiltWith list, which covers around 7 Million US websites and 2 Million UK websites, we determined that around 5.5 Million websites may potentially be impacted by this issue.

UK websites

StatusNumber
Websites from the BuiltWith list1,946,633
Testable websites974,536
Potentially impacted websites 938,413
Non-impacted websites36,123

US websites

StatusNumber
Websites from the BuiltWith list6,827,954
Testable websites4,540,894
Potentially impacted websites 4,363,028
Non-impacted websites177,866

On analysis of the robots.txt files, we can see that most of the body bytes are relatively small.

UK websites robots.txt bytes (compressed)

PercentileBytes
25th137
50th (Median)137
75th137
95th213
99th748

US websites robots.txt bytes (compressed)

PercentileBytes
25th137
50th (Median)137
75th137
95th575
99th1,145

It is difficult, within the scope of this article, to provide an exact prediction of the number of websites currently impacted. While 5.5 million websites may be affected by the issue, they would only experience a negative impact if they exhibit certain specific characteristics, such as serving primary content via JavaScript and blocking a portion of requests using specific robots.txt rules.

Our analysis provides a broad overview of potential impacts without hands-on verification. To identify if a site is affected, a more complex assessment would be necessary, involving the comparison of a site’s initial and rendered HTML. This requires a level of testing that goes beyond our current scope, emulating search engine behaviours to extract and analyse a page’s primary content.

The web is inherently broken, and simple methods, like checking the <main> HTML tag, fall short due to the web’s inconsistency and the varying adherence to best practices among servers and websites. Other approaches, such as comparing initial and rendered HTML sizes or word count differences, are imprecise and unreliable, potentially leading to the publication of incorrect data.

Given the complexity of automating the test, we have opted to describe a straightforward method for self-diagnosing the issue in the FAQ section. This approach allows users to assess their websites independently.

Google AdSense impact 

The ideal test to assess the impact on Google AdSense in this scenario would be to quantify the number of websites affected by the issues that display inappropriate ads, yet this is unfeasible.

Google AdSense utilises a variety of ad-matching techniques that go well beyond contextual targeting. This comprehensive approach offers a broad spectrum of ad targeting possibilities, ranging from matches based on content to ads chosen by advertisers for specific placements and those tailored to user interests.

While publishers can customise the types of ad categories permitted on their site, they have limited influence over the exact ads that are shown. Moreover, the presence of ads that seem to not align with the site content could be attributed to advertisers who have set overly broad or generic targeting criteria rather than an issue with the ad targeting system itself.

Due to this complexity, it’s not possible to determine whether a website is displaying incorrect ads based solely on the issue we discovered.

As an alternative method to estimate whether websites affected by the issue might see an impact on revenue, publishers can use the revenue calculator to get an idea of how much they should earn with AdSense.

The Adsense revenue calculator quantifies your potential annual revenue

In the calculator, you can select region, category, and monthly page views to get an estimate. The calculator itself emphasises that the estimate should only be used as a reference and that numbers may vary, but it could be useful to have an idea of the missing revenue if the numbers differ significantly from what publishers can see in the AdSense dashboard.

Google Ads impact

Google Ads is not directly affected by the issue. We have examined the Google Ads crawler’s requests, and for the tested websites, it is sending the correct User Agent for all fetches. Nonetheless, Advertisers may observe an impact of this issue on the quality of traffic, click-through rate (CTR), and indirectly on revenue.

Server Access Logs impact

Access logs are not commonly utilised by publishers or advertisers, yet these logs might be utilised by others for analysis or to establish a business case for technical modifications.

Using the methodology described in the ‘Reproduction and Isolation of the Issue’ section, we examined the access logs of multiple websites for different clients. Our findings revealed that, depending on the scale of the website, the percentage of misattributed ‘Mediapartners-Google’ fetches using the ‘Googlebot’ User-Agent can range from 20% to 70% of the total ‘Googlebot’ requests.

This substantial discrepancy in the access logs analysis can significantly distort any analysis.

Solutions and Recommendations

Best Practices for AdSense

While Google has confirmed it is a bug, they have not yet fixed it. Businesses can work around the issue by ensuring essential assets that are used to render a webpage, such as API endpoints, scripts, stylesheets and images, are not blocked by robots.txt for both “Mediapartners-Google” and “Googlebot” User-Agents.

Best Practices for Server Access Logs Analysis

To effectively understand the impact of issues within server access logs, it is crucial to employ a systematic approach to log analysis. The method outlined in the “Reproduction and Isolation of the Issue” section provides a simple way to filter the access logs removing the pages that Googlebot can’t crawl. It’s worth remembering that this approach would offer only a partial view of the problem, excluding only those pages blocked by Googlebot and not for Mediapartners-Google.

It is recommended that you use more advanced filtering techniques to fully understand the issue’s impact. For a detailed and comprehensive analysis of your server access logs, we encourage you to get in touch with us.

FAQ

What is the Google AdSense rendering bug?

The Google AdSense rendering bug is a technical issue in which ads served by Google AdSense might not display correctly on publishers’ websites.

This problem presents itself due to discrepancies in how pages are rendered when different rules are applied to Googlebot and the AdSense bot (“Mediapartners-Google”). If these bots are treated differently by your site’s robots.txt, it can lead to improper ad display.

What steps can I take to diagnose the AdSense rendering issue on my site easily?

To diagnose the issue, review your robots.txt checking for any directives that might block “Googlebot” from accessing certain URL paths on your site that are not similarly restricted for the AdSense bot (“Mediapartners-Google”).

If your website is using Client Side Rendering and/or the main content of the webpages is generated dynamically at rendering time using additional JavaScript requests, it’s crucial to ensure that both “Googlebot” and “Mediapartners-Google” have equal access to these JavaScript resources and the resultant content paths.

Discrepancies in access permissions between these bots can lead to issues and prevent proper rendering.

Are there any quick fixes or workarounds for the rendering bug?

A quick fix to address the rendering bug involves aligning the access rules for both “Googlebot” and the Google AdSense bot (“Mediapartners-Google”) in your robots.txt file.

Ensuring both bots have the same level of access to your site’s content can mitigate rendering issues. This approach helps ensure that even if requests are misattributed in server access logs, page rendering works as expected.

Are my Server Access Logs affected?

Server Access Logs play a crucial role in diagnosing and understanding how web crawlers and bots interact with your website. These logs contain detailed records of every request made to your server, including those by Googlebot and the AdSense bot (“Mediapartners-Google”).

Even if your website is not affected by the rendering bug, the logs may contain misattributed requests. The consequence of this misattribution would be an inaccurate count of Googlebot requests, you would see more requests than there actually are. In your analysis, the number of Googlebot requests would be the sum of actual Googlebot requests plus the misattributed Google AdSense requests that use Googlebot as the User-Agent.

Can I use IP ranges to filter the Server Access Logs?

Google’s documentation details the IP ranges for verifying Googlebot and other Google crawlers, organising these ranges into multiple files.

This categorisation seemingly simplifies filtering processes for our use case: Googlebot IPs are classified as “Common Crawlers”, while Google AdSense IPs are deemed “Special Case Crawlers”. Initially, one might expect to filter Googlebot requests using the googlebot.json IP ranges and exclude those listed in special-crawler.json.

However, the situation is more complex. The misattributed requests actually originate from genuine Googlebot IP addresses. It appears that the Google AdSense bot uses Googlebot’s infrastructure to crawl resources rather than just misattributing the User-Agent string.

How can I fix the Server Access Logs for my analysis?

The most straightforward approach to verifying and filtering Server Access Logs is examining the request referrer URLs. Specifically, for requests identified with a Googlebot User-Agent, the presence of a referrer page that is blocked to Googlebot but accessible to the Google AdSense bot (‘Mediapartners-Google’) could indicate incorrect attribution.

This technique, however, is limited in its applicability. It does not yield reliable insights for paths that are accessible to both Googlebot and the Google AdSense crawlers, as these scenarios do not facilitate clear differentiation based on robots.txt rules. To have a comprehensive filtering method, more advanced solutions, such as our Search Engine Rendering Monitoring tool, are required.

We would like to thank Aleyda Solis (LinkedIn, X/Twitter), Barry Adams (LinkedIn, X/Twitter), and Jes Scholz (LinkedIn, X/Twitter) for their thorough peer review of this article. Their experience and insightful suggestions have enhanced the depth and clarity of our analysis, allowing us to highlight key aspects and decisions made during the writing process for a more coherent and impactful delivery.

The post Discovering and Diagnosing a Google AdSense Rendering Bug appeared first on Merj.

]]>
https://merj.com/blog/discovering-and-diagnosing-a-google-adsense-rendering-bug/feed 0