2025 年爬十億個頁面的成本

上禮拜看到的文章,作者在 AWS 上面只用 25.5 個小時就爬了 1B 個頁面,在 tune 過效能後的成本是 US$462:「Crawling a billion web pages in just over 24 hours, in 2025 (via)」。

作者在文章裡面有提到一篇 2012 年的「How to crawl a quarter billion webpages in 40 hours」,當時用了 39.5 個小時左右,花了 US$580 爬了 250M 個頁面:

For some reason, nobody’s written about what it takes to crawl a big chunk of the web in a while: the last point of reference I saw was Michael Nielsen’s post from 2012.

這次是純 HTML (沒有 JavaScript),在 2012 年的文章沒有特別提,應該也是沒有在管 JavaScript:

HTML only. The elephant in the room. Even by 2017 much of the web had come to require JavaScript. But I wanted an apples-to-apples comparison with older web crawls, and in any case, I was doing this as a side project and didn’t have time to add and optimize a bunch of playwright workers.

裡面有提到幾個有趣的問題,一個是 parsing 的部分其實很吃 CPU,會是瓶頸之一,主要的原因是現在頁面比 2012 年大許多了,中位數與平均數都大很多:

Profiles showed that parsing was clearly the bottleneck, but I was using the same lxml parsing library that was popular in 2012 (as suggested by Gemini). I eventually figured out that it was because the average web page has gotten a lot bigger: metrics from a test run indicated the P50 uncompressed page size is now 138KB, while the mean is even larger at 242KB - many times larger than Nielsen’s estimated average of 51KB in 2012!

他的解法是換 parsing library,從 lxml 換成 selectolax,效能有巨大的提升:

I switched from lxml to selectolax, a much newer library wrapping Lexbor, a modern parser in C++ designed specifically for HTML5. The page claimed it can be 30 times faster than lxml. It wasn’t 30x overall, but it was a huge boost.

另外一個就是 HTTPS 的 handshake overhead 了:

That said, one part of fetching got harder: a LOT more websites use SSL now than a decade ago. This was crystal clear in profiles, with SSL handshake computation showing up as the most expensive function call, taking up a whopping 25% of all CPU time on average, which - given that we weren’t near saturating the network pipes, meant fetching became bottlenecked by the CPU before the network!

Chrome 的 telemetry 可以看出來 HTTPS 的成長速度,這邊比較重要的時間點是 Let's Encrypt 在 2015 年十月的時候透過 IdenTrust 的簽名讓現有的瀏覽器支援 Let's Encrypt:

所以現在的硬體與技術對於 raw data 的取得已經不是太大的問題了 (甚至 2012 年的時候就已經是可行的了...)。

MariaDB 12.3 對 InnoDB 的大改進

看到 Mark Callaghan 寫的「MariaDB innovation: binlog_storage_engine」、「MariaDB innovation: binlog_storage_engine, small server, Insert Benchmark」這篇,裡面提到了 MariaDB 12.3 (還沒 GA) 的 InnoDB 改了很底層的架構,從本來的 binlog + InnoDB 的架構整合到只有 InnoDB,連帶也降低了 fsync 呼叫的次數:

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

這個是 MySQL InnoDB 架構上很久的問題,binlog 與 InnoDB 本身兩次的寫入會需要呼叫兩次 fsync,而 fsync 本身就有很高的 latency (需要等到 i/o 完成)。

Mark Callaghan 用手邊的機器測試發現,改進甚至比想像中的大,無論是 cpu bounded 或是 i/o bounded 的情況:

tl;dr for a CPU-bound workload

* Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
* When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)

tl;dr for an IO-bound workload

* Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
* When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)

如果在大台 server 上面也有類似的效果,這個有機會是 InnoDB 這十年來在寫入效能上最大的改進?上次應該是壓縮了...

Fabrice Bellard 又弄了一套 JavaScript Engine,主打極低的記憶體用量

Lobsters 上看到 Fabrice Bellard 又生了一個 JavaScript Engine 出來:「MicroQuickJS」,主打極低的記憶體用量:

MicroQuickJS (aka. MQuickJS) is a Javascript engine targetted at embedded systems. It compiles and runs Javascript programs with as low as 10 kB of RAM.

上面支援的語法很少,但至少實作 ES5 大多數的規格,這應該是考慮到可以用 Babel 轉出來:

MQuickJS only supports a subset of Javascript close to ES5. It implements a stricter mode where some error prone or inefficient Javascript constructs are forbidden.

效能上有提到跟他自己寫的另外一套 QuickJS 算是能拿出來比較 (這邊註明一下,QuickJS 有支援到 ES2023),我猜應該就是堪用的程度,考慮到記憶體很少的 embedded system 速度大概也不快,要有預期用比較複雜一點的套件還是會慢的前提:

The speed is comparable to QuickJS.

Kagi 搞定台灣的稅務了,所以可以繼續收錢了...

收到 Kagi 的信件「Taiwan Subscription Update – Action Recommended」,裡面通知台灣的稅務搞定了,所以可以繼續扣款了:

Earlier this year, our payment processor temporarily paused payments from Taiwan while working to comply with local tax requirements. We're happy to share that payments have now resumed, and your subscription will be billed normally going forward. We genuinely appreciate you sticking with us.

Due to local tax requirements, you may see an approximate 5% increase in your subscription amount. This reflects the applicable taxes, not a change to our pricing.

三月的時候有提到這個問題,結果跑了半年多,總算是解了:「Kagi 的台灣付款問題看起來還沒搞定...」。

差不多是免費用了九個月?

Servo 出 0.0.1 版

看到 Servo 的新消息,推出了 0.0.1 版,第一次 release:「Servo 0.0.1 Release (via)」。

Servo 透過 Rust 打造 browser engine:

Servo is an experimental browser engine designed to take advantage of the memory safety properties and concurrency features of the Rust programming language.

GitHub 上的說明則是強調是 embedding 環境:

Servo aims to empower developers with a lightweight, high-performance alternative for embedding web technologies in applications.

算是讓目前全是 Blink + V8 基底的環境多點選擇,不過開發能量還是個值得探討的問題... 但畢竟是先丟出 0.0.1 版。

PHP 提議將軟體授權改為 BSD-3-Clause

在東京的時候看到「PHP RFC: PHP License Update」這篇,花了不少時間看完整篇,先把其他兩個主題整理出來:

這次的提案想要把 PHP 的軟體授權改成 BSD-3-Clause,屬於 OSI 認定的 BSD licenses 之一,與其他 license 的相容性算是很好的。

裡面比較重要的是「Change Authority」這段在討論需要找哪些人同意。

現有使用的兩個 license (PHP License 以及 Zend Engine License) 因為都有更新條款,所以可以利用這邊的路徑 (漏洞?) 更新授權,只需要有這兩個 license 所指定的團體同意就可以了:

The PHP Group may publish revised and/or new versions of the license from time to time. Each version will be given a distinguishing version number. Once covered code has been published under a particular version of the license, you may always continue to use it under the terms of that version. You may also choose to use such covered code under the terms of any subsequent version of the license published by the PHP Group. No one other than the PHP Group has the right to modify the terms applicable to covered code created under this License.

Zend Technologies Ltd. may publish revised and/or new versions of the license from time to time. Each version will be given a distinguishing version number. Once covered code has been published under a particular version of the license, you may always continue to use it under the terms of that version. You may also choose to use such covered code under the terms of any subsequent version of the license published by Zend Technologies Ltd. No one other than Zend Technologies Ltd. has the right to modify the terms applicable to covered code created under this License.

在 PHP 本身內不使用這兩個授權的是少數,可以另外找到 copyright owner 取得同意後改成 BSD-3-Clause,或是在 license 屬於相容的情況下 (像是近年很常見的 MIT License) 維持原始的 license,如果真的不行的話應該要安排重寫。

回到這兩個團體,PHP Group 這邊因為章程上沒有提到這種 case 需要多少人同意,所以採取最嚴格的標準,看起來已經全數同意了:

Depending on the bylaws adopted by the PHP Association (as discussed earlier in Zend and the PHP Association), we may require approval from one or more representatives of the PHP Group to accept this proposal. There is no public record of the association's bylaws, so unless the bylaws specify a quorum, we will need approval from each of:

Zend 這邊因為現在是 Perforce 下的單位,所以要跟 Perforce 討論,但看起來也取得共識了,剩下的就是正式的授權文件:

Note: Legal representatives of Perforce Software have informally approved this proposal. The next step is a formal approval, in writing.

目前的目標是 PHP 9 換過去,現在看起來前期的工作都差不多了,應該是蠻有機會的...

Zend 的多次易主

還是跟「PHP RFC: PHP License Update」這篇相關,裡面看到 ZendPerforce 的資料才注意到 Zend 已經被交易好幾次了。

2015 年十月的時候被 Rogue Wave Software 收購:

In October 2015, Louisville, Colorado-based software developer Rogue Wave Software acquired Zend.[

2019 年一月的時候 Rogue Wave Software 被 Perforce 收購:

In January 2019, Rogue Wave Software was acquired by Minneapolis, Minnesota-based application software developer Perforce.

然後 2019 年四月的時候 Zend Framework 被拆出來給 Linux Foundation,然後改名成 Laminas

In April, Perforce's Zend Framework was spun off as a separate project to the Linux Foundation, and was renamed Laminas.

所以 Zend Engine 的擁有人還是在 Perforce,這次要把 PHP 本身從 PHP License 與 Zend Engine License 換成 BSD-3-Clause 需要 Perforce 的同意。

Kagi 在 50k 訂閱使用者後的活動

前幾天在「Kagi 訂閱超過 50k 用戶」這邊有提到 Kagi 應該會有活動,昨天發出來了:「Celebrating 50K users with Kagi free search portal, Kagi for libraries, and more...」。

看起來最重要的是讓推出 Kagi Search portal,讓使用者免費用,有點像是 free tier 的概念,不過不確定這邊會有什麼樣的方式計算 quota,會是 per-IP quota 加上 cookie-based 嗎?

This means anyone can experience Kagi Search without signing up, with 50 free searches directly from our homepage and 100 more after creating an account.

然後這個功能要等他上線,之後上線了再來觀察:

Note: the Kagi Search portal will be rolled out in phases, per region. We expect the full rollout to be completed within two weeks from now.

下一個目標是 100k users:

And we can’t wait to prepare another surprise for you, at the 100,000 members mark!

20k 到 50k 花了一年多,這次可能一年後?到時候看看 SERP 的品質有沒有掉... (像是有沒有被 SEO spam 針對)

Kagi 訂閱超過 50k 用戶

Hacker News 上看到「Kagi Reaches 50k Users (kagi.com)」的消息,這個算是當初 Kagi 自己定的目標:

前陣子有測試 DuckDuckGoBrave Search,品質有差不少 (好很多),不過目前有不少人猜測應該是還沒有被 SEO spammer 針對,畢竟現在還是很小眾的服務。

維基百科上剛好有提到三月底的時候是 43k 左右:

Kagi had around 43,403 subscribed members as of March 28, 2025 and 845,200 searches were made that day.

過了兩個多月漲到 50k 了,直接拿 unlimited 算的話大約是 $500k/mo 的收入 (有 $5/mo、$10/mo 以及 $25/mo 的方案),記得在 20k 的時候他們有說當時的收入只剛好付 infrastructure 的部分:「Kagi 訂閱數量過兩萬」。

上個禮拜的「Kagi status update: First three years」有提到 50k 後有公佈:

As of writing this, we are at almost 50,000 customers! You know what that means - there will soon be a Kagi surprise!

就等這幾天的消息?

Firefox 可以自訂搜尋引擎了...?

昨天才在抱怨 Firefox 明明就有實作新增搜尋引擎的 UI 介面,但預設是存取不到的:「Firefox 自己新增搜尋引擎的方式」,結果下面 comment 就被 Outvi V 留言說要支援了:

刚巧昨天默认开了: https://bugzilla.mozilla.org/show_bug.cgi?id=1967739

Bugzilla 上的記錄「Enable browser.urlbar.update2.engineAliasRefresh by default」,應該是一個禮拜前加進去了,接下來是 developer edition & beta,最後進到正式版?這樣應該再幾個月就有機會看到了...

總... 算...?