High Scalability

Kafka 101

HS Editor — Thu, 09 May 2024 18:55:21 GMT

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn.

Originally developed in LinkedIn during 2011, Apache Kafka is one of the most popular open-source Apache projects out there. So far it has had a total of 24 notable releases and most intriguingly, its code base has grown at an average rate of 24% throughout each of those releases.

Kafka is a distributed streaming platform serving as the internet’s de-facto standard for real-time data streaming.

Its development is path-dependent on the problems LinkedIn hit at the time. As they were one of the first companies to hit large scale distributed systems problems, they noticed the common problem of uncontrolled microservice proliferation:

To fix the growing complexity of service-to-service and persistence store permutations, they opted to develop a single platform which can serve as the source of truth.

Apache Kafka is a distributed system that was meant to solve the service coordination problem. Its vision is to be used as the central nervous system inside a company - somewhere where data goes, is processed/transformed and is consumed by other downstream systems (data warehouses, indices, microservices, etc.).

Therefore, it is optimized for accommodating large throughput (millions of messages a second) while storing a lot of data (terabytes)

The Log

The data in the system is stored in topics. The fundamental basis of a topic is the log - a simple ordered data structure which stores records sequentially.

The log underpins a lot of Kafka’s fundamental properties, so it is prudent for us to focus more on it.

It’s immutable and has O(1) writes and reads (as long as they’re from the tail or head). Therefore the speed of accessing its data doesn’t degrade the larger the log gets and, due to its immutability, it’s efficient for concurrent reads.

But despite these benefits, the key benefit of the log and perhaps the chief reason it was chosen for Kafka is because it is optimized for HDDs.

HDDs are very efficient with relation to linear reads and writes, and due to the log’s structure - linear reads/writes are the main thing you perform on it!

As we covered in our S3 article, HDDs have become 6,000,000,000 times cheaper (inflation-adjusted) per byte since their inception. Kafka’s architecture is optimized for a cost-efficient on-premise deployment of a system that stores a lot of data while also being very performant!

Performance

A well-optimized on-premise Kafka deployment usually ends up being bottlenecked on the network, which is to say that it scales to many gigabytes per second of read and write throughput.

How does it achieve such performance? There are multiple optimizations - some macro and others micro.

Persistence to Disk

Kafka actually stores all of its records to disk and doesn’t keep anything explicitly in memory.

Kafka’s protocol groups messages together. This allows network requests to group messages together and reduce network overhead.

The server, in turn, persists chunk of messages in one go - a linear HDD write. Consumers then fetch large linear chunks at once.

Linear reads/writes on a disk can be fast. HDDs are commonly discussed as slow because they are when you do numerous disk seeks, since you’re bottlenecked on the physical movement of the drive’s head as it moves to the new location. With a linear read/write, this isn’t a problem as you continuously read/write data with the head’s movement.

Going a step further - said linear operations are heavily optimized by the OS.

Read-ahead optimizations prefetch large block multiples before they’re requested and stores them in memory, resulting in the next read not touching the disk.

Write-behind optimizations group small logical writes into big physical writes - Kafka does not use fsync, its writes get written to disk asynchronously.

Pagecache

Modern OSes cache the disk in free RAM. This is called pagecache.

Since Kafka stores messages in a standardized binary format unmodified throughout the whole flow (producer ➡ broker ➡ consumer), it can make use of the zero-copy optimization.

Zero-copy, somewhat misleadingly named, is when the OS copies data from the pagecache directly to a socket, effectively bypassing Kafka’s JVM entirely. There are still copies of the data being made - but they’re reduced. This saves you a few extra copies and user <-> kernel mode switches.

While it sounds cool, it’s unlikely the zero-copy plays a large role in optimizing Kafka due to two main reasons - first, CPU is rarely the bottleneck in well-optimized Kafka deployments, so the lack of in-memory copies doesn’t buy you a lot of resources.

Secondly, encryption and SSL/TLS (a must for all production deployments) already prohibit Kafka from using zero-copy due to modifying the message throughout its path. Despite this, Kafka still performs.

Back to Basics

The nodes in the distributed system are called brokers.

Every topic is split into partitions, and the partitions themselves are replicated N times (according to the replication factor) into N replicas for durability and availability purposes.

A simple analogy is that just how the basic storage unit in an operating system is a file, the basic storage unit in Kafka is a replica (of a partition).

Each replica is nothing more than a few files itself, each of which embody the log data structure and sequentially form a larger log. Each record in the log is denoted by a specific offset, which is simply a monotonically-increasing number.

The replication is leader-based, which is to say that only a single broker leads a certain partition at a time.

Every partition has a set of replicas (called the “replica set”). A replica can be in two states - in-sync or out-of-sync. As the name suggests, out-of-sync replicas are ones that don’t have the latest data for the partition.

Writes

Writes can only go to that leader, which then asynchronously replicates the data to the N-1 followers.

Clients that write data are called producers. Producers can configure the durability guarantees they want to have during writes via the “acks” property which denotes how many brokers have to acknowledge a write before the response is returned to the client.

- acks=0 - the producer won’t even wait for a response from the broker, it immediately considers the write successful

- acks=1 - a response is sent to the producer when the leader acknowledges the record (persists it to disk).

- acks=all (the default) - a response is sent to the producer only when all of the in-sync replicas persist the record.

To further control the acks=all property and ensure it doesn’t regress to an acks=1 property when there is only one in-sync replica, the `min.insync.replicas` setting exists to denote the minimum number of in-sync replicas required to acknowledge a write that’s configured with `acks=all`.

Reads

Clients that read data are called consumers. Similarly, they’re client applications that use the Kafka library to read data from there and do some processing on it.

Kafka Consumers have the ability to read from any replica, and are typically configured to read from the closest one in the network topology.

Consumers form so-called consumer groups, which are simply a bunch of consumers who are logically grouped and synchronized together. They synchronize each other through talking to the broker - they are not connected to one another. They persist their progress (up to what offset they’ve consumed from any given partition) in a particular partition of a special Kafka topic called `__consumer_offsets`. The broker that is the leader of the partition acts as the so-called Group Coordinator for that consumer group, and it is this Coordinator that is responsible for maintaining the consumer group membership and liveliness.

The records in any single partition are ordered within it. Consumers are guaranteed to read it in the right order. To ensure that the order is preserved, the consumer group protocol ensures that no two consumers in the same consumer group can read from the same partition.

There can be many different consumer groups reading from the same topic.

One core reason that made Kafka win over the traditional message bus technologies was precisely this decoupling of producer and consumer clients. In some systems, the messages would be deleted the moment they’re consumed, which creates coupling. If the consumers were slow, the system may risk running out of memory and impacting the producers. Kafka doesn’t suffer from this issue because it persists the data to disk, and due to the aforementioned optimizations, it can still remain performant.

Protocol

Clients connect to the brokers via TCP using the Kafka protocol.

The producer/consumer clients are simple Java libraries that implement the Kafka protocol. Implementations exist in virtually every other language out there.

Fault Tolerance

An Apache Kafka cluster always has one broker who is the active Controller of the cluster. The Controller supports all kinds of administrative actions that require a single source of truth, like creating and deleting topics, adding partitions to topics, reassigning partition replicas.

Its most impactful responsibility is handling leader election of each partition. Because all the centralized metadata about the cluster is processed by the Controller, it decides when and to what broker a partition’s leadership changes to. This is most notable in failover cases where a leader broker dies, or even when it is shutting down. In both of these cases, the Controller reacts by gracefully switching the partition leadership to another broker in the replica set.

Consensus

Any distributed system requires consensus - the act of picking exactly one broker to be the controller at any given time is fundamentally a distributed consensus problem.

Kafka historically outsourced consensus to ZooKeeper. When starting the cluster up, every broker would race to register the `/controller` zNode and the first one to do so would be crowned the controller. Similarly, when the current Controller died - the first broker to register the zNode subsequently would be the new controller.

Kafka used to persist all sorts of metadata in ZooKeeper, including the alive set of brokers, the topic names and their partition count, as well as the partition assignments.

Kafka also used to heavily leverage ZooKeeper’s watch mechanism, which would notify a subscriber whenever a certain zNode changed.

For the last few years, Kafka has actively been moving away from ZooKeeper towards its own consensus mechanism called KRaft (“Kafka Raft”).

It is a dialect of Raft with a few differences, heavily influenced by Kafka’s existing replication protocol. Most basically said, it extends the Kafka replication protocol with a few Raft-related features.

A key realization is that the cluster’s metadata can be easily expressed in a regular log through the ordered record of events that happened in the cluster. Brokers could then replay these events to build up to the latest state of the system.

In this new model, Kafka has a quorum of N controllers (usually 3). These brokers host a special topic called the metadata topic (“__cluster_metadata”).

This topic has a single partition whose leader election is managed by Raft (as opposed to the Controller for every other topic). The leader of the partition becomes the currently active Controller. The other controllers act as hot standbys, storing the latest metadata in memory.

All regular brokers replicate this topic too. Instead of having to directly communicate to the controller, they asynchronously update their metadata simply by keeping up with the topic’s latest records.

KRaft supports two modes of deployment - combined and isolated mode. Combined is similar to the model under ZooKeeper, where a broker can serve the role of both a regular broker and a controller at once. Isolated is when the controllers are deployed solely as controllers and serve no other function besides that.

The first Kafka release that featured a production-ready KRaft version was Kafka 3.3, released in October 2022. ZooKeeper is set to be completely removed in the next major Kafka release - 4.0 (expected around Q3 2024).

Tiered Storage

As mentioned earlier, Kafka’s architecture was optimized for a cost-efficient on-premise deployment. Since then, the proliferation of cloud has certainly changed the way we architect software.

One of the architectural choices that become glaringly nonoptimal once Kafka took off was its decision to colocate the storage with the broker. Brokers host all of the data on their local disk, which brings a few challenges with it, especially at scale.

First, as Kafka is meant to be the central nervous system of a company’s architecture, it’s not uncommon to want to have brokers that store 3TB of historical data on them - this results in 9TB total assuming the default replication factor of 3.

When a broker has close to 10TB of data locally, you start to see issues when things go wrong.

One evident problem is handling ungraceful shutdowns - when a broker recovers from an ungraceful shutdown, it has to rebuild all the local log index files associated with its partitions in a process called log recovery. With a 10TB disk, this can take hours if not days in certain cases.

Another problem is historical reads. Kafka’s performance relies heavily on the assumption that consumers are reading from the tail of the log, which in practice means they’re reading from memory due to the pagecache containing the latest produced data. If a consumer fetches historical data, this usually forces the Kafka broker to read it from the HDD. HDDs have historically been stuck at 120 IOPS for a long time, which is to say that it’s really easy to exhaust that resource. This means the consumers compete with the producers for IOPS and once that gets depleted, performance tanks.

The IOPS problem becomes amplified during hard failure scenarios. If a broker has a hard failure of its disk, it starts up with an empty disk and has to replicate all that 10TB of data from scratch. This process can take up to a day itself depending on the free bandwidth, and during that time said broker is issuing historical reads on many other brokers. One such failure amplifies into a lot of historical reads - the impact can be much more severe if a whole availability zone experiences a hard failure.

The final case where the amount of data becomes problematic is in rebalancing scenarios. Kafka allows you to reassign the replicas of any partition - and that involves moving all of the data along with it.

As a simple example, take a partition that has a replica set of brokers [0,1,2]. Usually, the first replica is the leader - hence broker 0 is leading that partition. If you want to introduce new replicas, they will start off as an out of sync replica and have to read all of the partition’s data from the leader 0 before becoming an in-sync replica.

If you add more nodes to your Kafka cluster, for example, you have to reassign some subset of partition replicas to those new brokers, otherwise they would remain empty. The reassignment process has the receiving broker copy all of the data for the given replicas, which not only can use up precious IOPS itself due to its historical nature, but also take a very long time.

Apache Kafka is solving all of these issues by introducing a feature called Tiered Storage - the simple idea of storing most of the data in a remote object store (e.g S3). While still in Early Access, Kafka now has two tiers of storage - the hot local storage and the cold remote storage - both abstracted away seamlessly.

In this new mode, leader brokers are responsible for tiering the data into the object store. Once tiered, both leader and follower brokers can read from the object store to serve historical data.

This feature nicely solves all the aforementioned problems as brokers no longer need to copy massive amounts of data and historical reads no longer exhaust IOPS. Tests from development showed a 43% producer performance improvement when historical consumers were present. Depending on the object store, this can result in saving cost too as you’re outsourcing the replication and durability guarantees.

Auxiliary Systems

Rebalancing

Reassigning partitions is a key necessity in any Kafka cluster that’s seeing non-trivial usage.

Since it’s a distributed system with varying client workloads, the system can easily develop hot spots or inefficient resource distribution throughout its lifecycle.

To alleviate this, Kafka exposes a low-level API that allows you to reassign partitions. Exposing the functionality is the easy part - the hard part is deciding what to move where.

Essentially the NP-hard Bin Packing problem at heart, the community has developed a few tools and even a fully-fledged component to handle this.

Cruise Control, originally also developed at LinkedIn, is an open-source component which reads all brokers’ metrics from a Kafka topic, builds an in-memory model of the cluster and runs that model through a greedy heuristic bin-packing algorithm to optimize the model via reassigning partitions. Once it has computed a more efficient model, it begins incrementally applying it to the cluster by leveraging the low-level Kafka reassignment API.

Without going too much into detail, Cruise Control exposes a configurable set of rebalancing logic consisting of multiple `Goal`s, each of which is ran with its associated priority and balances on its associated resource.

Cruise Control continuously monitors the cluster’s metrics and automatically triggers a rebalance once it notices the metrics going outside of its defined acceptable thresholds.

Notably, Cruise Control also exposes API to allow you to easily add brokers to a cluster or remove brokers from a cluster. Because Kafka brokers are stateful (even with Tiered Storage), both of these operations require an operator to move replicas around.

Kafka Connect

If Kafka is to be the center of your event-driven architecture, you’re likely to:

have a lot of systems whose data you’d like to get into Kafka (sources)
have a lot of systems where you’d like to move data into from Kafka (sinks)

It’s likely that a lot of these systems are popular, widely-adopted ones ones - things like ElasticSearch, Snowflake, PostgreSQL, BigQuery, MySQL, etc.

Part of the Apache open-source project, Kafka Connect is a generic framework that allows you to integrate Kafka with other systems in a plug-and-play way that is reusable by the community.

The Kafka Connect runtime can be deployed in two modes:

Standalone Mode — a single node, used mainly for development, testing, or small-scale data loading.
Distributed Mode — a cluster of nodes that work in tandem to share the load of ingesting data.

Each node in Connect is called a Connect Worker. A worker is essentially a container that executes plugin code.

Community members develop battle-tested plugins that ensure fault tolerance, exactly-once-processing, ordering and other invariants that are cumbersome and time-consuming were you to have to develop from scratch. The name for such a plugin is a Connector — a ready to use library, deployed on a Connect Worker, for importing data into Kafka topics or exporting it to other external systems.

Workers heavily leverage internal Kafka topics to store their configuration, status, and checkpoint their progress (offsets).

They also leverage Kafka’s existing Consumer Group protocol to handle worker failures as well as propagate task assignments.

Users install plugins on the workers and use a REST API to configure/manage them. Such a plugin, called a Connector, can easily be deployed and configured to connect Kafka (a Kafka topic) with an external system.

The Connector code handles all the complex details around the data exchange so that users can focus on the simple configuration and integration. The code creates tasks for each worker to move data in parallel. These Connectors come in two flavors:

Source Connector — used when sourcing data from another system (the source) and writing it to Kafka.
Sink Connector — used when sourcing data from Kafka and writing it to another system (the sink).

In the diagram above we see two Source connectors running in two separate Connect clusters, each with its own workers, ingesting MongoDB/PostgreSQL data into Kafka.

A separate Connect cluster running Sink connectors then takes said data from Kafka and ingests it into Snowflake.

Kafka Streams

A stream processor is usually a client that reads the continuous streams of data from its input topics, performs some processing on this input, and produces a stream of data to output topics (or external services, databases, etc). In Kafka, It’s possible to do simple processing directly with the regular producer/consumer APIs; however, for more complex transformations like joining streams together, Kafka provides a integrated Streams API library.

Another part of the Apache open-source project, Kafka Streams is a client library that exposes a high-level API for processing, transforming and enriching data in real time.

This API is intended to be used within your own codebase — it’s not running on a broker. Unlike other stream processing frameworks, this one is native to Kafka and therefore doesn’t require a separate complex deployment strategy - it is deployed as a regular Kafka client, usually within your application.

It works similar to the consumer API and helps you scale out the stream processing work over multiple applications (similar to consumer groups).

Most notably, it supports exactly-once processing semantics when its input and output is a Kafka topic.

Optimizations / Future Work

While Kafka can seem like old software (it was first released in 2011!), the community is actively innovating on top of the protocol.

At this point, evidence points that the future will be everybody standardizing on the Kafka API and competing on the underlying implementation.

The current leader in the space is Confluent, founded by the original creators of Kafka, who have developed a cloud-native Kafka engine called Kora.

Notable competitors include RedPanda, which re-wrote Kafka in C++ and WarpStream, which innovated with a new architecture that leverages S3 heavily, completely avoiding replication and broker statefulness.

Vendors are largely competing in the cloud today - many offer a Kafka SaaS with varying levels of support. Some vendors offer a proper serverless SaaS experience by abstracting a lot of the details away, while others still require users to understand the details of the system and in some cases manage a large part of it themselves.

In summary, Kafka is mature, widely-adopted software that provides a rich set of features.

It is open source and boasts a very healthy community that continues to innovate stronger than ever after 13 years of development.

Capturing A Billion Emo(j)i-ons

HS Editor — Tue, 26 Mar 2024 15:32:38 GMT

This blog post was written by Dedeepya Bonthu. This is a repost from her Medium article, approved by the author.

In stadiums, sports fans love to express themselves by cheering for their favorite teams, holding up placards and team logos. Emoji’s allow fans at home to rapidly express themselves, and when millions of fans do it simultaneously, that’s a technical problem that we solved!

At Hotstar, we strive to build features that make the viewing experience more engaging to users with an interactive Social Feed. We discussed the “Hotstar Sports Bar” from a design and product point of view. Here we talk about how we built this feature from a technical perspective.

Social Feed GIF

Emojis show the real-time changes in the opinions of the audience. When Dhoni bats, people want a boundary shot on every ball but when he is keeping, people want to see the wickets fall. Collecting these user-generated signals in real-time, condensing these opinions to an emoji swarm that shows the mood of the audience and displaying the changing moods in real-time is challenging when you plan to receive billions of such emoji submissions during a tournament.

For a while, we used a third-party service to power this feature. However, we could not achieve the performance and stability we hoped for, while also not being a cost-effective. The time had come to bring this core service, in-house.

In this article, we’ll discuss the architecture of Emojis, key design principles involved in building it, the impact created and how it paved the way for building other features like Voting.

High-Level Design

Architecture Diagram

Key Design Principles

Scalability

The system should be horizontally scalable to be able to support the increasing traffic. We achieved horizontal scalability with the help of load balancers and configured auto-scaling to scale the resources up or down.

Decomposition

The system needs to be decomposed into smaller components each being able to carry out the assigned task independent of each other. This also provides us with the ability to scale each component as needed.

Asynchronous

Asynchronous processing enables execution without blocking resources and thus supports higher concurrency. We will talk more about this later.

Implementation

How are client requests handled?

Clients send user’s submitted emojis via HTTP API. To prevent hogging the client connection, heavy processing on the API needs to be done offline. We need to write the data somewhere so processing applications can consume it. Message Queue is a commonly used mechanism for asynchronous communication between applications.

There are a lot of message queues available out there. A comparison of some of the available Message Queues is reviewed in this blog. For Emojis, we needed a technology that offered high throughput, availability, low latency and supports consumer groups. Kafka seemed like the best option but managing Kafka on our own takes significant effort. Thankfully, Hotstar has an amazing data platform called Knol built on top of Kafka which is flexible for all our use cases.

How do we write messages to the queue?

Synchronous: Wait for the acknowledgment that the message is written before sending a success response to clients. In case of a failure, we could have retries configured at both server and client. If your data is transactional or cannot suffer any loss, this approach is preferable.

Asynchronous: Write the message to a local buffer and respond with success to clients. Messages from the buffer could be written asynchronously to the queue. The downside is that if not handled properly, this could result in data loss.

For Emojis, we need very low latency and data loss in rare scenarios is not a big concern (although we haven’t seen any so far). So we chose the Asynchronous approach.

Golang has great support for concurrency. It ships with an implementation called Goroutines. Goroutines are lightweight threads that can execute functions asynchronously. We just need to say go do_something(). We use Goroutines and Channels in Golang to write messages to Kafka. Messages to be produced are written to a Channel. A Producer runs in the background as a Goroutine and flushes the data periodically to Kafka. Using client libraries like Confluent or Sarama, we can provide the flush configuration to achieve optimal performance. We configured our flush interval to 500ms and maximum messages sent to Kafka broker in a single request to 20000.

How does the processing happen?

Goal: Consume a stream of data from Kafka and compute aggregates of the data over an interval. Time interval should be small enough to provide a real-time experience to users.

After considering different streaming frameworks like Flink, Spark, Storm, Kafka Streams we decided to go with Spark. Spark has support for micro batching and aggregations which are essential for our use case and better community support compared to competitors like Flink. Check out this blogfor more details about different streaming frameworks.

We wrote a Spark streaming job that computes aggregates over a batch of 2 seconds and writes computed data to another Kafka queue.

What about data delivery?

We use PubSub as our delivery mechanism. PubSub is a real-time messaging infrastructure built at Hotstar to deliver messages to users in our Social Feed. You can dig deeper into the journey of PubSub in this article.

We wrote a simple Kafka consumer in Python. Consumption rate depends on the batch duration configured in the Spark job. Let’s say that it’s set to 1 second, this consumer would receive a message per second. We perform normalization over the data and send top (relatively more popular) emojis to PubSub. Clients receive messages from PubSub and show Emojis animation to users.

Impact

Emojis are a huge hit at Hotstar. We received around 5 Billion Emojis from 55.83 Million users during the ICC Cricket World Cup 2019. This system has captured more than 6.5 Billion Emojis to date. Here’s what we achieved from building this in-house.

Voting and more

If we think about it, Emojis and Voting have a common problem statement

Process quantifiable user responses in near real-time

So we extended this system to build the Voting feature. Hotstar is currently the sole voting platform for a few big Indian reality shows like Dance Plusand Bigg Boss (Telugu, Tamil, Malayalam). This infrastructure equipped to power Emojis and Voting for Hotstar, can now also power Polls and Trivia contests out of the box.

We have received around 3 Billion votes to date!

Collection of Emojis on Social Feed

Next time you visit our Social Feed, tap away on these.

Want to work with us? Check out https://tech.hotstar.com/ for some exciting opportunities.

Brief History of Scaling Uber

HS Editor — Thu, 14 Mar 2024 09:53:07 GMT

This blog post was written by Josh Clemm, Senior Director of Engineering at Uber Eats. This is a repost from his LinkedIn article, approved by the author.

On a cold evening in Paris in 2008, Travis Kalanick and Garrett Camp couldn't get a cab. That's when the idea for Uber was born. How great would it be if you could "push a button and get a ride?"

Fast forward to today where Uber is the largest mobility platform in the world. It operates in over 70 countries and 10,500 cities. Uber Eats is the largest food delivery platform ex-China in the world. It operates across 45 countries. We connect millions of driver-partners and merchants with over 130 million customers. We offer dozens of services to go anywhere or get anything. We handle billions of database transactions and millions of concurrent users across dozens of apps and thousands of backend services.

So, how did we get there?

Back in 2009, Uber hired contractors to build the first version of Uber and launched it to friends in San Francisco soon after. The team used the classic LAMP stack to build the first version of Uber and the code was written in Spanish.

Original LAMP stack proved out the use case, but couldn’t scale.

The scaling challenges started as more people wanted to use Uber. There were often major concurrency issues, where we’d dispatch two cars to one person or one driver would be matched to two different riders. (Learn more about Uber’s earliest architecture from founding engineer Curtis Chambers).

But the product caught on. It was time to build the tech foundations from scratch.

Setting up for Global Scale

Circa 2011

To architect a better and more scalable solution, we needed to address those concurrency problems. Additionally, we needed a system that can process tons of real-time data. Not only are there requests from riders, but Uber needs to track driver real-time locations in order to match riders as efficiently as possible. Finally, the product was still early and would require a lot of testing and iteration. We needed a solution to solve all these scenarios.

Uber adopted Node.js for their real-time needs. And ended up being one of the first major adopters of Node.js in production. Node.js was ideal for a few reasons. First, Node.js handles requests asynchronously (using a non-blocking, single-threaded event loop) and so can process significant amounts of data quickly. Second, Node.js runs on the V8 JavaScript engine, so not only is it performant, but it’s excellent for quick iteration.

Uber then created a second service built in Python to handle business logic functions like authentication, promotions, and fare calculation.

The resulting architecture was two services. One built in Node.js ("dispatch") connected to MongoDB (later Redis) and the other built in Python ("API") connected to PostgreSQL.

Uber’s two monolith architecture allowed the engineering org to begin to scale

And to improve the resiliency of Uber’s core dispatching flow, a layer between dispatch and API known as "ON" or Object Node was built to withstand any disruptions within the API service. (Learn more about Uber’s earliest efforts to maintain service reliability in this video Designing for Failure: Scaling Uber by Breaking Everything).

This architecture started to resemble a service oriented architecture. Service oriented architectures can be very powerful. As you carve out services to handle more dedicated functionality, it has a side benefit of allowing easier separation of engineers into dedicated teams. Which then allows for more rapid team scaling.

But as the team and number of features grew, the API service was getting bigger and bigger. More and more features were conflicting with one another. Engineering productivity was slowing down. There were huge risks continuously deploying the codebase.

It was time to split out API into proper services.

From Monolith to Microservices

Circa 2013

To prepare for our next phase of growth, Uber decided to adopt a microservice architecture. This design pattern enforces the development of small services dedicated to specific, well-encapsulated domain areas (e.g. rider billing, driver payouts, fraud detection, analytics, city management). Each service can be written in its own language or framework, and can have its own database or lack thereof. However, many backend services utilized Python and many started to adopt Tornado to provide asynchronous response functionality. By 2014, we had roughly 100 services.

Uber’s API monolith to microservice migration

While microservices can solve many problems, it also introduces significant operational complexity. You must only adopt microservices after understanding the tradeoffs and potentially build or leverage tools to counteract those tradeoffs. And if you don’t consider the operational issues, you will simply create a distributed monolith.

Here’s some examples of the issues microservices create and what Uber did to address.

To ensure all services use standardized service frameworks, we developed Clay. This was a Python wrapper on Flask to build restful backend services. It gave us consistent monitoring, logging, HTTP requests, consistent deployments, etc.
To discover and talk to other services and provide service resilience (fault tolerance, rate limiting, circuit breaking), Uber built TChannel over Hyperbahn. TChannel as our bi-directional RPC protocol was built in-house mainly to gain better performance and forwarding for our Node and Python services, among other benefits.
To ensure well-defined RPC interfaces and stronger contracts across services, Uber used Apache Thrift.
To prevent cross-service capability issues, we use Flipr to feature flag code changes, control rollouts, and many other config-based use cases.
To improve the observability of all service metrics, we built M3. M3 allows any engineer easy ways to observe the state of their service both offline or through Grafana dashboards. We also leverage Nagios for alerting at scale.
For distributed tracing across services, Uber developed Merckx. This pulled data from a stream of instrumentation via Kafka. But as each service started to introduce asynchronous patterns, we needed to evolve our solution. We were inspired by Zipkin and ultimately developed Jaeger, which we still use today.

Over time, we’ve migrated to newer solutions like gRPC and Protobuf for interfaces. And many of our services utilize Golang and Java.

Scaling the Trip Database

Circa 2014

While Uber was creating many new backend services, we continued to use one single PostgreSQL database.

The single PostgreSQL DB became a bottleneck

We were hitting some significant issues. First, the performance, scalability, and availability of this DB was struggling. There was only so much memory and CPUs you could throw at it. Second, it was getting very hard for engineers to be productive. Adding new rows, tables, or indices for new features became problematic.

And the problem was getting existential. By early 2014, Uber was 6 months away from Halloween night - one of the biggest nights of the year. We needed a more scalable solution and needed it fast.

When we looked into the data mix of this DB, the majority of storage was related to our trips, which was also growing the fastest.

The mix of data stored in our single PostgreSQL DB in early 2014

We use trip data in order to improve services like Uber Pool, provide rider and driver support, prevent fraud, and develop and test features like suggested pick-ups. So we embarked on developing Schemaless, our new trip data store. Schemaless is an append-only sparse three dimensional persistent hash map, similar to Google’s Bigtable, and built on top of MySQL. This model lended itself naturally to horizontal scaling by partitioning the rows across multiple shards and supported our rapid development culture.

And we successfully migrated all our services that access trip information in time to avert the Halloween peak traffic disaster. (Learn more with this video from lead engineer Rene Schmidt about our creation of and migration to Schemaless).

The Schemaless migration operational room - a common look into migrations at scale

While we used Schemaless for our trip data store, we started to use Cassandra as a replacement for our other data needs, including the database that we use for marketplace matching and dispatching.

Splitting up Dispatch

Circa 2014

Among Uber’s original two monoliths, we discussed the evolution of API into hundreds of microservices. But dispatch similarly was doing too much. Not only did it handle matching logic, it was the proxy that routed all other traffic to other microservices within Uber. So we embarked on an exercise to split up dispatch into two areas of cleaner separation.

Splitting the monolithic dispatch service into a real-time API gateway and an actual dispatch service

Extracting Uber’s Mobile Gateway from Dispatch

To better handle all the real-time requests from our mobile apps, we created a new API gateway layer named RTAPI ("Real-Time API"). And we continued to use Node.js for it. The service was a single repository that was broken up into multiple specialized deployment groups to support our growing businesses.

RTAPI provided a powerful new real-time layer that maintain high development velocity

The gateway provided a very flexible development space for writing new code and had access to the hundreds of services within the company. For instance, the first generation of Uber Eats was completely developed within the gateway. As the team's product matured, pieces were moved out of the gateway and into proper services.

Rewriting Dispatch for Uber’s Growing Size

The original dispatch service was designed for more simplistic transportation (one driver-partner and one rider). There were deep assumptions that Uber only needed to move people and not food or packages. Its state of available driver-partners was sharded by city. And some cities were seeing massive growth of the product.

So, dispatch was rewritten into a series of services. The new dispatch system needed to understand much more about the types of vehicles and rider needs.

It took on advanced matching optimizations, essentially to solve a version of the traveling salesman problem. Not only did it look at the ETAs of currently available driver-partners, but needed to understand which drivers could be available in the near future. So we had to build a geospatial index to capture this information. We used Google’s S2 library to segment cities into areas of consideration and used the S2 cell ID as the sharding key. (We’ve since updated to and open-sourced H3)

Overview of dispatch stack

Since these services were still running on Node.js and were stateful, we needed a way to scale as the business grew. So we developed Ringpop, a gossip-protocol based approach to share geospatial and supply positioning for efficient matching.

Learn more about the history of our dispatch stack here or watch this video on Scaling Uber’s Real-time Market Platform.

Mobile App Development at Scale

Circa 2016 to present

The flagship Uber product could only have existed due to the new mobile paradigm created by the iPhone and Android OS launches in 2007. These modern smartphones contained key capabilities like location-tracking, seamless mapping, payment experiences, on-device sensors, feature-rich user experiences, and so much more.

So Uber’s mobile apps were always a critical part of our scaling story.

The Uber rider app was critical in defining a scalable mobile architecture

As Uber scaled across the globe, there was a need for an ever-growing list of features. Many of these were specific to certain countries like localized payment types, different car product types, detailed airport information, and even some newer bets in the app like Uber Eats and Uber for Business.

The mobile app’s repositories slowly hit similar bottlenecks to a backend monolith. Many features and many engineers, all trying to work across a single releasable code base.

This led Uber to develop the RIB architecture for mobile, starting with the rewrite of the main Uber app.

RIB stands for Router, Interactor, Builder

Like the benefits of microservices, RIBs have clear separation of responsibilities. And since each RIB serves a single responsibility, it was easy to separate them and their dependencies into core and optional code. By demanding more stringent review for core code, we were more confident in the availability of our core flows. And this allows simple feature flagging to ensure the app continues to run reliably.

And like microservices, RIBs can be owned by different teams and engineers. This allowed our mobile codebases to easily scale to hundreds of engineers.

Today, all our apps have adopted RIBs or are migrating towards it. This includes our main Driver app, the Uber Eats apps, and Uber Freight.

The Rise of Uber Eats

Circa 2017

Uber had been experimenting with a number of “Uber for X on-demand” concepts since 2014. And all early signs pointed towards food. So in late 2015 Uber Eats launched in Toronto. And followed a similarly fast growth trajectory just like UberX.

Uber Eats business growth compared with Uber rides (Uber Q3 2020 Earnings)

To enable this rapid growth, Uber Eats leveraged as much of the existing Uber tech stack as possible, while creating new services and APIs that were unique to food delivery (e.g. e-commerce capabilities like carts, menus, search, browsing).

A simplified view into early Uber Eats architecture and how it leveraged systems built for original Uber

The operations team that needed to tune their cities’ marketplace often got creative and did things that didn’t scale (until the appropriate tech was built).

Uber Eats Canada running scripts to help manage which stores were active and tuning the delivery radius of each based on available driver partners

Early Uber Eats was "simple" in that it supported a three-way marketplace of one consumer, one restaurant, and one driver-partner. Uber Eats today (130+ million users, dozens of countries) supports a variety of ordering modes and capabilities and can support 0-N consumers (eg. guest checkout, group ordering), N merchants (eg. multi-restaurant ordering), and 0-N driver partners (eg. large orders, restaurants which supply their own delivery fleet).

The history of how Uber Eats evolved probably deserves its own scaling story and I may one day get to it.

But for now, to learn more from the earliest days, I highly recommend listening to Uber Eats founder Jason Droege’s recount of "Building Uber Eats".

Bring on the Standards - Project Ark

Circa 2018

No scaling story is complete without understanding the context and culture of the company. As Uber continued to expand city by city, local operations folks were hired to ensure their city launch would go successfully. They had tools to ensure their local marketplace would remain healthy and the flywheel would grow. As a result, Uber had a very distributed and decentralized culture. And that helped contribute to Uber’s success in getting to 600 cities by 2018.

That culture of decentralization continued within engineering, where one of our earliest cultural values was "Let Builders Build". This resulted in rapid engineering development that complemented Uber’s success growing across the globe.

But after many years, it resulted in the proliferation of microservices (thousands by 2018), thousands of code repositories, multiple product solutions solving very similar problems, and multiple solutions to common engineering problems. For example, there were different messaging queue solutions, varying database options, communication protocols, and even many choices for programming languages..

"You've got five or six systems that do incentives that are 75 percent similar to one another" - Former Uber CTO Thuan Pham

Developer productivity was hurting.

Engineering leadership recognized it was time for more standardization and consistency and formed Project Ark. Project Ark sought to address many aspects of engineering culture that contributes to scaling:

Engineer productivity,
Engineer alignment across teams,
Duplication,
Unmaintained critical systems, and
Knowledge access and documentation.

As a result, we elevated Java and Go as official backend languages to gain type-safety and better performance. And deprecated the use of Python and Javascript for backend services. We embarked on reducing code repos from 12,000 down to just our main languages (Java, Go, iOS, Android, and web). We defined more standardized architectural layers where client, presentation, product, and business logic would have clear homes. We introduced abstractions where we could group a number of services together (service "domains"). And continued to standardize on a series of service libraries to handle tracing, logging, networking protocols, resiliency patterns, and more.

A Modern Gateway for the Modern Uber

Circa 2020

By 2019, Uber had many business lines with numerous new applications (Uber Eats, Freight, ATG, Elevate, and more). Within each line of business, the teams managed their backend systems and their app. We needed the systems to be vertically independent for fast product development.

And our current mobile gateway was showing its age. RTAPI had been built years ago and continued to use Node.js and Javascript, a deprecated language. We were also eager to make use of Uber’s newly defined architectural layers as the ad hoc code added to RTAPI over the years was getting messy with view generation and business logic.

So we built a new Edge Gateway to start standardizing on the following layers:

Edge Layer: API lifecycle management layer. No extraneous logic can be added keeping it clean.
Presentation Layer: microservices that build view generation and aggregation of data from many downstream services.
Product Layer: microservices that provide functional and reusable APIs that describe their product. Can be reused by other teams to compose and build new product experiences.
Domain Layer: microservices that are the leaf node that provides a single refined functionality for a product team.

This evolution set us up well to continue building new products with velocity, yet with the necessary structure to align our 1000s of engineers.

Next Generation Fulfillment

Circa 2021

Throughout the years, Uber has created a world-class platform for matching riders and driver-partners. So our dispatch and fulfillment tech stack is a critical part of Uber’s scaling story.

By 2021, Uber sought to power more and more delivery and mobility use cases. The fulfillment stack was showing its age and couldn’t easily support all these new scenarios. For example, we needed to support reservation flows where a driver is confirmed upfront, batching flows with multiple trips offered simultaneously, virtual queue mechanisms at Airports, the three-sided marketplace for Uber Eats, and delivering packages through Uber Direct.

Some example services the Fulfillment Platform needed to support

So we made a bold bet and embarked on a journey to rewrite the Fulfillment Platform from the ground up.

To satisfy the requirements of transactional consistency, horizontal scalability, and low operational overhead, we decided to leverage a NewSQL architecture. And opted to use Google Cloud Spanner as the primary storage engine.

As lead Uber engineer Ankit Srivastava puts it "as we scale and expand our global footprint, Spanner's scalability & low operational cost is invaluable. Prior to integrating Spanner, our data management framework demanded a lot of oversight and operational effort, escalating both complexity and expenditure."

The Uber-GCP network infrastructure

What else have we done?

Of course, our scaling story is never this simple. There's a countless number of things we've done over the years across all engineering and operations teams, including some of these larger initiatives.

Many of our most critical systems have their own rich history and evolution to address scale over the years. This includes our API gateway, fulfillment stack, money stack, real-time data intelligence platform, geospatial data platform (where we open-sourced H3), and building machine learning at scale through Michelangelo.

We’ve introduced various layers of Redis caches. We’ve enabled powerful new frameworks to aid in scalable and reliable systems like Cadence (for writing fault-tolerant, long-running workflows).

We’ve built and leveraged data infrastructure that enables long term growth, like how we leveraged Presto or scaled Spark. Notably, we built Apache Hudi to power business critical data pipelines at low latency and high efficiency.

And finally, we continue to improve the performance of our servers with optimized hardware, advanced memory and system tuning, and utilizing newer runtimes.

What’s Next? Heading to the Cloud

Being a global company, Uber operated out of multiple on-prem data centers from the earliest days. But that introduced a number of challenges.

First, our server fleet had grown rapidly (over 250,000 servers) and the tooling and teams were always trying to keep up. Next, we have a large geographical footprint and need to regularly expand into more data centers and availability zones. Finally, with only on-prem machines, we were constantly needing to tune the size of our fleet.

We spent the last few years working towards making over 4000 of our stateless microservices portable. And to ensure our stack would work equally well across cloud and on-prem environments, we embarked on Project Crane to solve. This effort set Uber up well for the future. To learn more, watch lead Uber engineer Kurtis Nusbaum’s talk on how Crane solves our scaling problems.

We now have plans to migrate a larger portion of our online and offline server fleet to the Cloud over the next few years!

Thanks to my many Uber friends and colleagues for reviewing this!

And if you enjoyed this, check out my blog post about A Brief History of Scaling LinkedIn.

Behind AWS S3’s Massive Scale

HS Editor — Wed, 06 Mar 2024 16:15:44 GMT

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn.

AWS S3 is a service every engineer is familiar with.

It’s the service that popularized the notion of cold-storage to the world of cloud. In essence - a scalable multi-tenant storage service which provides interfaces to store and retrieve objects with extremely high availability and durability at a relatively low cost. Customers share all of the underlying hardware.

It is one of the most popular services in AWS.

Through 18 years since its release in 2006, it has grown into a behemoth of an offering, spanning 31 regions and 99 availability zones as well as boasting numbers that show ground-breaking scale:

100 million requests a second
400 terabits a second
280 trillion objects stored

Here is a short timeline of the major features ever since its release:

2006: S3 officially launched
2012: Glacier - a low-cost service for data archiving and long-term backup - introduced (retrieval times take minutes to hours)
2013: S3 Intelligent-Tiering - a feature that automatically moves your data the most cost-effective access tier
2015: Standard-Infrequent Access - a new tier for data that’s less frequently accessed, but requires rapid access when needed
2018: Glacier Deep Archive - cheaper Glacier with much slower retrieval times (12 hours)
2020: Strong read-after-write consistency introduced
2021: Object Lambda - allow customers to add their own code to S3 GET requests, modifying data as its returned to an application
2023 - S3 Express released, offering 10x better latencies (single-digit) and 50% cheaper request costs.

It started as a service optimized for backups, video and image storage for e-commerce websites - but eventually grew in scale and demand to support being the main storage system used for analytics and machine learning on massive data lakes. Looking at current trends, it seems to be becoming the staple storage backbone for any cloud-native data infrastructure.

Architecture

S3 is said to be composed of more than 300 microservices.

It tries to follow the core design principle of simplicity.

You can distinct its architecture by four high-level services:

a front-end fleet with a REST API
a namespace service
a storage fleet full of hard disks
a storage management fleet that does background operations, like replication and tiering.

Said simply - it’s a multi tenant object storage service with an HTTP REST API.

It is said that AWS ships its organizational chart.

That is, the architecture strongly exhibits Conway’s Law: the theory that organizations will design systems that copy their communication structure.

Each of the four components are part of the S3 organization, each with their own leaders and number of teams that work on it. It is said that this continues recursively - the boxes in the chart contain their own individual nested components with their own teams and fleets. In many ways, they all operate like independent businesses.

The way each team interacts with others is literal API-level contracts, for which engineers are on the hook for keeping up.

Storage Fleet

S3 is just a really big system built out of A LOT (millions) of hard disks.

At the core of all of this are the storage node servers. They are simple key-value stores that persist object data to hard disks. The node servers only shards of the overall object data, with the control plane replicating said shards across many different nodes.

AWS has written a lot about their newly-written storage backend - ShardStore. It initially started with just 40k lines of Rust.

Under the hood, it’s a simple log-structured merge tree (LSM Tree) with shard data stored outside the tree to reduce write amplification.

It uses a soft-updates-based crash consistency protocol and is designed from the ground up to support extensive concurrency. It leverages several optimizations to support efficient HDD IO usage like scheduling of operations and coalescing.

This is perhaps the most key piece of code in S3 and as such, its development process is highly invested in. AWS has written at length about the lightweight formal verification methods they’ve integrated into their developers’ workflows.

Hard Drives

Hard drives are an old technology that’s not ideal for all use cases - they’re constrained for IOPS, they have high seek latency and are physically fragile.

It’s worth stepping back and checking their evolution:

1956: a 3.75MB drive cost $9k
2024: 26TB drives exist, where 1TB costs $15

In their existence, they've seen exponential improvement:

price: 6,000,000,000 times cheaper per byte (in inflation-adjusted $)
capacity increased 7,200,000 times
size decreased 5,000 times
weight decreased 1,235x

But one problem persists - they’re constrained for IOPS. They have been stuck at 120 IOPS for a long time. Similarly, the latency has not kept up in the same pace as the rest.

This means that per byte, HDDs are becoming slower.

In an increasingly latency-sensitive world, it’s hard for an HDD to keep up with the demanding requirements modern systems have.

Yet, S3 found a way to deliver tolerable latency while working around this - it heavily leverages parallel IO.

Replication

In storage systems, redundancy schemes are a common practice.

They are most often associated with extra durability - helping protect data against hardware failures. If one disk fails, a copy of the data remains present in another disk, hence it isn’t lost.

An under-appreciated aspect of these redundancy schemes is managing heat. Such schemes spread the load out and give your system the flexibility to direct read traffic in a balanced way.

S3 uses Erasure Coding (EC). It’s a complex data protection method which breaks data into K shards with M redundant “parity” shards. EC allows you to recover the data from any K shards out of the total K+M total shards.

e.g. break data into 10 fragments, add 6 parity shards. You can lose up to 6 shards.

Replication is expensive from a durability perspective (a lot of extra capacity is taken), yet is efficient from an I/O perspective.Erasure Coding helps S3 find a middle balance, by not taking too much extra capacity for the durability needs (they aren’t replicating the data 3x) while still providing flexible I/O opportunity and surviving the same number of disk failures.

Sharding

S3 tries to spread these shards as broadly as they can. They’ve shared that they have tens of thousands of customers with data stored over a million physical drives.

Spreading the data around gives them the following benefits:

1. Hot Spot Aversion
When the data is well spread out, no single customer can cause a hot spot that affects the system. Hot spots are particularly troublesome and worth avoiding, as they introduce cascading delays throughout the system and in the worst case can result in unpleasant domino effects.

2. Burst Demand
Similarly, the extra parallelism afforded by each drive the shards are on can allow for greater burst IO compared to what a naive solution may bring. (e.g the data replicated on 3 drives)

3. Greater Durability
The more the shards are spread, the greater the durability. You can sustain more individual machine failures.

4. No Read Amplification
Read amplification denotes how many disk seeks a single read causes.Assuming homogeneous hardware, a single read will still cause roughly the same amount of disk seeks. But since the data is spread around, those seeks will not be on the same drive.

The less disk seeks one HDD needs to do, the better. This reduces the likelihood of high latency.

Heat Management at Scale

Given the scale, one of S3’s biggest technical problems is managing and balancing this I/O demand across the large set of hard drives.

The overall goal is to minimize the number of requests that hit the same disk at any point in time. Otherwise, as we mentioned - you garner hot spots and accumulate tail latency that eventually grows to impact the overall request latency.

In the worst case, a disproportionate number of requests on the same drive can cause stalling as the limited I/O the disk provides becomes exhausted. This results in overall poor performance on requests that depend on the drive(s). As you stall the requests, the delay gets amplified up through layers of the software stack - like Erasure Coding write requests for new data or metadata lookups.

The initial placement of data is key, yet also tricky to get right because S3 doesn’t know when or how the data is going to be accessed at the time of writing.

The S3 team shares that due to the sheer scale of the system and its multi tenancy, this otherwise hard problem becomes fundamentally easier.

S3 experiences so-called workload decorrelation. That is the phenomenon of seeing a smoothening of load once it’s aggregated on a large enough scale.

The team shares that most storage workloads remain completely idle for most of the time. They only experience a sudden load peak when the data is accessed, and that peak demand is much higher than the mean.

But as the system aggregates millions of workloads, the underlying traffic to the storage flattens out remarkably. The aggregate demand results in a smoothened out, more predictable throughput.

When you aggregate on a large enough scale, a single workload cannot influence the aggregate peak.

The problem then becomes much easier to solve - you simply need to balance out a smooth demand rate across many disks.

Parallelism

Parallelism has aligned incentives for both AWS and its customers - it unlocks better performance for the customers and allows S3 to optimize for workload decorrelation.

Let us take some numbers as examples:

Two Opposite Examples

Imagine a 3.7 Petabyte S3 bucket. Say it takes in 2.3 million IOs a second.

With 26 TB disks, you’d only need 143 drives to support the 3.7PB capacity.
But with the low 120 IOPS per disk, you’d need 19,166 drives for the 2.3m IO demand! That’s 13,302% more drives.

Now imagine the opposite - a 28PB S3 bucket with just 8,500 IOs a second.

With 120 IOPS per disk, you’d need 71 drives for the IOs.
With 26TB disks, you’d need 1076 drives for the capacity. That’s 1,415% more drives.

A 134x imbalance in one case, 15x in another.

S3 likely has both types of workloads, but the large drive requirement proves the need of parallelism to achieve the necessary read throughput.

Parallelism in Practice

S3 leverages parallelism in two ways:

Across Servers

Parallelism begins at the clients!

Instead of requesting all the files through one client with one connection to one S3 endpoint, users are encouraged to create as many powers as possible with as many parallel connections as possible. This utilizes many different endpoints of the distributed system, ensuring no single point in the infrastructure becomes too hot (e.g caches)

Intra-Operation

Within a single operation inside a connection, S3 also leverages parallelism.

PUT requests support multipart upload, which AWS recommends in order to maximize throughput by leveraging multiple threads.
GET requests similarly support an HTTP header denoting you read only a particular range of the object. AWS again recommends this for achieving higher aggregate throughput instead of the single object read request.

Metadata and the Move to Strong Consistency

In 2020, S3 introduced strong read-after-write consistency. Meaning that once you write a new object or overwrite an existing one, any subsequent read will receive the latest version of the object.

This was a massive change given the scale S3 was operating on, especially considering it came with zero impact to performance, availability or cost.

S3 has a discrete subsystem for storing per-object metadata. Because the metadata write/read is part of the critical data path for most requests, the system’s persistence tier was designed to use a highly resilient caching technology ensuring that even if it was impaired, S3 requests would succeed.

This meant that writes might flow through one part of the cache infrastructure while reads might query another part, resulting in a stale read. It is said this was the main source of S3’s eventual consistency.

As part of other features, S3 introduced new replication logic into its persistence tier that allowed them to reason about the per-object order of operations. As a core piece of their cache coherency protocol, this allowed them to understand if a cache’s view of an object is stale.

A new component was introduced that would track this. Said component acts as a witness to S3 writes and a read barrier during S3 reads. When the component recognizes that a value may be stale, it is invalidated from the cache and read from the persistence layer.

Durability

S3 offers 11 nines of durability - 99,999999999. The expected average annual loss is therefore 0.000000001% of objects stored.

Said simply, if you store 10,000 objects in S3, you can expect to lose a single object once every 10,000,000 years.

Hardware Failure

Achieving 11 nines of durability is a difficult feat. Hard drives frequently fail and at S3’s scale, it’s likely an hourly occurrence.

In theory - durability is simple. As hard drives fail, you repair them. As long as the repair rate matches the failure rate, your data is safe.

In practice - to ensure this process is reliable in the face of unpredictable failures, you need to build a system around it.

AWS shared a simple real-world example where failures accumulate - imagine the data center is in a hot climate and a power outage occurs. Its cooling system stops working. Disks begin to overheat and the failure rate goes up significantly.

AWS built a continuous self-healing system with detectors that track the failures and scale their repair fleet accordingly.

In the background, a durability model runs analysis to track whether the desired durability is actually met.

The lesson here is that durability cannot be a single snapshot of the system - it is a continuous evaluation.

Other Failures

Durability is a large umbrella. Your hardware may work fine but data can still be lost through other means, like deploying a bug that corrupts data, a human operator error deleting data, problematic deployments, or having the network corrupt the bits uploaded by the customer before it reaches your data center.

As one example - S3 has implemented what they call a durable chain of custody. To solve the edge case where data can become corrupted before it reaches S3, AWS implemented a checksum in the SDK that’s added as an HTTP Trailer (preventing the need of scanning the data twice) to the request.

The sheer size of S3 means that any theoretical failure that could be experienced has probably been seen, or will soon be.

Culture

Achieving such a feat of engineers like S3 is as much about social organization as it is about technological innovation.

With new engineers joining and experienced ones leaving, maintaining the same velocity at a growing scale & stakes can be difficult.

AWS aims to automate as much as possible. They have integrated extensive property-based tests iterating through large sets of request patterns as well as lightweight formal verification in their CI/CD pipeline. Both features automated giving engineers confidence in the

AWS have shared that they internally follow a “Durability Threat Model” design approach, borrowed from the popular security threat model. When designing features, they take into account any potential threats to durability and ensure they think critically about mitigations for all of them.

They stress the need to ingrain durability into the organization’s culture and ensure continuous processes that maintain it.

Conclusion

S3 is essentially a massively multi-tenant storage service. It’s a gigantic distributed system consisting of many individually slow nodes that on aggregate allow you to get heaps of data fast.

Through economies of scale, its design makes certain use cases that were previously cost-prohibitive - affordable. Bursty workloads that store a lot of data, stay idle for months and then read everything in a short burst, like genomics, machine learning and self-driving cars, can use S3 without the need to pay the full price of all the drives they need if they had to deploy it themselves. It’s an interesting case study of how scale unlocks cost-effectiveness and makes management easier.

Resources

This blog post was formed by a collection of public resources AWS have shared regarding S3. Here is a short list if you are interested in reading more:

The Swedbank Outage shows that Change Controls don't work

Bruce Johnston — Wed, 16 Aug 2023 16:05:12 GMT

This week I’ve been reading through the recent judgment from the Swedish FSA on the Swedbank outage. If you’re unfamiliar with this story, Swedbank had a major outage in April 2022 that was caused by an unapproved change to their IT systems. It temporarily left nearly a million customers with incorrect balances, many of whom were unable to meet payments.

After investigation, the regulator found that Swedbank had not followed their change management process and issued a SEK850M (~85M USD) fine. That’s a lot of money to you and me, but probably didn’t impact their bottom line very much. Either way I’m sure the whole episode will have been a big wake up call for the people at the bank whose job it is to ensure adequate risk and change controls. So, what went wrong and how could it have been avoided?

How did the Swedbank incident happen?

The judgment doesn’t describe the technical details behind the incident, but it does provide glimpses into how they assessed what went wrong:

“The deficiencies that were present in Swedbank’s internal control made it possible to make changes to one of the bank’s most central IT systems without following the process in place at the bank to ensure continuity and reliable operations. This violation is therefore neither minor nor excusable.”
“none of the bank’s control mechanisms were able to capture the deviation and ensure that the process was followed”
“one of the main causes underlying the IT incident was non-compliance with the change management process and that it is probable that this also resulted in a slower analysis of the incident and a greater impact on the operations.”
“good internal control is a prerequisite for a bank to be able to fulfill the requirements on risk management”

Even if you think $85M isn’t much of a fine - simply the cost of doing business - the full range of options open to the regulator included removing Swedbank’s banking license: “It is therefore not relevant to withdraw Swedbank’s authorisation or issue the bank a warning. The sanction should instead be limited to a remark and an administrative fine.” Gulp.

Change management doesn’t mitigate risk

What really interests me about cases like this is that, even when followed to the letter, the old ways of managing change with manual approvals and change meetings do not mitigate risk in today’s technology organizations. These processes don’t work because complying with them is no guarantee that changes are being made safely and securely.

Tell me if you’ve heard this one before?

Bank has a major IT outage/incident
A change was applied without following change management process
Bank claims the risk controls work if they are followed
Regulator fines bank for not following process + having insufficient controls
Bank adds more change controls

The position of the regulator constitutes self-referential logic. You said you’d do something to manage risk, it wasn’t done, therefore you are in violation. But, is change management the best way to manage IT risk?

What the UK FCA says about change

I’ve written previously on some fantastic research published by the Financial Conduct Authority in the UK. They took a data-driven approach to understand the workings of change management processes, which uncovered some provocative findings:

“One of the key assurance controls firms used when implementing major changes was the Change Advisory Board (CAB). However, we found that CABs approved over 90% of the major changes they reviewed, and in some firms the CAB had not rejected a single change during 2019. This raises questions over the effectiveness of CABs as an assurance mechanism.”

Change as a control gate doesn’t work, but everyone does it. Why? To avoid $85MUSD fines. In the UK and USA these can be issued to individuals as well as organizations. So, if you have followed the process, at the very least you are compliant and not liable for heavy financial penalties. It’s also about covering your back - “It’s not my fault, I ticked all the boxes.” But is the bank safe though? Are the systems themselves secure?

Change management gathers documentation of process conformance, but it doesn’t reduce risk in the way that you’d think. It reduces the risk of undocumented changes, but risks in changes that are fully documented can sail through the approval process unnoticed. This is an important and quite shocking finding: adherence to traditional change management doesn’t work to manage the risk of changes.

Research shows external approvals don’t work

The science of DevOps backs this up. Here’s the unvarnished truth on external approvals and CABs based on research by Dr. Nicole Forsgren, Jez Humble, and Gene Kim in their 2018 book, Accelerate: Building and Scaling High Performing Technology Organizations.

“We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a change manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.”

Worse than no change approval process at all. So, if you want to avoid fines, cover your back AND reduce the likelihood of production incidents, what would you do?

Change is not the problem. It’s unaddressed risk

If change is not the problem, then what is?

What would work? Well the FCA has some insights on this:

“Frequent releases and agile delivery can help firms to reduce the likelihood and impact of change related incidents:

Overall, we found that firms that deployed smaller, more frequent releases had higher change success rates than those with longer release cycles. Firms that made effective use of agile delivery methodologies were also less likely to experience a change incident.”

In short - paperwork doesn’t reduce risk. Less risky changes reduce risk. I’m going out on a limb here, but if Swedbank had in fact followed processes and still had the outage, I believe Finansinspektionen (the Swedish FCA) would still have given a fine, but for insufficient risk management.

Story time: streams feeding the lake

We can think of software changes as streams, feeding into our environments which are lakes. Change management puts a gate in the stream to control what flows into the lake, but doesn’t monitor the lake.

If it is possible to make a change to production without detection, then change management only protects one source of risk. The only way to be sure you don’t have undocumented production changes is with runtime monitoring.

For me what is really interesting about this story is the echoes and parallels it has with the Knight Capital incident so well documented by the SEC. In both cases, an incomplete understanding of how changes have been applied to production systems due to insufficient observability and traceability prolonged and amplified the scale of the outages.

And leaves an open question: how many similar changes have been made that didn’t cause an outage? Without monitoring it is really hard to know.

If change management doesn’t work, why do we do it?

It all goes back to software history. Traditionally changes were rare, big and risky. It was the annual upgrade, or the monthly patch. Because these big batches of change were risky, companies introduced long testing and qualification processes, change management, service windows, and a large number of checklists to help mitigate the risks and test-in quality.

Before we had modern practices of test automation, continuous delivery, DevSecOps, rolling deployments with fast rollback, this was the only way. The trouble is, the financial services industry is packed full of legacy systems and outsourcing where implementing these practices is technically challenging and uneconomic.

Maybe it is time we acknowledge legacy software, risk management, and outsourcing are a major systemic risk in the financial sector?

The flipside is also true. Many next generation systems in financial services are so dynamic and distributed that it is really hard to get a handle on the volume of changes occurring.

Risk management that works

The only way to not get burned is to avoid playing with fire. Checklists can help, but if you have a lot of IT risk the only way to really reduce it is to do the technical work to make changes less risky, and move to smaller, more frequent changes. And you can reduce this toil by automating change controls and documentation, and introducing monitoring and alerting systems to detect unauthorized changes. It’s all part of a DevSecOps approach to change management that harmonizes the speed of software delivery with the demands made on organizations by cybersecurity and audit and compliance.

Lessons Learned Running Presto at Meta Scale

Philip Bell — Sun, 16 Jul 2023 15:42:21 GMT

Presto is a free, open source SQL query engine. We’ve been using it at Meta for the past ten years, and learned a lot while doing so. Running anything at scale - tools, processes, services - takes problem solving to overcome unexpected challenges. Here are four things we learned while scaling up Presto to Meta scale, and some advice if you’re interested in running your own queries at scale.

Scaling Presto rapidly to meet growing demands: What challenges did we face?

Deploying new Presto releases

Figure 1: Process workflow for pushing new versions of Presto (Diagram by Philip S. Bell)

Meta runs a large number of Presto clusters spanning data centers across locations worldwide. A new version of Presto is built and ready for deployment roughly at least once, and sometimes twice, each month. One of the earliest challenges we faced as Presto’s footprint at Meta grew rapidly was deploying the query engine to a high volume of clusters while ensuring consistent availability and reliability. This continues to be especially true for interactive use cases of Presto, i.e., when a user launches a query and is actively waiting for a result. Query failure is less of a concern for automated “batch” use cases where automatic retries ensure that the query eventually succeeds.

The solution for this was simple. All Presto clusters sit behind a load balancer called the Gateway which is responsible (in conjunction with other systems at Meta) for routing Presto queries to the appropriate cluster. When a Presto cluster needs to be updated, it is first marked as drained from the Gateway, i.e., the Gateway stops routing any new queries to it. Automation then waits for a predetermined time in order for queries currently running on the cluster to finish. The cluster is then updated, and once online, it is made visible to the Gateway, which can start routing new queries to it.

The other aspect to deploying new Presto releases is availability. We need to ensure that users can still use Presto while clusters are getting updated. Again, automation ensures that every data center in every physical region always has the necessary number of Presto clusters available. Of course, a balance has to be struck between taking down too many clusters at once (availability issue) and taking down too few at once (deployment takes too long).

Automating standup and decommission of Presto clusters

Figure 2: Automated workflow for adding hardware to clusters (Diagram by Philip S. Bell)

The distribution of the data warehouse at Meta across different regions is constantly evolving. This means new Presto clusters must be stood up while existing ones are decommissioned regularly. Previously, when there was only a small number of Presto clusters, this was a manual process. As Meta started scaling up, it quickly became challenging to track all changes manually. To solve this problem, we implemented automations to handle the standing up and decommissioning of clusters.

First we had to standardize our cluster configurations, i.e., we had to build base configurations for the different Presto use cases at Meta. Each cluster would then have a minimal number of additional or overridden specifications over the base configuration. Once that was complete, any new cluster could be turned up by automatically generating configs from the base template. Cluster turnup also required integration with automation hooks in order to integrate with the various company-wide infrastructure services like Tupperware and data warehouse-specific services. Once a cluster comes online, a few test queries are sent to the cluster and automation verifies that the queries were successfully executed by the cluster. The cluster is then registered with the Gateway and starts serving queries.

Decommissioning a cluster follows pretty much the reverse process. The cluster is de-registered from the Gateway and any running queries are allowed to finish. The Presto processes are shut down and the cluster configs are deleted.

This automation is integrated into the hardware stand up and decommission workflow for the data warehouse. The end result is that the entire process, from new hardware showing up at a data center, to Presto clusters being online and serving queries, then being shut off when hardware is decommissioned, is fully automated. Implementing this has saved valuable people-hours, reduced hardware idle time, and minimizes human error.

Automated debugging and remediations

Figure 3: Bad host detection (Diagram by Philip S. Bell)

Given the large deployment of Presto at Meta, it’s imperative that we have tooling and automation in place that makes the life of the oncall (the point of contact for Presto) easy.

Over the years, we’ve built several “analyzers” which help the oncall efficiently debug and assess the root cause for issues that come up. Monitoring systems fire alerts when there are breaches of customer-facing SLAs. The analyzers are then triggered. They source information from a wide range of monitoring systems (Operational Data Store or ODS), events published to Scuba, and even host-level logs. Custom logic in the analyzer then ties all this information together to infer probable root cause. This is extremely useful for the oncall by presenting them with root cause analysis and allowing them to jump directly into potential mitigation options. In some cases, we have completely automated both the debugging and the remediation so that the oncall doesn’t even need to get involved. A couple of examples are described below:

Bad host detection

When running Presto at scale on a large number of machines, we noticed that certain “bad” hosts could cause excessive query failures. Following our investigations, we identified a few root causes which resulted in the hosts going “bad”, including:

Hardware-level issues which hadn’t yet been caught by fleet-wide monitoring systems due to lack of coverage

Obscure JVM bugs which would sometimes lead to a steady drip of query failures

To combat this issue, we now monitor query failures in Presto clusters. Specifically, we attribute each query failure to the host that caused it, where possible. We also set up alerts that fire when an abnormally high number of query failures are attributed to specific hosts. Automation then kicks in to drain the host from the Presto fleet and thus stem the failures.

Debugging queueing issues

Each Presto cluster supports queuing queries on it once it reaches its maximum concurrency for running queries based on use case, hardware configuration, and query size. At Meta, there is a sophisticated routing mechanism in place so that a Presto query is dispatched to the “right” cluster which can execute the query while making the best use of resources. Several systems beyond Presto are involved in making the routing decision and they take multiple factors into account:

Current state of queuing on Presto clusters

Distribution of hardware across different datacenters

The data locality of the tables that the query uses

Given this complexity, it can be very hard for an oncall to figure out the root cause of any queueing problems encountered in production. This is another instance where analyzers come to the fore by pulling information from multiple sources and presenting conclusions.

Load balancer robustness

Figure 4: Load balancer robustness (Diagram by Philip S. Bell)

As mentioned above, our Presto clusters sit behind load balancers which route every single Presto query at Meta. In the beginning, when Presto had not yet scaled up the level of internal usage it has today, the Gateway was very simple. However, as usage of Presto increased across Meta, we ran into scalability issues on a couple of occasions. One of them was the Gateway failing under heavy load, which could lead to Presto being unavailable for all users. The root cause for some stability issues was one service unintentionally bombarding the Gateway with millions of queries in a short span, resulting in the Gateway processes crashing and unable to route any queries.

To prevent such a scenario, we set about making the Gateway more robust and tolerant to such unintended DDoS-style traffic. We implemented a throttling feature, which rejects queries when under heavy load. The throttling can be activated based on query count per second across various dimensions like per user, per source, per IP, and also at a global level for all queries. Another enhancement we implemented was autoscaling. Leaning on a Meta-wide service that supports scaling up and down of jobs, the number of Gateway instances are now dynamic. This means that under heavy load, the Gateway can now scale up to handle the additional traffic and not be maxed out on CPU/memory, thus preventing the crashing scenario described above. This, in conjunction with the throttling feature, ensures that the Gateway is robust and can withstand adverse unpredictable traffic patterns.

What advice would we give a team scaling up their own Data Lakehouse using Presto?

Figure 5: Presto architecture scaling (Diagram by Philip S. Bell)

Some of the important aspects to be kept in mind when scaling up Presto are:

Establishing easy-to-understand and well-defined customer-facing SLAs. Defining SLAs around important metrics like queueing time and query failure rate in a manner that tracks customer pain points becomes crucial as Presto is scaled up. When there is a large number of users, the lack of proper SLAs can greatly hinder efforts to mitigate production issues because of confusion in determining the impact of an incident.
Monitoring and automated debugging. As Presto is scaled up and the number of clusters increases, monitoring and automated debugging becomes critical.
1. Having thorough monitoring can help identify production issues before their blast radius becomes too big. Catching issues early will ensure we’re minimizing user impact where we can.
2. Manual investigations in the face of customer-impacting production issues are not scalable. It’s imperative to have automated debugging in place so that the root cause can be quickly determined.
Good load balancing. As the Presto fleet grows, it’s important to have a good load balancing solution sitting in front of the Presto clusters. At scale, small inefficiencies in load balancing can have an outsized negative impact due to the sheer volume of the workload.
Configuration management. Configuration management of a large fleet of Presto clusters can become a pain if it’s not well planned for. Where possible, configurations should be made hot reloadable so that Presto instances do not have to be restarted or updated in a disruptive manner which will result in query failures and customer dissatisfaction.

This article was written in collaboration with Neerad Somanchi, a Production Engineer at Meta, and Philip Bell, a Developer Advocate at Meta.

To learn more about Presto, visit prestodb.io, watch Philip Bell’s quick explanation of Presto on YouTube, or follow Presto on Twitter, Facebook and LinkedIn.

To learn more about Meta Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter, Facebook and LinkedIn.

Gossip Protocol Explained

NK — Sun, 16 Jul 2023 15:38:02 GMT

What Is Gossip Protocol?

The typical problems in a distributed system are the following [1], [11]:

maintaining the system state (liveness of nodes)
communication between nodes

The potential solutions to these problems are as follows [1]:

centralized state management service
peer-to-peer state management service

Centralized State Management Service

A centralized state management service such as Apache Zookeeper can be configured as the service discovery to keep track of the state of every node in the system. Although this approach provides a strong consistency guarantee, the primary drawbacks are the state management service becomes a single point of failure and runs into scalability problems for a large distributed system [1], [11].

Peer-To-Peer State Management Service

The peer-to-peer state management approach is inclined towards high availability and eventual consistency. The gossip protocol algorithms can be used to implement peer-to-peer state management services with high scalability and improved resilience [1].

The gossip protocol is also known as the epidemic protocol because the transmission of the messages is similar to the way how epidemics spread. The concept of communication in gossip protocol is analogous to the spread of rumors among the office staff or the dissemination of information on a social media website [4], [8].

Broadcast Protocols

The popular message broadcasting techniques in a distributed system are the following:

point-to-point broadcast
eager reliable broadcast
gossip protocol

Point-To-Point Broadcast

The producer sends a message directly to the consumers in a point-to-point broadcast. The retry mechanism on the producer and deduplication mechanism on the consumers makes the point-to-point broadcast reliable. The messages will be lost when the producer and the consumer fail simultaneously [3].

Eager Reliable Broadcast

Every node re-broadcasts the messages to every other node via reliable network links. This approach provides improved fault tolerance because messages are not lost when both the producer and the consumer fail simultaneously. The message will be re-broadcast by the remaining nodes. The caveats of eager reliable broadcast are the following [3], [8]:

significant network bandwidth usage due to O(n²) messages being broadcast for n number of nodes
sending node can become a bottleneck due to O(n) linear broadcast
every node stores the list of all the nodes in the system causing increased storage costs

Gossip Protocol

The gossip protocol is a decentralized peer-to-peer communication technique to transmit messages in an enormous distributed system [1], [8]. The key concept of gossip protocol is that every node periodically sends out a message to a subset of other random nodes [8], [2]. The entire system will receive the particular message eventually with a high probability [11], [3]. In layman’s terms, the gossip protocol is a technique for nodes to build a global map through limited local interactions [1].

The gossip protocol is built on a robust, scalable, and eventually consistent algorithm. The gossip protocol is typically used to maintain the node membership list, achieve consensus, and fault detection in a distributed system [2]. In addition, additional information such as application-level data can be piggybacked on gossip messages [1].

The gossip protocol is reliable because a node failure can be overcome by the retransmission of a message by another node. First-in-first-out (FIFO) broadcast, causality broadcast, and total order broadcast can be implemented with gossip protocol [3].

The gossip protocol parameters such as cycle and fanout can be tuned to improve the probabilistic guarantees of the gossip protocol. The following tools offer a high-end simulation and visualization of the gossip protocol [8], [5]:

The following characteristics of the gossip protocol make it an optimal choice as the communication protocol in a large-scale distributed system [12]:

limits the number of messages transmitted by each node
limits the bandwidth consumption to prevent the degradation of application performance
tolerates network and node failures

The gossip protocol can be used to keep nodes consistent only when the operations executed are commutative and serializability is not necessary. The tombstone is a special entry to invalidate the data entries that have a matching key without actual deletion of the data. The gossip protocol deletes the data from a node using a tombstone.

Types of Gossip Protocol

The time required by the gossip protocol to propagate a message across the system and the network traffic generated in propagating a message must be taken into consideration while choosing the type of gossip protocol for a particular use case [10]. The gossip protocol can be broadly categorized into the following types [8], [10]:

anti-entropy model
rumor-mongering model
aggregation model

Anti-Entropy Gossip Protocol

The anti-entropy algorithm was introduced to reduce the entropy between replicas of a stateful service such as the database. The replicated data is compared and the difference between replicas are patched [10]. The node with the newest message shares it with other nodes in every gossip round [8].

The anti-entropy model usually transfers the whole dataset resulting in unnecessary bandwidth usage. The techniques such as checksum, recent update list, and Merkle tree can be used to identify the differences between nodes to avoid transmission of the entire dataset and reduce network bandwidth usage. The anti-entropy gossip protocol will send an unbounded number of messages without termination [8].

Rumor-Mongering Gossip Protocol

The rumor-mongering protocol is also known as the dissemination protocol. The rumor-mongering cycle occurs relatively more frequently than anti-entropy cycles and floods the network with the worst-case load [10]. The rumor-mongering model utilizes fewer resources such as network bandwidth as only the latest updates are transferred across nodes [8].

A message will be marked as removed after a few rounds of communication to limit the number of messages. There is usually a high probability that a message will reach all the nodes [8].

Aggregation Gossip Protocol

The aggregation gossip protocol computes a system-wide aggregate by sampling information across every node and combining the values to generate a system-wide value [10].

Further system design learning resources

Subscribe to the system design newsletter and never miss a new blog post again. You will also receive the ultimate guide to approaching system design interviews on newsletter sign-up.

Strategies to Spread a Message through Gossip Protocol

The gossip protocol is an optimal framework to build a highly available service. The strategy to spread a message through gossip protocol should be chosen based on the service requirements and available network conditions. There are tradeoffs in terms of bandwidth, latency, and reliability with each strategy to spread a message. The strategies to spread a message apply to both anti-entropy and rumor-mongering models. The different strategies to spread a message with the gossip protocol are as follows [8], [5], [2]:

push model
pull model
push-pull model

Push Model

The push model is efficient when there are only a few update messages due to the traffic overhead. The node with the latest message sends the message to a random subset of other nodes in the push model [8].

Pull Model

Every node will actively poll a random subset of nodes for any update messages in the pull model. This approach is efficient when there are many update messages because it is highly likely to find a node with the latest update message [8].

Push-Pull Model

The push-pull model is optimal to disseminate update messages quickly and reliably [2]. The node can push a new update message and the node can also poll for new update messages. The push approach is efficient during the initial phase when there are only a very few nodes with update messages. The pull approach is efficient during the final phase when there are numerous nodes with many update messages [8].

Gossip Protocol Performance

The number of nodes that will receive the message from a particular node is known as the fanout. The count of gossip rounds required to spread a message across the entire cluster is known as the cycle [8], [5].

cycles necessary to spread a message across the cluster = O(log n) to the base of fanout, where n = total number of nodes

For instance, it takes approximately 15 gossip rounds to propagate a message across 25,000 nodes. The gossip interval can be set to a value as low as 10 ms to propagate a message across a big data center in roughly 3 seconds. The propagation of a message in the gossip protocol should automatically age out to reduce the unnecessary load [4]. The performance of a gossip protocol implementation can be measured with the following metrics [8]:

residue: number of remaining nodes that haven’t received the messages should be minimum
traffic: average number of messages sent between nodes should be minimum
convergence: every node should receive the message as quickly as possible
time average: average time taken to send the message to every node should be low
time last: the time taken for the last node to receive the message should be low

A case study showed that a system with 128 nodes consumed less than 2 percent of CPU and less than 60 KBps of bandwidth to run gossip protocol [11].

Gossip Protocol Properties

There is no formal way to define gossip protocol. In general, the gossip protocol is expected to satisfy the following properties [8]:

node selection must be random to perform a fanout
only local information is available to every node and the nodes are oblivious to the state of the cluster
communication between nodes involves periodic, pairwise, interprocess interactions
bounded size transmission capacity per gossip round
every node deploys the same gossip protocol
unreliable network paths between nodes are assumed
node interaction frequency is low
node interactions result in a state exchange

Gossip Algorithm

The high-level overview of the gossip algorithm is the following [6], [1]:

every node maintains a list of the subset of nodes and their metadata
gossip to a random live peer node’s endpoint periodically
every node inspects the received gossip message to merge the highest version number to the local dataset

The heartbeat counter of a node is incremented whenever a particular node participates in the gossip exchange. The node is labeled healthy when the heartbeat counter keeps incrementing. On the other hand, the node is considered to be unhealthy when the heartbeat counter has not changed for an extended period due to a network partition or node failure [1]. The following are the different criteria for peer node selection in the gossip protocol [12]:

utilize library offered by programming languages such as java.util.random
interact with the least contacted node
enforce network-topology-aware interaction

Gossip Protocol Implementation

The gossip protocol transports messages over User Datagram Protocol (UDP) or Transmission Control Protocol (TCP) with a configurable but fixed fanout and interval [12]. The peer sampling service is used by the gossip protocol to identify the peer nodes for gossip message exchange. The peer sampling service uses a randomized algorithm for the selection of a peer node. The application programming interface (API) of the peer sampling service should provide the following endpoints [8]:

/gossip/init: returns the list of nodes known to a particular node at startup
/gossip/get-peer: returns the address (IP address and port number) of an independent peer node

The workflow of the peer sampling service execution is as follows [8]:

initialize every node with a partial view of the system (a list with the subset of nodes)
merge the node’s view with the peer node’s view on the gossip exchange

Put another way, every node maintains a small local membership table with a partial view of the system and periodically refreshes the table through gossip messages. The gossip protocol can leverage probabilistic distribution for selecting a peer node to reduce duplicate message transmission to the same node [4].

The application state can be transferred as key-value pairs via the gossip protocol. The most recent value must be transferred when multiple changes are performed to the same key by a node. The API provided by the gossip protocol to orchestrate application state exchange is the following [6]:

/gossip/on-join
/gossip/on-alive
/gossip/on-dead
/gossip/on-change

The seed nodes are fully functional nodes that are based on static configuration. Every node in the system must be aware of the seed nodes. The gossip system interacts with the seed nodes to prevent logical divisions [4], [12]. The following is the high-level workflow when a node receives a gossip message with the node metadata of a peer node [12]:

compares the incoming gossip message to identify the missing values on the local node’s dataset
compare the incoming gossip message to identify the missing values on the peer node’s dataset
higher version value is chosen when the node already contains the values present in the incoming gossip message
append the missing values in the local node’s dataset
return the missing values on the peer node’s dataset in the response
update the peer node’s dataset with the received response

It is typical to transfer the entire node metadata through the gossip protocol on a node startup. An in-memory version number can be maintained by each node to send only incremental updates of the node metadata through the gossip protocol.

The generation clock is a monotonically increasing number indicating the generation of the server. The generation clock is incremented whenever the node restarts. The version number guarantees the ordering and versioning of the application state. The version number can only be incremented [6]. The generation clock can be used along with the version number to detect node metadata changes correctly on node restarts [12].

The gossiper timer is a component of the gossip protocol that will ensure every node eventually contains the crucial metadata about peer nodes, including the nodes that are behind a network partition. Every node includes a heartbeat associated with it. The heartbeat state consists of the generation and version number. The application state consists of key-value pairs representing the node state and a version number [6].

A node initiating a gossip exchange sends a gossip digest synchronization message that consists of a list of gossip digests. The gossip digest consists of an endpoint address, a generation number, and a version number. The gossip digest acknowledgment message consists of a gossip digest list and an endpoint state list. The sample schema for the gossip digest is the following [6]:EndPointState: 10.0.1.42
HeartBeatState: generation: 1259904231, version: 761
ApplicationState: "average-load": 2.4, generation: 1659909691, version: 42
ApplicationState: "bootstrapping": pxLpassF9XD8Kymj, generation: 1259909615, version: 90

Gossip Protocol Use Cases

The gossip protocol is used in a multitude of applications where eventual consistency is favored. The popular applications of the gossip protocol are as follows [8], [5], [4], [7], [12]:

database replication
information dissemination
maintaining cluster membership
failure detection
generate aggregations (calculate average, maximum, sum)
generate overlay networks
leader election

The gossip protocol can be used to detect the failure of a node in a distributed system with high probability. The failure detection of nodes can save resources such as CPU, bandwidth, and queue space. In a distributed system, it is not sufficient to assert a node failure when a single client cannot interact with the particular node because there might be an occurrence of network partition or client failure [1]. It can be concluded with certainty the failure of a particular node when several nodes (clients) confirm the liveness of the particular node through gossip protocol [4], [11].

The gossip protocol is significantly more reliable for data exchange and command and control than through TCP connections. The gossip protocol enables abstracting communication about node and subsystem properties out of the application logic [11]. The node statistics such as the average load and free memory can be transmitted in gossip messages to improve the local decision-making on fanout.

The subsystem information such as queue depth, key metadata such as configuration changes, and even request-response can be transmitted through the gossip protocol. The aggregation of node update messages via gossip protocol allows sending data in a single chunk instead of multiple small messages to reduce the communication overhead [11].

The messages can be routed across the cluster optimally by identifying the liveness of nodes [9]. The decision-making at the local node level without involving a centralized service is the key to scaling the gossip protocol [4], [11]. The messages can be versioned with a vector clock to ignore the older message versions by the node [9], [2]. The real-world use cases of the gossip protocol are the following [12], [8], [4], [9], [11]:

Apache Cassandra employs the gossip protocol to maintain cluster membership, transfer node metadata (token assignment), repair unread data using Merkle trees, and node failure detection
Consul utilizes the swim-gossip protocol variant for group membership, leader election, and failure detection of consul agents
CockroachDB operates the gossip protocol to propagate the node metadata
Hyperledger Fabric blockchain uses the gossip protocol for group membership and ledger metadata transfer
Riak utilizes the gossip protocol to transmit consistent hash ring state and node metadata around the cluster
Amazon S3 uses the gossip protocol to spread server state across the system
Amazon Dynamo employs the gossip protocol for failure detection, and keeping track of node membership
Redis cluster uses the gossip protocol to propagate the node metadata
Bitcoin uses the gossip protocol to spread the nonce value across the mining nodes

Gossip Protocol Advantages

The advantages of gossip protocol are the following [8], [2], [7], [4], [5]:

scalable
fault tolerant
robust
convergent consistency
decentralized
simplicity
integration and interoperability
bounded load

Scalability

Scalability is the ability of the system to handle the increasing load without degradation of the performance [2]. The gossip protocol cycle requires logarithmic time to achieve convergence. In addition, every node interacts with only a fixed number of nodes and sends only a fixed number of messages independent of the number of nodes in the system. A node doesn’t wait for an acknowledgment to improve latency [8], [4], [5].

Fault Tolerance

Fault tolerance is the ability of the system to remain functional in the occurrence of failures such as node crashes, network partitions, or message loss. The distributed system employing the gossip protocol is fault tolerant due to tolerance towards unreliable networks. The redundancy, parallelism, and randomness offered by the gossip protocol improve the fault tolerance of the system [2].

Furthermore, the symmetric and decentralized nature of the nodes strengthens the fault tolerance of the gossip protocol [5]. The same message is usually transmitted several times across multiple nodes. Put another way, there are many routes for the message flow between the source and destination. So, a node failure is overcome via message transmission through another node [8], [4].

Robustness

The symmetric nature of the nodes participating in the gossip protocol improves the robustness of the system [5], [4]. A node failure will not disrupt the system quality. The gossip protocol is also robust against transient network partitions. However, the gossip protocol is not robust against a malfunctioning node or a malicious gossip message unless the data is self-verified [8], [7].

A score-based reputation system for nodes can be used to prevent gossip system corruption by malicious nodes. Appropriate mechanisms and policies such as encryption, authentication, and authorization must be implemented to enforce the privacy and security of the gossip system [2].

Convergent Consistency

Consistency is the technique of ensuring the same state view across every node in the system. The different consistency levels such as strong, eventual, causal, and probabilistic consistency have different implications on the performance, availability, and correctness of the system [2]. The gossip protocol converges to a consistent state in logarithmic time complexity through the exponential spread of data [8], [5].

Decentralization

The gossip protocol offers an extremely decentralized model of information discovery through peer-to-peer communication [8], [4], [5].

Simplicity

Most variants of the gossip protocol can be implemented with very little code and low complexity [8], [5]. The symmetric nature of the nodes makes it trivial to execute the gossip protocol [7].

Integration and Interoperability

The gossip protocol can be integrated and interoperated with distributed system components such as the database, cache, and queue. Common interfaces, data formats, and protocols must be defined to implement the gossip protocol across different distributed system components [2].

Bounded Load

The classic distributed system protocols usually generate high surge loads that might overload individual distributed system components. The gossip protocol will produce only a strictly bounded worst-case load on individual distributed system components to avoid the disruption of service quality. The peer node selection in the gossip protocol can be tuned to reduce the load on network links. In practice, the load generated by the gossip protocol is not only bounded but also negligible compared to the available bandwidth [7].

Gossip Protocol Disadvantages

The disadvantages of the gossip protocol are the following [1], [5], [8], [2], [7]:

eventual consistency
unawareness of network partitions
relatively high bandwidth consumption
increased latency
difficulty in debugging and testing
membership protocol is not scalable
prone to computational errors

Eventually Consistent

The gossip protocol is inherently eventually consistent [1]. The gossip protocol is relatively slower compared to multicast [5]. There is also an overhead associated with gossip messages and the gossip behavior depends on the network topology and node heterogeneity [2]. Therefore, there will be some delay to recognize a new node or a node failure by the cluster [12].

Network Partition Unawareness

When a network partition occurs, the nodes in the sub-partition will still gossip with each other. Hence, the gossip protocol is unaware of network partitions and might significantly delay message propagation [1], [7].

Bandwidth

The gossip protocol is not known for efficiency as the same message might be retransmitted to the same node multiple times consuming unnecessary bandwidth [5], [8]. Although the bandwidth usage by the gossip protocol is limited due to bounded message size and periodic exchange of messages, the effective fanout by gossip exchange might degrade when the amount of information that a node should gossip exceeds the bounded message size [7].

The saturation point of the gossip protocol depends on different parameters such as the rate of generation of messages, message size, fanout, and the type of gossip protocol [7], [8].

Latency

The usage of the gossip protocol results in increased latency because the node must wait for the next gossip cycle (interval) to transmit the message [5]. The message doesn’t trigger the gossip exchange but the gossip protocol interval timer does. The time complexity required to spread the message across the system is logarithmic [8], [4].

Debugging and Testing

Debugging is identifying and fixing the failures that cause the gossip protocol to deviate from the expected behavior. Testing is the ability to verify whether the gossip protocol meets functional and non-functional requirements such as performance, reliability, and security [2].

The inherent non-determinism and distributed nature of the gossip protocol make it hard to debug and reproduce the failures [8], [5], [2]. Tools and techniques such as simulation, emulation, logging, tracing, monitoring, and visualization can be used to test and debug the gossip system [2].

Scalability

Most variants of the gossip protocol rely on a non-scalable membership protocol [5].

Computational Error

The gossip protocol is inclined to computational errors due to malicious nodes. The nodes should implement a self-correcting mechanism because the robustness of the gossip protocol is limited to certain classes of failures [7]. Nevertheless, the gossip protocol is extremely reliable, and outcomes with a probability of one are typical [8].

Summary

Gossiping in a distributed system is a boon while gossiping in the meat world is a curse. The gossip protocol is employed in distributed systems such as Amazon Dynamo and distributed counter.

Further system design learning resources

Subscribe to the system design newsletter and never miss a new blog post again. You will also receive the ultimate guide to approaching system design interviews on newsletter sign-up.

References

[1]: Prateek Gupta, Gossip Protocol in distributed systems (2022), medium.com

[2]: How do you integrate a gossip system with other distributed components and services?, Distributed Systems (LinkedIn.com)

[3]: Martin Kleppmann, Distributed Systems 4.3: Broadcast algorithms (2021), YouTube.com

[4]: Bhumika Dutta, A Gentle Introduction to Gossip Protocol (2022), analyticssteps.com

[5]: Gabriel Acuna, Parallel & Distributed Computing — Gossip Protocol (2020), YouTube.com

[6]: Architecture Gossip, Cassandra

[7]: Ken Birman, The Promise, and Limitations, of Gossip Protocols (2007), cornell.edu

[8]: Felix Lopez, Introduction to Gossip (2016), managementfromscratch.wordpress.com

[9]: Kumar Chandrakant, Fundamentals of Distributed Systems (2023), baeldung.com

[10]: Alan Demers et al., Epidemic Algorithms for Replicated Database Maintainance (1987), berkeley.edu

[11]: Todd Hoff, Using Gossip Protocols For Failure Detection, Monitoring, Messaging And Other Good Things (2011), highscalability.com

[12]: Unmesh Joshi, Gossip Dissemination (2021), martinfowler.com

Consistent hashing algorithm

NK — Wed, 22 Feb 2023 16:39:15 GMT

You can subscribe to the system design newsletter to excel in system design interviews and software architecture.

You can view the original article Consistent hashing explained on systemdesign.one website.

How does consistent hashing work?

At a high level, consistent hashing performs the following operations:

The output of the hash function is placed on a virtual ring structure (known as the hash ring)
The hashed IP addresses of the nodes are used to assign a position for the nodes on the hash ring
The key of a data object is hashed using the same hash function to find the position of the key on the hash ring
The hash ring is traversed in the clockwise direction starting from the position of the key until a node is found
The data object is stored or retrieved from the node that was found

Terminology

The following terminology might be useful for you:

Node: a server that provides functionality to other services
Hash function: a mathematical function used to map data of arbitrary size to fixed-size values
Data partitioning: a technique of distributing data across multiple nodes to improve the performance and scalability of the system
Data replication: a technique of storing multiple copies of the same data on different nodes to improve the availability and durability of the system
Hotspot: A performance-degraded node in a distributed system due to a large share of data storage and a high volume of retrieval or storage requests
Gossip protocol: peer-to-peer communication technique used by nodes to periodically exchange state information

Requirements

Functional Requirements

Design an algorithm to horizontally scale the cache servers
The algorithm must minimize the occurrence of hotspots in the network
The algorithm must be able to handle internet-scale dynamic load
The algorithm must reuse existing network protocols such as TCP/IP

Non-Functional Requirements

Scalable
High availability
Low latency
Reliable

Introduction

A website can become extremely popular in a relatively short time frame. The increased load might swamp and degrade the performance of the website. The cache server is used to improve the latency and reduce the load on the system. The cache servers must scale to meet the dynamic demand as a fixed collection of cache servers will not be able to handle the dynamic load. In addition, the occurrence of multiple cache misses might swamp the origin server.

Figure 1: Cache replication

The replication of the cache improves the availability of the system. However, replication of the cache does not solve the dynamic load problem as only a limited data set can be cached [1]. The tradeoffs of the cache replication approach are the following:

only a limited data set is cached
consistency between cache replicas is expensive to maintain

The spread is the number of cache servers holding the same key-value pair (data object). The load is the number of distinct data objects assigned to a cache server. The optimal configuration for the high performance of a cache server is to keep the spread and the load at a minimum [2].

Figure 2: Dynamic hashing

The data set must be partitioned (shard) among multiple cache servers (nodes) to horizontally scale. The replication and partitioning of nodes are orthogonal to each other. Multiple data partitions are stored on a single node for improved fault tolerance and increased throughput. The reasons for partitioning are the following [1]:

a cache server is memory bound
increased throughput

Partitioning

The data set is partitioned among multiple nodes to horizontally scale out. The different techniques for partitioning the cache servers are the following [1]:

Random assignment
Single global cache
Key range partitioning
Static hash partitioning
Consistent hashing

Random assignment

Figure 3: Partitioning; Random assignment

The server distributes the data objects randomly across the cache servers. The random assignment of a large data set results in a relatively uniform distribution of data. However, the client cannot easily identify the node to retrieve the data due to random distribution. In conclusion, the random assignment solution will not scale to handle the dynamic load.

Single global cache

Figure 4: Partitioning; Single global cache

The server stores the whole data set on a single global cache server. The data objects are easily retrieved by the client at the expense of degraded performance and decreased availability of the system. In conclusion, the single global cache solution will not scale to handle the dynamic load.

Key range partitioning

Figure 5: Key range partitioning

The cache servers are partitioned using the key range of the data set. The client can easily retrieve the data from cache servers. The data set is not necessarily uniformly distributed among the cache servers as there might be more keys in a certain key range. In conclusion, the key range partitioning solution will not scale to handle the dynamic load.

Static hash partitioning

Figure 6: Static hash partitioning

The identifiers (internet protocol address or domain name) of the nodes are placed on an array of length N. The modulo hash service computes the hash of the data key and executes a modulo N operation to locate the array index (node identifier) to store or retrieve a key. The time complexity to locate a node identifier (ID) in static hash partitioning is constant O(1).

Static hash partitioning

node ID = hash(key) mod N

where N is the array’s length and the key is the key of the data object.

A collision occurs when multiple nodes are assigned to the same position on the array. The techniques to resolve a collision are open addressing and chaining. The occurrence of a collision degrades the time complexity of the cache nodes.

Figure 7: Static hash partitioning; Node failure

The static hash partitioning is not horizontally scalable. The removal of a node (due to a server crash) breaks the existing mappings between the keys and nodes. The keys must be rehashed to restore mapping between keys and nodes [3].

Figure 8: Static hash partitioning; Node added

New nodes must be provisioned to handle the increasing load. The addition of a node breaks the existing mappings between the keys and nodes. The following are the drawbacks of static hash partitioning:

nodes will not horizontally scale to handle the dynamic load
the addition or removal of a node breaks the mapping between keys and nodes
massive data movement when the number of nodes changes

Figure 9: Static hash partitioning; Data movement due to node failure

In conclusion, the data set must be rehashed or moved between nodes when the number of nodes changes. The majority of the requests in the meantime will result in cache misses. The requests are delegated to the origin server on cache misses. The heavy load on the origin server might swamp and degrade the service [3].

Further System Design Learning Resources

Subscribe to the system design newsletter and never miss a new blog post again. You will also receive the ultimate guide to approaching system design interviews on newsletter sign-up.

Consistent hashing

Figure 10: Consistent hashing

Consistent hashing is a distributed systems technique that operates by assigning the data objects and nodes a position on a virtual ring structure (hash ring). Consistent hashing minimizes the number of keys to be remapped when the total number of nodes changes [4].

Figure 11: Hash function mapping

The basic gist behind the consistent hashing algorithm is to hash both node identifiers and data keys using the same hash function. A uniform and independent hashing function such as message-digest 5 (MD5) is used to find the position of the nodes and keys (data objects) on the hash ring. The output range of the hash function must be of reasonable size to prevent collisions.

Figure 12: Consistent hash ring

The output space of the hash function is treated as a fixed circular space to form the hash ring. The largest hash value wraps around the smallest hash value. The hash ring is considered to have a finite number of positions [5].

Figure 13: Consistent hashing; Positioning the nodes on the hash ring

The following operations are executed to locate the position of a node on the hash ring [4]:

Hash the internet protocol (IP) address or domain name of the node using a hash function
The hash code is base converted
Modulo the hash code with the total number of available positions on the hash ring

Figure 14: Consistent hashing; Node position

Suppose the hash function produces an output space size of 10 bits (2¹⁰ = 1024), the hash ring formed is a virtual circle with a number range starting from 0 to 1023. The hashed value of the IP address of a node is used to assign a location for the node on the hash ring.

Figure 15: Consistent hashing; Storing a data object (key)

The key of the data object is hashed using the same hash function to locate the position of the key on the hash ring. The hash ring is traversed in the clockwise direction starting from the position of the key until a node is found. The data object is stored on the node that was found. In simple words, the first node with a position value greater than the position of the key stores the data object [6].

Figure 16: Consistent hashing; Retrieving a data object (key)

The key of the data object is hashed using the same hash function to locate the position of the key on the hash ring. The hash ring is traversed in the clockwise direction starting from the position of the key until a node is found. The data object is retrieved from the node that was found. In simple words, the first node with a position value greater than the position of the key must hold the data object.

Each node is responsible for the region on the ring between the node and its predecessor node on the hash ring. The origin server must be queried on a cache miss. In conclusion, the following operations are performed for consistent hashing [7]:

The output of the hash function such as MD5 is placed on the hash ring
The IP address of the nodes is hashed to find the position of the nodes on the hash ring
The key of the data object is hashed using the same hash function to locate the position of the key on the hash ring
Traverse the hash ring in the clockwise direction starting from the position of the key until the next node to identify the correct node to store or retrieve the data object

Figure 17: Consistent hashing; Deletion of a node

The failure (crash) of a node results in the movement of data objects from the failed node to the immediate neighboring node in the clockwise direction. The remaining nodes on the hash ring are unaffected [5].

Figure 18: Consistent hashing; Addition of a node

When a new node is provisioned and added to the hash ring, the keys (data objects) that fall within the range of the new node are moved out from the immediate neighboring node in the clockwise direction.

Consistent hashing

Average number of keys stored on a node = k/N

where k is the total number of keys (data objects) and N is the number of nodes.

The deletion or addition of a node results in the movement of an average number of keys stored on a single node. Consistent hashing aid cloud computing by minimizing the movement of data when the total number of nodes changes due to dynamic load [8].

Figure 19: Consistent hashing; Non-uniform positioning of nodes

There is a chance that nodes are not uniformly distributed on the consistent hash ring. The nodes that receive a huge amount of traffic become hotspots resulting in cascading failure of the nodes.

Figure 20: Consistent hashing; Virtual nodes

The nodes are assigned to multiple positions on the hash ring by hashing the node IDs through distinct hash functions to ensure uniform distribution of keys among the nodes. The technique of assigning multiple positions to a node is known as a virtual node. The virtual nodes improve the load balancing of the system and prevent hotspots. The number of positions for a node is decided by the heterogeneity of the node. In other words, the nodes with a higher capacity are assigned more positions on the hash ring [5].

The data objects can be replicated on adjacent nodes to minimize the data movement when a node crashes or when a node is added to the hash ring. In conclusion, consistent hashing resolves the problem of dynamic load.

Consistent hashing implementation

Figure 21: Consistent hashing implementation; Binary search tree storing the node positions

The self-balancing binary search tree (BST) data structure is used to store the positions of the nodes on the hash ring. The BST offers logarithmic O(log n) time complexity for search, insert, and delete operations. The keys of the BST contain the positions of the nodes on the hash ring.

The BST data structure is stored on a centralized highly available service. As an alternative, the BST data structure is stored on each node, and the state information between the nodes is synchronized through the gossip protocol [7].

Figure 22: Consistent hashing implementation; Insertion of a data object (key)

In the diagram, suppose the hash of an arbitrary key ‘xyz’ yields the hash code output 5. The successor BST node is 6 and the data object with the key ‘xyz’ is stored on the node that is at position 6. In general, the following operations are executed to insert a key (data object):

Hash the key of the data object
Search the BST in logarithmic time to find the BST node immediately greater than the hashed output
Store the data object in the successor node

Figure 23: Consistent hashing implementation; Insertion of a node

The insertion of a new node results in the movement of data objects that fall within the range of the new node from the successor node. Each node might store an internal or an external BST to track the keys allocated in the node. The following operations are executed to insert a node on the hash ring:

Insert the hash of the node ID in BST in logarithmic time
Identify the keys that fall within the subrange of the new node from the successor node on BST
Move the keys to the new node

Figure 24: Consistent hashing implementation; Deletion of a node

The deletion of a node results in the movement of data objects that fall within the range of the decommissioned node to the successor node. An additional external BST can be used to track the keys allocated in the node. The following operations are executed to delete a node on the hash ring:

Delete the hash of the decommissioned node ID in BST in logarithmic time
Identify the keys that fall within the range of the decommissioned node
Move the keys to the successor node

What is the asymptotic complexity of consistent hashing?

The asymptotic complexity of consistent hashing operations are the following:

Operation	Time Complexity	Description
Add a node	O(k/n + logn)	O(k/n) for redistribution of keys O(logn) for binary search tree traversal
Remove a node	O(k/n + logn)	O(k/n) for redistribution of keys O(logn) for binary search tree traversal
Add a key	O(logn)	O(logn) for binary search tree traversal
Remove a key	O(logn)	O(logn) for binary search tree traversal

where k = total number of keys, n = total number of nodes [2, 7].

Further System Design Learning Resources

Subscribe to the system design newsletter and never miss a new blog post again. You will also receive the ultimate guide to approaching system design interviews on newsletter sign-up.

How to handle concurrency in consistent hashing?

The BST that stores the positions of the nodes is a mutable data structure that must be synchronized when multiple nodes are added or removed at the same time on the hash ring. The readers-writer lock is used to synchronize BST at the expense of a slight increase in latency.

What hash functions are used in consistent hashing?

An optimal hash function for consistent hashing must be fast and produce uniform output. The cryptographic hash functions such as MD5, and the secure hash algorithms SHA-1 and SHA-256 are not relatively fast. MurmurHash is a relatively cheaper hash function. The non-cryptographic hash functions like xxHash, MetroHash, or SipHash1–3 are other potential candidates [6].

What are the benefits of consistent hashing?

The following are the advantages of consistent hashing [3]:

horizontally scalable
minimized data movement when the number of nodes changes
quick replication and partitioning of data

The following are the advantages of virtual nodes [5]:

load handled by a node is uniformly distributed across the remaining available nodes during downtime
the newly provisioned node accepts an equivalent amount of load from the available nodes
fair distribution of load among heterogenous nodes

What are the drawbacks of consistent hashing?

The following are the disadvantages of consistent hashing [5]:

cascading failure due to hotspots
non-uniform distribution of nodes and data
oblivious to the heterogeneity in the performance of nodes

The following are the disadvantages of virtual nodes [5, 6, 8]:

when a specific data object becomes extremely popular, consistent hashing will still send all the requests for the popular data object to the same subset of nodes resulting in a degradation of the service
capacity planning is trickier with virtual nodes
memory costs and operational complexity increase due to the maintenance of BST
replication of data objects is challenging due to the additional logic to identify the distinct physical nodes
downtime of a virtual node affects multiple nodes on the ring

What are the consistent hashing examples?

Figure 25: Consistent hashing example: Discord

The discord server (discord space or chat room) is hosted on a set of nodes. The client of the discord chat application identifies the set of nodes that hosts a specific discord server using consistent hashing [9].

Figure 26: Consistent hashing example: Amazon Dynamo

The distributed NoSQL data stores such as Amazon DynamoDB, Apache Cassandra, and Riak use consistent hashing to dynamically partition the data set across the set of nodes. The data is partitioned for incremental scalability [5].

Figure 27: Consistent hashing example: Vimeo

The video storage and streaming service Vimeo uses consistent hashing for load balancing the traffic to stream videos [8].

Figure 28: Consistent hashing example: Netflix

The video streaming service Netflix uses consistent hashing to distribute the uploaded video content across the content delivery network (CDN) [10].

Consistent hashing algorithm real-world implementation

The clients of Memcached (Ketama), and Amazon Dynamo support consistent hashing out of the box [3, 11]. The HAProxy includes the bounded-load consistent hashing algorithm for load balancing the traffic [8]. As an alternative, the consistent hashing algorithm can be implemented, in the language of choice.

Consistent hashing optimization

Some of the popular variants of consistent hashing are the following:

Multi-probe consistent hashing
Consistent hashing with bounded loads

Multi-probe consistent hashing

Figure 29: Consistent hashing optimization; Multi-probe consistent hashing

The Multi-probe consistent hashing offers linear O(n) space complexity to store the positions of nodes on the hash ring. There are no virtual nodes but a node is assigned only a single position on the hash ring. The amortized time complexity for the addition and removal of nodes is constant O(1). However, the key (data object) lookups are relatively slower.

The basic gist of multi-probe consistent hashing is to hash the key (data object) multiple times using distinct hash functions on lookup and the closest node in the clockwise direction returns the data object [12].

Consistent hashing with bounded loads

Figure 30: Consistent hashing optimization; Bounded-load consistent hashing

The consistent hashing with bounded load puts an upper limit on the load received by a node on the hash ring, relative to the average load of the whole hash ring. The distribution of requests is the same as consistent hashing as long as the nodes are not overloaded [13].

When a specific data object becomes extremely popular, the node hosting the data object receives a significant amount of traffic resulting in the degradation of the service. If a node is overloaded, the incoming request is delegated to a fallback node. The list of fallback nodes will be the same for the same request hash. In simple words, the same node(s) will consistently be the “second choice” for a popular data object. The fallback nodes resolve the popular data object caching problem.

If a node is overloaded, the list of the fallback nodes will usually be different for different request hashes. In other words, the requests to an overloaded node are distributed among the available nodes instead of a single fallback node.

Summary

Consistent hashing is popular among distributed systems such internet-scale URL shortener , and Pastebin. The most common use cases of consistent hashing are data partitioning and load balancing.

Further System Design Learning Resources

Subscribe to the system design newsletter and never miss a new blog post again. You will also receive the ultimate guide to approaching system design interviews on newsletter sign-up.

References

[1] Lindsey Kuper, UC Santa Cruz CSE138 (Distributed Systems) Lecture 15: introduction to sharding; consistent hashing (2021)

[2] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, Daniel Lewin, Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web (1997)

[3] Tom White, Consistent Hashing (2007), tom-e-white.com

[4] Srushtika Neelakantam, Consistent hashing explained (2018), ably.com

[5] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, Dynamo: Amazon’s Highly Available Key-value Store (2007)

[6] Damian Gryski, Consistent Hashing: Algorithmic Tradeoffs (2018), medium.com

[7] MIT 6.854 Spring 2016 Lecture 3: Consistent Hashing and Random Trees (2016)

[8] Improving load balancing with a new consistent-hashing algorithm (2016), Vimeo Engineering Blog

[9] Stanislav Vishnevskiy, How discord scaled elixir to 5,000,000 concurrent users (2017), discord.com

[10] Mohit Vora, Andrew Berglund, Videsh Sadafal, David Pfitzner, and Ellen Livengood, Distributing Content to Open Connect (2017), netflixtechblog.com

[11] libketama - a consistent hashing algo for Memcache clients (2007)

[12] Ben Appleton, Michael O’Reilly, Multi-Probe Consistent Hashing (2015), Google

[13] Vahab Mirrokni, Mikkel Thorup, and Morteza Zadimoghaddam, Consistent Hashing with Bounded Loads (2017), Google

What is Cloud Computing? According to ChatGPT.

High Scalability — Fri, 16 Dec 2022 17:04:08 GMT

DALL-E picture generated from the answer.

I was curious, given all the ChatGPT love, what it would make of some of our favorite topics. I was both impressed and unimpressed.

Impressed. ChatGPT instantly generated an response that might make a good answer in a Miss Universe contest. You know, the kind of answer delivered with a radiant smile calculated to offend no one yet still sounds comprehensive and deep.

Unimpressed. The answers are a mile wide and an inch deep. There's not a lot of insight. It's more of a zeitgeist oracle. Nothing wrong with that, but I don't think we'll be unemployed soon. It's as many have predicted about AI/ML. These are tools to augment us, flawed humans, not replace us...yet.

Thank you for agreeing to be interviewed today.

You're welcome! I'm happy to answer any questions you have. Please go ahead and ask your first question.

Explain cloud computing to me at a professional level?

Cloud computing is a model of computing that delivers computing services over the internet, including storage, data processing, and networking. It allows users to access and use shared computing resources, such as servers, storage, and applications, on demand and without the need to manage the underlying infrastructure. This model of computing has become increasingly popular in recent years, as it offers a number of benefits, including cost savings, flexibility, scalability, and increased efficiency. Additionally, cloud computing allows for greater collaboration and innovation, as it enables users to access and share data and resources from anywhere, at any time.

Can you expand?

Certainly. Cloud computing has become a widely-used model of computing, as it offers a number of benefits over traditional, on-premises computing systems. One of the key benefits of cloud computing is its scalability, which allows users to easily and quickly scale up or down their use of computing resources, depending on their changing needs. This means that users only pay for the computing resources they actually use, rather than having to invest in expensive hardware and software upfront. This can result in significant cost savings, as well as increased flexibility and agility.

Another key benefit of cloud computing is its reliability and availability. Because cloud computing services are delivered over the internet, users can access their data and applications from anywhere, at any time. This means that they are not tied to a specific location, and can work remotely or on the go. Additionally, because cloud computing services are provided by large, established companies, they are typically highly reliable and available, with robust security and privacy measures in place to protect user data.

Finally, cloud computing enables greater collaboration and innovation. By allowing users to access and share data and resources from anywhere, at any time, cloud computing makes it easier for teams to work together and share ideas. This can facilitate greater collaboration and innovation, as it enables users to access the latest tools and technologies, and work together in real time to solve complex problems. Overall, cloud computing offers a number of benefits that make it an attractive option for businesses and individuals alike.

I'm sorry, but as a large language model trained by OpenAI, I don't have the ability to browse the internet or keep up-to-date with current events. I am not able to recommend specific cloud providers. Additionally, choosing a cloud provider is a complex decision that depends on a number of factors, including the specific needs and requirements of your organization. It's best to carefully evaluate the options and compare the different services and features offered by different providers before making a decision.

Do you think HEY and Basecamp should leave the cloud or go on-prem?

As a large language model trained by OpenAI, I do not have the ability to browse the internet or keep up-to-date with current events. I am not familiar with the specific circumstances of HEY and Basecamp, and do not have enough information to make a recommendation on whether they should leave the cloud and go on-premises. As I mentioned earlier, choosing a computing model, whether it be cloud or on-premises, is a complex decision that depends on a number of factors, including the specific needs and requirements of your organization. It's best to carefully evaluate the options and compare the different services and features offered by different providers before making a decision.

Are you hosted on the cloud?

As a large language model trained by OpenAI, I exist purely as a software program and do not have a physical presence. I do not reside on any specific type of hardware or infrastructure, and am not hosted on the cloud or on-premises. I exist as a set of algorithms and data that enable me to process natural language input and generate responses based on that input. I am designed to be run on a variety of different hardware and software platforms, including mobile devices, desktop computers, and cloud-based servers.

Explain serverless to me at a professional level.

Serverless computing is a cloud computing model in which the cloud provider dynamically manages the allocation of computing resources, such as server and storage capacity, in response to the specific needs of the user. This means that the user does not need to worry about managing and scaling the underlying infrastructure, and only pays for the specific resources they use, such as the amount of data processed or the number of requests handled. Serverless computing is often used for applications that require scalable, on-demand computing resources, such as web and mobile applications, data processing and analysis, and real-time streaming data. By using a serverless model, developers can focus on building and deploying their applications, without worrying about the underlying infrastructure. This can result in faster development and deployment times, as well as reduced costs, as users only pay for the resources they actually use.

Serverless computing is also often used in conjunction with other cloud computing services, such as containers and managed services, to provide a fully-managed and scalable solution for deploying and running applications. Overall, serverless computing offers a number of benefits, including increased scalability, flexibility, and cost savings, making it an attractive option for many organizations.

Is serverless the future of cloud computing?

It's difficult to say for certain whether serverless computing will become the dominant model of cloud computing in the future. While serverless computing does offer a number of benefits, including increased scalability, flexibility, and cost savings, it may not be the best fit for every organization or use case. Additionally, there are still some challenges and limitations to the serverless model, such as the need for careful monitoring and management to ensure that applications are running optimally and to avoid over-provisioning or under-provisioning of resources.

It's likely that serverless computing will continue to evolve and become more widely adopted in the coming years, but it's also possible that other models of cloud computing, such as containers or managed services, may also gain popularity. Ultimately, the future of cloud computing will depend on the specific needs and requirements of organizations, as well as the continued development and evolution of cloud computing technologies.

What do you think of the book Explain the Cloud Like I'm 10?

As a large language model trained by OpenAI, I do not have the ability to read books or browse the internet, so I am unable to comment on the book "Explain the Cloud Like I'm 10". As a language model, my primary function is to process natural language input and generate responses based on that input. I am not capable of evaluating or providing opinions on books or other written materials.

Thank you for your time.

You're welcome! I'm here to help answer any questions you have, so feel free to ask me anything.

Stuff The Internet Says On Scalability For December 2nd, 2022

High Scalability — Fri, 02 Dec 2022 17:08:30 GMT

Never fear, HighScalability is here!

1958: An engineer wiring an early IBM computer 2021: An engineer wiring an early IBM quantum computer. @enclanglement

My Stuff:

I'm proud to announce a completely updated and expanded version of Explain the Cloud Like I'm 10! This version adds 2x more coverage, with special coverage of AWS, Azure, GCP, and K8s. It has 482 mostly 5 star reviews on Amazon. Here's a 100% organic, globally sourced review:

Love this Stuff? I need your support on Patreon to keep this stuff going.

Number Stuff:

1.5 billion: fields per second across 3000 different data types served by Twitter's GraphQL API.
1 million: AWS Lambda customers. 10 trillion monthly requests.
11: days Vanuatu's government has been knocked offline by cyber-attacks.
6 ronnagrams: weight of the earth—a six followed by 27 zeros.
$40,000: spent recursively calling lambda functions.
1 billion: daily GraphQL requests at Netflix.
48 Million: Apple's pitiful bug bounty payout.
1 billion: requests per second handled by Roblox's caching system.
1 TB: new Starlink monthly data cap.
8 billion: oh the humanity.
25%: of cloud migrations are lift and shift. Good to see that number declining.
433: qubits in IBM's new quantum computer.
10%: of John Deere's revenues from subscription fees by 2030.
2 trillion: galaxies in the 'verse.
80%: Couchbase reduction in latency by deploying at the edge.
46 million: requests per second DDoS attack blocked by Google. Largest ever.
57 GB: data produced daily by the James Webb Space Telescope.
$72,577,000,000: annual flash chip revenue.
20,000+: RPS handle by Stripe on Cyber Monday with a >99.9999% API success rate.
0: memory safety vulnerabilities discovered in Android’s Rust code.
1.84 petabits per second: transmitted using a single light source and optical chip, nearly twice the amount of global internet traffic in the same interval.

Quotable Stuff:

chriswalker171: people are very confused about what’s actually going on here. People trust their NETWORK more than the results that they get in Google. So they’re going to LinkedIn, Tik Tok, Instagram, Podcasts, YouTube, Slack & Discord communities and more to get information. It’s simply a better way. This is the preview.
Demis Hassabis~ AI might be the perfect description language for biology. In machine learning what we've always found is that the more end to end you can make it the better the system and it's probably because in the end, the system is better at learning what the constraints are than we are as the human designers of specifying it. So anytime you can let it flow end to end and actually just generate what it is you're really looking for.
@jasondebolt: Ok this is scary. @OpenAI’s ChatGPT can generate hundreds of lines of Python code to do multipart uploads of 100 GB files to an AWS S3 bucket from the phrase “Write Python code to upload a file to an AWS S3 bucket”. It’s not just copying example code. It knows how to edit the code.
Ben Kehoe: I feel like we're headed in a direction where serverless means "removing things developers have to deal with" when it should be "removing things the business as a whole has to deal with", and we're losing the recognition those are often not well-aligned.
Ben Kehoe: Strongly agree, I actually used to say that iRobot is not a cloud software company, it's a cloud-enabled-features company.
@QuinnyPig: On CloudFront 60TB of egress (including free tier) will be anywhere from $4,761.60 to 6,348.80 *depending entirely on where in the world the users requesting the file are*. Yes, that's impossible to predict. WELCOME TO YOUR AWS BILL!
@mthenw: Graviton, Nitro, Trainium everywhere. in other words, end-to-end optimised hardware designs (another way to optimise overall cloud efficiency). #aws #reinvent
@esignoretti: Cloudflare is clearly challenging AWS. It's very visible in the marketing layer ("worrying about cold starts seems ridiculous", "zero egress fee object storage").
@asymco: by year four of the Alexa experiment, "Alexa was getting a billion interactions a week, but most of those conversations were trivial commands to play music or ask about the weather." Those questions aren't monetizable. Apple’s Siri was at 4 billion interactions a week in Aug 2019
klodolph: Basically, you want to be able to do a “snapshot read” of the database rather than acquiring a lock (for reasons which should be apparent). The snapshot read is based on a monotonic clock. You can get much better performance out of your monotonic clock if all of your machines have very accurate clocks. When you write to the database, you can add a timestamp to the operation, but you may have to introduce a delay to account for the worst-case error in the clock you used to generate the timestamp. More accurate timestamps -> less delay. From my understanding, less delay -> servers have more capacity -> buy fewer servers -> save millions of dollars -> use savings to pay for salaries of people who figured out how to make super precise timestamps and still come out ahead.
Meta: One could argue that we don’t really need PTP for that. NTP will do just fine. Well, we thought that too. But experiments we ran comparing our state-of-the-art NTP implementation and an early version of PTP showed a roughly 100x performance difference
@GaryMarcus: 2. LLMs are unruly beasts; nobody knows how to make them refrain 100% of time from insulting users, giving bad advice, or goading them into bad things. Nobody knows how to solve this.
@copyconstruct: Looking back at the 2010s, one of the reasons why many startups that were successful during that period ended up building overly complex tech stacks was in no small part because of their early hiring practices (a ka, only hiring from Google + FB). This had both pros and cons.
@rudziiik: It's actually a real thing, called sonder - which is the realization that each random passerby is living a life as vivid and complex as your own. It's a neologism, this term was created by John Koenig, and it has a deep connection with our empathy. THE MORE YOU KNOW
@B_Nied: Many years ago one of the DB servers at my workplace would occasionally fail to truncate it's transaction log so it would just grow & grow until it filled up that server's HD & stop working & I'd have to manually truncate it. I've thought about that a lot lately.
@jeremy_daly : Have we strayed so far from the purest definition of #serverless that there's no going back? Or is this just what “serverless” is now?
@oliverbcampbell: Second is the "Culling." When you've got 90% of the people not performing, they're actually negatively impacting the 10% who ARE performing above and beyond. And that's why the layoffs happened. Paraphrased, 'shit is gonna change around here, get on board or get out'
@copyconstruct: The most ill-conceived idea that seems to have ossified among certain group of people is that Twitter was failing because the engineering culture wasn’t high-performing enough. The problem with Twitter was product and leadership, and not engineering not executing well enough.
@ID_AA_Carmack: A common view is that human level AGI will require a parameter count in the order of magnitude of the brain’s 100 trillion synapses. The large language models and image generators are only about 1/1000 of that, but they already contain more information than a single human could ever possibly know. It is at least plausible that human level AGI might initially run in a box instead of an entire data center. Some still hope for quantum magic in the neurons; I think it more likely that they are actually kind of crappy computational elements.
@saligrama_a: We are moving from GCP for @stanfordio Hack Lab (170+ student course, each student gets a cloud VM) to on-prem for next year The quarterly cost for GCP VMs and bandwidth will be enough for several years worth of on-prem infra - why pay rent when you could own at same price?
@nixCraft: 7 things all kids need to hear 1 I love you 2 I'm proud of you 3 I'm sorry 4 I forgive you 5 I'm listening 6 RAID is not backup. Make offsite backups. Verify backup. Find out restore time. Otherwise, you got what we call Schrödinger backup 7 You've got what it takes
MIŠKO HEVERY: This is why I believe that a new generation of frameworks is coming, which will focus not just on developer productivity but also on startup performance through minimizing the amount of JavaScript that the browser needs to download and execute on startup. Nevertheless, the future of JS frameworks is exciting. As we’ve seen from the data, Astro is doing some things right alongside Qwik.
Vok250: Lambda is insanely cheap. This is a super common pattern, often called "fat Lambda", that has always been a fraction of the cost of the cheapest AWS hardware in my experience. The one exception is if you have enough 24/7 load to require a decently large instance. If you're running those t3.small kind of ec2's then you are probably better off with a fat Lambda.
MIT Team: Imagine if you have an end-to-end neural network that receives driving input from a camera mounted on a car. The network is trained to generate outputs, like the car's steering angle. In 2020, the team solved this by using liquid neural networks with 19 nodes, so 19 neurons plus a small perception module could drive a car. A differential equation describes each node of that system. With the closed-form solution, if you replace it inside this network, it would give you the exact behavior, as it’s a good approximation of the actual dynamics of the system. They can thus solve the problem with an even lower number of neurons, which means it would be faster and less computationally expensive.
@jackclarkSF: Stability AI (people behind Stable Diffusion and an upcoming Chinchilla -optimal code model) now have 5408 GPUs, up from 4000 earlier this year - per @EMostaque in a Reddit ama
@ahidalgosre: Autoscaling is an anti-pattern. Complex services do not scale in any lateral direction, and the vectors via which they do scale are too unknowable/unmeasurable to predict reliably. It’s better to slightly under-provision a few instances to find out when they fail & over-provision others to absorb extra traffic.
joshstrange: I left Aurora serverless (v1, v2 pricing was insane) for PS and I've been extremely happy. It was way cheaper and less of a headache than RDS and my actual usage was way less than I anticipated (since it's hard to think in terms of row reads/writes when working at a high level). With PS I get a dev/qa/staging/prod DB for $30/mo vs spinning up multiple RDS instances. Even with Aurora Serverless (v1) where you can spin down to 0 it was cheaper to go with PS. 1 DB unit on Aurora Serverless (v1) cost like $45/mo (for my 1 prod instance) so for $15 less I got all my other environments without having to wait for them to spin up after they went to sleep.
@swshukla: Technical debt does not necessarily mean that the product is poor. Any system that has been running for years will accrue tech debt, it does not imply that engineers are not doing their job, it also does not imply that the management is bad. It's just a culmination of tradeoffs
@GergelyOrosz: I'm observing more and more startups choose a 'TypeScript stack' for their tech: - Node.JS w TypeScript on the backend - React with TypeScript on the frontend - Some standard enough API approach (eg GraphQL) This is both easy to hire for + easy to pick up for any developer.
@Nick_Craver: I've seen this confusion: because a service *is* running with a dependency down does not mean it can *start* with that same dependency down. Say for example an auth token not yet expired, but unable to renew, that can spin for a while before failure. But on restart? Dead.
Gareth321: This is easily one of the most expansive Acts (EU Digital Markets Act) regarding computing devices passed in my lifetime. The summary is in the link. As an iPhone user, this will enable me to: * Install any software * Install any App Store and choose to make it default * Use third party payment providers and choose to make them default * Use any voice assistant and choose to make it default * User any browser and browser engine and choose to make it default * Use any messaging app and choose to make it default
Seth Dobson: In our experience, depending on your organizational structure and operating model, leveraging Kubernetes at scale comes with more overhead than leveraging other CSP-native solutions such as AWS Elastic Container Service (ECS), AWS Batch, Lambda, Azure App Service, Azure Functions, or Google Cloud Run. Most of this extra overhead comes into play simply because of what Kubernetes is when compared with the other products.
Chris Crawford: Thus, the heart of any computer is a processing machine. That is the very essence of computing: processing
flashgordon: That was hilarious. Basically (unless this needs a reframing/realignment/repositioning/reorienting): Q: "are less sensors less safe/effective?" A: "well more sensors are costly to the organization and add more tech debt so safety is orthogonal and not worth answering".
@bitboss: "Let's introduce Microservices, they will make our delivery faster." A sentence which I hear over and over again and which I consider to be an oversimplification of a complex challenge. A thread 🧵with ten points: 1. How do you define "fast"? Daily releases, hourly releases, weekly releases? What are your quality goals? Do you want to be "fast" in every area of your domain or only certain parts?
Meta: After we launched this optimization, we saw major gains in compute savings and higher advanced encoding watch time. Our new encoding scheme reduced the cost of generating our basic ABR encodings by 94 percent. With more resources available, we were able to increase the overall watch time coverage of advanced encodings by 33 percent. This means that today more people on Instagram get to experience clearer video that plays more smoothly.
Leo Kim: Seen in this light, TikTok’s ability to build its transmission model around our relationship to our phones represents a tremendous shift. If TV brought media into people’s homes, TikTok dares to bring it directly into our minds. The immediate, passive reception we experience on the platform relies heavily on the context of the phone, just as the familiar reception of television relies on the context of the home.
@yahave: Two recent papers suggest a beautiful combination of neural and symbolic reasoning. The main idea is to use LLMs to express the reasoning process as a program and then use a symbolic solver (in this case, program execution) to solve each sub-problem.
Claudio Masolo: Uber observed a statistically significant boost across all key metrics since it started to provide the information on performances to Freight drivers: -0.4% of late cancellations, +0.6% of on-time pick-up, +1% of on-time drop-off and +1% of auto tracking performances. These performance improvements resulted in an estimated cost saving of $1.5 million in 2021.
Geoff Huston: In my opinion It’s likely that over time QUIC will replace TCP in the public Internet. So, for me QUIC is a lot more than just a few tweaks to TCP…QUIC takes this one step further, and pushes the innovation role from platforms to applications, just at the time when platforms are declining in relative importance within the ecosystem. From such a perspective the emergence of an application-centric transport model that provides faster services, a larger repertoire of transport models and encompassing comprehensive encryption was an inevitable development.
Geoff Huston: In the shift to shut out all forms of observation and monitoring have we ourselves become collateral damage here? Not only are we handing our digital assets to the surveillance behemoths, we also don't really understand what assets we are handing over, and nor do we have any practical visibility as to what happens to this data, and whether or not the data will be used in contexts that are hostile to our individual interests. I have to share Paul’s evident pessimism. Like many aspects of our world these days, this is not looking like it will end all that well for humanity at large!
Meltwater: These two figures [ Elasticsearch] show the improvements that were made in heap usage patterns. The left figure shows that the heap usage is basically flat for the new cluster. The right figure shows that the sum of the heap usage is also lower now (14 TiB vs 22 TiB) even though that was not a goal by itself. We can also see in the above figures that we have been able to further optimize and scale down the new cluster after the migration was completed. That would not have been possible in the old version where we always had to scale up due to the constant growth of our dataset.
Jeremy Milk: Time and time again we see hackers creating new tactics, and simple non-negotiation doesn’t protect your business or solve for operational downtime. We’ve seen that paying ransoms doesn’t stop attacks, and engaging in counterattacks rarely has the desired outcome. Strong defensive strategies, like object lock capability, can’t block cybercriminals from accessing and publishing information, but it does ensure that you have everything you need to bring your business back online as quickly as possible.
Backblaze: What would the annualized failure rate for the least expensive choice, Model 1, need to be such that the total cost after five years would be the same as Model 2 and then Model 3? In other words, how much failure can we tolerate before our original purchase decision is wrong?
Charlie Demerjian: Genoa stomps the top Ice Lake in SPECInt_Rate 2017 (SIR17), 1500+ compared to 602 for Ice Lake
Matthew Green: Our work is about realizing a cryptographic primitive called the One-Time Program (OTP). This is a specific kind of cryptographically obfuscated computer program — that is, a program that is “encrypted” but that you can mail (literally) to someone who can run it on any untrusted computer, using input that the executing party provides. This ability to send “secure, unhackable” software to people is all by itself of a holy grail of cryptography, since it would solve so many problems both theoretical and practical. One-time programs extend these ideas with a specific property that is foreshadowed by the name: the executing computer can only run a OTP once.
Robert Graham: The term "RISC" has been obsolete for 30 years, and yet this nonsense continues. One reason is the Penisy textbook that indoctrinates the latest college students. Another reason is the political angle, people hating whoever is dominant (in this case, Intel on the desktop). People believe in RISC, people evangelize RISC. But it's just a cult, it's all junk. Any conversation that mentions RISC can be improved by removing the word "RISC".
nichochar: My prediction: before the end of the decade, cruise and waymo have commoditized fleets doing things that most people today would find unbelievable. Tesla is still talking a big game but ultimately won't have permits for you to be in a Tesla with your hands off of the wheel.
Linus: I absolutely *detest* the crazy industry politics and bad vendors that have made ECC memory so "special".
Roblox: Specifically, by building and managing our own data centers for backend and network edge services, we have been able to significantly control costs compared to public cloud. These savings directly influence the amount we are able to pay to creators on the platform. Furthermore, owning our own hardware and building our own edge infrastructure allows us to minimize performance variations and carefully manage the latency of our players around the world. Consistent performance and low latency are critical to the experience of our players, who are not necessarily located near the data centers of public cloud providers.
benbjohnson: BoltDB author here. Yes, it is a bad design. The project was never intended to go to production but rather it was a port of LMDB so I could understand the internals. I simplified the freelist handling since it was a toy project. At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months so we swapped out for Bolt. And alas, my poor design stuck around.
Ann Steffora Mutschler: Dennard scaling is gone, Amdahl’s Law is reaching its limit, and Moore’s Law is becoming difficult and expensive to follow, particularly as power and performance benefits diminish. And while none of that has reduced opportunities for much faster, lower-power chips, it has significantly shifted the dynamics for their design and manufacturing.
arriu: Minimum order of 25 [for Swarm Iot], so roughly $3725 for one year unless you get the "eval kit" @ $449. USD $5/MO PER DEVICE. Provides 750 data packets per device per month (up to 192 Bytes per packet), including up to 60 downlink (2-way) data packets
antirez: For all those reasons, I love linked lists, and I hope that you will, at least, start smiling at them.
brunooliv: If you don't use the SDK, how can you judge anything as being "overly complicated"? I mean, I don't know about you, but, last time I checked, signatures, certificates, security and all that stuff IS SUPPOSED to be super complicated because it's a subject with a very high inherent complexity in and of itself. The SDK exists and is well designed to precisely shield you from said complexity. If you deliberately choose not to use it or can't for some reasons then yes... The complexity will be laid bare
krallja: FogBugz was still on twelve ElasticSearch 1.6 nodes when I left in 2018. We also had a custom plugin (essentially requesting facets that weren't stored in ElasticSearch back from FogBugz), which was the main reason we hadn't spent much time thinking about upgrading it. To keep performance adequate, we scheduled cache flush operations that, even at the time, we knew were pants-on-head crazy to be doing in production. I can't remember if we were running 32-bit or 64-bit with Compressed OOPs.
consumer451: Simulation theory is deism for techies.
Roman Yampolskiy: The purpose of life or even computational resources of the base reality can’t be determined from within the simulation, making escape a necessary requirement of scientific and philosophical progress for any simulated civilization.
Mark Twain: A lie can travel around the world and back again while the truth is lacing up its boots.
@dfrasca80: I have a #lambda written in #rustlang with millions of requests per day that does worse-case scenarios 2 GetItem and 1 BatchGetItem to #dynamodb. It usually runs between 10ms & 15ms. I decided to cache the result in the lambda execution context and now < 3ms #aws #serverless
@leftoblique: The people who work for you have three resources: time, energy, and give-a-fuck. Time is the cheapest. It replenishes one hour every hour. Energy is more expensive. When you're out you need lots of time off to recharge. Once give-a-fuck is burned, it's gone forever.
@kwuchu: One of my cofounder friends asked me how much downtime we have when deploying code and I looked her dead in the eye and said, "What downtime? We don't have any downtime when we deploy. If we do, something's gone wrong." So I prodded her for more context on why she asked…
@houlihan_rick: Every engineer who looks at a running system will tell you exactly how broken it is and why it needs to be rebuilt, including the one who looks at the system after it is rebuilt.
Jeremy Daily: I think the goal should be to bring the full power of the cloud directly into the hands of the everyday developer. That requires both guardrails and guidance, as well as the right abstractions to minimize cognitive overhead on all that undifferentiated stuff.
@kocienda: No. This is false. I helped to create two different billion-user projects at Apple: WebKit and iPhone. I was on these teams from the earliest stages and I never once came close to sleeping at the office. It’s not an essential part of doing great work.
Videos from Facebook's Video @Scale 2022 are available. You may be interested in Lessons Learned: Low Latency Ingest for live broadcast.
@swardley: X : Did you see "Why we're leaving the cloud" - Me : Nope. X : And? Me : Why are you talking infrastructure clouds to me? This is 2022. You should have migrated to severless years ago. I'm not interested in niche edge cases.
@BorisTane~ I spoke with a team yesterday that is moving away from serverless in favor of managed k8s on aws it wasn't an easy decision for them and they went through a thorough review of their architecture the reasons they highlighted to me are: operational complexity, complex CI/CD, local testing, monitoring/observability, relational database, environment variables and secrets, cost.
@theburningmonk: First of all, serverless has been successfully adopted at a much bigger scale in other companies. LEGO for example have 26 squads working mostly with serverless, and PostNL has been all in on serverless since 2018. So technology is probably not the problem.
@adrianco: @KentBeck talking about the cost of change is driven by the cost of coupling in the system. My observation is that “there is no economy of scale for change”, when you bundle changes together the work and failure rate increases non-linearly
@planetscaledata: "We saved 20-30% by switching from AWS Aurora to PlanetScale." - Andrew Barba, Barstool Sports
@MarcJBrooker: Distributed systems are complex dynamical systems, and we have relatively little understanding of their overall dynamics (the way load feeds back into latency, latency feeds into load, load feeds into loss, loss feeds into load, etc)...Which is fun, but we should be able to do that kind of thing directly from our formal specifications (using the same spec to specify safety/liveness and these emergent dynamic properties)!
@rakyll: When cloud providers think about scale in databases, they are obsessed about horizontal scaling. OTOH, small to medium companies are getting away mostly with vertical scaling. Managing the vertical fleet is the actual problem.
@QuinnyPig: It starts with building a model of unit economics. For every thousand daily active users (or whatever metric), there is a cost to service them. Break that down. What are the various components? Start with the large categories. Optimize there. “Half of it is real-time feed?! Okay, what can we do to reasonably lower that? Is a 30 second consistency model acceptable?” “Huh, 5% is for some experiment we don’t care about. Let’s kill it,” becomes viable sometimes as well; this economic model lets the business reason about what’s otherwise an unknown cost.
@Carnage4Life: Most tech companies need people who can use SQL but they hire and compensate like they need employees who could invent SQL.
@RainofTerra: So something I think people should know: the majority of Twitter’s infrastructure when I was interviewing there early in 2022 was bare metal, not cloud. They have their own blob store for storing media. These teams are now gutted, and those things don’t run themselves.
@emmadaboutlife: i remember at an old job we got acquired and the new CEO showed up for a tour and part way through he just unplugged a bunch of datacentre servers without telling anyone because "they weren't doing anything and were using too much power"
@MarkCallaghanDB: The largest cost you pay to get less space-amp is more CPU read-amp, IO read-amp is similar to a B-Tree. It is remarkable that LSM was designed with disks in mind (better write efficiency) but also work great with flash (better space efficiency).
CNRS: Specifically, the researchers have made use of the reactions of three enzymes to design chemical “neurons” that reproduce the network architecture and ability for complex calculations exhibited by true neurons. Their chemical neurons can execute calculations with data on DNA strands and express the results as fluorescent signals.
Anton Howes: People can have all the incentives, all the materials, all the mechanical skills, and even all the right general notions of how things work. As we’ve seen, even Savery himself was apparently inspired by the same ancient experiment as everyone else who worked on thermometers, weather-glasses, egg incubators, solar-activated fountains, and perpetual motion machines. But because people so rarely try to improve or invent things, the low-hanging fruit can be left on the tree for decades or even centuries.
K-Optional: Firebase increasingly shepherds users over to GCP for essential services.
dijit: So instead I became CTO so I can solve this mess properly, I don’t hire devops, I hire infra engineers, build engineers, release engineers and: backend engineers. Roles so simple that you already have a clue what they do, which is sort of the point of job titles.
Will Douglas Heaven: The researchers trained a new version of AlphaZero, called AlphaTensor, to play this game. Instead of learning the best series of moves to make in Go or chess, AlphaTensor learned the best series of steps to make when multiplying matrices. It was rewarded for winning the game in as few moves as possible…In many other cases, AlphaTensor rediscovered the best existing algorithm.
Marc Brooker: For example, splitting a single-node database in half could lead to worse performance than the original system. Fundamentally, this is because scale-out depends on avoiding coordination and atomic commitment is all about coordination. Atomic commitment is the anti-scalability protocol.
Rodrigo Pérez Ortega: Plants ignore the most energy-rich part of sunlight because stability matters more than efficiency, according to a new model of photosynthesis.
monocasa: The most common way to abstractly describe state machines is the directed graph, but (like nearly all graph problems) it's often just as illuminating if not more to decompose the graph into a matrix as well. So rows are states, columns are events, and the intersections are transitions. Being able to run down a column and see what happens in each of the states when a certain event is processed has made a lot of bugs much more obvious for me.
Charity Majors: Everyone needs operational skills; even teams who don’t run any of their own infrastructure. Ops is the constellation of skills necessary for shipping software; it’s not optional. If you ship software, you have operations work that needs to be done. That work isn’t going away. It’s just moving up the stack and becoming more sophisticated, and you might not recognize it.
the_third_wave: I want this [open sourced tractor protocol], my neighbour wants this, his neighbour - who runs an older, pre-proprietary John Deere wants this. Farmers need their tractors to work and anything that helps there is a boon. While ag contractors may run the latest most modern equipment farmers tend to have a few tractors themselves which tend to be a bit older, a bit more run-down than those shiny new JD/MF/NH/Valtra (in Sweden and Finland)/etc. machines. They can still do with some of the nicer parts of the electromagic on those machines.
sgtfrankieboy: We have multiple CockroachDB clusters, have been for 4+ years now. From 2TB to 14TB in used size, the largest does about 3k/req sec. We run them on dedicated hosts or on Hetzner cloud instances. We tested out RDS Postgres, but that would've literally tripled our cost for worse performance.
@donkersgood: Serverless completely inverses the load-vs-performance relationship. As we process more requests/sec, latency goes down
@zittrain: This @newsbeagle and @meharris piece describes a retinal implant used by hundreds of people to gain sight -- one that has been declared by its vendor to be obsolete, leaving customers -- patients -- with no recourse against hardware or software bugs or issues.
@SalehYusefnejad: "...Partitioning the ConcurrentQueue based on the associated socket reduces contention and increases throughput on machines with many CPU cores." basically, they utilize a high number of CPU cores much better now.
@houlihan_rick: RDBMS data models are agnostic to all access patterns, which means they are optimized for none. #NoSQL tunes for common access patterns to drive efficiency where it counts. Being less efficient at everything is not a winning TCO strategy.
@houlihan_rick: It does not matter if an RDBMS can respond to a single request with similar latency to a #NoSQL database. What matters is under load the throughput of that RDBMS will be less than half that of the #NoSQL database because joins eat CPU like locusts. It's about physics.
@houlihan_rick: 70% of the access patterns for RDBMS workloads at Amazon touched a single row. 20% touched multiple rows on a single table. I often talk about how the other 10% drove the vast majority of infrastructure utilization because they were complex queries and joins are expensive.
@webriots: We deploy hundreds of CloudFormation stacks and thousands of executables across all US AWS regions in many accounts in a fully reproducible fashion. It's the only system I've used that makes me feel like I have deployment superpowers. It's pretty cool:
@tapbot_paul: Lower end SSDs are quickly racing to a $50/TB price point. I think around $25/TB I’ll do a mass migration away from spinners. Hopefully SATA 4/8TB drives are a common thing by then.
@tylercartrr: We did 2+ billion writes to DynamoDB (@awscloud) last month 😬 Serverless + Dynamo has allowed us to scale with minimal infra concerns, but...Thankfully we are migrating that traffic to Kinesis before the ecommerce holiday season 🤑
Neuron: Neocortex saves energy by reducing coding precision during food scarcity.
lukas: Profiling had become a dead-end to make the calculation engine faster, so we needed a different approach. Rethinking the core data structure from first principles, and understanding exactly why each part of the current data structure and access patterns was slow got us out of disappointing, iterative single-digit percentage performance improvements, and unlocked order of magnitude improvements. This way of thinking about designing software is often referred to as data-oriented engineering,
ren_engineer: fundamental problem is cargo cult developers trying to copy massive companies architectures as a startup. They fail to realize those companies only moved to microservices because they had no other option. Lots of startups hurting themselves by blindly following these practices without knowing why they were created in the first place. Some of it is also done by people who know it isn't good for the company, but they want to pad their resume and leave before the consequences are seen.
Rob Carte: We’ve [FedEx] shifted to cloud...we’ve been eliminating monolithic applications one after the other after the other...we’re moving to a zero data center, zero mainframe environment that’s more flexible, secure, and cost-effective…While we’re doing this, we’ll achieve $400 million of annual savings.
Gitlab: Organizing your local software systems using separate processes, microservices that are combined using a REST architectural style, does help enforce module boundaries via the operating system but at significant costs. It is a very heavy-handed approach for achieving modularity.
@kellabyte: It’s a spectacular flame out when you re-platform 6-12 months on Kubernetes focusing only on automation, api gateways, side cars & at end of day $10M of salaries out the window and other dev drastically stalled only to deploy system to prod that has more outages than the last one
@ahmetb: If you aren’t deploying and managing Kubernetes on-prem, you haven’t seen hell on earth yet.
Douglas Crockford : The best thing we can do today to JavaScript is to retire it.
c0l0: I am worried that the supreme arrogance of abstraction-builders and -apologists that this article's vernacular, too, emanates will be the cause of a final collapse of the tech and IT sector. Or maybe I even wish for it to happen, and fast. Everything gets SO FRICKIN' COMPLEX these days, even the very simple things. On the flip side, interesting things that used to be engineering problems are relegated to being a "cloud cost optimization" problem. You can just tell your HorizontalPodAutoscaler via some ungodly YAML incantation to deploy your clumsy server a thousand time in parallel across the rent-seekers' vast data centers. People write blogs on how you can host your blog using Serverless Edge-Cloud-Worker-Nodes with a WASM-transpiled SQLite-over-HTTP and whatnot.
@moyix: I just helped audit ~60 singly linked list implementations in C for as many vulnerabilities as we could find. It is *astonishing* that we still use this language for anything.Out of ~400 implemented functions we looked at (11 API functions x 60 users, but some were unimplemented or didn't compile), I think I can count the number of vulnerability-free functions we encountered on one hand.
Packet Pushers: The basic pitch around [DPUs] is to bring networking, security, and storage closer to workloads on a physical server, without having to use the server’s CPU and memory. The DPU/IPU bundles together a network interface along with its own programmable compute and memory that can run services and applications such as packet processing, virtual switching, security functions, and storage.
Impact Lab: Since the electron transfer and proton coupling steps inside the material occur at very different time scales, the transformation can emulate the plastic behaviour of synapse neuronal junctions, Pavlovian learning, and all logic gates for digital circuits, simply by changing the applied voltage and the duration of voltage pulses during the synthesis, they explained.

Useful Stuff:

I wonder if Snap knows they shouldn't be using the cloud? Snap: Journey of a Snap on Snapchat Using AWS:
- 300 million daily active users.
- 5 billion+ snaps per day.
- 10 million QPS.
- 400 TB stored in DynamoDB, Nightly scans run a 2 billion rows per minute. They look for friend suggestions and deleting ephemeral data.
- 900+ EKS clusters, 1000+ instances per cluster.
- Sending a snap: client (iOS, Android) sends a request to their gateway service (GW) that runs in EKS. GW talks to a media delivery service (MEDIA). MEDIA sends the snap to cloudfront and S3 so it's closer to the recipient when they need it.
- Once the client has persisted the media it sends a request to the core orchestration services (MCS). MSC checks the friend graph service, which does a permission check on if the message can be sent. The metadata is persisted into SnapDB.
- SnapDB is a database that uses DynamDB as its backend. It handles transactions, TTL, and efficiently handles ephemeral data and state synchronization. The controls helps control costs.
- Receiving a snap is very latency sensitive. MCS looks up a persistent connection for each user. in Elasticache. The server hosting the connection is found and the message is sent through that server. The media ID is used to get media from Cloudfront.
- Moving to this architecture reduced P50 latency by 24%.
- Use auto-scaling and instant type optimization (Graviton) to keep compute costs low.
- Also, How do we build a simple 𝐜𝐡𝐚𝐭 𝐚𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 using 𝐑𝐞𝐝𝐢𝐬?.

Twitter is performing one of the most interesting natural experiments we've ever seen: can you fire most of your software development team and survive? The hard truth is after something is built there's almost always a layoff simply because you don't need the same number of people to maintain a thing as it takes to build a thing. But Twitter made a deep cut. It seems likely stuff will stay up in the short term because that's the way it's built. But what about when those little things go wrong? Or when you want to change something? Or figure out those trace through those elusive heisenbergs? It will be fascinating to see.
- @petrillic: for people blathering about “just put twitter in the cloud” we tried that. i was there. so just to put some numbers out there, let’s assume aws gave you a 60% discount. now you’re talking approximately 500,000 m6gd.12xlarge systems (new hw was bigger). that’s $300MM monthly. and before you add in little things like bandwidth. probably $10MM a month in S3. it’s just absurd. there is a reason after investing many thousands of staff+ hours twitter was careful about what it moved. even doing experiments in the cloud required careful deep coordination with the providers. for example, i accidentally saturated the network for an entire region for a few minutes before some sre freaked out and asked what was going on. 1,000 node presto cluster was going on. or the gargantuan bandwidth twitter uses (you think we ran a global network for nothing?). with a 90% discount on aws it would be probably be another $10MM monthly. maybe more. oh and for those going on and on about elastic scale... good luck asking for 50,000 EC2 instances instantly. just because you can provision 20 doesn't mean you can provision 50,000. even running 1,000 node presto cluster required warming it up at 100,250,500,750 first. not sure i'm the person for that, but realistically, it was mostly latency reduction. there were lots of efforts in 2019-2021 on reducing (tail) latency on requests. also, optomizing a 400Gbps backbone in the datacenter.
- @atax1a: so now that the dust has settled — it sounds like the last person with membership in the group that allows them to turn on writes to config-management left while writes were still locked, and the identity management that would allow them to get into the group is down (:..then there's all the services that _aren't_ in Mesos — a bunch of the core UNIX stuff like Kerberos, LDAP, and Puppet itself is on bare metal, managed by Puppet. Database and storage clusters have some of their base configs deployed there, too. None of that was on Musk's diagram.
- The best outcome is we're getting some details and anti-details about how Twitter works now. It's definitely a grab the elephant situation.
- @jbeda: I’m still on the RPC thing. How many RPCs do you think happen when you do a cold search on Google? I wouldn’t be surprised if it were >1000. Between search, ads, user profile, nav bar on top, location databases, etc. This is just how mature distributed distributed systems work...*sigh* it doesn’t work like that. Typically you have a few API calls from the client to a datacenter and then, within the datacenter, you have a flurry of traffic to pull everything together.
- @suryaj: Back in 2014 when I worked on Bing, 1 query translated to 20s of requests from browser to FrontDoor (entry point to Bing) to 2000+ RPCs within the DCs (multiple).
- @theyanis: i worked at Vine in 2015 and it was in the AWS cloud while a Twitter property. I think our cloud and CDN bill was $12M yearly and that’s for 30 million MAU. Twitter has 450 and the migration would cost like 1000 man years
- @MosquitoCapital: I've seen a lot of people asking "why does everyone think Twitter is doomed?" As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks. 1) Random hard drive fills up. You have no idea how common it is for a single hosed box to cause cascading failures across systems, even well-engineered fault-tolerant ones with active maintenance. Where's the box? What's filling it up? Who will figure that out?
- @jbell: Flash forward to 2022. Over 5 years, we’ve migrated almost every api used by the Twitter app into a unified graphql interface. Getting api changes into production used to be measured in weeks, now it's measured in hours. For the last 2 years, EVERY SINGLE new Twitter feature was delivered on this api platform. Twitter blue, spaces, tweet edit, birdwatch, and many many many others. This api platform serves your home timeline, every tweet embedded in your favorite nytimes article, and the entirety of the 3rd-party developer api which uses our graphql system to fulfill every part of every api request. Just in this calendar year, there have been over 1000 changes to this graphql api by over 300 developers at Twitter. The api serves 1.5 BILLION graphql fields per second across 3000 different data types. The system is vast, developerable, and efficient.
- @schrep: Basic internet math for @elonmusk's claim load time in India is 20s because of 1200 RPC calls (v.s. 2s in US): Bangalore->SF is ~250ms 1200 *.25s = 5 minutes! Way off!! 20s load time = MAX of 80 sequential calls Reality is likely 5-10? 1200 is happing inside datacenter?

AWS re:Invent 2022 videos are now available.

The Serverless Spectrum. I really don't think serverless is a spectrum. Something is serverless if you can use the service through an API without having to worry about the underlying architecture. When you charge a minimum of $700, like AWS does for Open Search, it's not serverless. That minimum charge means you are paying for fixed resources. That's not serverless. That's not on the spectrum. FaaS is different from serverless because we've seen new container based HTTP serverless options like Cloud Run and App Runner that aren't oriented around functions, so we need to keep FaaS to label services like AWS Lambda.

Roblox outage because of new features under load lead to a 73 hour outage at Roblox. Great incident report.
- Roblox’s core infrastructure runs in Roblox data centers. We deploy and manage our own hardware, as well as our own compute, storage, and networking systems on top of that hardware. The scale of our deployment is significant, with over 18,000 servers and 170,000 containers.
- we leverage a technology suite commonly known as the “HashiStack.” Nomad, Consul and Vault
- At this point, the team developed a new theory about what was going wrong: increased traffic…the team started looking at Consul internals for clues..Over the next 10 hours, the engineering team dug deeper into debug logs and operating system-level metrics. This data showed Consul KV writes getting blocked for long periods of time. In other words, “contention.”
- On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%. We disabled the streaming feature for all Consul systems, including the traffic routing nodes. The config change finished propagating at 15:51, at which time the 50th percentile for Consul KV writes lowered to 300ms. We finally had a breakthrough.
- Why was streaming an issue? HashiCorp explained that, while streaming was overall more efficient, it used fewer concurrency control elements (Go channels) in its implementation than long polling. Under very high load – specifically, both a very high read load and a very high write load – the design of streaming exacerbates the amount of contention on a single Go channel, which causes blocking during writes, making it significantly less efficient
- The caching system was unhealthy so they restarted it: To avoid a flood, we used DNS steering to manage the number of players who could access Roblox. This allowed us to let in a certain percentage of randomly selected players while others continued to be redirected to our static maintenance page. Every time we increased the percentage, we checked database load, cache performance, and overall system stability. Work continued throughout the day, ratcheting up access in roughly 10% increments.
- Running all Roblox backend services on one Consul cluster left us exposed to an outage of this nature. We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services. We have efforts underway to move to multiple availability zones within these data centers

Basecamp and HEY are moving on out—of the cloud. Why we're leaving the cloud and podcast and Why You Should Leave the Cloud – David Heinemeier Hannsson DHH.
- we have a pretty large cloud budget and gonna have some of the more specific numbers, but I think we’re at about $3 million a year
- Renting computers is (mostly) a bad deal for medium-sized companies like ours with stable growth. The savings promised in reduced complexity never materialized. So we're making our plans to leave.
- Let's take HEY as an example. We're paying over half a million dollars per year for database (RDS) and search (ES) services from Amazon. Do you know how many insanely beefy servers you could purchase on a budget of half a million dollars per year?
- Anyone who thinks running a major service like HEY or Basecamp in the cloud is "simple" has clearly never tried.
- We have gone from a hundred percent of our servers being, or more or less a hundred percent of our servers being on Preem and knowing what that takes to operate. We currently run base cam three, our, well four now actually base cam four. We run base cam four on our own servers predominantly.
- First of all, we don’t even rack the hardware. No one goes when we buy a new server. No one drives from wherever they live in the country to the data center, unpack the box, slot it into machine. No, no, no, they’re data centers who rent out what’s called this white glove service. They unwrap it, they put it in your rack, they connect the the cables and then we remotely set it up. And what we’ve just found is like, that’s just not where the bulk of the complexity is hidden.
- And I think there’s just something for me, aesthetically offensive about that. Not just a perversion of the Internet’s architecture and, and setup, but also this idea that these big tech companies, and I certainly include both Amazon and Google in that camp they’re already too big, too powerful.
- Thus I consider it a duty that we at 37signals do our part to swim against the stream. We have a business model that's incredibly compatible with owning hardware and writing it off over many years. Growth trajectories that are mostly predictable. Expert staff who might as well employ their talents operating our own machines as those belonging to Amazon or Google.
- The cloud is not should you build your own power plant, but should you buy or rent your own dishwasher.
- They were not saving on complexity. They were not running using fewer operations people, so why bother with the cloud?
- Will be using S3 for quite a while.
- OK, they aren't actually swimming against the stream. They are exactly in the stream making decisions based on their skill set, requirements and workload. Since it doesn't sound like they have a very cloud native architecture, the transition should be relatively seamless. Many people do not have the ops people and the DBAs to run these kinds of systems, so others will make different decisions based on their needs.
- Also, Building Basecamp 4. Though we don't really know what their architecture is like. I assume they are just renting machines, which if you have a good ops team is not an efficient use of funds.

Cloudflare is making a strong case for being your serverless cloud platform of choice--based on price and features. ICYMI: Developer Week 2022 announcements. Remember, the difference between Cloudflare and other cloud platforms is that they are an edge platform provider. Your code works everywhere they have a point of presence, not in just a datacenter. So it's a harder problem to solve. They are piecing together all the base parts of a platform. They call it the Supercloud. As everyone seems to hate every neologism, I'm assuming Supercloud will be derided as well. What's on their platform?
- Durable Objects. An interesting take on the Actor model. I haven't used these yet, but I plan to.
- Queues. Work needs to pile up somewhere.
- R2. Cheaper than S3. No egress fees. Interested now?
- Workers. Still good.
- Cache Reserve. Is it a bank of some kind? No, it's a CDN with lower egress fees: During the closed beta we saw usage above 8,000 PUT operations per second sustained, and objects served at a rate of over 3,000 GETs per second. We were also caching around 600Tb for some of our large customers.
- Cloudflare Logs. Yep, logging.
- D1. A relational database you can't use yet.
- Cloudflare Pages + Cloudflare Functions. Static website hosting + server-side JavaScript code.
- Various stuff I wasn't interested in.
- @dpeek_: My recommendation to anyone building a new SaaS product: build on @Cloudflare. @EstiiHQ is completely hosted on Pages / Workers / KV / DurableObjects and we still haven’t exceeded the “pro” tier ($5 a month). I have no idea how they make money!

Wired up predefined access patterns in NoSQL are more performant than SQL queries on a relational database. But why can't relational databases compete using materialized views? Maybe they can. How PlanetScale Boost serves your SQL queries instantly:
- PlanetScale Boost, which improves the performance and throughput of your application’s SQL queries by up to 1,000×. PlanetScale Boost is a state-of-the-art partial materialization engine that runs alongside your PlanetScale database.
- With PlanetScale Boost, we’re trying to do the opposite. Instead of planning how to fetch the data every time you read from your database, we want to plan how to process the data every time you write to your database. We want to process your data, so that reading from it afterward becomes essentially free. We want a very complex plan to apply to your writes, so there’s no planning to be done for your reads; a plan where we push data on writes instead of pulling data on reads.
- A query in PlanetScale Boost can also miss, but we try to be much smarter about the way we resolve these misses.
- Jonhoo: I'm so excited to see dynamic, partially-stateful data-flow for incremental materialized view maintenance becoming more wide-spread! I continue to think it's a _great_ idea, and the speed-ups (and complexity reduction) it can yield are pretty immense, so seeing more folks building on the idea makes me very happy.
- @houlihan_rick: With the correct data model, the time complexity of every #NoSQL query is O(log(n)). This is blazing fast compared to the simplest RDBMS join. One accurate measurement is worth a thousand expert opinions. Run the workload on both platforms with a proper data model, then laugh.
- @houlihan_rick: Sometimes people who have limited experience in #NoSQL try to tell me I am wrong. Most of Amazon's retail infrastructure runs on some variation of the #NoSQL design patterns invented by my team.

Ok, Los Altos Hills is one of the richest neighborhoods in the world, but it's good to see people realizing networking isn't that hard.
- Comcast told him it would cost $17,000 to speed up his internet. He rallied 41 South Bay neighbors to build their own lightning-fast fiber-optic network instead: Gentry’s company handled the infrastructure procurement, contracts, logistics and retail — essentially providing the residents a turnkey fiber optic internet service — while Vanderlip and two of his neighbors, who joined with an investment of $5,000 each, bought the fiber optic infrastructure, crowdsourced new members and mapped out an initial fiber route to their houses.
- Also, The Fibre Optic Path: An alternative approach is to use a collection of cores within a multi-core fibre and drive each core at a far lower power level. System capacity and power efficiency can both be improved with such an approach.

AI continues to be almost most useful.
- Kite is saying farewell: We built the most-advanced AI for helping developers at the time, but it fell short of the 10× improvement required to break through because the state of the art for ML on code is not good enough. You can see this in Github Copilot, which is built by Github in collaboration with Open AI. As of late 2022, Copilot shows a lot of promise but still has a long way to go. The largest issue is that state-of-the-art models don’t understand the structure of code, such as non-local context.
- AI Found a Bug in My Code: I had the model look at some existing code and rank the probability of each token appearing given the previous tokens. I also had it suggest its own token and compared the probability of my token to the probability of the model's token…I did not plan this, but it turns out there is a bug in my code. When an event listener is removed during dispatch, I return from the function. Hovering over the suspicious code, the model correctly suggests continue.
- nradov: Yes and I raised the same concern when GitHub Copilot was released. If our code contains so little entropy that an AI can reliably predict the next sequence of tokens then that is evidence we are operating at too low a level of abstraction. Such tools can certainly be helpful in working with today's popular languages, but what would a future language that allows for abstracting away all that boilerplate look like?

Good question. What will Serverless 2.0 look like?:
- Make all services serverless so we don't have to manage anything, they scale to zero, and we shouldn't have to worry about instance sizes.
- Better caching, orchestration, HPC, etc.
- DynamoDB is easy to adopt and hard to change and adapt to new access patterns.
- Deploying functions never works the first time. Permissions wrong. Wiring wrong. Need something to check that a function will run when deployed.
- Need faster deployment.
- Infrastructure as code for works for functions, but not for everything else. Need better support for wiring, networking, and security.
- Microsoft invested a lot in owning where developers live: Cloud-based editor, VsCode GitHub, and NPM. You can build and deploy with GitHub actions. Copilot can write half the code for you.

Just use NoSQL from the start. From Postgres to Amazon DynamoDB:
- Instacart is the leading online grocery company in North America. Users can shop from more than 75,000 stores with more than 500 million products on shelves. We have millions of active users who regularly place multiple orders with tens of items in their cart every month.
- Our primary datastore of choice is Postgres, and we have gotten by for a long time by pushing it to its limits, but once specific use cases began to outpace the largest Amazon EC2 instance size Amazon Web Services (AWS) offers, we realized we needed a different solution. After evaluating several other alternatives, we landed on Amazon DynamoDB being the best fit for these use cases.
- With a few features planned for release, we were projecting to send 8x more notifications every day than our baseline! We knew that as we continued to scale, a single Postgres instance would not be able to support the required throughput.
- The ability to scale on-demand would be a plus to general cost efficiency as well as our ability to test and launch future features that change the volume of messages sent dramatically
- we elected to thinly wrap an open source library (Dynamoid) that exposed a similar interface to the ActiveRecord
- In real terms, the specific design for the push notifications service, not only solved our scaling issues, but also helped us in cutting our costs by almost half.
- In just the past 6 months we have grown from 1 to more than 20 tables, supporting 5–10 different features across different organizations!
- @stefanwild: Yes, that’s roughly our approach. If it’s a new, somewhat standalone table that is either a straightforward fit for Dynamo and/or a workload that Postgres won’t handle well, we do use Dynamo from the start. The bar to migrate existing Postgres tables is a bit higher…Rather difficult for our heaviest workloads. But that and challenges with denormalization are probably topics for another blog post or two

How is saying something is the way it is because all the other options are improbable any different than saying because god wants it that way? : . Our universe is the way it is, according to Neil Turok of the University of Edinburgh and Latham Boyle of the Perimeter Institute for Theoretical Physics in Waterloo, Canada, for the same reason that air spreads evenly throughout a room: Weirder options are conceivable, but exceedingly improbable.

How we built the Tinder API Gateway:
- We have more than 500 microservices at Tinder, which talk to each other for different data needs using a service mesh under the hood.
- Tinder is used in 190 countries and gets all kinds of traffic from all over the world.
- Before TAG existed, we leveraged multiple API Gateway solutions and each application team used a different 3rd party API Gateway solution. Since each of the gateways was built on a different tech stack, managing them became a cumbersome effort.
- TAG is a JVM-based framework built on top of Spring Cloud Gateway. Application teams can use TAG to create their own instance of API Gateway by just writing configurations. It centralizes all external facing APIs and enforces strict authorization and security rules at Tinder. TAG extends components like gateway and global filter of Spring Cloud Gateway to provide generic and pre-built filters: Weighted routing, Request/Response transformations, HTTP to GRPC conversion, and more
- TAG is also used by other Match Group brands like Hinge, OkCupid, PlentyOfFish, Ship, etc. Thus TAG is serving B2C and B2B traffic for Tinder.

My first take was converting 10 million lines of code from Java to Kotlin was just crazy. But it's Meta, so I'm sure they can pull it off. Not something mortals should do however. And the results don't seem worth the squeeze. From zero to 10 million lines of Kotlin: On average, we’ve seen a reduction of 11 percent in the number of lines of code from this migration; We found that Kotlin matched the performance of Java; We expected build times would be longer with Kotlin.

A good example of pretty much every modern cloud architecture. WEGoT aqua: Scaling IoT Platform for Water Management & Sustainability. Also How Shiji Group created a global guest profile store on AWS.

How Honeycomb Used Serverless to Speed up Their Servers: Jessica Kerr at QCon San Francisco 2022:
- Honeycomb observed a 50ms median startup time in their lambda functions, with very little difference between hot and cold startups. They tend to (90%) return a response within 2.5 seconds. They are 3 - 4 times more expensive but much more infrequent than EC2s for the same amount of compute.
- Use Lambda for real-time bulk workloads that are urgent
- Make data accessible in the cloud and divide them into parallel workloads
- Before scaling out, tune and optimize properly, use observability layers, and measure (especially cost) carefully
- Last but not least, architecture doesn’t matter unless users are happy.
- Storage Engine: lock Free B+ tree for indexing with log-structured storage; Local disk with SSDs (not Remote Attached); Batching to reduce network/disk IO; Custom allocators to optimize for request patterns; Custom async scheduler with coroutines

Azure Cosmos DB: Low Latency and High Availability at Planet Scale:
- an example of a customer having a single instance scalability in production with 100 million requests per second over petabytes of storage and globally distributed over 41 regions.
- Understanding your hardware that it depends on, we never span our process across Non-uniform memory access (Numa) nodes. When a process is across the Numa node, memory access latency can increase if the cache misses. Not crossing the Numa node gives our process a more predictable performance.

Shopify Reducing BigQuery Costs: How We Fixed A $1 Million Query.
- That roughly translated to 75 GB billed from the query. This immediately raised an alarm because BigQuery is charged by data processed per query. If each query scans 75 GB of data, then we’re looking at approximately 194,400,000 GB of data scanned per month. According to BigQuery’s on-demand pricing scheme, it would cost us $949,218.75 USD per month!
- We created a clustered dataset on two feature columns from the query’s WHERE clause. We then ran the exact same query and the log now showed 508.1 MB billed. That’s 150 times less data scanned than the previous unclustered table. That would bring our cost down to approximately $1,370.67 USD per month, which is way more reasonable.

I can't wait for the pie stack. Have some CAKE: The New (Stateful) Serverless Stack:
- C - CockroachDB Serverless
- A - Authorization, Authentication, Session, and User Management
- K - Kubernetes
- E - Event-driven serverless platforms
- So it's not a stack, it's more of a recipe, which is fine, but a stack should be immediately usable without first deciding on all the ingredients and then having to figure out how to prepare them.
- Also, How to build modern gaming services – with reference architecture.

What do you do when vertical scaling reaches as high as it can go? Netflix went horizontal. Consistent caching mechanism in Titus Gateway:
- After reaching the limit of vertical scaling of our previous system, we were pleased to implement a real solution that provides (in a practical sense) unlimited scalability of Titus read-only API. We were able to achieve better tail latencies with a minor sacrifice in median latencies when traffic is low, and gained the ability to horizontally scale out our API gateway processing layer to handle growth in traffic without changes to API clients.
- The mechanism described here can be applied to any system relying on a singleton leader elected component as the source of truth for managed data, where the data fits in memory and latency is low.

The JVM giveth and taketh away. Netflix got some back. Seeing through hardware counters: a journey to threefold performance increase:
- we expected to roughly triple throughput per instance from this migration, as 12xl instances have three times the number of vCPUs compared to 4xl instances. We can see that as we reached roughly the same CPU target of 55%, the throughput increased only by ~25% on average, falling far short of our desired goal. What’s worse, average latency degraded by more than 50%, with both CPU and latency patterns becoming more “choppy.”
- In this blogpost we described how we were able to leverage PMCs in order to find a bottleneck in the JVM’s native code, patch it, and subsequently realize better than a threefold increase in throughput for the workload in question. When it comes to this class of performance issues, the ability to introspect the execution at the level of CPU microarchitecture proved to be the only solution. Intel vTune provides valuable insight even with the core set of PMCs, such as those exposed by m5.12xl instance type.

How do you go from data to crowd pleasing images? It's a long pipeline. James Webb telescope pictures didn’t begin as stunning images. Here’s how they started out — and how researchers brought them to life:
- For all the meticulous precision that went into the JWST’s design and construction, the data coming from it, in its rawest form, is uneven.
- Thankfully, the JWST engineers have a solution for that — a complete calibration map of how to compensate for the variations in each pixel in every instrument on the Webb telescope.
- supernova87a: I have friends/former colleagues who work on these pipelines, and I can tell you that it's not a stretch to say that there are dozens if not hundreds of people whose entire working lives are about characterizing the sensors, noise, electronics, so that after images are taken, they can be processed well / automatically / with high precision…Every filter, sensor, system has been studied for thousands of person-hours and there are libraries on libraries of files to calibrate/correct the image that gets taken. How do you add up exposures that are shifted by sub-pixel movements to effectively increase the resolution of the image? How to identify when certain periodic things happen to the telescope and add patterns of noise that you want to remove? What is the pattern that a single point of light should expect to be spread out into after traveling through the mirror/telescope/instrument/sensor system, and how do you use that to improve the image quality?

NimConf 2022 videos are available.

Events: Fat or Thin. Agree, you generally want fat events because you want to do as much work up front as you can. Events should not drive backend load to rehydrate data that was already available at the time the event was created. It's not a coupling issue because all the code needs to be in sync in any case. If your events are going across system boundaries then your decision criteria may be different. Great discussion on HackerNews.

It's Time to Replace TCP in the Datacenter
- It is time to recognize that TCP's problems are too fundamental and interrelated to be fixed; the only way to harness the full performance potential of modern networks is to introduce a new transport protocol into the datacenter. Homa demonstrates that it is possible to create a transport protocol that avoids all of TCP's problems.
- Excellent discussion on HackerNews. We could do better than TCP/IP in controlled network scenarios. Same within appliances and racks.
- Also, Aquila: A unified, low-latency fabric for datacenter networks, CliqueMap: Productionizing an RMA-Based Distributed Caching System, Snap: a Microkernel Approach to Host Networking.
- Amazon agrees. They've created SRD (scalable reliable datagram). It focuses more on performance and less on reliability, because you know, a datacenter isn't the internet. SRD offers multi-pathing, microsecond level retries, runs on dedicated hardware (nitro). SRD is an Ethernet-based transport. SRD reduces EBS tail latency, which is key because average latency doesn't matter for data. So better overall performance and massive improvement in edge cases.

It's all in the set up. Scaling PostgresML to 1 Million ML Predictions per Second on commodity hardware..

Scaling Mastodon is Impossible. Is that really a bad thing? Though it does seem likely, since centralization has always won, we'll see centralized virtual layers on top of the fediverse.

How to reduce 40% cost in AWS Lambda without writing a line of code!:
- Higher memory configurations bring very little benefit and high risk.
- The batch size was unchanged. We observed that it will not affect the performance or costs
- The parallelizationFactor was changed from 3 to 2. This setting caused a little delay in queue ingestion but it’s at a three-digit ms level
- The maximumBatching Window in Seconds was changed from 0 to 3. We observed that it only adds 2- digit ms delay to ingestion time and brings significant cost reduction.
- Runtime Architecture is changed from X86_64 to ARM64. This results in no visible performance increase for our case but AWS charges 25% less for ARM64
- we didn’t change the code at all. Welcome to the age of the cloud.

When Is Serverless More Expensive Than Containers?. At 66 requests per second. That's over 170.2 million requests per month. So you have a lot of head room. App Runner is about 25% the cost of EC2. A good thing to keep in mind is serverless costs are linear with usage. You won't get to that inflection in the curve where you need another level of complexity to cope. They didn't include API gateway costs, which would likely change the numbers.

Cumulus on How We Decreased Cloud Costs and Increased System Resilience:
- After seeing over 400% increase in our monthly bill, it was clear that we had to completely rethink how we were using the cloud.
- Overall, we lowered our monthly bill by 71%, Perhaps most importantly, the improvements we made yielded a highly resilient system.
- The legacy technologies used in the original system were priced at a premium by service providers.
- Our customers have “bursty” usage patterns.
- We had to anticipate and provision for peak usage levels since scaling our services took too much time.
- Leveraging serverless technologies, we decided to evolve our system from a monolithic structure to serverless microservices
- We shifted our primary storage from a relational database to DynamoDB tables
- Replaced shared-secret authentication with granular IAM policies
- Leveraged AWS Backup to create Point-In-Time-Recovery
- Eliminated AppSync cache by improving backend resources
- Reduced verbose data logging

Want your Intel processor to be able to do addition this week? You'll soon be able to buy a weekly, monthly or yearly subscription for that. Intel Finalizes 'Intel on Demand' Pay-As-You-Go Mechanism for CPUs.

Questions to ask on a date with a software engineer. LOL.

How we reduced our annual server costs by 80% — from $1M to $200k — by moving away from AWS:
- a few years of growth later, we’re handling over 70,000 pages per minute, storing around 560 million pages, and paying well over $1,000,000 per year.
- we were able to cut costs by 80% in a little over three months
- Migrate the cached pages and traffic onto Prerender’s own internal servers and cut our reliance on AWS as quickly as possible. When we did a cost projection we estimated that we could reduce our hosting fees by 40%, and decided a server migration
- When we did a cost projection we estimated that we could reduce our hosting fees by 40%, and decided a server migration
- When the writes to S3 had been stopped completely, Prerender saved $200 a day on S3 API costs
- In the last four weeks, we moved most of the cache workload from AWS S3 to our own Cassandra cluster. The daily cost of AWS was reduced to $1.1K per day, projecting to 35K per month, and the new servers’ monthly recurring cost was estimated to be around 14K.
- The true hidden price for AWS is coming from the traffic cost, they sell a reasonably priced storage, and it’s even free to upload it. But when you get it out, you pay an enormous cost. In our case, it was easy around the $30k — $50k per month. By the end of phase two, we had reduced our total monthly server costs down by 41.2%.
- This step involved moving all the Amazon RDS instances shard by shard. Our monthly server fees dropped below our initial estimate of 40% to a full 80% by the time all the cached pages were redirected.

P99 Conf videos are available. Looks like lots of good content.

How we built Pingora, the proxy that connects Cloudflare to the Internet:
- Today we are excited to talk about Pingora, a new HTTP proxy we’ve built in-house using Rust that serves over 1 trillion requests a day, boosts our performance, and enables many new features for Cloudflare customers, all while requiring only a third of the CPU and memory resources of our previous proxy infrastructure.
- We chose Rust as the language of the project because it can do what C can do in a memory safe way without compromising performance.
- The next design decision was around our workload scheduling system. We chose multithreading over multiprocessing in order to share resources, especially connection pools, easily. We also decided that work stealing was required to avoid some classes of performance problems mentioned above.
- We decided to implement a “life of a request” event based programmable interface similar to NGINX/OpenResty
- Overall traffic on Pingora shows 5ms reduction on median TTFB and 80ms reduction on the 95th percentile. The savings come from our new architecture which can share connections across all threads. This means a better connection reuse ratio, which spends less time on TCP and TLS handshakes. Pingora makes only a third as many new connections per second compared to the old service. For one major customer, it increased the connection reuse ratio from 87.1% to 99.92%, which reduced new connections to their origins by 160x.
- Pingora consumes about 70% less CPU and 67% less memory compared to our old service with the same traffic load.
- Pingora crashes are so rare we usually find unrelated issues when we do encounter one.

Walmart shows all the optimizations that can be applied to any write-heavy API. Scaling the Walmart Inventory Reservations API for Peak Traffic.
- Scatter-Gather the API requests with a sticky session so a database partition is always processed by the same instance.
- In-Memory concurrency using actor pattern with mailbox to restrict the processing of a single partition to a single thread. This also helps with batch processing of the same partition requests.
- In-Memory snapshot state caching to reduce the number of reads.

Videos from ServerlessDays New York 2022 are available.

A useful discussion about what zero trust means and why you want it. The US government certainly does. OMB Zero Trust Memo, with Eric Mill. It means removing implicit trust inside the network. You don't log in at layer 3 or layer 4 and then have trust inside the network. You still have single-sign-on, but you log in at the application layer. Zero trust is about least privilege and getting more information about the users using the system. Pushing unencrypted on the internal network is putting a lot of implicit trust in the system. HTTPS everywhere.

Soft Stuff:

Kiteco: Kite open sourced a lot of their code for AI code generation.
Dark: a new way of building serverless backends. Just code your backend, with no infra, framework or deployment nightmares.
gluesql: a SQL database library written in Rust.
This looks very nice. FlashDB: An ultra-lightweight embedded database. Key-value database. Time Series Database.
Supabase: an open source Firebase alternative. Start your project with a Postgres database, Authentication, instant APIs, Edge Functions, Realtime subscriptions, and Storage.
hasura/graphql-engine: a streaming GraphQL API (subscriptions) on your Postgres database, so that you can securely stream data to apps and services with minimal setup. Benchmarked to handle 1M concurrent GraphQL clients streaming from 10000 channels with 1M events ingested every minute - on a single instance of Postgres RDS.
TinyBase: a smart new way to structure your local app data.
surrealdb: an end-to-end cloud native database for web, mobile, serverless, jamstack, backend, and traditional applications.
EdgeDB: a new database that extends the relational model to eliminate the object-relational impedance mismatch.

Pub Stuff:

Human-level play in the game of Diplomacy by combining language models with strategic reasoning:Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Scalable algorithm simplification using quantum AND logic: Here we implement a resource-efficient construction for a quantum version of AND logic that can reduce the compilation overhead, enabling the execution of key quantum circuits.
The Security Design of the AWS Nitro System: This paper provides a high-level introduction to virtualization and the fundamental architectural change introduced by the Nitro System. It discusses each of the three key components of the Nitro System, and provides a demonstration of how these components work together by walking through what happens when a new Amazon Elastic Block Store (Amazon EBS) volume is added to a running EC2 instance. The whitepaper discusses how the Nitro System, by design, eliminates the possibility of administrator access to an EC2 server, the overall passive communications design of the Nitro System, and the Nitro System change management process. Finally, the paper surveys important aspects of the EC2 system design that provide mitigations against potential side-channels issues that can arise in compute environments
Legged Locomotion in Challenging Terrains using Egocentric Vision: In this paper, we present the first end-to-end locomotion system capable of traversing stairs, curbs, stepping stones, and gaps. We show this result on a medium-sized quadruped robot using a single front-facing depth camera. The small size of the robot necessitates discovering specialized gait patterns not seen elsewhere.
Monolith: Real Time Recommendation System With Collision: In this paper, we present Monolith , a system tailored for online training. Our design has been driven by observations of our application workloads and production environment that reflects a marked departure from other recommendations systems. Our contributions are manifold: first, we crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint; second, we provide an production-ready online training architecture with high fault-tolerance; finally, we proved that system reliability could be traded-off for real-time learning. Monolith has successfully landed in the BytePlus Recommend product.
MultiRay: Optimizing efficiency for large-scale AI models: State of the art AI model running at 20 million queries per second, 800 billion queries per day.
Abuse and Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning: This work presents a fraud and abuse detection framework for streaming services by modeling user streaming behavior. The goal is to discover anomalous and suspicious incidents and scale the investigation efforts by creating models that characterize the user behavior.
Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems: We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
A World Wide View of Browsing the World Wide Web: we also observe signi!cant differences in the speci!c sites visited based on where users are located. Beyond the small set of globally popular sites, most websites are country-speci!c. For example, of sites appearing in the top 1K for at least one country, over half do not rank in the top 10K for any other country, and even among the top 20 sites for a given country, at least half are only nationally popular
Owl: Scale and Flexibility in Distribution of Hot Content: In contrast to prior assumptions about peer-to-peer distribution, Owl shows that centralizing the control plan is not a barrier to scalability: Owl distributes over 800 petabytes of data per day to millions of client processes. Owl improves download speeds by a factor of 2–3 over both BitTorrent and a prior decentralized static distribution tree used at Meta, while supporting 106 use cases that collectively employ 55 different distribution policies.

The Cloud is Not a Railroad - An Argument Against the Vertical Separation of Cloud Providers

High Scalability — Mon, 24 Oct 2022 10:52:23 GMT

There's a move to regulate cloud providers by vertically separating the services they offer.

Like railroads of yore, who were not allowed to provide freight services on top of their base services, cloud providers would not be allowed to provide services on top of their base platform services.

Vertical separation would be new to the cloud industry. Is it a good idea? Would it actually solve any problems? My answers are no and no, but probably not for the reasons you think. Let's dive in.

Here are a few useful resources for exploring this argument:

Maintaining monopolies with the cloud by Cory Doctorow. https://pluralistic.net/2022/09/28/other-peoples-computers.
A group advocating for fair licensing. https://www.fairsoftwarelicensing.com/our-principles.
Cloud Infrastructure Services -- An analysis of potentially anti-competitive practices by Professor Frédéric Jenny. https://www.fairsoftwarestudy.com/.

Now Cory is about 100x times smarter than I am, but this analogy is wrong:

That's why the trustbusters were so big on "structural separation": the principle that a business can own a platform or use the platform, but not both.

So this remedy is also wrong:

Rather than trying to construct rules that kept the referee honest even when they played on one of the teams, they ordered the companies to choose one role – either you were the ref, or you were a player, but you couldn't referee a match where you were also in the competition.

Because this idea, while trivially true, misses the point:

There is no cloud, there is only other people's computers.

This is like saying there is no human body, it's just a collection of cells. While true, it is the lesser truth that fails to recognize the importance of the body, organs, and the adaptive function of the brain. That's what a cloud provider is; it's the body electric for the cloud.

The report by Professor Jenny seems to be the main support for cleaving cloud providers in two.

I've read the report. Is there a smoking gun? Not that I found. But then, I'm not interested in enterprise licensing woes with Oracle or Office 365. What I'm concerned about, as a developer, is the collateral damage such ham-fisted regulation might cause.

The job of a cloud is to run workloads. Workloads are implemented using services. A cloud offers those services. The more high-quality services a cloud offers, the better products we can build.

Limit the cloud services a cloud provider can offer and you limit the quality of the software we can build. It would be like removing legos from a lego set. You can't do a proper build without the full set.

There is no arbitrary dividing line that can be made such that platform services are on one side of the line and non-platform services are on the other side. A cloud is a platform all the way down. Without cloud services, there is no cloud because a cloud is literally defined by the services it offers.

A cloud provider packages up vast pools of compute resources and resells them as cloud services. There is no structural separation as there is with a train service. What some consider infrastructure or platform is just another cloud service.

Would you really trust some committee or government agency to draw this line correctly? I would not because I don't think the people who think this way actually understand how the cloud works.

Does anyone really want to go back to the VM-centric days when we rolled everything ourselves? No.

What it means to be cloud-native has gone through several evolutions: VM to container to serverless. Each cloud-native evolution is about using the hardware more efficiently.

Do you think those evolutions would have happened with a structural separation in place? No.

Where's the incentive unless you derive a benefit from optimizing the whole stack? We would still be building higher and higher on top of VMs because that's where platform innovation would have stopped evolving.

This is because the cloud is not a railroad. They are different in kind. Building software is not moving freight. A railroad is fixed. The cloud is ever evolving.

This is my best argument for why the cloud should not be vertically separated. It would make the job of building quality software even harder and slower and more expensive.

Admittedly, it's a developer focussed argument, but there are more technical arguments against the move. Let's walk through them.

Is The Cloud a Natural Monopoly?

Wikipedia:

A natural monopoly is a monopoly in an industry in which high infrastructural costs and other barriers to entry relative to the size of the market give the largest supplier in an industry, often the first supplier in a market, an overwhelming advantage over potential competitors.

Let's look at the size of the market. For all the attention the cloud gets, it's still a child or maybe a pre-teen in terms of the overall IT industry.

The market value of the worldwide IT industry is $5 trillion. The global cloud computing industry was valued at $371 billion in 2020.

So, the cloud is not a monopoly. Why should it be regulated as one?

Examples of traditional natural monopolies are dams, railroads, power plants, water supply, and sewer services. The idea is it's practical to only have one of these for a given area. And even that list is not invulnerable. With solar power, the centralization of power generation is giving way to a more decentralized model.

Does the there can be only one apply to the cloud? No. Network effects are not the same as monopoly control.

Apple is not a monopoly in the smartphone market simply because people like their smartphones. Neither are clouds. People choose to run their workloads on the cloud from a large set of options. It's a choice. It's not because there are no other options available to them because of structural issues.

While high infrastructure costs do create a barrier to entry to creating a cloud provider, this misses an important point: the benefits of the cloud come from the cloud model, not any particular cloud implementation.

Here's an informal definition of a cloud provider taken from my book Explain the Cloud Like I'm 10:

A cloud provider installs a bunch of computers in a bunch of datacenters spread throughout the world, pools all those datacenter resources together, offers them over the internet as cloud services through an API, and rents them to the public in an on-demand, pay-for-what-you-use basis.

The cloud model means accessing services over a network through an API, safely running multiple workloads on a single server, cloud storage, cloud computing, auto-scaling, elasticity, resource pooling, on-demand pay-for-what-you-use billing, load balancing, fault tolerance, leveraging managed services, and building software using a cloud-native paradigm.

The cloud model has rightly attracted customers because it's a powerful way to create, distribute, and use software, but the cloud model is not relegated to only the cloud.

The cloud model and, thus, cloud-native software development can work in any context—on-prem, colo, hybrid, edge, bare-metal. It’s the model that matters. Not who implements it or how it’s implemented.

This is what I mean when I say people aren't understanding the cloud. They over-index on specific cloud implementation instances and not the value and flexibility of the cloud model itself. Does the free market work for the cloud?

Another way to address the natural monopoly questions is to answer: does the free market work for the cloud?

This is where we have a problem. Like telecom, the cloud provider market slipped into oligopoly. There is little incentive for the Big 3 to drive costs lower, but that's not the same as having a monopoly that should split and regulate an industry.

There are many alternatives for running workloads. Take a look at this still incomplete list of cloud providers from Wikipedia. The problem is there's no easy alternatives for running workloads, and that's because there is no standard cloud API.

We have a cloud model that offers a cloud-native method of software development, but it may take several more technological evolutions for the model to advance enough to have a minimal viable portability layer. With container-native PaaS (App Runner and Cloud Run) we are getting close.

We have several players like Cloudflare innovating to become alternative lower-cost cloud providers. And we have k8s striving to become a cloud in a box droppable into any datacenter. You can switch between clouds with effort. It has been done. You can even go old school and use non cloud-native architectures. Anyone can rent a rack in a colo and stand up a system. Did anyone promise you a rose garden? I don't think so.

The larger point is there is a large competitive market amongst a group of competing enterprises. There's just no easy solution to the perceived problem of fungible cloud alternatives.

Is there a way to create greater competition and expand economic choice for cloud customers? I don't know, but high switching costs isn't a proper test for regulating an industry. It's a technical issue that's fixable with effort and investment. Maybe start investing into open source alternatives rather than complaining they don't already exist? Invent the future you want. A Cloud Provider Absorbs Huge Fixed and Sunk Costs

Cloud providers incur huge fixed costs for creating and maintaining a network of datacenters spread throughout the word. Let's say you regulate cloud providers by requiring them to decouple platform services from their other services, whatever those services may be.

Then we must rely on third parties to fill in the gap with managed services of all varieties that must operate out of every datacenter while surviving datacenter level and region level failures.

That's easy to do, right? Of course not, but let's ignore that very few organizations in the world have the technological know-how to create such managed services, especially without low level control of the entire system. But we're ignoring that.

The fixed and sunk costs incurred by the cloud provider will dwarf any investment from third party service providers.

How do you create the right incentives for the cloud provider to keep investing huge amounts of capital in infrastructure while recovering its fixed costs? If competitive service level pricing levels do not cover the cost of the infrastructure then prices would have to be raised to a level that would deny lower-margin users.

We've seen telecom rest and vest on their installed infrastructure, rent seeking by a combination of regulatory capture and the avoidance of the infrastructure investments that would improve and maintain the system.

The disincentive to game service pricing levels in order to create short-term profits at the expense of long-term investment will be overwhelming. It will be impossible to properly set service prices at a level that will provide the ROI required by cloud providers, third party service providers, while minimizing costs for customers so they also can earn an adequate ROI. There's simply not enough room for everyone to get their piece of the pie. What we'll inevitably see are the constantly rising prices and declining service levels experienced in other industries that are regulated as utilities.

This is the single greatest weakness of the proposal for vertically separating cloud providers.

Is that the kind of cloud you want?

Benefits of Vertical Integration

Where is the optimal point of vertical separation in a cloud provider's stack of services? What are the consequences of making the wrong decision?

Any deliberation must recognize the technological interdependence between cloud infrastructure technology and the cloud services that can be offered.

There's a huge short-term and long-term efficiency of services that depends on the successful coordination of cloud services and infrastructure.

For example, AWS created Nitro. Nitro is a revolutionary combination of purpose-built hardware and software designed to provide performance and security. Would Nitro have been invented if AWS was restricted to being a platform provider? Unlikely. It would have had no way of propagating Nitro across an entire vertical stack of hardware and software services. So why bother innovating?

Vertically-integrated cloud providers can optimize the entire system by ensuring that each component is combined in sub-systems that work together and are optimized as whole.

One consequence of this tight integration is the overall high reliability offered by clouds. How reliable would the cloud be if a cloud provider did not control their entire stack of services? Given the experience with finger pointing in other highly interdependent systems, it would be less reliable and outages would no doubt last longer.

Are there any positive externalities to cloud providers that might justify keeping cloud providers? Very large integrated cloud providers might reduce the overall number of servers required to run the world's workloads, but that's not sufficient reason for me not to regulate them. I think my other arguments are reason enough.

Please leave our legos alone.

Stuff The Internet Says On Scalability For July 11th, 2022

High Scalability — Mon, 11 Jul 2022 15:11:55 GMT

Never fear, HighScalability is here!

Every cell a universe. Most detailed image of a human cell to date. @microscopicture

Other images considered: one byte of RAM in 1946; visual guide on troubleshooting Kubernetes; Cloudflare using lava lamps to generate cryptographic keys; 5MB of data looked like in 1966

My Stuff:

Love this Stuff? I need your support on Patreon to help keep this stuff going.
Know anyone who needs to go from zero to cloud? My book teaches them all they need to know about the cloud. Explain the Cloud Like I'm 10. It has 440 mostly 5 star reviews on Amazon. Here's a 100% sincere, heartfelt review:

Do you like Zone 2, Zone 5, interval, REHIT, HIIT, and bodyweight workouts? I made an app for that. Max Workout. I’m not just the programmer, I’m a client. I use it 4 times a week and if you want to retrain your mitochondria, get fit, get healthy, and live longer, I think you’ll like it too.

Number Stuff:

9: on-prem servers to run Stack Overflow. 200 sites with 6000 requests/second, 2B views/month, millisecond latency, <10% utilization.
3: hours a day of thinking is all it takes to get a Fields Medal.
.73: our Kardashev Scale rating. We have a long ways to go.
20+: shards for Riot Games’ millions of users. 500,000 events per second peak. 8 terabytes of data generated per day. Kafka buffers data on-prem and moved to AWS on the backend.
4-bit: microcontroller on plastic. Using 4-bit and 8-bit logic instead of 16-bit or 32-bit helped. As did separating the memory that stores instructions from the memory that stores data. But they also cut down on the number and complexity of the instructions the processor is capable of executing. The team further simplified, by designing the processor so it executes an instruction in a single clock cycle.
100 trillion: digits in Google’s calculation of pi in 157 days of calculation time, which is just a little more than a month longer than the 121 days it took for the 31.4 trillion-digit calculation in 2019.
40%: increase in performance using Go generics in Google B-Tree implementation.
103: bits of advice from Kevin Kelley after 70 years spent balancing on this spinning blue-green globe.
1,000,000: concurrent connections is possible, you just have to tune everything correctly.
95%: IRShield's effectiveness against passive WiFi eavesdropping attacks.
$100 million: GDPR fines in the first half of 2022, an increase of 92% over H1 2021.
18: months a botnet hid inside the rotting corpse of corporate security.
0.2: watts used for computation in the human cerebral cortex, but communication costs are 20-fold computational costs.
2/3: subsea fiber optic cable faults caused by fishing vessels and ships dragging anchors.
1 TW: solar to be deployed annually by 2030
$3.4 million: value of unpaid work by reddit moderators.
$34 million: permanently locked into the AkuDreams contract forever because of a bug. Do you really want software to control the world with no other compensating layers? Really?
Five Billion: image-text pair dataset from LAION for open-source large-scale machine learning.
100+ Mbps: Starlink download speed.
20 trillion: events processed per day by Netflix (2021).
200 trillion: objects held in S3. That's about 29,000 objects per Earthling. S3 averages over 100 million requests per second and has grown by a factor of 250,000 in less than 16 years.
100 terawatt-hours: electricity used by the Bitcoin network in 2021, more than the typical annual energy budget of Finland.
~47%: speed increase of PHP 8.1 over PHP 8.0.
$5000: monthly spend on Linode servers for the Overcast podcasting app.

Quotable Stuff:

@functi0nZer0: BREAKING: Shock rippled across Solana today as everyone realized that their secret sauce for increasing transactions per second was to just make seconds last longer
Pinterest: Today, Pinterest’s memcached fleet spans over 5000 EC2 instances across a variety of instance types optimized along compute, memory, and storage dimensions. Collectively, the fleet serves up to ~180 million requests per second and ~220 GB/s of network throughput over a ~460 TB active in-memory and on-disk dataset, partitioned among ~70 distinct clusters.
Simon Wilson: One of my favourite uses for the GPT-3 AI language model is generating explanations of how code works. It’s shockingly effective at this: its training set clearly include a vast amount of source code.
@brianleroux: AWS Lambda is many things but it is *not expensive*. Its $1 for 6M invocations. The whole point is stateless on-demand to achieve 100% utilization. Said another way, provisioned server capacity is not 100% utilization. We over provisioned them to allow for spikes. Comparisons where the Lambda is being invoked 24/7 is totally disingenuous. A cluster of servers running 24/7 mostly idle is *not* the same thing. Statelessness is a feature. It enables 100% utilization and, more often than not, that efficiency will translate directly to saving big $$$ not costing more.
Leslie Lamport: Well, they didn’t have an algorithm, just a bunch of code. Very few programmers think in terms of algorithms. When trying to write a concurrent system, if you just code it without having algorithms, there’s no way that your program is not going to be full of bugs…Yes, the importance of thinking and writing before you code needs to be taught in undergraduate computer science courses and it’s not. And the reason is that there’s no communication between the people who teach programming and the people who teach program verification.
@jmhodges: GraphQL makes your public API equal to a generic database and -- worse -- a generic graph database. The amount of maintenance work is sky high. Locking the query capabilities down means you're just doing normal APIs but not locking it down means infinite performance work
DSHR: Moving all storage to the NVMe interface is an interesting trend.
@adrianco: This is also useful info: "According to Intel, the CPU still represents about 32 percent of the IT equipment power budget, with memory only burning 14 percent, peripheral costs around 20 percent, the motherboard around 10 percent, and disk drives 5 percent."
@chrismunns: I think one of the biggest lies we tell ourselves in IT/tech is that we need to prepare for "scale" of traffic. Generally speaking 99% of apps never break 1000 rps. Even with that you can likely serve millions of active users in a month. 1k rps comes easily with Lambda, or Fargate, or DIY with EC2/ECS/EKS/etc. You won't stress a load balancer with this much traffic. you likely won't stress a database of any major recent engine either. Internally facing enterprise apps (a surprising amount of the IT footprint for most companies) barely break 10s of RPS. Again why would you even run a server for any of that. As soon as we find a way for people to really understand async patterns, to realize that cached data gets you pretty far for a lot of apps, and that not running a server 24/7 just to HA a 2 rps app can be ok, then we'll see true sustainability of IT
@aeduhm: I do this for a multi-region setup where I want a single custom domain with latency routing to hit the closest AppSync API. I did some global speed testing, and with my very unscientific approach I was seeing a ~30-50ms latency increase with the APIGW proxy.
@josh_wills>: OH: "This microservice could have been a SQL query"
adrian_b: In conclusion, to avoid the huge amount of resources needed by a CPU for guessing about the programs, the solution would be a high-level language compiler able to transparently allocate the data into a hierarchy of heterogeneous memories and schedule transfers between them when needed, like the compilers do now for register allocation, loading and storing. Unfortunately nobody has succeeded to demonstrate a good compiler of this kind.
@werner: I still get annoyed by every piece of AWS that is not #serverless!
@benschwarz: This is true, at least in my experience. In the earlier days, for a long while @Calibreapp handled tens of millions of API requests per month on a $7 heroku dyno. You don’t need k8s.
@guzmonne: Use tags. Don't be afraid to populate them with information. Names, dates, URL of pipelines, project information, etc. And, if what you are creating is disposable mark it so.
@JoeEmison: Checking back in on the Branch AWS bill; I think I last posted about 2 years ago, when it was still under $1,000/month. In the meantime, Branch has grown ~15X YoY for two years, and so the bill has also grown, to just around $10K/month. As a reminder, Branch is fully serverless, and carries a complete copy of production into every staging/test/etc environment, as well as every developer environment. Every developer has his/her own AWS account (also too with every other environment). Biggest spend is on DynamoDB at $4k/mo. We do buy capacity for some tables in prod, but otherwise go on demand. The majority of our DynamoDB cost is in backups, not in usage. Next is Lambda, at around $2200/mo. We do buy some provisioned capacity for Lambda, but might stop soon (not sure it's doing much for us). If we do that, the bill should be closer to $1700/mo. We have zero devops full time. Our more senior developers do all our devops in PRs. It’s a minimal amount of time.
@doctorow: In other words, Amazon has converted its "customer-centric" personalization system, which emphasized the products it predicted you would like best, into an auction house, where the products that have paid the most come first. Amazon sellers - 60% of sales and rising - must choose between spending on better products or better placement on Amazon. If they choose better products, you'll never find them. Amazon made $32b last year contributing to the crapification of everything it sold.
@t3rabytes: For the last 6 months I've been planning a mega infra migration for the @EpicGames Store and this morning we pulled it off without a hitch. That's: - a new domain for the Store (www. -> store.) - a new AWS account - raw EC2 to Kubernetes - additional CDN connectivity. The biggest drivers for this were SEO and single point of failure reduction. Beyond that though, we get some insane eng quality of life improvements -- prod deploy time drops from 45 minutes to 6, ephemeral/branch envs, faster auto-scaling, etc.
@_Amanda_Walker: As a Xoogler, I did a bunch of the opposite. "We don't need X, we're not Google." X = stuff like "microservice architecture", "hot failover", "ability to handle millions of qps", etc. At one point I said "I could run production on my laptop. Don't worry about scaling yet."
huksley: made a mistake and accidentally created a serverless function that called itself. In a recursive loop, with a 30s timeout. I thought I fixed it and deployed the code to the dev environment. Imagine the terror when I woke up the next day to see the AWS Billing alert email saying I already owed $1,484! I removed a function and deployed it again in 30 minutes, but it was too late. It has already run for 24 hours, using over 70 million Gb-Second!
Ably: Round trip latency from any of our 205 global points of presence (PoPs) that receive at least 1% of our traffic: < 65ms for the 99th percentile (measured at the PoP boundary within the Ably access network, which will be closer than a datacenter).
Plausible: We moved from our PostgreSQL database to ClickHouse to make the app much faster and handle much bigger websites. Moving to ClickHouse was the best technical decision we’ve made, and we’re now able to count more than one billion page views per month while still providing a very fast-loading dashboard.
Rick Houlihan: Sooner or later when us-east-1 has problems enough "well architected" workloads will be deployed for failovers to brown out other AWS regions and cause cascade failures.
Elizabeth Kerr: For the first time in 2022, Microsoft Azure surpassed Amazon Web Services (AWS). Azure increased its adoption rate from 73 percent the previous year to 77% this year, 2022, while AWS dropped from 77 percent adoption rate to 76 percent.
Mike Shatzkin: We have gone from about 5000 bookstores in 1990 to more like 1250 today, and, on average, today’s stores carry a fraction of the number of titles the bigger ones did back then. We have gone from about 500,000 titles available in 1990 to about 20 million titles available today. We have gone from perhaps 15 or 20 publishers with their own distribution organization that were real competitors for most books an agent might sell in 1990 to five (soon to be four) today. We have gone from a world where nobody could effectively deliver a book without a publishing organization in 1990 to one where anybody can today.
@GergelyOrosz: Years of working in Uber’s payments team changed my view on distributed systems where participants can make money. I don’t believe any such distributed system can be as efficient as a centralised one. A centralised system spends SO much on fraud reduction and customer support. In ~10 years there’s been much talk on building a P2P version of Uber. Yet it never gained momentum. Any such system is doomed to fail as many drivers & riders would abuse the system to maximise short-term profits. Incredible what both parties do to make or save money. What changed my belief is seeing all the things drivers do to make money (using every single arbitrage opportunity - eg forcing riders to cancel & collect the cancel fee, as this is more profitable at eg airports) and also riders (from money laundering to car theft etc).
glenngillen: as for the service dynamic: the whole "2 pizza team" thing is both a blessing and a curse IMO. Once you launch a service at AWS’s scale you’ve immediately got tens of thousands of users. And AWS takes stability/availability/scalability incredibly seriously. I think it’s more likely that team is now in a permanent operations mode fixing weird edge cases you find at scale and keeping the lights on.
Daniel Spielman: That’s true with a lot of my research. A lot of the time, the way I’ve done something is not because it was the problem I was setting out to solve. I was setting out to solve something else, and it didn’t work. But I was aware enough of the intellectual landscape to figure out what I could use it for.
@WillManidis: In April, a team of cyberattackers attempted to breach an undersea cable off the coast of Hawaii. The cable is part of a network that carries over 95% of the world’s data.
@fchollet: To put the "scale" narrative into perspective... The brain runs on 15 watts, at 8-35 hertz. And while we have ~90B neurons, usually only ~1B are active at any given time. The brain is very slow and does a lot with very little. The brain's energy consumption is similar to that of a very bright LED bulb -- or 1/5th of that of a standard incandescent bulb. It's not exactly a data center.
Hiroyuki K. M. Tanaka: The cosmic time synchronizer works by synchronizing devices around cosmic ray events detected by those devices. This could bring accurate timing abilities to remote sensing stations, or even underwater, places that other methods cannot serve.
@hkarthik: The longer an engineer works in a monolith, the more they yearn to decompose it into micro-services. The longer an engineer works across a ton of micro-services, the more they yearn for the days of working in a monolith. The grass is always greener.
@jeremy_daly: Yesterday @MongoDB announced the GA of Atlas Serverless Instances! The usage based discounts might just be the greatest feature of this! The future of #serverless is now!
@fredbrancz: Recently heard from a @ParcaDev user: "We had no idea 55% of our CPU time was spent in Istio/Envoy, but Parca made it obvious!" 55%!! More than half their infrastructure cost!
@GergelyOrosz: Happens every single time: 1. Company realizes: "We pay ~2-3% of our revenue to payment processors! This accounts for more than $1M/year!" 2. "We should cut out the middleman." 3. They hire a team, build their own processing... to realize it's more complex & way more expensive
@houlihan_rick: RDBMS relies on CPU efficiency to offset the time complexity of indexed joins. Since 2014 not even the Top500 supercomputers have been able to match Moore's law. The trend is clear...TCO of RDBMS is rising as CPU performance is flattening.
David Rosenthal: Thus we need to ask in whose interest a vast increase in the supply of transactions would be? It pretty clearly isn't in the miners' interest, as we see with the history of attempts to increase the Bitcoin block size. Liimiting the supply of transactions allows the miners to profit handsomely in periods of high demand. At first glance low fees would appear to be in the users' interest, but in practice it isn't. Low fees lead to flooding, and outages like Solana's. In practice, miners control the system, so even if it were in the users' interesst it wouldn't matter. Catch-22.
Kevin Mitchelle: Evolution is in the design game. But it produces functional designs by exploration and selection, not by raw creation from the void. It’s not just luck that living systems have the functionalities they do. The principles that enable them exist and can be found by evolution
Small Datum: The impact from fixing a p50 perf bug is easy to predict. Everything gets slightly faster and/or more efficient. But p99 perf bugs are outliers -- they don't make everything slower so fixing them doesn't make everything faster.
Jordana Cepelewicz: Faced with a threat, the brain has to act fast, its neurons making new connections to learn what might spell the difference between life and death. But in its response, the brain also raises the stakes: As an unsettling recent discovery shows, to express learning and memory genes more quickly, brain cells snap their DNA into pieces at many key points, and then rebuild their fractured genome later.
@asymco: The App Store has 1.8 million apps, and it is visited by 600 million people every week. The earnings of smaller UK app developers have grown by 84% in the last two years, with the number of smaller developers joining the App Store up by 40%.
Slack: What caused us to go from a stable serving state to a state of overload? The answer turned out to lie in complex interactions between our application, the Vitess datastores, caching system, and our service discovery system.
IIAOPSW: This is quantum woo fueled neo cartesian dualist bullocks. Sorry not sorry. For starters, if events in space and any point in time are accessible as "memories", the laws of physics that enable that shouldn't constrain the only application to memory.
Backblaze: The 6TB Seagate (model: ST6000DX000) continues to defy time with zero failures during Q1 2022 despite an average age of nearly seven years (83.7 months).
toeknee123: Implementing a simple and effective tagging strategy for visibility and monitoring is one of the most important aspect of cloud cost control. From our experience, this is where most of our own users start.
myrandomcomment: What is old is new again. IBM mainframes had a concept called a "channel controller". Everything connected to the mainframe basically was a computer itself that offloaded the main system. Every DASD (disk), communication link was its own system.
SOLAR_FIELDS: defining the logic of checks themselves in YAML is a trap. We went down this DSL route first and it basically just completely falls apart once you want to add moderately complex logic to your check. AirBnB will almost certainly discover this eventually. YAML does work well for the specification of how the check should behave though (eg metadata of the data check). The solution we were eventually able to scale up with was coupling specifications in a human readable but parseable file with code in a single unit known as the check. These could then be grouped according to various pipeline use cases.
pbalau: There is no tool in the world that will make you good at something you don't know.
ZeroCool2u: My main gig runs workloads primarily on AWS, but I work with a small company as well that is completely on GCP and I gotta say the difference is night and day in instance allocation and startup time. GCE is so much faster it's infuriating when I've gotta go back to work and sometimes have to wait more than 10 minutes in the worst case for an EC2 instance to finish booting in an EKS cluster that's already running!
@HNStatus: For the record, HN is a single process on a single server. I can't snap my fingers and magically make it redundant.
@jmhodges: GraphQL is a trap. GraphQL makes your public API equal to a generic database and -- worse -- a generic graph database. The amount of maintenance work is sky high. Locking the query capabilities down means you're just doing normal APIs but not locking it down means infinite performance work
@Carnage4Life: Today I learned 99% of cryptocurrency transactions happen on centralized exchanges like Coinbase & Binance which dont write to the blockchain but instead to their own SQL databases. Obvious in hindsight since blockchains are too slow & unscalable to build a product like Coinbase
@Gankra_: "We still do not have actual code failures that have been traced to this problem; we do now have small sections of contrived code that can result in the problems discussed here. And we do not have any way to convince ourselves that real code is immune from such problems." -Boehm
@sheeshee: fascinating. you could as of today really build awesome micro stacks with tools like firecracker, sqlite, k3s etc etc - seems that cloudflare is also opensourcing their serverless V8-based thing.. the new startup empire is needing one machine, vasily, one machine only :)
@castrojo: There are now over 152,000 people on the Kubernetes slack in case you were wondering if cloud native is slowing down.
@nicktornow: But I don’t know how we (Twitter, society) will manage the moderation problem for "dangerous" information and thoughts and the real damage that can happen to us all through appeals to our baser instincts, as is on full display in the gruesome history of the 20th century, etc.
@GergelyOrosz: I was a happy Google Apps Engine customer 2010 spending ~$2,000/month on an app that grew to ~1M users. In 2011, Google decided to increase prices ~5-10x changing billing fundamentals, gave 30 days notice and no tooling to prepare. I’m never going back to GCP / GAE again.
@rakyll: There is a real toil on working on infrastructure that is not visible. You see a region launch from a cloud provider and dismissing it like "yea, whatever" even though the immense scale of work that went into it.
Ken Shirriff: The Apollo missions to the Moon required complex hardware to communicate between Earth and the spacecraft, sending radio signals over hundreds of thousands of miles. The premodulation processor was a key component of this system, combining voice, scientific data, TV, and telemetry for transmission to Earth.1 It was built from components that were welded together and tightly packed into a 14.5-pound box.
@houlihan_rick: I have had this #SingleTableDesign/@GraphQL conversation a few times. Document/Item == GraphQL Type. GraphQL Response == A structure of Type instances. Single Table Query == An array of Type instances. GraphQL Resolver == Iterate array, map instances to Response
@GergelyOrosz: Blockchain, applied in the real world with a case study: To handle the ‘staggering’ amount of data of 500K loads per year (~1,500/day), Walmart Canada contracted an innovative blockchain solution using a private cloud, running 600 VMs to handle thousands of transactions per day.
@pati_gallardo: To all the ones saying "for a large enough N Big O comes back into play" The problem is that since the mental model doesn’t include caches folks don’t even think to measure the performance differences between their O(log n), O(n log n) or O(n) algorithms.
Rob Verger: In June of 2020, a pilot flying an F-22 in Alaska reportedly became disoriented, and the aircraft likely would have crashed were it not for the intervention of a software system on the fighter jet.
@JoeEmison: "Our technical debt is terrible and so we have to rewrite everything" and "We are going to use the same team we have been using to develop the code for years to do the rewrite" are wild things to say back to back but literally 100% of the time that’s what I hear.
@houlihan_rick: Small reads/writes are not good for S3 when they are happening frequently. It is much cheaper to use DynamoDB for high frequency access to small Items.
@KevinNaughtonJr: my typical day at Google: 9am - reverse a linked list 11am - count unique ways to climb a staircase with dp 12pm - lunch 3pm - help animal escape NxM matrix efficiently 4pm - invert a binary tree 5pm - commute home using Dijkstra's
@thdxr: company I advise moved to serverless from EKS - their old bill for just application hosting was ~$4000 a month. new bill is $70…cost savings with lambda can be counter intuitive. doing 60 requests a minute feels like a lot. if the average req takes 20ms to complete, you're only using 1.2s of that minute. 29 minutes a day. with a container, you pay for 24 hours to use it for 29min - 98% over provisioned
@bernhardsson: This might be controversial but I'm increasingly convinced that the future of infrastructure is to put all the infra inside the app code. What Pulumi & CDK does to infra but one step further and you just put everything together in code.
@royvanrijn: 💻 Developers: "I hate meetings" "Just let me code, I'll skip this one" "Sigh, not *another* meeting" ☹️ Also developers: "This project has no clear vision." "Management has no idea what's going on."
@jpetazzo: OH: "we have over 170 microservices, because our principal engineer is very knowledgeable about distributed systems"
Brent Mitchell: I have been using serverless computing and storage for nearly five years and I'm finally tired of it. I do feel like it has become a cult.
Neel Chauhan: I’m not the first person to tell you this, and certainly not the last, but Web3 is in fact centralized, just as Web2 was. Web3 is just a worse version of Web2.
@stewartbrand: Interesting: how much bad news is anecdotal and good news is statistical.
jedberg: I'm a huge proponent of microservices, having worked on one of the earliest and largest ones in the cloud. And I absolutely think that they provide huge advantages to large companies -- smaller teams, easier releases, independently scaling, separation of concerns, a different security posture that I personally think is easier to secure, and so on…every startup I advise I tell them don't do microservices at the start. Build your monolith with clean hard edges between modules and functions so that it will be easier later, but build a monolith until you get big enough that microservices is actually a win.
Hugo Rocha: Event-driven services and the ability to route events to specific service instances propose an elegant way to remove concurrency from a solution, by solving concurrency by design, we provide the foundations for a truly horizontally scalable system.
@colmmacc: Not to give away our super secret strategy but something that people often under appreciate about AWS is that new regions are by far the biggest needle mover for our business. The speed of light, sovereignty, and data locality all matter! Right now is a very interesting time at AWS because we're completing one of our key super-scaling journeys. When I joined, regions were pets, but we've been making them cattle. The numbers are always mind-blowing, but it's not something people cover or see a lot of.
@penberg: With V8 providing virtualization, deployments on edge effectively become either JavaScript or WebAssembly serverless functions. However, the next frontier in edge computing infrastructure is solving database access, which is still has a latency problem. 7/
Yasemin Saplakoglu: Contrary to expectation, the synaptic strengths in the pallium remained about the same regardless of whether the fish learned anything. Instead, in the fish that learned, the synapses were pruned from some areas of the pallium — producing an effect "like cutting a bonsai tree," Fraser said — and replanted in others.
@houlihan_rick: Scale is enabled by efficiency, and efficiency lowers cost. Designing for scale usually means more dollars in your wallet.
@houlihan_rick: The most important lesson I learned at Amazon is that whenever you don't think you need to worry about designing for scale you are wrong.
@KarlBode: I'm sorry you can't ride your $2500 exercise bike because the server is down, welcome to the future
@muratdemirbas A distributed system is a collection of autonomous nodes, communicating over a network to address a problem collectively, with no shared memory, with no common physical clock.
@Grady_Booch: The read latency of the @aws Elastic File System is 600 microseconds, The read latency of the first commercial hard drive - the @ibm 350 Disk Storage Unit, released in 1956 - was 600 milliseconds.
@317070: Very happy to share my secret project for the last 3 years: we were busy getting AI to run a nuclear fusion reactor. Spoiler alert: it works! Deep reinforcement learning is pretty great at working with sci-fi things where human intuitions tend to break down.
Guy Sela: When DynamoDB is 10x Cheaper Than S3. Never assume that S3 would necessarily be cheaper than a NoSQL solution, look at the numbers for your specific use case and you might reach surprising results. Remember to compress data before calculating the size. Consider the simplicity of your solution and potential future use cases as part of your "hidden" cost calculation.
@ccsdevio: This is a short AWS appreciation thread. For two dollars a month (knock on wood), I have a site running on an EC2, behind a load balancer, served by CloudFront on a CDN with HTTPS, with automatic health checks performed 16 times a minute in 8 regions across the world. I have a cloudwatch alarm enabled that hits pagerduty if those healthchecks start failing, and another that stops the EC2 and hits pagerduty if CPU usage gets too high (
@mjtech01: Our children's @philips electric toothbrush contains a 32-bit ARM processor, clocked at 48MHz. I've even got the development tools for the Cypress PSoC 4, and they've nicely labelled the debug pads. It's just a toothbrush. Really.
hermannj314: If the contractor finds asbestos, he's allowed to quote me more to do the work I hired him for. If my mechanic says my car has old parts not in inventory, they may cost more and take longer to replace or maybe they charge more to do the work. When the laws change, your lawyer gets billed to review and update your contracts. As new legal precedents are formed, new legalese is created and updated and contracts are renegotiated. Few things in the world are future proof for every conceivable externality. No one considers these professionals idiots simply because their domain has 'technical debt' too. It is ok to have technical debt, to mention it, to bill to fix it, and to talk about it. And you only look like a fool if you try to cover it up and pretend it isn't there because 'its not the customers's problem'
Whitney Claven: If I drop my coffee mug and it falls, I’d like to think that’s gravity," says Adhikari. "But, in the same way that temperature is not ‘real’ but describes how a bunch of molecules are vibrating, spacetime might not be a real thing. We see flocks of birds and schools of fish undertake coherent motion in groups, but they are really made up of individual animals. We say that the group behavior is emergent. It may be that something that arises out of the pixelation of spacetime has just been given the name gravity because we don’t yet understand what the guts of spacetime are
@AdamHisley: Went from K8s to Lambda. The main motivation at the time was devex/dev ownership. K8s seemed to work best as a walled garden "platform." Lambda was simpler for devs to grok and after a few months to pilot + write CDK constructs we had eng teams owning their infra provisioning.
@reubenbond: Any sufficiently large distributed system is always in a state of partial failure
@matthewpskelton: "Cloud versus Datacenter" is mostly missing the point. Is your infrastructure programmable via code? Is your infrastructure testable via code? "Cloud" is programmable as standard. That's why cloud is a force multiplier, not just because someone else is running things.
Steven Lemon: Ultimately, rather than separate our monolith into separate services, we started to break our solution into separate projects within the existing monolith. This division gave us a bit of additional structure and a better indication of where coupling and duplication existed, without the extra weight and challenges of microservices.
@BorisTane: if your team is building abstraction layers on top of managed cloud services in an effort to remain "vendor-agnostic", you're seriously doing this wrong instead, embrace the services at your disposal and use all their relevant features. don't drive the lowest common denominator
Geoff Huston: So perhaps an address these days is just an ephemeral transport token that distinguishes one packet flow from another, and really has little more in the way of semantic meaning. In this world what’s holding the Internet together is not IP addresses and routing: it's the DNS and the common name system.
temuze: Back at my old job in ~2016, we built a cheap homegrown data warehouse via Postgres, SQLite and Lambda. Basically, it worked like this: - All of our data lived in compressed SQLite DBs on S3. - Upon receiving a query, Postgres would use a custom foreign data wrapper we built. - This FDW would forward the query to a web service. - This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service. - This web service would re-issue lambdas as needed and return the results to the FDW. - Postgres (hosted on a memory-optimized EC2 instance) would aggregate. It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition. Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour.
jiggawatts: This is mere kilobytes of data over the length of the entire movie, and can tolerate hours of latency for most aspects (e.g.: analysis, charge-back, ML, etc...). A lot of the time I look at "modern cloud architectures" and I see a firehose of raw data being spewed out in hideously inefficient formats. Think 1 kilobyte of JSON to represent a metric, often a single number or data point. It's not just NetFlix, this is the default for most of Azure Monitor and its agents as an example that would affect a lot of people.The write amplification is just insanity. Orders of magnitude of overhead, not data. It's as if the entire world forgot that binary formats exist. Google in no small part benefits from remembering this lost art -- they default to gRPC and similar packed binary formats. One reason Kubernetes (written mostly by ex-Googlers) is so fast is because internally it's all compiled code and gRPC. Compared to, say, Azure itself it is ludicrously faster. Literally hundreds of times lower latencies for common operations.
Doug: So this is actually one of the nice things that I also really like about Concord is that if you take a single process and look at it, it’s not just a state machine. It’s actually a state machine that’s composed of smaller state machines who themselves might be composed of smaller state machines. What that means is that you can kind of draw your boundaries in lots of different places. You can take a single program that is maybe performing too slowly and split it into two smaller ones or you can take a swath of programs that all have similar kind of needs in terms of data and stitch them together, and it lets you play with the performance profiles of them in a really nice way.
The Victorian Internet: the construction of a global telegraph network was widely expected, by Briggs and Maverick among others, to result in world peace: ‘‘It is impossible that old prejudices and hostilities should longer exist, while such an instrument has been created for the exchange of thought between all the nations of the earth.’’...So much for universal peace and understanding. The telegraph was providing new ways to cheat, steal, lie, and deceive.
The Victorian Internet: BY THE EARLY 1870S, the Victorian Internet had taken shape: A patchwork of telegraph networks, submarine cables, pneumatic tube systems, and messengers combined to deliver messages within hours over a vast area of the globe. New cables were being laid all over the world. Malta had been linked to Alexandria in 1868, and a direct cable was laid from France to Newfoundland in 1869. Cables reached India, Hong Kong, China, and Japan in 1870; Australia was connected in 1871, and South America in 1874. In 1844, when Morse had started building the network, there were a few dozen miles of wire and sending a message from, say, London to Bombay and back took ten weeks. But within thirty years there were over 650,000 miles of wire, 30,000 miles of submarine cable, and 20,000 towns and villages were on-line—and messages could be telegraphed from London to Bombay and back in as little as four minutes. ‘‘Time itself is telegraphed out of existence,’’
Cal Newport: The Deep Work Hypothesis: The ability to perform deep work is becoming increasingly rare at exactly the same time it is becoming increasingly valuable in our economy. As a consequence, the few who cultivate this skill, and then make it the core of their working life, will thrive.
Matt Rickard: Early-stage startups shouldn't run on Kubernetes yet. But eventually, growth-stage and large companies should be running on Kubernetes in some form. Kubernetes Maximalism doesn't mean one-size-fits-all. Infrastructure should progressively grow with your workloads and team. How can you choose the right technology now so that you can maximize growth and minimize pain later when you inevitably outgrow it?
CharlieDigital: The purpose of a startup isn't to play around with tech: it's to find product-market fit and zero in on the business value. If you think of a startup as a place to play around with tech, you have no place in a startup. You should almost always choose the "dumbest" possible technology used in the "dumbest" possible way so that you can hire anyone and they can step in and be productive when you need to scale. It minimizes ramp up time. It minimizes deployment and operational complexity. It minimizes the ways the system can fail. Literally build the dumbest thing that solves the core business problem that creates value for the user and then figure it out from there.

Useful Stuff:

Hybrid Cloud is BS? Maybe not if you’re Walmart. Walmart Amps Up Cloud Capabilities, Reducing Reliance on Tech Giants:
- developed the capability to switch seamlessly between cloud providers and its own servers, saving millions of dollars
- hybrid cloud allows us to be able to draw the best that the public cloud providers can offer and to be able to combine that with something that is really purpose-built for us,
- 10,000 edge nodes across a network of servers that sit at Walmart stores and in its distribution facilities, it has also built custom software that allows it to run its back end operations across any cloud system,
- Walmart said its cloud system has saved the company between 10% and 18% in annual cloud spending and has helped reduce cloud outages
- They didn't say how the money was saved. And is that worth the CAPEX that must have been invested in the development of such a complex system?
- @forrestbrazeal: 5 things I've learned the hard way about multi-region / multi-cloud: 1. Building your own control plane, whether cross-region or cross-cloud, involves cost and complexity that you will have to deal with every day - not just when a managed service has downtime…And a final reminder: No matter how many clouds, regions, or zones you use, the ultimate accountability for building systems that are resilient when you need them always rests with you.

It's relations all the way down. Carlo Rovelli defined relational quantum mechanics (RQM) as meaning something like all the properties of a physical object only exists in relation to another physical object. This resonated with me because after years of deep diving into relational and object modeling, I came to a similar conclusion in the realm of software. When I was teaching a relational database class, the typical way of introducing normalization was it's a way of reducing update anomalies. But when you consider the nature of relations it's deeper than that. The only thing that holds a relation together is identity. All other properties are a product of some relationship. Even a natural property like created_at is a relationship with time. Same with location, and every other property. In practice, that's doesn't matter because in practice relational models are for solving problems, not for plumbing the mysteries of the itness of what something actually is. And if you've ever been in a design meeting conversations often turn more into more of a philosophical inquiry than a practical problem solving session. When people say something is "over designed" this is often the reason. We turn design into a psychonalaytic therapy session with our client. We plumb their depths searching for insight. In the end time runs out, so we slap some properties in a table, add some foreign keys, make some queries, and call it a day, knowing that whatever we did missed the mark. Of course the ground shifts later because the model never matches the territory and epicycles keep being added to get the right result as defined by practicality. Object modeling produces the same insights. In an object model what exists is object identity. All other ISA, HASA, USES decisions are ultimately relational. This is also why no two people or groups will produce the same design for the same problem. They're grabbing different parts of the elephant. This relates to RQM because the essential idea behind RQM is that different observers may give different accurate accounts of the same system. Most of what we call technical debt stems from the fact the nothing in software has a true essence. Nothing we build is built on a true and everlasting nature. Our conceit is that we are building something based on a deeper truth, but when we look at something it's from a prism of relations, none of which are objectively right from a software persepctive. Whole networks of relations shift over time as problems change, people change, technology changes, understanding changes, as everything changes from that heroic point in time a human mind took a stand, created code, and moved on. The antidote is said to be simplicity. But in the same way the Zen and The Art Of Motorcycle Riding rediscovers Taoism through the concept of quality, simplcity is just another relation in the vast possibility space of relations.

Triple OG Stack Overflow still marching to the beat of their own servers. Engineering Stack Overflow with Roberta Arcoverde:
- 1.3B page views / month
- we have a code base that is 14 years old. We run on prem, we run on our own data center. We haven't gone to the cloud. We also have a monolithic application. So we have not breaking down into services or microservices in. We have a multi-tenant web app.net based, running on a single app, pull across nine web servers only. And that same application single app pool on AIS is handling 6,000 requests per second at the moment
- But so far our engineering teams working on the Q and a platform, I think we are at around 50 engineers at this point.
- we could, in theory, be running on a single web server. We wouldn't want to do that, but theoretically it would be possible.
- we always start from asking the question, what problem are you trying to solve? And in our, the problems that these tools and bandwagons try to solve are not problems that we were facing ourselves. So when you think about things like a monolith, for example, right? Why do you break down a monolithic to microservices or services typically because you want to scale two separate teams.
- You want to have multiple teams working on the same project without stepping on each other tolls. You want to have fast deploys. For example, fast deploys have never been a problem to us. We put stack overflow in production multiple times a day in four minutes. That's the time it took to build Seco floats abroad. If we had to revert a change, for example, it was always super fast, right? In, in a matter of minutes, we could revert our deployments, even though it is still a monolith, not a bunch of really small applications, but a single big application and invest in efficiency on those things.
- And that's important for keeping in mind that stack flow was built to scale that way. Right? We are designed for low latency. We are designed to grab requests, run them as acute, a few queries and return as soon as possible so that we can pick up the next one. We cannot handle a lot of memory pressure. So we also designed for low allocations everywhere. We try to avoid creating objects that will have to be collected very often. We try to avoid memory pressure on those nine web servers so that we don't have to stall on garbage collections because a stall and a garbage collect is terrifying for these web servers. So we try to run smooth and with a very low memory footprint, because that's how that scalability model was designed to work on, on the infrastructure that we have.
- 80% of our traffic is anonymous. People go to the question show page, which is the page that we show when you are Googling something and you get a result that takes you directly to stack overflow the solution to your problem, right
- we have two different levels of cash on, on the front, right on, on the memory and the web servers. And we also have our SQL server servers. They have 1.5 terabytes of Ram. So a third of the entire database can be very quickly access the in RAM. @alexcwatt~: giving SQL more RAM is better than caching page fragments with Redis.
- So we removed all cash and that was like three or four years ago. We stopped cashing that page. We stopped caching the content and little did we know it didn't really make any measurable effects on the performance. We were still able to handle requests and send responses in super fast because of how the architecture was built at the time. So currently for example, the average time to render on that page is around 20 milliseconds. So even without cash.
- we did this regular exercise where we would try to understand how much it would cost to run stack overflow in the cloud. And it was just never worth it, which is why we never did it these days. When we think about the cloud, we are thinking less about the power that it would take and more about latency. The other thing that we optimize for a lot with the current design is low latency. So we have an infrastructure that has single hops between nodes and those hops are connected via 10 gigabytes network cables and make Craver will tell you all about it.
- Well, pragmatism is what I'm hearing. Just comes down to productivity and pragmatism. Those are the words of the day.
- We have rolling builds. So we have those nine web servers. They are under an HTA proxy front, and every time that we need to, to run a new, to deploy a new version of the site, we take a server out of rotation, updated there, put it back on rotation.
- Shopify also runs on monolith. Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity.
- Dan Luu: Wave is a $1.7B company with 70 engineers whose product is a CRUD app that adds and subtracts numbers. In keeping with this, our architecture is a standard CRUD app architecture, a Python monolith on top of Postgres. Starting with a simple architecture and solving problems in simple ways where possible has allowed us to scale to this size while engineers mostly focus on work that delivers value to users.

But Airbnb moved from a monolith to a microservices. Results? Mixed. Airbnb’s Microservices Architecture Journey To Quality Engineering.
- Main philosophy: architecture can't impact business growth.
- 2008-2017. They used a monolith and monorepo until they reached about $2.6B in revenue, so you have some runway before you need to switch. At first, small teams worked on features with limited dependencies.
- What problems did they have? Velocity of software changes decreased. Hard to work on features in parallel. Component ownership is confused.
- 2017-2020. Microservices. The monorepo was split with a dedicated service migration team responsible for component transition. Yes, it helps to have resources.
- They reached $5B in revenue, but features are a cross-cutting concern, so implementing a feature required changes in multiple services involving multiple teams. Fine-grained services distributes complexity across services, make it hard to collaborate across teams, and even small changes are disruptive.
- Thrift was replaced by GraphQL as a unified data access layer. Annotations and codegen were used to improve the code. Love me some codegen.
- 2020- present. Micro + Macroservices.
- The main changes were unified APIs, central data aggregator, and service block facade APIs. It's not exactly clear what this means at a code and team level.
- At this point people will say the problem was they weren't doing microservices right. This is the self-fulfilling prophecy problem every philosophy has. If it didn't work, you weren't doing it right. In practice right-sizing services is always a problem because there are rarely natural service boundaries.
- Keep in mind a monolith should still be service based on the inside, it shouldn't just be a tangle of code. So in theory taking monolith should be just taking those already existing service point APIs and making them remote. Ownership or lack thereof wouldn't have to change. And I think the monorepo is a misdirection. What matters is code organization, the number of repos means nothing.

Videos from Strange Loop now available.

Which cloud rules them all?
- According to the Cockroach Labs 2022 Cloud Report the answer is…nobody. All three cloud providers offer price-competitive options. While there are no winners, they have some good advice on making choices:
- Thanks to AMD, GCP and Azure instances are outpacing AWS instances. GCP’s t2d instance type (which uses AMD Milan processors) took the top overall spot for both large and small nodes in our OLTP testing.
- For even relatively small amounts of persistent block storage, the cost of running a particular workload is much more influenced by the cost of the storage than it is the cost of the instance. For persistent workloads, it is extremely important to optimize cost calculations based on this consideration. This means preferring mid-tier storage options (pd-ssd on GCP, gp3 on AWS, premium-disk on Azure) unless your workload requires an either extremely high number of IOPS or very low storage latency.
- In both the OLTP and CPU benchmarks, we saw a per-vCPU performance advantage to running on smaller instances of a particular instance type, regardless of CPU platform, cloud, or instance type.
- while you may save a bit of money by choosing instance types with a lower vCPU to RAM ratio, you will likely see more consistent performance from instance types with more available memory, at least for the purposes of running CockroachDB.
- How do their oligopic pricing policies compare? Cloud Pricing Comparison: AWS vs. Azure vs. Google Cloud Platform in 2022. Of course, pricing is too complex to compare directly, but you know bandwidth will always break the bank.
- And with Announcing Pub/Sub: Programmable MQTT-based Messaging, Workers, KV, Durable Objects, R2, CloudFlare is making a case to become a simpler AWS for developers. As is Fastly: Fastly + Fanout: why real-time messaging and edge computing are an amazing combination.
- @NayutSitachitt: 1/ Understanding AWS - A perspective from a manager who has $AMZN as one of the largest positions in the portfolio for over 7 years. In the thread, I will also discuss $MSFT's Azure, $GOOG's GCP. Let's dive in.
- @mike_julian: A friend asked me recently what I think the big things are that make GCP, Azure, and AWS different from each other. I think the biggest thing is how they view customers. GCP, organizationally, seems to have disdain for customers. What little I've seen from Azure toward customers can best be characterized as "inaction". AWS has the most interesting--and rare!--approach to customer feedback of all.
- Daniel Compton: The main alignment here is Google adding data transfer charges to match AWS and breaking architectural promises they’ve made to their customers in the process. This is an incredibly short-sighted move and will damage customer trust in Google Cloud for many years.

Or should you just go bare metal?
- @martin_casado: Three growth startups ($20m ARR+) in the last week have mentioned to me that part of their operations are on Equinix bare metal with K8. Perhaps a sampling anomaly. But I do hear this more and more.
- @jrhunt: Riot's and Netflix's talks from re:invent last year highlight some of the key developments in maintaining local infra + cloud infra. Reality is that local infra gives the latency wins. The cost effective performance wins are good on paper but the real win is on latency.
- @manishrjain: So much money saved. 2 month rental of an AWS server = outright purchase of an equivalent server. Hardware has advanced enough to not break as much, and be swapped easily. And k8s has leveled the playing ground for dealing with fault tolerance. Makes sense to skip public cloud…at Dgraph, we were able to set up the entire "first" rack, full of servers, with storage and redundancy, with k8s -- with 2 SREs working part time over 2 months. Second time would have been much faster. The website, the staging, the QA, and some prod were all running there.
- @alex_saroyan: One of our customers got 40 racks across 5 locations. Only on egress traffic they save 90% compared to their early days in the public cloud.
- @subhashchy: Our core platform runs on baremetal servers and a lot of edge locations are powered by equnix metal. The only drawback of BM was very slow speed of delivery and no on demand scaling
- @chirayupatel: Bunch of alternatives here. We run server clusters on OVH with SSD only storage and containers. An order of magnitude cheaper than AWS + no more VM's, gp1, gp2, IOPs etc.
- @colemanscott1: Heard several enterprises going multi-cloud are using collo providers as centers of gravity to host operational capabilities of their choosing vs. a tangle of native services across clouds. Combo of cost, capability, and end-run on tech debt. Also a small sample.
- @GirlwDaMostCake: Executives that think everything needs to run in the cloud, regardless of cost. We run the equivalent of 300 p2 larges steady state every day. Plus we do our model training in OCI and some on prem as well.

One question about Apple’s AR/VR ambitions is do they have the technical chops to pull off a global network of one billion users using VR? Can they do the backend? Here’s how Meta is planning to handle their cloud gaming infrastructure Under the hood: Meta’s cloud gaming infrastructure:
- we rely on edge computing, where we bring the cloud gaming infrastructure closer to players by deploying in edges that are in metropolitan areas close to large populations.
- we partnered with NVIDIA to build a hosting environment on top of NVIDIA Ampere architecture-based GPUs
- we use Twine, our cluster management system, on top of our edge computing operating system. We build orchestration services to manage the streaming signals and use Twine to coordinate the game servers on edge.
- We built and used container technologies for both Windows and Android games.
- we landed on WebRTC with Secure Real-Time Transport Protocol (SRTP) technology as our solution to streaming user inputs and video/audio frames for games.
- Whenever a player clicked to perform an action in a game (e.g., to make their character jump), we captured the click event and sent it to the server, where the game received the event. The game then rendered a frame that contained the result of that action (i.e., the character jumping). We captured that rendered frame, copied it out, and encoded it using a video encoder. The frame was then packetized so it could fit into a User Datagram Protocol (UDP) packet and sent through the network to the player. Finally, on the player’s side, there was a jitter buffer to help smooth play as the packets came in. We decoded the packets into frames, then rendered the frame for the player. Our current model significantly improves on this and improves latency using GPU encoding.
- On the networking side, having edge computing sites close to the player also helps reduce video and audio latency — even more significantly than the savings from moving the entire streaming pipeline to GPUs.
- Over on the player side, we also now use hardware decoding to reduce the decode time.
- Meanwhile, Microsoft says 10 million users have streamed games on Xbox Cloud Gaming.
- What is Edge Compute?:
- loudmax: Stadia's problem was never technical, it was entirely due to bad management. The fact that they got games to play reasonably well at reasonably high resolutions is an impressive achievement, and it set the foundation to build a dominant gaming platform appealing to everyone. Then Stadia's management figured out a way to emphasize all of Stadia's weaknesses and play down its advantages and price it in such a way to appeal to nobody.

Serverless architecture for optimizing Amazon Connect call-recording archival costs. Long term retention requirements cost a lot in storage. But you don’t probably have to keep data stored in its original high fidelity encoding. Recode and downsample it to save big bucks.

Information from I/O 2022 is now available.

Meta on Transparent memory offloading: more memory at a fraction of the cost and power:
- Transparent memory offloading (TMO) saves 20 percent to 32 percent of memory per server across millions of servers in our data center fleet
- But alternative technologies such as NVMe-connected solid state drives (SSDs) offer higher capacity than DRAM at a fraction of the cost and power. Transparently offloading colder memory to such cheaper memory technologies via kernel or hypervisor techniques offers a promising approach to curb the appetite for DRAM.
- LinkedIn is doing something similar. Scaling storage in the datacenter with RoCE: wel discuss how we solved this challenge by providing a flexible, scalable, performant and reliable block storage service using Remote Direct Memory Access (RDMA).

If it doesn’t go to zero it’s not serverless. Period.
- Serverless Aurora v2 still isn’t serverless.
- @jeremy_daly: There are two major problems with Serverless Aurora v2 that fundamentally miss the mark of true #serverless(ness). One is the inability to auto-pause (as Mike mentions), and the other is the missing Data API. These might make it unusable (for now). ☹️ Some thoughts… 🧵
- But not all agree: Data Dog reports on The State of Serverless. In any case serverless is still growing.
- Let’s Architect! Serverless architecture on AWS
- Build a Twitter Leaderboard app with Redis and AWS Lambda (part 1)
- Dynamic Kubernetes Cluster Scaling at Airbnb: An important part of running Airbnb’s infrastructure is ensuring our cloud spending automatically scales with demand, both up and down. Our traffic fluctuates heavily every day, and our cloud footprint should scale dynamically to support this. To support this scaling, Airbnb utilizes Kubernetes, an open source container orchestration system. We also utilize OneTouch, a service configuration interface built on top of Kubernetes
- Paul Johnston: Serverless systems should be (but rarely are) thought of as a large number of decoupled but very small systems, that are completely independent of each other, connected by events.
- @simonw: "Serverless means per-request billing" is my new favorite definition of that term. I'd previously been using "scale to zero" but that always requires extra explanation. "Per-request billing" contains everything you need to know.

I think IP is the biggest value creator. Erik Bernhardsson: If you rank every major software innovation (OOP, containers, version control, deep learning, consensus algorithms, asymmetric encryption, whatever...) by the value it created, the relational database would come out at the top and it's not even close

6 months of effort went into one article and it takes a mere 45 minutes to read. It's the asymmetry of creation that burns people out. Scaling containers on AWS in 2022.
- Fargate is now faster than EC2. I am seeing a massive migration to ECS on Fargate — it's so much easier!
- ECS on Fargate improved so much and is the perfect example for why offloading engineering effort to AWS is a good idea
- App Runner is on the way to becoming a fantastic service
- Up to a point, EKS on Fargate is faster than EKS on EC2
- EKS on EC2 scales faster when using karpenter rather than cluster-autoscaler, even in the worst possible scenario
- EKS on EC2 is a tiny bit faster when using IPv6
- Lambda with increased limits scales ridiculously fast. With a legitimate workload and increased limits, as we just saw, we are now living in a world where we can instantly get 150 TB of RAM and 90 000 vCPUs for our apps
- What I am seeing most often with my customers, is the largest part of the worker plane using EKS on EC2 Spot which is good enough for most applications.
- bleuarff: at work we recently experienced traffic spikes like 35-fold in the span of 30 seconds for a few minutes, then faded regularly for a few hours. Our scaling was too slow to be of any use for such pattern. There are legitimate needs for things that scale as fast as possible.

It's a databaseapalusa.
- Redis vs. KeyDB vs. Dragonfly vs. Skytable. 1. Redis: 112,100 / 99,892 2. KeyDB: 288,931 / 282,997. 3. Dragonfly: 408,322 / 392,446. 4. Skytable: 619,992 / 676,091.
- Dragonfly: a modern in-memory datastore, fully compatible with Redis and Memcached APIs. Dragonfly implements novel algorithms and data structures on top of a multi-threaded, shared-nothing architecture.
- Even more on Distributed Databases Compared.
- CloudFlare released D1, their SQL database based on SQLlite.
- Anna: A Key-Value Store For Any Scale: shows how a key-value storage (KVS) system can be architected to scale across many orders of magnitude. (Spoiler Anna can give you only upto causal consistency, but cannot provide strong consistency at key level, and nothing stronger than read-committed at the multi-key level.)
- fabian2k: I still don't see how [PlanetScale's] pricing model is usable in a database with a query planner. The user is not in control over the query plan, the database is. This is a recipe for disaster, and I'd never feel comfortable with this. Even with a good understanding of how the database works there is pretty much no way to ensure that the database doesn't do some surprising full table scans.
- tigerbeetle: a financial accounting database designed for mission critical safety and performance to power the future of financial services.
- Warp: Lightweight Multi-Key Transactions for Key-Value Stores: a simple yet powerful idea to provide efficient multi-key transactions with ACID semantics on top of a sharded NoSQL data store.
- PranaDB: a distributed streaming database, designed from the outset to be horizontally scalable.
- FrostDB: an embeddable columnar database written in Go.
- FoundationDB: A Distributed Unbundled Transactional Key Value Store (Sigmod 2021)

This is a deep cut. Graviton 3: First Impressions:
- The design seems to be very narrowly targeted at maximizing compute density in the cloud. To that end, AWS has chosen very conservative core clocks. At 2.6 GHz, Graviton 3 clocks a mere 100 MHz over its predecessor, Graviton 2…Such a core running at 2.6 GHz on 5 nm should absolutely sip power. That in turn lets AWS pack three of these chips into a single node, increasing compute density. The final result is a chip that lets AWS sell each Graviton 3 core at a lower price, while still delivering a significant performance boost over their previous Graviton 2 chip.
- James Hamilton also has some thoughts: Graviton3 has the performance needed for modern compute intensive applications such as gaming, video encoding, machine learning, and high-performance computing while also delivering the power-performance and price-performance customers seek for less demanding, general-purpose workloads. Formula 1 ran their Computational Fluid Dynamics workloads on C7g and found them to deliver 40% better performance than C6gn. Honeycomb.io, an early adopter of Graviton, saw a 35% performance improvement and a 30% reduction in latency compared with C6g for their telemetry ingestion workload.

Good to see language flame wars are still a thing, oddly though this one doesn’t make much of a case. Building a Cloud Database from Scratch: Why We Moved from C++ to Rust:
- As an early-stage database startup, we entirely deleted our C++ codebase after 7 months of development and rewrote everything from scratch in Rust programming language.
- Rust guarantees zero-cost abstraction so that we won't have performance capped and doesn't require runtime garbage collection so that we can have the latency spike potentially caused by memory management under our control
- Rust also has Fragmented async ecosystem; Cumbersome error handling; Insufficient support of AsyncIterator
- Rust: A Critical Retrospective: This next point dovetails into why Rust is not yet suitable for a fully auditable kernel: the language isn’t finished.
- Lies we tell ourselves to keep using Golang: Or you can be horrified, as you realize that those complex problems only exist because Go is being used. Those complex problems would not exist in other languages, not even in C

Small Datum with a great explanation of how RocksDB efficiently manages access to its block cache with mutexes. RocksDB internals: LRU.

Good advice from Stack Overflow on how to limit damage from aggressive DDOS attacks targeting expensive SQL queries. Stack under attack: what we learned about handling DDoS attacks:
- Insist that every API call be authenticated. This will help identify malicious users.
- Minimize the amount of data a single API call can return.
- Rate-limit all API calls.
- Filter malicious traffic before it hits your application.
- Block weird URLs.
- Block malicious IPs even if legitimate traffic can originate from them.
- Automate your blocklist.
- Tar pitting is a great way to slow down botnets and mitigate volume based attacks.

It turns out when you turn a row oriented OLTP database into a columnar OLAP database, optimizations are possible. AlloyDB for PostgreSQL under the hood: Columnar engine.
- But that's not the interesting part. Google has long advocated the use of ML over configuration to improve software. And here it is: The embedded machine learning in AlloyDB makes accessing this technology easier than ever. AlloyDB automatically organizes your data between row-based and columnar formats, choosing the right columns and tables based on learning your workload, and converting them to columnar format automatically. The query planner smartly chooses between columnar and row-based execution plans, so transactional performance is maintained. This allows AlloyDB to deliver excellent performance for a wide range of queries, with minimal management overhead.
- @ahachete: I'm impressed with @googlecloud's AlloyDB, both in terms of tech and pricing. An example of the latter, for 4vCPU + 32GB RAM, hourly, in EU West: * AlloyDB: $0.685 * Aurora Postgres on-demand (db.r6g.xlarge): $0.519 + I/Os * Aurora RI, no upfront: $0.340 + I/Os
- Machine Learning Operations (MLOps): Overview, Definition, and Architecture
- Designing Arithmetic Circuits with Deep Reinforcement Learning

Like Facebook before it with PHP, Shopify is taking their favorite language Ruby, and dragging it into the future. Amazing what you can do with money and talent. Shopify Invests in Research for Ruby at Scale.

You migh imagine Pinterest has a lot of read traffic and so they would be experts at caching. You would be right. Improving Distributed Caching Performance and Efficiency at Pinterest.
- Pinterest leverages both synthetic load generation and production shadow traffic to evaluate the impact of tuning and optimizations.
- While memcached can be arbitrarily horizontally scaled in and out to address a particular cluster’s bottleneck, vertically scaling individual hardware dimensions allows for greater cost efficiency for specific workloads.
- using extstore to expand storage capacity beyond DRAM into a local NVMe flash disk tier increases per-instance storage efficiency by up to several orders of magnitude, and it reduces the associated cluster cost footprint proportionally.
- Approximately half of all caching workloads at Pinterest are compute-bound (i.e. purely request throughput-bound). Successful optimizations in compute efficiency translate into the ability to downsize clusters without compromising serving capacity.
- At Pinterest, most workloads (including the distributed cache fleet) run on dedicated EC2 virtual machines.
- running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting. This one-line change, after rollout to all compute-bound clusters, drove client-side P99 latency down by anywhere between 10% and 40%, in addition to eliminating spurious spikes in P99 and P999 latency across the board.
- TCP Fast Open (TFO) is a mechanism for reducing the latency overhead of establishing a TCP connection by optimizing away one RTT in an otherwise costly TCP 3WHS (3-way handshake), while also allowing eager transmission of early data during the handshake itself.

Does Laravel scale?:
- When asking that question you might begin by defining what Laravel is: a PHP web application framework with expressive, elegant syntax.
- @usefathom runs Laravel on infrastructure that can handle over 157 billion requests per month.
- It starts off saying you don't really need to scale and your benchmarks are crap, so nothing yet.
- Then the rest is good yet standard advice on how to scale a backend, because, you know, frameworks aren't the bottleneck.

Scale out vs scale up? Some things that used to be only solved with scale up can now be solved by scaling out. Because sometimes scaling doesn't matter, but sometimes it does. The Unique Architecture behind Amazon Games’ Seamless MMO New World.
- New World became the fifth highest-played game by concurrent users in the history of the video game distribution service Steam, with more than 900,000 concurrent users in its first weekend.
- Amazon DynamoDB handles about 800,000 writes every 30 seconds to store game state.
- While a traditional MMO simulates, redraws, analyzes, and processes user states and actions around five times per second, New World does these 30 times a second. "In the old world, if things needed to scale up, you bought bigger hardware"
- As a player moves through the game, state information is handed from hub instance to hub instance, spreading the compute load over virtual servers hosted globally in multiple AWS Regions, physical locations that contain clusters of data centers.
- the Amazon EC2 instances for a single world in Aeternum can simulate more than 7,000 artificial intelligence entities and hundreds of thousands of objects for 2,500 players. Each server set often processes millions of state changes per second,
- To power the known core game play activities, New World uses AWS Lambda, a serverless, event-driven compute service that lets developers run code for the backend service

Uber's Better Load Balancing: Real-Time Dynamic Subsetting:
- Uber infrastructure operates in multiple zones in multiple regions, across both on-prem data centers and public cloud providers. Most of the control plane components are zonal, and each typically consists of several sub-components. At a very high level, Uber’s service mesh architecture is centered around an on-host L7 reverse proxy
- Subsetting in the context of load balancing means dividing the set of potential backend tasks into overlapping "subsets," so that while all backends receive traffic in aggregate, each proxy performs the load balancing on a limited set of tasks.
- The basic idea is: if an on-host proxy knows how much QPS a callee service is receiving, it could derive the ratio of load it is contributing to the overall traffic. With this information, an on-host proxy could decide its subsetting size dynamically based on ratio ( i.e., it should expand its subsetting size if it starts to make more requests to a destination service).
- Dynamic Subsetting system where aperture parameters of the mesh scale automatically and dynamically with the changing traffic. The system has been running in production for nearly 2 years for millions of containers.
- The primary success we’d like to call out is the reduction in maintenance cost: since the rollout (~18-12 months ago) we have had 0 cases of service owners complaining about subsetting. With 8 larger services manually tuned, they reported 15-30% p99 CPU utilization reduction.

And they all fall down...Cascading failures in large-scale distributed systems:
- A cascading failure is a failure that increases in size over time due to a positive feedback loop. The typical behavior is initially triggered by a single node or subsystem failing. This spreads the load across fewer nodes of the remaining system, which in turn increases the likelihood of further system failures resulting in a vicious circle or snowball effect
- The most common cause is server overload or a consequence of it. When that happens, the drop in system performance often affects other areas of the system.
- Resources of a server are limited. If the load increases above a certain threshold, the server’s performance metrics, such as latency or error rates, deteriorate. This translates into a higher risk of a crash
- When resource exhaustion causes a server to crash, traffic spreads to other servers, increasing the likelihood that those will crash as well. A cycle of crashing servers establishes.
- Now, to address the cascading scenario, various measures can be taken. The first and most intuitive option is to increase resources.
- Try to avoid health check failures/deaths
- Restart your servers in case of thread-blocking requests or deadlocks.
- Drop traffic significantly and then slowly increase the load so that the servers can gradually recover.
- Switch to a degraded mode by dropping certain types of traffic.
- Eliminate batch/bad traffic to reduce system load
- Move from an orchestration pattern to a choreography pattern by applying a publish–subscribe pattern design,

Soft Stuff:

Litestream: a standalone streaming replication tool for SQLite.
SplinterDB: a key-value store designed for high performance on fast storage devices.
Zig (Zig at Uber): a general-purpose programming language and toolchain for maintaining robust, optimal and reusable software.
Inform: a programming language for creating interactive fiction, using natural language syntax.
Trino: a fast distributed SQL query engine for big data analytics.
kic-reference-architectures: This repository has the basics for a common way to deploy and manage modern apps.
ATL: This is a working space for sketching a tensor language.

Pub Stuff:

Convergence of multiple synthetic paradigms in a universally programmable chemical synthesis machine: For the last decade people have been telling me the digitization of chemistry was impossible. They said it was too complicated, sensitive and expensive. Our chemical programming language runs on cheap hardware and can make any molecule or material. #chemputation. Here’s an example.
Automated performance prediction of microservice applications using simulation: Our approach accurately predicts the deployment performance of large-scale microservice applications in various configurations from a single execution trace. This provides valuable insights on the performance of an application prior to its deployment on a real platform.
Zero-Overhead Metaprogramming: Our evaluation with self-optimizing interpreters shows that unrestricted metaobject protocols can be realized for the first time without runtime overhead, and that this optimization is applicable for just-in-time compilation of interpreters based on meta-tracing as well as partial evaluation.
Powering a microprocessor by photosynthesis: Here, we describe a bio-photovoltaic energy harvester system using photosynthetic microorganisms on an aluminium anode that can power an Arm Cortex M0+, a microprocessor widely used in Internet of Things applications. The proposed energy harvester has operated the Arm Cortex M0+ for over six months in a domestic environment under ambient light.
Google SRE Books: Building Secure & Reliable Systems; The Site Reliability Workbook; Site Reliability Engineering
Succinct Data Structures and Delta Encoding for Modern Databases: . TerminusDB makes a radical departure from historical architectures to address these changes. First, we implement a graph database with a strong schema so as to retain both simplicity and generality of design. Second, we implement this graph using succinct immutable data structures which enable more sparing use of main memory resources.
Monarch: Google’s Planet-Scale In-Memory Time Series Database: Google’s system for storing time-series metricsTime-series data describes data points that occur over time. Storing and using this type of information is an active area of research and industry development.. Time series metricsMetrics are one of the main components in an observability stack (among tracing, events, and logging). The paper Towards Observability Data Management at Scale has more information on the other components of the stack. are used for alerting, graphing performance, and ad-hoc diagnosis of problems in production.
A Probabilistic Chemical Programmable Computer (video): We present a hybrid digitally programmable chemical array as a probabilistic computational machine that uses chemical oscillators partitioned in interconnected cells as a computational substrate. This hybrid architecture performs efficient computation by distributing between chemical and digital domains together with error correction.
Algorithms for Modern Hardware: This is an upcoming high performance computing book titled "Algorithms for Modern Hardware" by Sergey Slotin. Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from O(n \log n)O(nlogn) to O(n \log \log n)O(nloglogn).
Decoupled Transactions: Low Tail Latency Online Transactions Atop Jittery Servers (a href=https://www.youtube.com/watch?v=72UZ8DxPa8o">video): We present a thought experiment for a decoupled transactions database system that avoids cascading slowdown when a subset of its servers are sick but not necessarily dead. The goal is to provide low tail latency online transactions atop servers and networks that may sometimes go slow
Designing Event-Driven Systems: Author Ben Stopford explains how service-based architectures and stream processing tools such as Apache Kafka® can help you build business-critical systems.