Network Next is a network accelerator for multiplayer games. Since the start of Season 1, Network Next has been accelerating play in REMATCH around the world.
Today, with SLOCLAP's permission, I am able to share some results of this acceleration and how we have improved the player experience in REMATCH.
First, let's start with the problem. Like all multiplayer games, some people play REMATCH and get terrible network performance. This sucks. As game developers we want everybody to have the best experience, but the Internet doesn't always deliver it. These players then rush to Reddit or Discord and complain about lag. The game for them is completely unplayable. This is of course totally reasonable, and I fully believe them.
So what's going on?
The first thing to realize is that your network performance playing a game doesn't just depend on your internet connection and the game servers, it also depends on the route your packets take between your ISP and where the game servers are hosted.
You see there isn't actually a direct connection between your ISP and the game servers. Instead, the internet is a network of networks, meaning that your game packets have to go across several different networks between your ISP and the game server, and these networks are often different transit providers that run fiber optic cables that don't have any business relationship with the game developer.
Their only job is to make the "best effort" to send your packets to the right direction, and they really don't coordinate to make sure your packets always take the lowest latency and lowest packet loss route to the game server. Usually they do a decent job, but not always. I like to think of it as multiple disinterested strangers sitting in between your ISP and the game server, sort of throwing packets over the wall and then wiping their hands clean afterwards. Job done!
Also, especially during weekends and peak play times at night when most people are playing, there can be congestion, causing really big spikes of high latency and packet loss for unfortunate players who are given that route (too much data being sent through the same link so something has to give...)
Sometimes cables get cut, like has happened recently in the Middle East and afterwards packets have to take a longer route and players get higher latency than usual. Sometimes an ISP just has some misconfiguration or bad link on their network Friday night, and all players for the rest of the weekend on that ISP will get high latency and packet loss until it's fixed sometime next week.
Sometimes the ISP is just load balancing across multiple upstream routes, and one of them happens to have higher latency or is more congested than the others, so players get a good ping most of the time, but every now and then (according to the hash of their client IP address and port and the game server IP address and port) they take the higher latency route. We see this often when players get a good ping most of the time, but every nth game becomes unplayable.
Network Next was created to fix all these things. We actively monitor your connection while you play, and if your connection is already taking the lowest latency route and you don't have packet loss, we don't do anything. But, if we see that we can reduce latency or packet loss by sending game packets down a different route, we override the routing decisions made by the internet and force your game traffic to take the lowest latency and lowest packet loss route instead.
How do we do this exactly? By sending packets across a relay network.
This means that when we accelerate a player, instead of just sending packets from the client to the server IP address, we send the packets to a relay, which then forwards the packets to another relay, and then maybe another relay, and then to the game server, and the same in reverse from server to client. This way we're able to control the route packets take end-to-end between the client and server.

We couple this with a route optimization system that selects the relays that packets are sent across, making sure that only routes with the lowest possible latency, jitter and packet loss are taken, and multi-path, where we send packets across multiple routes at the same time to reduce packet loss.
The end result is this:

The image was captured at some time during the free play weekend for REMATCH at the end of September 2025, showing players getting accelerated right at that moment in time.
Here you see absolutely massive latency reductions 100ms+ up to 300ms. These are players going from literally unplayable network conditions down to the lowest possible latency given the distance to game server.
Players getting fixed aren't limited to players in South America and Middle East, during peak play times in the USA we see huge latency reductions there too:

In fact, we see huge improvements all around the world.
A player in Palestine, playing on a server in Istanbul:

A player in Libya, playing on a server in Frankfurt:

A player in India being accelerated to servers in Dubai:

A player in Buenos Aires, being accelerated to a server in São Paulo:

A player in Mexico being accelerated to a server in Dallas:

A player in the UK, being accelerated to a game server in Frankfurt:

A player in Bermuda, being accelerated to a game server in Virginia, USA:

A player in Texas, playing on a server in Dallas:

Somebody with Google Fiber in Los Angeles, playing on a server in LA, proving that having a good internet connection doesn't always equal low latency!

We even saw somebody playing in the Riot LA offices, playing on a server in LA. They have their own private internet Riot Direct, and still this happens:

I could go on forever. We just see so many massive improvements, every single day. The internet really doesn't care about games.
Now let's switch to statistics.
Since turning on with REMATCH, we've seen more than one million unique players, and have accelerated more than half of them, 52.33%, at least once.
In less than a month we have:
In particular we have focused our efforts on fixing high latency in South America, Central America and the Caribbean, Mexico, Turkey and the Middle East. Our goal is to reduce the amount of playtime spent above 60ms of latency, and especially the amount of playtime above 100ms of latency.
For example, for players in Turkey playing on Istanbul servers, we reduced the percentage of playtime above 60ms from 4.7% down to 0.3%, and playtime above 100ms from 0.29% down to just 0.04%.
We saw similar results for players in Brasil playing on servers in São Paulo, and for players across the Middle East playing on servers in Dubai.
And for players in other regions, especially in the USA and Europe, even if your latency is pretty good most of the time, we saved you from a surprisingly large number of packet loss and latency spikes:
Of course, if the server is far away from you we cannot always get your latency below 60ms or 100ms. The speed of light isn't just a good idea, it's the law. We understand that sometimes you want to play with your friends in another region, and rest assured that in this case we work just as hard to find you the lowest latency possible. That's right, Network Next doesn't just accelerate you to your local region, we accelerate you to wherever you want to play, no matter how far away that is.
In summary, thank you very much to SLOCLAP for graciously allowing us to share these statistics with you. If you are a player of REMATCH I hope you can understand just how much effort has gone in over the last month to help reduce latency and packet loss for you and give you the best in game experience possible. I'm making no exaggeration that we have spent most waking hours setting up and running this system for you over the last month and optimizing it to get you the most acceleration possible.
If you are a game developer, please consider if you are about to launch your own multiplayer game, you can use Network Next too. It's source available and free to use if you host it for your own game and you have fewer than 10K CCU, and of course we are also happy to provide a full service hosted offering with support for a reasonable price if that's what you prefer.
Please see the source code on GitHub for more details: https://github.com/networknext/next
]]>I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
It's not often that something this worthy comes along...
NeuroAnimation, a company that develops virtual life forms for stroke recovery and rehabilitation, reached out and asked if I could help network their virtual creatures.
It's just so easy to see the win here. A therapist can now be inside the virtual reality simulation with the child they are helping recover from a stroke or other traumatic brain injury. A parent can now be with their non-verbal child and communicate with them inside the simulation. Children undergoing therapy can now play together.
I'm incredibly excited to be helping NeuroAnimation with this worthy cause.
More details here: NeuroAnimation.com
See you all in a few months!
- Glenn
]]>I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
XDP is an amazing way to receive millions of packets per-second, bypassing the Linux networking stack and letting you write eBPF programs that operate on packets as soon as they come off the NIC. But, you can only write packets in response to packets received – you can't generate your own stream of packets.
In this article I'm going to show you how to use AF_XDP to generate and send millions of UDP packets per-second. You can use these packets to test components you've written in XDP/eBPF by sending packets close to line rate, making sure your XDP/eBPF programs work properly under load.
AF_XDP is a (relatively) new type of Linux socket that lets you send and receive raw packets directly from userspace programs. It's incredibly efficient because instead of traditional system calls like sendto and recvfrom, it sends and receives packets via lock free ring buffers.
On the receive path, AF_XDP works together with an XDP/eBPF program that decides which packets should be sent down to the AF_XDP socket, so you can make decisions like "is this packet valid?" before passing it down to userspace. But on the send side – which we'll focus on in this article – AF_XDP is completely independent of XDP/eBPF programs. It's really just an efficient way to send packets with ring buffers.
First, you create an area of memory called a UMEM where all packets are stored. This memory is shared between your userspace program and the kernel, so they can both read and write to it.
The UMEM is broken up into frames, each frame being the maximum packet size: for example: 1500 bytes. So really, the UMEM is just a contiguous array of 4096 buffers, each 1500 bytes large. It's nothing complicated.
Next, you create an AF_XDP socket linked to this UMEM and associate it with two ring buffers: TX and Complete. Yes, there are two additional ring buffers used for receiving packets but let's ignore them, because thinking of all four ring buffers at the same time fries brains.
The TX ring buffer is the send queue. It just says, "hey kernel, here's a new packet to send, at this offset in the UMEM and it's this many bytes long". The kernel driver reads this TX data from the queue, and sends the packet in the UMEM at that offset and length directly to the network interface card (NIC).
Obviously, it's extremely important that you don't reuse a frame in the UMEM to send another packet until the packet in it has actually been sent, so the Complete ring buffer is a queue that feeds back the set of completed packet sends to the userspace program. You read from this queue and mark frames in the UMEM as available for new packets.
And that's really all there is to it. You write raw packets in the UMEM, send them via the TX queue, and the kernel notifies you when packet sends are completed via the Complete queue. You can read more about the AF_XDP ring buffers here.
The packet constructed in the UMEM is sent directly to the NIC without modification, so you have to construct a raw packet including ethernet, IPv4 and UDP headers in front of your UDP payload.
Thankfully, this is relatively easy to do. Linux headers provide convenient definitions of these headers as structs, and you can use these to write these headers to memory.
Here's my code for writing a raw UDP packet:
// generate ethernet header
memcpy( eth->h_dest, SERVER_ETHERNET_ADDRESS, ETH_ALEN );
memcpy( eth->h_source, CLIENT_ETHERNET_ADDRESS, ETH_ALEN );
eth->h_proto = htons( ETH_P_IP );
// generate ip header
ip->ihl = 5;
ip->version = 4;
ip->tos = 0x0;
ip->id = 0;
ip->frag_off = htons(0x4000); // do not fragment
ip->ttl = 64;
ip->tot_len = htons( sizeof(struct iphdr) + sizeof(struct udphdr) + payload_bytes );
ip->protocol = IPPROTO_UDP;
ip->saddr = 0xc0a80000 | ( counter & 0xFF ); // 192.168.*.*
ip->daddr = SERVER_IPV4_ADDRESS;
ip->check = 0;
ip->check = ipv4_checksum( ip, sizeof( struct iphdr ) );
// generate udp header
udp->source = htons( CLIENT_PORT );
udp->dest = htons( SERVER_PORT );
udp->len = htons( sizeof(struct udphdr) + payload_bytes );
udp->check = 0;
// generate udp payload
uint8_t * payload = (void*) udp + sizeof( struct udphdr );
for ( int i = 0; i < payload_bytes; i++ )
{
payload[i] = i;
}One thing I do above is instead of including the actual LAN IP address of the client sending packets, I increment a counter with each packet sent and use it to fill the lower 16 bits of 192.168.[x].[y].
This distributes received packets across all receive queues for multi-queue NICs, because on Linux the queue is selected using a hash of the source and destination IP addresses. Without this, throughput will be limited by the receiver, because all packets arrive on the same NIC queue.
You can see the full source code for my AF_XDP test client and server here.
I'm running this test over 10G ethernet:
I can send 6 million 100 byte UDP packets per-second on a single core.
If I reduce to 64 byte UDP packets (on wire, not payload – see this article) I should be able to send a total of 14.88 million per-second at 10G line rate, but I can only get up to around 10-11 million. I suspect there is something NUMA related going on with ksoftirqd running on different CPUs than the packets were queued for send on, but I don't know how to solve it.
If you know what's up here, please email [email protected]. I'd love to be able to hit line rate with AF_XDP and share it here.
]]>I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
One of the most difficult parts about launching a multiplayer game is planning where to put your game servers. This is particularly challenging because before launch you probably have no experience of where players will be, what density of players to expect in each region, nor any idea about how internet topology maps to real world locations.
In this article I aim to share with you all the relevant geographical information I've learned over the last five years operating the network accelerator Network Next.
So strap in and let's find out where your game servers should be.
To pick server locations you first need to decide on your strategy.
Are you going to have lots of server locations around the world, like first person shooters such as Valorant, Counterstrike or Overwatch – or – will you have centrally located servers in the middle of large regions, for a less latency sensitive game like League of Legends?
The key trade-off here is latency vs. player fragmentation. The more player server locations you have, the lower the average latency will be, but also the smaller the player pool per-location. Conversely, with a larger geographical area playing in a location, there are more opportunities to perform better skill based matching because there's more players to choose from.
The correct decision of course depends on your game, but the best advice I can give is to make your game as latency insensitive as possible. If your game is latency insensitive from [0,50ms] can you extend it so that it also plays well up to 100ms?
Why is this so important? Because even if you try to minimize latency by deploying game servers in major cities everywhere, at best you'll obtain an average latency of 25-30ms. And even then approximately 50% of your players will have latency above this.
So think less about "how can I get the absolute minimum latency for my game?", and more about "how can I engineer my game so it plays well at the the widest possible range of latencies so that the MOST players can play together?", and then once you know this acceptable range of latency for your game, build your server locations and matchmaking strategy around it.
In this article we'll start with the regional strategy.
Here the general assumption is that you have a game which plays perfectly well in the [0,100ms] round trip latency range, only starting to degrade above ~100ms.
Your goal is have one really good server location per-region, where everybody in the region has a strong chance of playing on the server below 100ms. This server location should be centrally located for fairness, and you might even consider equalizing players within the region in some way, such that any players close to the central location don't get too much of an advantage over ones that are further away.









If you are just getting started developing a multiplayer game, my recommendation is to stop worrying too much about minimizing latency and just design your game so it plays well at the largest latency range possible, ideally at least [0,100ms], then host your game servers in the major regional cities listed above.
Don't chase the latency dragon by having tens of server locations per-region. You'll just end up get latency divergence as you whiplash players from one city to another. Instead, focus on providing the most consistent experience from one match to the next.
Unfortunately, the internet doesn't always deliver game traffic across the lowest latency route and this can be a real problem with a regional matchmaking strategy. Consider using a network accelerator like Network Next to fix bad routes that take players outside your ideal latency range of [0,100ms].
You can configure Network Next to only accelerate players when their latency is outside of your ideal latency range, so you only pay for network acceleration when it provides a meaningful improvement for your players.
Please contact us if you'd like to learn more.
]]>I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
You've spent years of your life working on a multiplayer game, then you launch, and your subreddit is full of complaints about lag. Should you just go and integrate a network accelerator like Network Next to fix it?
Absolutely not.
Lag doesn't always mean latency. These days players use "lag" to describe anything weird or unexpected. The player might just have had low frame rate, whiffed an attack, or got hit when they thought they blocked.
Before considering a network accelerator for your multiplayer game, it's important to first exclude any causes of "lag" that have nothing to do with the network. A network accelerator can't fix these problems. That's up to you.
In this article I share a comprehensive list of things I've personally experienced masquerading as lag in production, and what you can do to fix them. I hope this helps you fix some cases of "lag" in your game!
Problem: You're sending your game network traffic over TCP.
While acceptable for turn-based multiplayer games, TCP should never be used for latency sensitive multiplayer games. Why? Because in the presence of latency and packet loss, head of line blocking in TCP induces large hitches in packet delivery. There is no fix for this when using TCP, it's just a property of reliable-ordered delivery.
Note that if you're using WebSockets or HTTP for your game, these are built on top of TCP and have the same problem. Consider switching to WebRTC which is built on top of UDP.
As a side-note, it's possible to terminate TCP at the edge, so the TCP connection and any resent packets are only between the player and an edge node close to them. While this improves things, it doesn't fix everything. Some players, especially in South America, Middle East, APAC and Africa will still get high latency and packet loss between their client and the edge node, causing hitches. For example, I've seen players in São Paulo get 250ms latency and 20% packet loss to servers in... São Paulo. The internet is just not great at always taking the lowest latency route, even when the server is right there.
Solution: Use a UDP-based protocol for your game. UDP is not subject to head of line blocking and you can design your network protocol so any essential data is included in the very next packet after one packet is lost. This way you don't have to wait for retransmission. For example, you can include all unacked inputs or commands in every UDP packet.
Problem: You have low frame rate on your client.
This is something I see a lot on games that support both current and previous gen consoles. On the older consoles, clients can't quite keep up with simulation or rendering and frame rate drops. Low frame rate increases the delay between inputs and corresponding action in your game, which some players feel as lag, especially if the low frame rate occurs in the middle of a fight.
Solution: Make sure you hit a steady 60fps on the client. Implement analytics that track client frame rate so you can exclude it as a cause of "lag". Be aware that it's common for new content to be released post-launch (like a new level or items) that tank frame rate, especially on older platforms, so make sure you launch with instrumentation that lets you track this over time and have enough contextual information so you can find cases where particular characters, levels or items cause low frame rate.
Problem: You have sporadic hitches on the client or server.
Hitches can happen in your game for many reasons. There could be a delay loading in a streaming asset, contention on mutex, or just a bug where some code takes a lot more time to run than expected.
Hitches on the client interrupt play and cause a visible stutter that players feel as lag. Long hitches can also shift the client back in time relative to the server, causing inputs to be delivered too late for the server to use them, leading to rollbacks and corrections on the client.
Hitches on the server are worse. Packets sent from the server are delayed, causing a visual shutter across all clients. If the hitch is long enough, clients are shifted forward in time relative to the server, causing inputs from clients to be delivered too early for the server to use them – once again leading to rollbacks and corrections.
Solution: Implement debug functionality to trigger hitches on the client and server, and make sure your netcode handles them well. Track not just low frame rate in your analytics, but also when hitches occur on the client and server. Track the total amount of hitches across your game over time so you can see if you've broken something, or fixed them in your latest patch.
Problem: You get bad performance because you are running too many game instances on a bare metal machine or VM.
Capacity planning is an important part of launching a multiplayer game. This includes measuring the CPU, memory and bandwidth used per-game instance, so you can estimate how many game server instances you can per-machine.
This becomes especially difficult when your game runs across multiple clouds, or a mix of bare metal and cloud, for example with a hosting provider like Multiplay. Now your server fleet is heterogenous, with a mix of different server types, each with different performance.
I've seen teams address this differently. Some teams prefer to standardize on a single VM as their spec with a certain amount of CPU and memory allocated, and refuse to think about multithreading at the game instance level. Other teams run multiple instances in separate processes on the same machine, trusting the OS to take care of it. I've also seen teams run multiple instances of the server in the same process with shared IO that distributes packets to game instances pinned to specific CPUs. This gives developers greater control over scheduling between game instances, but if the multi-instance server process crashes, all instances of the game server go down.
Solution: Make sure you have analytics visibility not only at the game server instance level, but also to the underlying bare metal or VM. If you see game instances running on the same underlying hardware having low frame rate or hitches at the same time, check to see if it's caused by having too many game server instances running on that machine.
Problem: Random bad performance on some server VMs.
You might have a noisy neighbor problem. Another customer is using too much CPU or networking resources on a hardware node shared with your VM.
Solution: Configure a single tenant arrangement with your cloud provider, or rent larger instance types that map to the actual hardware and slice them up into multiple game server instances yourself.
Problem: A bare metal machine in your fleet is overheating, causing bad performance for all matches played on it.
I've seen this frequently, and although you might think that bare metal hosting providers like Multiplay would perform their own server health checks for you, the "trust, but verify" adage definitely applies here.
I've seen bare metal servers in the fleet that overheat as matches are played on them because their fan is broken. Players keep getting sent to them in the overheated state and they experience hitching, low frame rates and bad performance that looks a lot like lag.
Solution: Implement your own server health checks to identify badly performing hardware, not only when it's initially added to the server pool, but also if it starts performing badly mid-match. Remember that if something can go wrong, with enough game servers in your fleet, it probably will. Automatically exclude bad servers from running matches until you can work with your hosting company to resolve the problem.
Problem: You have a low tick rate on your server.
Did you know that on average more latency comes from the server tick rate than the network?
For example, Titanfall 1 shipped with 10HZ tick rate. This means that a Titanfall 1 game server steps the world forward in discrete 100ms steps. A packet arriving from a client just after the server pumps recvfrom at the start of the frame has to wait 100ms before being processed, even if the client is sending packets to the server at a high rate like ~144HZ.
This quantization to server tick rate adds up to 100ms of additional latency on top of whatever the network delay is. The same applies on the client. For example, a low frame rate like 30HZ adds up to 33ms additional latency.
Solution: Increase your client frame rate and server tick rates as much as possible. At 60HZ ticks, the server adds only 16.67 additional latency worst case. At 144HZ a maximum of ~7ms additional latency is added. To go even further you can implement subticks, which means you process player packets immediately when they come in from the network, without quantizing socket reads to your server tick rate. See this article for details.
Problem: Your team spent the entire development play testing on LAN with zero latency, no packet loss and minimal jitter. Suddenly the game launches, and everything sucks, because now players are experiencing the game under real-world network conditions.
All multiplayer games play great on the LAN. To make sure your game feels great at launch, it's essential that you playtest with realistic network conditions during development.
Of course, this frustrates designers because now the game doesn't feel as good. But as my good friend Jon Shiring says: Yes. Get mad. The game doesn't play well. So change the game code and design so it does, because otherwise you're fooling yourself.
Solution: Playtests should always occur under simulated network conditions matching what you expect to have at launch. I recommend at least 50ms round trip latency, several frames worth of jitter, and a mix of steady packet loss at 1% combined with bursts of packet loss at least once per-minute. The designers must not have a way of disabling this. You have to force the issue.
Problem: Network conditions change mid-match leading to problems with your time synchronization and de-jitter buffers on client and server.
I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain
If it can go wrong, with enough players, it will. The internet is best effort delivery only. I've seen many matches where players initially have good network performance and it turns bad mid-match.
Latency that goes from low to high and back again, jitter that oscillates between 0 and 100ms over 10 seconds. Weird stuff that happens depending on the time of day. Somebody microwaves a burrito and the Wi-Fi goes to hell.
Solution: Implement debug functionality to toggle different types of bad network performance during development. Toggle between 50ms to 150m latency. Toggle between 0ms and 100ms jitter. Make sure your time synchronization, interpolation buffers and de-jitter buffers are able to handle these transitions without freaking out too much. Add analytics so you can count the number of times packets are lost because they arrive too late for your various buffers. Be less aggressive with short buffer lengths and accept that a wide variety of network conditions demands deeper buffers than you would otherwise prefer.
Problem: The player sees occasional pops and warps because the client and server player code aren't close enough in sync.
In the first person shooter network model with client side prediction, the client and server run the same player simulation code, including movement, inventory and shooting, for the same set of delta times and inputs, expecting the same (or a close enough) result on both sides.
If the client and server disagree about the player state, the server wins. This correction on the client in this case can feel pretty violent. Pops, warps and being pulled back to correct state over several seconds can feel a lot like "lag".
Solution: It's vital that your client and server code are close enough in sync for client side prediction to work. I've seen some games only do the rollback if the client and server state in the past are different by some amount, eg. only if client and server positions are different by more than x centimeters. Don't do this. Always apply the server correction and re-simulation, even if the result will be the same. This way you exercise the rollback code all the time, and you'll catch any desyncs and pops as early as possible. Instrument your game so you track when corrections result in a visible pop or warp for the player, so you know if some code you pushed has increased or decreased the frequency of mispredictions.
Problem: The player shoots and hits another player but doesn't get credit for the shot.
Nothing gets serious first person shooters screaming "lag" faster than filling another player with bullets but not getting credit for the kill.
Because of this top-tier first person shooters like Counterstrike, Valorant and Apex Legends implement latency compensation so players get credit for shots that hit from their point of view.
Behind the scenes this is implemented with a ring buffer on the server containing historical state for all players. When the server executes shared player code for the client, it interpolates between samples in this ring buffer to reconstruct the player's point of view on the server, temporarily moving other players objects to the positions they were in on the client at the time the bullet was fired.
If this code isn't working correctly, players become incredibly frustrated when they don't get credit for their shots, especially pro players and streamers. And you don't want this. ps. I accidentally once broke lag comp in Titanfall 2 during development, and the designers noticed the very same day, in singleplayer.
Solution: Implement a bunch of visualizations and debug tools you can use to make sure your lag compensation is working correctly. If possible, implement analytics to detect "fake hits" which are hits made on the client that don't get credit on the server. If these go up in production, you've probably broken lag comp.
Problem: Players are using lag switches.
Somebody is being a dick and using a lag switch in your game. When they press the button, packets are dropped. In the remote view their player warps around and is hard to attack.
Solution: The fix depends on your network model. Buy some lag switches and test with them during development and make any changes necessary so they don't give assholes any advantage.
Problem: Players are getting a bad experience because there aren't any servers near them.
The speed of light is not just a good idea. It's the law.
While a network accelerator certainly can help reduce latency to a minimum, players in São Paulo, Brasil are never going to get as good an experience playing in Miami as they do on servers in São Paulo. Similarly, players in the Middle East could play on Frankfurt, but they'll get a much better experience with servers in Dubai, Bahrain or Turkey.
Solution: Instrument your game with analytics that include the approximate player location (latitude, longitude) using an ip2location database like maxmind.com so you can see where your players are, then deploy servers in major cities near players.
Problem: Your matchmaker sends players to the wrong datacenter, giving them higher latency than necessary.
While players in Colombia, Equador, Peru are technically in South America, you shouldn't send them to play on servers in São Paulo, Brasil, because the Amazon rainforest is in the way.
Several games I've worked with end up hardcoding the upper north east portion of South America to play in Central America region (Miami), instead of sending them to South America (São Paulo). Somewhat paradoxical, but it's true. Don't assume that just selecting the correct region from a player's point of view (South America) always results in the best matches.
Solution: Instrument your game with ip2location (lat, long) per-player. Implement analytics that let you can look on a per-country basis and see which datacenters players in that country play in, including the % of players going to each datacenter, and the corresponding average latency they receive. Keep an open mind and tune your matchmaker logic as necessary.
Problem: A player in New York parties up with their best friend in Sydney, Australia and plays your game.
File this one under "impossible problems". You really can't tell players they can't party up with a friend on the other side of the world, even though it might not be the smartest idea in your hyper-latency sensitive fighting game – you just have to accommodate it as best you can.
Solution: Make your game as latency tolerant as possible. Regularly run playlists with one player having 250ms latency so you know how it plays in the worse case, both from that player's point of view, and from the point of view of other players in the game with them. Make sure it's at least playable, even though it's probably not possible for the experience to be a great one. Sometimes the only thing you can do is put up a bad connection icon above players with less than ideal network conditions, so the person's connection gets the blame instead of your netcode.
Problem: Players are whiplashed between different server locations in the same region, each with radically different network performance, and get a pattern of good-bad-good-bad matches when playing your game.
I once helped a multiplayer game that was load balancing players in South America across game servers in Miami, and São Paulo. Players in São Paulo, depending on the time of day and the player pool available, would get pulled up to Miami to play one match at 180ms latency, and then the next match they'd be back on a São Paulo server with 10ms latency. This was happening over and over.
Solution: Latency divergence is the difference between the minimum and maximum latency experienced by a player in a given day. Extend your analytics to track average latency divergence on a per-country or state basis, and if you see countries with high latency divergence, adjust your matchmaker logic to fix it.
Problem: "Low ping bastards" have too much of an advantage. Regular players get frustrated and quit your game.
It's generally assumed that players with lower pings have an advantage in competitive games, and will win more often than not in a one-on-one fight with a higher ping player. It's your job to make sure this is not the case.
Ideally, your game plays identically in some window: [0,50ms] latency at minimum, and perhaps not quite as well but acceptably so at [50ms,100ms], only degrading above 100ms.
In this case, the goal is to make sure that the kill vs. death ratio for latency buckets below 100ms are close to identical, although players above 100ms might have some disadvantage, do your best to ensure that disadvantage isn't too much.
Solution: Instrument your game with analytics that track key metrics like damage done, damage taken, kills, deaths, wins, losses, and bucket them according to player latency at the time of each event. Look for any trend that indicates an advantage for low latency players and fix it with game design and netcode changes.
Problem: Really high ping players have too much advantage.
This problem is common in first person shooters with lag compensation. In these games, the server reconstructs the world as seen by the client (effectively in the past) when they fired their weapon.
The problem here is that if a player is really lagging with 250ms+ latency, they are shooting so far in the past, they end up having too much of an advantage over regular players, who feel that they are continually getting shot behind cover and cry "lag!".
Solution: Cap your maximum lag compensation window. I recommend 100ms, but it should probably be no more than 250ms worst case. It's simply not fair for high latency players to be able to shoot other players so far in the past.
Problem: You chose the wrong network model for your game.
If you choose the wrong network model, some problems that feel like lag can't really be solved. You'll have to re-engineer your entire game to fix it, and until then your players are left with a subpar experience.
Two classic examples would be:
Solution: Read this article before starting your next multiplayer game.
Problem: You have tightly coupled player-player interactions but don't have any input delay.
You can't client side predict your way out of everything.
If your game has mechanics like block, dodge and parry, chances are that players are going to notice inconsistencies in the local prediction vs. what really happened on the server and cry "lag".
Solution: Consider adding several frames of input delay to improve consistency in games with tightly coupled player-player interactions. Major fighting games do this. Don't feel bad about it.
Problem: Gameplay elements are causing players to feel lag.
Legend has it that the Halo team at Bungie would perform playtests with a lag button that players could smash anytime they felt "lag" in the game.
Lag button usage was recorded and timestamped, so after the play test they could go back and watch a video of the player's display leading up to the button press to see what happened.
Sometimes the lag button would correspond to putting up a shield but getting hit anyway. So the Bungie folks would go and fix this, and then test again. And again. And again. Until eventually the players would only press the lag button when they died.
Solution: When lag corresponds only to dying in the game, you know that you have truly won the battle!
How's that for zen?
I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
A good matchmaker is vitally important to your multiplayer game experience.
But it's difficult to test your matchmaker properly ahead of launch. Testing often misses problems that arise with real-world distributions of players, and you really need to be 100% confident that your matchmaker will do a good job from day 1.
In this article I present a matchmaking simulator built on player data captured from a real game. This data set reconstructs one virtual day's worth of player joins across the world with the correct distribution of (latitude, longitude) according to the time of day – accurately reproducing the timing and pattern of player joins you'll experience at launch.
Create a matchmaker that quickly finds low latency matches. Ideally, players find a good quality match almost instantly.
The game plays identically from 0 - 50ms RTT by design, so there is no point preferring a 10ms server over 50ms. From 50 - 100ms the game still plays well, but 50ms or below is preferable. Above 100ms performance starts to degrade.
The server fleet for the game is distributed across datacenters around the world. If there are multiple server hosting companies in the same city, each are considered to be logically distinct because they can have different network performance.
There are 4 players per-match.
Each datacenter has its own matchmaking queue. Players can be in multiple datacenter queues at the same time. Every second the set of datacenters are walked in random order.
For each datacenter, the matchmaker shuffles the queue to avoid the same players being matched together repeatedly (more should probably be done to avoid this), then walks the queue and assembles players into groups of 4 players. After all datacenters are processed, any left over players feed into the next iteration.
For each new player entering the matchmaking pool:
For players in state NEW:
Players in IDEAL state should quickly find a game and go to PLAYING state.
Players in EXPAND state should quickly find a game and go to PLAYING state.
Players in WARMBODY state donate themselves to all matchmaking queues. They just need to play somewhere.
Players in PLAYING or BOTS state go to BETWEEN MATCH state at the end of the match.
Players in BETWEEN MATCH go to NEW state once the time between matches has elapsed (30 seconds).
IMPORTANT: Break up players at the end of each match and send them back to matchmaking. Otherwise, little islands of players are formed and when you are in the down sloped portion of the CCU curve, new players won't find matches.
The key input into the algorithm is the estimated round trip latency between a player and each datacenter. This is what guides players to low latency datacenters, so it's important that it's accurate. If the input is not accurate, the matchmaker will send players to the wrong place.
One option is to use the haversine formula to calculate the distance between the player and datacenter, then use the speed of light in fiber optic cables (2/3rds speed of light in a vacuum) plus some fudge factor to estimate the round trip time in milliseconds.
The problem with this approach is that in many cases the best datacenter isn't actually the closest. A great example are players in Peru, for whom the closest datacenter by distance is often São Paulo, but since there are no fiber optic cables directly through the Amazon rainforest, it's not actually the lowest latency datacenter for them to play on.
One way to get around this is to put ping servers in each datacenter you host game server in, and send low frequency pings to all ping servers on game startup. This way the round trip time can be determined to each datacenter in just one second.
But even this approach has problems. The route packets take over the internet often depends on the hash of the source and destination IP addresses and port numbers, and the game server the player will ultimately connect to has a different IP address and port from the ping server in the datacenter.
This can lead to the best datacenter for that player being excluded because it has false positive high RTT due to a bad route between the client and the ping server that doesn't occur when the client connects to the game server. Conversely, the ping server RTT might look good, but when the player connects to the game server, the route has a much higher RTT. See this article for more details on why this happens.
Because of this, my preferred approach is to calculate latency maps as greyscale image files per-datacenter in a batch process run daily over the last 30 days. This lets you look up the RTT to a datacenter in constant time, and because it's an average, the RTT value isn't subject to internet fluctuations.
This way the player is always sent to the datacenters with the best chance of having low latency. I always pair this with a network accelerator like Network Next to fix any bad routes between the game client and server.

You can see the full source code and dataset for the matchmaking simulator here.
It takes an average of two seconds to find a match, with 30-40ms average RTT depending on the time of day. It seems that the matchmaker is working well.

To confirm correct operation, let's visualize the set of active players in matches, so we can verify the distribution of players is correct. It looks good!

While developing the simulator I found that for the same set of players joining, the peak CCU is incredibly sensitive to the % chance that a player will play another match.
What can you do to increase this % for your game?
]]>I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
Consider this scenario. You've just spent the last 3 years of your life working on a multiplayer game. Tens of millions have been spent developing it, plus it's competitive, so you still have the cost of dedicated servers ahead of you.
You've planned your server locations, budgeted for server costs, and considering the cost of egress bandwidth from clouds and the amount of bandwidth your game uses, you've selected an appropriate mix of bare metal and cloud.
You have retention, engagement and monetization strategies in place so you should be able to pay for your servers and turn a healthy profit, as long as the game is sticky enough and players don't churn out.
Launch is coming up. You're nervous, but at the same time you're confident you've done everything you can:
There's an initial burst of players and success is within your grasp! But, retention isn't where it needs to be, players are churning out, and your subreddit is full of complaints about lag.
Players are furious and demand that you "fix the servers".
You look at your server bill and you're pretty sure they're not low quality servers.
A chorus grows, and the engagement and monetization metrics come in. They're not tracking where they need to be. Your game is at risk of failure.
You look at your analytics and see that:
But players are still complaining about lag. What's going on?
Approximately 50% of players have above average latency.
You generate the distribution of latency around the world, sampled every 10 seconds during play and sum it up according to the amount of play time spent at each latency bracket:

It's a right-skewed distribution rather than a normal distribution and it has a long tail. This means that a significant percentage of playtime is spent at latencies above where your game plays best.
Although packet loss on average is 0.15%, you find that some players have minimal or no packet loss, and many players have intermittent bursts of packet loss. Only rarely do you see a player with high packet loss for the entire game.
Because most of the time packet loss is zero and it occurs in clumps, the average packet loss % is not really representative of its effect on your game. You change your metric to count the number of packet loss spikes in player sessions and it's clear that packet loss is having a significant impact.
Looking at jitter, you see that many people have perfectly reasonable jitter (<16ms), but others have large spikes of jitter mid game, and some players have high jitter for the entire match. Interestingly, although average jitter on Wi-Fi is more than twice that of wired, many players you see with high jitter are playing over wired connections.
Digging in further:
You start looking on a per-player basis and you notice:
From this point of view, it's fair to say it's affecting your entire player base.
It's easy to dismiss all of this as just issues at the edge. Perhaps some players are just unlucky and have bad internet connections. But if that were the case, why is a player's connection good one day, but terrible the next?
The thing to understand here is that the Internet is not the system that we imagine it to be, one that consistently delivers packets with the lowest latency. Instead, it's more like an amorphous blob comprised of 100,000+ different networks (ASNs) that don't really coordinate in any meaningful way to ensure that packets always take the lowest latency path. Packets take the wrong path, routes are congested, and misconfigurations and hairpins happen all the time.
The bad network performance is a property of the Internet itself.
You can confirm this yourself by running an experiment. Host your game servers with different hosting companies. Measure performance for each hosting company across your player base (you'll need 10k+ CCU over a month to reproduce this result). No matter which hosting company you choose, you'll find roughly 10% of players worldwide getting bad network performance at any time, and around 60% of players get bad network performance every month.
This is a lot to take in. The internet is not an efficient routing machine and makes mistakes. Yes, everything is distributed and nobody is to blame, but the end result is clear – best effort delivery doesn't take any amount of care to make sure your game packets are delivered with the lowest possible latency, jitter and packet loss.
Let's build up this understanding with an example, so you can see how easily it is for the internet to make bad routing decisions, without any ill intent.
Imagine you have three ISPs at your home:
You want to use all your internet connections, you're paying for them after all, so you setup multi-WAN and your router distributes connections across all three ISPs using a hash of the destination IP address and ports.
Then you play some multiplayer games and quickly notice that around 1 in 3 play sessions feel great, but the rest just aren't as good. What's going on? Simple. When you finish each match, you connect to a new server and your destination IP address and port change, resulting in a different hash value. This hash value modulo 3 is used by your router to select which ISP to use, and each of your ISPs have different properties, they don't have exactly the same network performance!
In this case, it's an easy fix. You create a new router rule to map UDP traffic to your fiber internet connection, splitting the rest between cable internet and Starlink. Problem solved!
Now step back even further. Imagine you're an ISP. You have multiple upstream transit provider links that can be used, and you have no way to know which one is really best for the destination IP address. In fact, your incentive is perverse, get the packet off your network as quickly and cheaply as possible, so it doesn't cost you money.
Even worse, it's now basically impossible to detect which applications are games, and which are throughput oriented applications like large downloads, so even if you did know which upstream links were the fastest for a particular destination, you wouldn't know which packets should be sent down them, and you can't send just send all UDP traffic down the fastest link.
So you load balance. You take a hash and you modulo n.
Extend this for every hop in the trace route between your client and server.
And now you understand why the Internet has such strange behavior, where bad network performance moves around like weather and players get good network performance one match, and terrible the next.
As game developers, we all want players to have the best experience: the lowest possible latency, jitter and packet loss and consistency from one match to the next.
And it's no surprise that bad network performance frustrates and churns players, so there's a financial incentive for us to solve this as well.
There's even an existence proof for a solution. In the early days of League, bad network performance was so severe that Riot built their own private Internet, Riot Direct to fix it.
Today, Riot Direct accelerates network traffic for League of Legends and Valorant and helps protect against DDoS attacks, greatly improving the player experience. I think in this case the results clearly speak for themselves.
For everybody else who can't afford to build their own private internet, there's a few things we can do:
Network Next is my startup. We implement all of the strategies above to fix bad network performance for your multiplayer game. Not only do we see huge improvements around the world in regions like South America, Central America, Middle East and Asia Pacific, but also in the USA, Canada and Europe. Bad network performance is everywhere.
Network Next is available as an open source SDK in C/C++ or as a drop-in UE5 plugin that integrates with your game and takes over sending and receiving UDP packets.
When a better route is found than the default internet route (subject to hashing and modulo n), the SDK automatically steers your game packets across relays to fix it, including multipath if you enable it – to reduce latency, jitter and packet loss for your players.
The Network Next backend is implemented in Golang and runs in Google Cloud. It's load tested up to 25 million CCU and is extremely robust and mature. We've been doing this since 2017 and have accelerated more than 50 million unique players, including professional play for ESL Counterstrike Leagues (we are the technology behind ESEA FastPath).
We can run a Network Next backend and relay fleet for your game with white glove service for a small monthly fee, or for larger studios you can license the backend and get full source code and operate the whole system yourself.
Please contact us if you'd like to try out Network Next with your game.
]]>I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
I'm fresh back from lovely Malmö, Sweden where I hung out with the team from Coherence at their offices for the week around Nordic Game conference.
I'm happy to report that Coherence is an excellent multiplayer product for Unity with a very strong and talented team behind it. I spent a lot of time with their CTO Tadej (a close personal friend of mine) and over the week we had many discussions about network models and the pros and cons for the networking strategy they have chosen for their product.
And this leads to a very important point that I think most people are unaware of.
There are just so many different ways to network a game.
In the industry we call these network models. Each network model represents a different strategy for keeping your game in sync, and comes with significant pros and cons. Something absolutely trivial in one network model might be difficult or even impossible to implement in another, so it's extremely important to choose the right network model for your game.
So if you're a game developer trying to decide which network model to use – read on, and I'll share some helpful tips to help you pick the correct network model for your next game!
Let's start with the key inputs to help make your a decision.
These include:
In this network model objects are distributed across player machines to achieve beneficial things like latency hiding and load balancing.
At its simplest, input delay is removed by simulating each player locally on their own machine, making each client have "authority" (acting as the server) for their own player character. Non-player objects like AIs can also be distributed, sharing the workload of simulating the world across all player machines.
This approach can also be extended to take authority over vehicles the player drives, turrets manned by the player, physics objects picked up and thrown by players, and even stacks of objects touched by the player or other objects under the authority of that player.
Examples:
When to consider:
When to exclude:
Pros:
Cons:
Further reading:
In this network model the game runs exclusively on the server. Each client sends inputs to the server, and displays an interpolated view of the world reconstructed from a time series of snapshots sent back to it (state of all visual objects in the world at time t).
Examples:
When to consider:
When to exclude:
Pros:
Cons:
Further reading:
Start with the Quake network model and modify it such that each player is now in their own time stream. Now, instead of the whole world stepping forward together in fixed time steps, step players forward on the server only when player input + delta time is received from their client.
All remote objects are still interpolated on the client (and possibly extrapolated although I prefer not to), but local player objects are special cased and run with full simulation using local player inputs on the client machine. This removes the feeling of latency on player actions, eg. players can now move and shoot with no delay.
To keep the server authoritative, the server regularly sends back "deep state" corrections (containing internal non-visual state, eg. inventory, # of bullets in ammo clip, current weapon, firing delay between shots and so on) to each client, and the client applies that correction (which is effectively in the past) and invisibly re-simulates back up to current predicted time on the client. This is called client side prediction, or outside of game development, optimistic execution with rollback.
Weapon firing and damage are applied exclusively on the server. Predicted weapon firing on the client is purely cosmetic. Since the client and the server have a strong concept of shared time, the server is able to reconstruct the state of the world (lag compensation) from the client's point of view when they fired their weapon, crediting hits to the player according to their point of view.
Reconstructing the world according to player point of view on the server is done using a ring buffer containing all visual state for the last ~second (eg. position, orientation and bone orientations) sent down to the client in snapshots and interpolating between them to match the interpolated state displayed on the client.
Lag compensation avoids needing to lead targets and creates more precision in firing shots, for example another player is running and you shoot them in the knee at a specific point in their run cycle, it will hit. The cost is that the player being shot sometimes feels that from their point of view they were shot from behind cover, because effectively the shot was fired in their past.
Examples:
When to consider:
When to exclude:
Pros:
Cons:
Further reading:
Instead of having remote objects on the client being interpolated as per-Quake model, a subset of the simulation runs on the client for remote objects, such that objects continue simulating forward on the client between network updates.
Object updates are prioritized such that not every object is included in each packet, and objects are updated at varying rates. Typically, objects closer to the player are updated more frequently. This is done by having a priority accumulator per-object, and increasing the priority accumulator value for an object each frame it is not sent, scaled by some multiplier (which can be some function of the type of object, the state of the object and how far away it is from the player, and so on.)
Because the client view of the remote players is no longer as accurate as in games like Counterstrike, you usually need to lead the shots. Sometimes this is compensated for by giving the client credit for some shots that are "credible" within some amount of tolerance of the server position, although this is not as robust as the lag compensation solution in Counterstrike style games.
This is the default mode of networking in Unreal Engine.
Examples:
When to consider:
When to exclude:
Pros:
Cons:
Further reading:
In this network model players connect directly to each other peer-to-peer and send inputs or commands to each other. Peers only step the simulation forward when they have received inputs/commands from all players for the current frame, hence the origin of the term "lock-step". In order for this approach to work, the game must be completely deterministic such that a checksum of game state at the end of the frame can be compared between peers at the end of each frame to check for desync.
Pros:
Cons:
When to avoid:
Examples:
Start with the deterministic lockstep model, but instead of connecting players peer-to-peer, have them exchange packets via a relay server. Give this relay server a concept of time instead of just being a dumb reflector. Now if a client tries using a lag switch or messes around with input timing too much, it can be detected and they can be banned from the game.
Examples:
When to consider:
When to avoid:
Pros:
Cons:
All the benefits of deterministic lockstep without the lag. Each client frame process frames with inputs from the relay server up to most recent server time. Then, copy the game state (fork it) and simulate the copy forward with local inputs to present (predicted) client time. Continue predicting the fork until the next update arrives from the network, then discard the predicted fork, rinse and repeat.
When to consider:
When to exclude:
Pros:
Cons:
Further reading:
Upgrade your relay server so that it doesn't just handle player inputs and time, but it also acts as an invisible player in your game and runs the same simulation with all player inputs. Things that happen in this headless server instance of the game can now securely call out to backend systems and grant players progression and items in the meta game without being vulnerable to modified code or memory hacking on the client.
Examples:
Pros:
Cons:
It's not possible. The two network models are fundamentally different. Pick one.
I'm working on this here, but it's likely to be very expensive to run.
The CPU cost makes this impossible today. Come back in 10 years.
I hate you.
I hate you even more.

I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
Back in high school I played iD software games from Wolfenstein to DOOM and Quake. Quake was truly a revelation. I still remember Q1Test when you could hide under a bridge in the level and hear the footsteps of a player walking above you.
That was 1996. Today it's 2024. My reflexes are much slower, I have slightly less hair, and to be honest, I really don't play first person shooters anymore – but I'm 25 years in on my game development career and now I'm one of the top world experts in game netcode. I've even been fortunate enough to have worked on several top tier first person shooters: Titanfall and Titanfall 2, and some of my code is still active in Apex Legends.
Taking a look at person shooters in 2024, there seems to be two main genres: massive online battle arenas like PUBG, Fortnite and Apex Legends (60-100 players), and team combat games like Counterstrike (5v5), Overwatch (6v6) and Valorant (5v5).
Notably absent are the first person shooters with thousands of players promised by Improbable back in 2014:
"Imagine playing a first-person shooter like Call of Duty alongside thousands of other players from across the world without having to worry about latency. Or a game where your actions can trigger persistent reactions in the universe that will affect all other players. This is what gaming startup Improbable hopes to achieve."
- Wired (https://www.wired.com/story/improbable/)
What's going on? Is it really not possible, or was Improbable's technology just a bunch of hot air? In this article we're going to find out if it's possible to make a first person shooters with thousands of players – and we're going to do this by pushing things to the absolute limit.
"Would someone tell me how this happened? We were the fucking vanguard of shaving in this country. The Gillette Mach3 was the razor to own. Then the other guy came out with a three-blade razor. Were we scared? Hell, no. Because we hit back with a little thing called the Mach3Turbo. That's three blades and an aloe strip. For moisture. But you know what happened next Shut up, I'm telling you what happened—the bastards went to four blades. Now we're standing around with our cocks in our hands, selling three blades and a strip. Moisture or no, suddenly we're the chumps. Well, fuck it. We're going to five blades.
- The Onion
When scaling backend systems a great strategy is to aim for something much higher than you really need. For example, my startup Network Next is a network acceleration product for multiplayer games, and although it's incredibly rare for any game to have more than a few million players at the same time, we load test Network Next up to 25M CCU.
Now when a game launches using our tech and it hits a few hundred thousand or million players it's all very simple and everything just works. There's absolutely no fear that some issue will show up due to scale because we've already pushed it much, much further than it will ever be used in production.
Let's apply the same approach to first person shooters. We'll aim for 1M players and see where we land. After we scale to 1M players, I'm sure that building a first person shooter with thousands of players will seem really easy.
The first thing we need is a way to perform player simulation at scale. In this context simulation means the game code that takes player inputs and moves the player around the world, collides with world geometry, and eventually also lets the player aim and shoot weapons.
Looking at both CPU usage and bandwidth usage, it's clearly not possible to have one million players on one server. So let's create a new type of server, a player server.
Each player server handles player input processing and simulation for n players, we don't know what n is yet, but this way we can scale horizontally to reach the 1M player mark, by having 1M / n player servers.
The assumptions necessary to make this work are:
Player servers take the player input and delta time (dt) for each client frame and step the player state forward in time. Players are simulated forward only when input packets arrive from the client that owns the player, and these packets correspond to actual display frames on the client machine. There is no global tick on a player server. This is similar to how most first person shooters in the Quake netcode model work. For example, Counterstrike, Titanfall and Apex Legends.
Let's assume player inputs are sent on average at 100HZ, and each input is 100 bytes long. These inputs are sent over UDP because they're time series data and are extremely latency sensitive. All inputs must arrive, or the client will see mis-predictions (pops, warping and rollback) because the server simulation is authoritative.
We cannot use TCP for input reliability, because head of line blocking would cause significant delays in input delivery under both latency and packet loss. Instead, we send the most recent 10 inputs in each input packet, thus we have 10X redundancy in case of any packet loss. Inputs are relatively small so this strategy is acceptable, and if one input packet is dropped, the very next packet 1/100th of a second later contains the dropped input PLUS the next input we need to step the player forward
First person shooters rely on the client and server running the same simulation on the same initial state, input and delta time and getting (approximately) the same result. This is known as client side prediction, or outside of game development, optimistic execution with rollback. So after the player input is processed, we send the most recent player state back to the client. When the client receives these player state packets, it rolls back and applies the correction in the past, and invisibly re-simulates the player back up to present time with stored inputs.
You can see the R&D source code for this part of the FPS here: https://github.com/mas-bandwidth/fps/blob/main/001/README.md
The results are fascinating. Not only is an FPS style player simulation for 1M players possible, it's cost effective, perhaps even more cost effective than a current generation FPS shooter with 60-100 players per-level.
Testing indicates that we can process 8k players per-32 CPU bare metal machine with XDP and a 10G NIC, so we need 125 player servers to reach 1 million players.
At a cost of a just $1,870 USD per-month per-player server (datapacket.com), this gives a total cost of $233,750 USD per-month,
Or... just 23.4c per-player per-month.
Not only this, but the machines have plenty of CPU left to perform more complicated player simulations:

Verdict: DEFINITELY POSSIBLE.
First person shooter style player simulation with client side prediction at 1M players is solved, but each client is only receiving their own player state back from the player server. To actually see, and interact, with other players, we need a new concept, a world server.
Once again, we can't just have one million players on one server, so we need to break the world up in some way that makes sense for the game. Perhaps it's a large open world and the terrain is broken up into a grid, with each grid square being a world server. Maybe the world is unevenly populated and dynamically adjusting voronoi regions make more sense. Alternatively, it could be sections of a large underground structure, or even cities, planets or solar systems in some sort of space game. Maybe the distribution has nothing to do with physical location.
The details don't really matter, but the important thing is we need to distribute the load of players interacting with other objects in the world across a number of servers somehow, and we need to make sure players are evenly distributed and don't all clump up on the same server.
For simplicity let's go with the grid approach. Assume that design has a solution for keeping players fairly evenly distributed around the world. We have a 10km by 10km world and each grid cell is 1km squared. You would be able to do an amazing persistent survival FPS like DayZ in such a world.
Each world server could be a 32 CPU bare metal machine with a 10G NIC, like the player servers, so they will each cost $1,870 USD per-month. For a 10x10 grid of 1km squared we need 100 world servers, so this costs $187,000 USD per-month, or 18.7c per-player, per-month. Add this to the player server cost, and we have a total cost of 42.1c per-player per-month. Not bad.
How does this all work? First, the player servers need to regularly update the world server with the state of each player. The good news is that while the "deep" player state sent back to the owning client for client side prediction rollback and re-simulation is 1000 bytes, the "shallow" player state required for interpolation and display in the remote view of a player is much smaller. Let's assume 100 bytes.
So 100 times per-second after the player simulation completes, the player server forwards the shallow player state to the world server. We have 10,000 players per-world server, so we can do the math and see that this gives us 10,000 x 100 x 100 = 100 million bytes per-second, or 800 million bits per-second, or 800 megabits per-second, let's round up and assume it's 1Gbit/sec. This is easily done.
Now the world server must track a history, let's say a one second history of all player states for all players on the world server. This history ring buffer is used for lag compensation and potentially also delta compression. In a first person shooter, when you shoot another player on your machine, the server simulates the same shot against the historical position of other players near you, so they match their positions as they were on your client when you fired your shot. Otherwise you need to lead your shot by the amount of latency you have with the server, and this is not usually acceptable in top-tier competitive FPS.
Let's see if we have enough memory for the player state history. 10,000 players per-world server, and we need to store a history of each player for 1 second at 100HZ. Each player state is 100 bytes, so 100 x 100 x 10,000 = 100,000,000 bytes of memory required or just 100 megabytes. That's nothing. Let's store 10 seconds of history. 1 gigabyte. Extremely doable!
Now we need to see how much bandwidth it will cost to send the player state down from the world server directly down to the client (we can spoof the packet in XDP so it appears that it comes from the player server address, so we don't have NAT issues).
10,000 players with 100 byte state each, and we have 1Gbit/sec for player state. This is probably a bit high for a game shipping today (to put it mildly), but with delta compression let's assume we can get on average an order of magnitude reduction in snapshot packet sizes, so we can get it down to 100mbps.
And surprisingly... it's now quite possible. It's still definitely on the upper end of bandwidth to be sure, but according according to Neilsen's Law, bandwidth available for high end connections increases each year by 50% CAGR, so in 10 years you should have around 57 times the bandwidth you have today.
Verdict: DEFINITELY POSSIBLE.
So far we have solved both for player simulations with FPS style client side prediction and rollback, and we can see 10,000 other players near us – all with a reasonable cost and requiring 1mbps up and 100mbps down to synchronize player inputs, predicted player state, and the state of other players.
But something is missing. How do players interact with each other?
Typically a first person shooter would simulate the player on the server, see the "fire" input held down during a player simulation step and shoot out a raycast in the view direction. This raycast would be tested against other players in the world and the first object in the world hit by that raycast would have damage applied to it. Then the next player state packet sent back to the client that got hit would include health of zero and would trigger a death state. Congratulations. You're dead. Cue the kill replay.
But now when the player simulation is updated on a player server, in all likelihood the other players that are physically near it in the game world are not on the same player server. Player servers are simply load balanced – evenly distributing players across a number of player servers, not assigning players to player servers according to where they are in the world.
So we need a way for players to interact with each other, even though players are on different machines. And this is where the world server steps in.
The world server has a complete historical record of all other player states going back one second. If the player simulation can call out asynchronously call out to one or more world servers "raycast and find first hit", and then "apply damage to <object id>" then I think we can actually make this whole thing work.
So effectively what we have now is something that looks something like, Redis for game servers. A in-memory world database that scales horizontally with a rich set of primitives so players interact with the world and other objects in it. Like Redis, calls to this world database would be asynchronous. Because of this, players server would need one goroutine (green thread) per-player. This goroutine would block on async IO to the world database and yield to other player goroutines do their work while waiting for a response.
Think about it. Each player server has just 8,000 players on it. These players are distributed across 32 CPUs. Each CPU has to deal with only 250 players total, and these CPUs have 90% of CPU still available to do work because they are otherwise IO bound. Running 250 goroutines per-CPU and blocking on some async calls out to world servers would be incredibly lightweight relative to even the most basic implementation of a HTTP web server in Golang. It's totally possible, somebody just needs to make this world database.
Verdict: DEFINITELY POSSIBLE.
Each world server generates a stream of just 100 mbit/sec containing all player state for 10,000 players @ 100HZ (assuming delta compression against a common baseline once every second), and we have 100 world servers.
100 mbit x 100 = 10 gbit generated per-second.
But there are 10,000 players per-world server. And each of these 10,000 players need to receive the same 100mbit/sec stream of packets from the world server. Welcome to O(n^2).
If we had real multicast over the internet, this would be a perfect use case. But there's no such thing. Potentially, a relay network like Network Next or Steam Datagram Relay could be the foundation of a multicast packet delivery system with 100G+ NICs on relays on major points round the world, or alternatively a traditional network supporting multicast could be built internally with PoPs around the world like Riot Direct.
Verdict: Expensive and probably the limiting factor for the moment. Back of the envelope calculations show that this increases the cost per-player per-month from cents to dollars. But if you already had the infrastructure available and running, it should be cost effective to implement this today. Companies that could do this include Amazon, Google, Valve, Riot, Meta and Microsoft.
Not only is it possible to create a first person shooter with a million players, but the per-player bandwidth is already within reach of high end internet connections in 2024.
The key limiting factor is the distribution of game state down to players due to lack of multicast support over the internet. A real-time multicast packet delivery system would have to be implemented and deployed worldwide. This would likely be pretty expensive, and not something that an average game developer could afford to do.
On the game code side there is good news. The key missing component is a flexible, game independent world database that would enable the leap from hundreds to at least thousands of players. I propose that creating this world database would be a fantastic open source project.
When this world database is combined with the architecture of player servers and world servers, games with 10,000 to 1M players become possible. When a real-time multicast network emerges and becomes widely available, these games will become financially viable.
]]>The actual question looks easy:
You are tasked with creating a client/server application in Golang that runs in Google Cloud. The client in this application must communicate with the server over UDP.
Each client sends 100 requests per-second. Each request is 100 bytes long. The server processes each request it receives and forwards it over HTTP to the backend.
The backend processes the request, and returns a response containing the FNV1a 64 bit hash of the request data. The server returns the response it receives from the backend down to the client over UDP.
Implement the client, server and backend in Golang. Provide an estimate of the cost to run the solution each month at a steady load of 1M clients, as well as some options you recommend as next steps to reduce the cost.
I think most programmers familiar with Golang could implement the code above over a weekend, it doesn't seem too hard. Time to get coding right?
No. Not at all. Turns out, it's basically fucking impossible.
The Kobayashi Maru is a training exercise in the Star Trek franchise designed to test the character of Starfleet Academy cadets by placing them in a no-win scenario. The Kobayashi Maru test was first depicted in the 1982 film Star Trek II: The Wrath of Khan, and it has since been referred to and depicted in numerous other Star Trek media.
The nominal goal of the exercise is to rescue the civilian fuel ship Kobayashi Maru, which is damaged and stranded in neutral territory between the Federation and the Klingon Empire. The cadet being evaluated must decide whether to attempt to rescue the Kobayashi Maru—endangering their ship and crew—or leave the Kobayashi Maru to certain destruction. If the cadet chooses to attempt a rescue, an insurmountable enemy force attacks their vessel. By reprogramming the test itself, James T. Kirk became the only cadet to overcome the Kobayashi Maru simulation.
The phrase "Kobayashi Maru" has entered the popular lexicon as a reference to a no-win scenario. The term is also sometimes used to invoke Kirk's decision to "change the conditions of the test."
-- Wikipedia
I've given this test to many candidates, and I've heard pretty much every response you could imagine, from "That's easy! I could code that in a few hours" to "That's impossible!". You can even see a bunch of people getting worked up over this question on the hacker news thread from when I posted it. https://news.ycombinator.com/item?id=39979078
From the answers that people give during interviews I can tell a lot about what sort of programmers they are. A particularly strong tell comes from programmers who think that they should just write it in <insert language/technology of choice> and everything would scale perfectly. It's easy if you just do X they say as hands are waved. Turns out if you want to actually solve this problem you really need to roll your sleeves up and code.
When you do this, you'll find it's such a difficult problem that even getting to 10k clients is hard. I can guarantee even an experienced programmer would work for days, potentially even weeks on a solution that hits 10k clients in Golang. It's also a very sneaky question in that and if you are not absolutely disciplined in tracking expected packets sent, vs. actual sent and received, you can easily fool yourself that you are scaling up when in fact most packets you send are dropped.
There are many tricks along the way to 10K clients: socket send and receive buffer sizes, SO_REUSEPORT, batching, sharding and being stateless, how to handle being CPU vs. IO bound, identifying bottlenecks, horizontal scaling, load balancing in Google Cloud, how the Linux kernel networking works and the many ways that UDP packets can get dropped. If you know all these things you are at an advantage – or perhaps, a disadvantage – since you'll get further along before you realize it's impossible.
Just like the Kobayashi Maru, you can learn a lot about somebody based on how they respond to the no-win scenario, and the very best programmers break the rules to pass the test.
So here is my answer. It's worth reviewing if you are curious about how to obtain high UDP request/response throughput for an application. It's not the only way to solve this problem, but it is mine:
https://github.com/mas-bandwidth/udp/blob/main/001/README.md
How would you approach it?
UPDATE:
Here is Sandeep Nambiar's solution from Coherence:
I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
Imagine you have a system you need to code and it needs to scale up to many millions of requests per-second. The well-trodden path is to implement this in HTTPS with a load balancer in front of some VMs that autoscale according to demand. For example, you could implement this in Google Could with load balancer in front of a MIG and implement the HTTPS handlers in Golang (trivial). There's plenty of information about how to do this online, and it's relatively easy to scale this up to many millions of requests per-second.
But what if you needed to do the same thing in UDP instead of HTTPS?
Well, now you're off the beaten path, and you'll find very little information about how to do this. This is actually something we needed to do at Network Next, my startup that provides network acceleration technology for multiplayer games. Our SDK runs on game consoles like PS4, PS5, Nintendo Switch and XBox as well as Windows, MacOS and Linux, and avoiding the overhead and complexity of porting and maintaining something like libcurl or mbedtls on consoles is beneficial to our customers.
To be clear, I'm not advocating that you stop using HTTPS and switch your backend to UDP. If you're happy with HTTPS and it's doing what you need, awesome! Stay on the well trodden path. But if, like me, you have some use case that is better with UDP, or even if you are just curious about how such a strange approach can work, read on.
This idea of building a scalable backend with UDP is so out there that I've used it as an interview question at Network Next for years. You simply cannot just google this and find example source code showing how to do it. To solve this problem, you need to take in many sources and creatively synthesize your own result. Exactly what I'm looking for from engineers at Network Next.
The actual question itself is deceptively easy:
You are tasked with creating a client/server application in Golang that runs in Google Cloud. The client in this application must communicate with the server over UDP.
Each client sends 100 requests per-second. Each request is 100 bytes long. The server processes each request it receives and forwards it over HTTP to the backend.
The backend processes the request, and returns a response containing the FNV1a 64 bit hash of the request data. The server returns the response it receives from the backend down to the client over UDP.
Implement the client, server and backend in Golang. Provide an estimate of the cost to run the solution each month at a steady load of 1M clients, as well as some options you recommend as next steps to reduce the cost.
While I'm confident that an experienced senior engineer could find a solution over a weekend, I gave engineers as much time to research and solve the problem at home as they needed. What matters is the thinking process of the engineer, not how quickly they implement it. And of course, I wanted to respect that engineers may be implementing this in their spare time while working another job.
If you'd like to have a go at solving this yourself, now is the time.
I'll publish the full solution April 16, 2024, one week from today.
I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
What's this new blog about? Games with thousands of players. Virtual worlds. Performance in virtual spaces to an audience of millions. The Metaverse (no, not THAT metaverse, the real metaverse, sans blockchain). Overlay worlds in AR. Telepresence and remote working in virtual reality. Game streaming (eww gross!). OK, everything but that last one and I'm going to explain why in a later article.
But seriously. Within 10 years, everybody is going to have 10gbps internet. How will this change how games are made? How are we going to use all this bandwidth? 5v5 shooters just aren't going to cut it anymore. What's next?
As a fitting first post to this blog, if ever you find yourself, as I did, needing the absolute maximum bandwidth for an application, then you'll need to use a kernel bypass technology. Why? Because otherwise, the overhead of processing each packet in the kernel and passing it down to user space and back up to the kernel and out to the NIC limits the throughput you can achieve. We're talking 10gbps and above here.
The good news is that in the last 5 years the Linux kernel bypass technology known as XDP/eBPF has matured enough that it has moved from the domain of kernel hackers, to now at the beginning of 2024, being generally usable by normal people like you and me.
So in this article I'm going to give you a quick overview of how XDP/eBPF works, show you what you can actually do with XDP/eBPF, and give you some working example code (https://github.com/mas-bandwidth/xdp) for some simple XDP programs so you can start using this technology in your applications.
You can do some truly amazing things in XDP so read on!
XDP stands for "express data path" and it's basically a way for you to write a function that gets called at the very earliest point when a packet comes off the NIC, right before the Linux kernel does any allocations or processing for the packet.
This is incredible. This is powerful. This is programmer crack. You can write a function that runs inside the Linux Kernel and you can do almost anything you want. You can:
OR
The last one is key. With other kernel bypass technologies like DPDK you needed to install a second NIC to run your program or basically implement (or license) an entire TCP/IP network stack to make sure that everything works correctly under the hood (the NIC is doing a lot more than just processing UDP packets for your game...).
Now you can just laser focus your XDP program to apply, for example, only to IPv4 UDP packets sent to port 40000, and pass everything else on to the Linux kernel for regular processing. Easy.
Correction: Apparently these days you can use a "bifurcated driver" with DPDK now to pass certain packets back to the OS. This wasn't available the last time I worked with DPDK, which was quite some time ago. I still prefer XDP over DPDK though.
eBPF stands for "extended Berkeley Packet Filter" and it's the technology that lets you compile, link and run your XDP program in the Linux Kernel.
In short, eBPF is a byte code and lightweight VM that runs functions inside the Linux Kernel. There are many different places that eBPF functions can be inserted, and XDP is just one of them.
Because eBPF functions run inside the Linux kernel, they must not crash and they absolutely must halt. To make sure this is true, BPF functions have to pass a verifier before they can be loaded into the Kernel.
In practice, this means that XDP functions are very slightly limited in what they can do. They're not Turing complete (halting problem), and you have to do a lot of dancing about to prove to the verifier that you aren't writing outside of bounds. But in practice, as long as you keep things simple, and you're willing to creatively do battle with the verifier, you can usually convince it that your program is safe.
Before you can get started wring XDP programs, you need to set up your machine so that it can compile, link and run eBPF programs, and load them in the kernel.
Starting from an Ubuntu 22.04 LTS distribution.
First you need to make sure you have the 6.5 Linux kernel:
uname -r
If the output from this isn't version 6.5, update your kernel with:
sudo apt install linux-generic-hwe-22.04 -y
Run the following in your command line:
# install necessary packages
sudo NEEDRESTART_SUSPEND=1 apt autoremove -y
sudo NEEDRESTART_SUSPEND=1 apt update -y
sudo NEEDRESTART_SUSPEND=1 apt upgrade -y
sudo NEEDRESTART_SUSPEND=1 apt dist-upgrade -y
sudo NEEDRESTART_SUSPEND=1 apt full-upgrade -y
sudo NEEDRESTART_SUSPEND=1 apt install libcurl3-gnutls-dev build-essential vim wget libsodium-dev flex bison clang unzip libc6-dev-i386 gcc-12 dwarves libelf-dev pkg-config m4 libpcap-dev net-tools -y
sudo NEEDRESTART_SUSPEND=1 apt install linux-headers-`uname -r` linux-tools-`uname -r` -y
sudo NEEDRESTART_SUSPEND=1 apt autoremove -y
# install libxdp and libbpf from source
cd ~
wget https://github.com/xdp-project/xdp-tools/releases/download/v1.4.2/xdp-tools-1.4.2.tar.gz
tar -zxf xdp-tools-1.4.2.tar.gz
cd xdp-tools-1.4.2
./configure
make -j && sudo make install
cd lib/libbpf/src
make -j && sudo make install
sudo ldconfig
# setup vmlinux btf
sudo cp /sys/kernel/btf/vmlinux /usr/lib/modules/`uname -r`/build/
In summary, the key step is building libxdp from source, and then building and installing the exact version of libbpf that is included in libxdp.
I can only guess why this is necessary, but without this, I have found no other way to get XDP fully working on Ubuntu 22.04, including all functionality like BTFs, kfuncs and kernel modules. More on those later.
Now we'll build and run a simple XDP program. In this program we'll just reflect UDP packets sent to us on port 40000 back to the sender. All other packets are passed to the kernel for regular processing.
First, clone my XDP example repo from GitHub:
git clone https://github.com/mas-bandwidth/xdp
Change into the reflect dir and make the program:
cd xdp/reflect && make
Run the UDP reflect program, passing in the name of the network interface to attach the program to. You can use ifconfig to list the network interfaces on your Linux machine.
sudo ./reflect enp4s0
Open up another terminal window to watch logs from the XDP program:
sudo cat /sys/kernel/debug/tracing/trace_pipe
Next clone the XDP repo again on another machine, and run the corresponding client for the reflect program, replacing 192.168.1.40 with the IP address of your Linux machine running the XDP program:
git clone https://github.com/mas-bandwidth/xdp
cd xdp/reflect && go run client.go 192.168.1.40
If everything is working, you should see logs like this:
gaffer@batman reflect % go run client.go
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
sent 256 byte packet to 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
received 256 byte packet from 192.168.1.40:40000
Congratulations, just you've built and run your first XDP program, and it's no toy. If you simply comment out the #define DEBUG 1 line in reflect_xdp.c, it's capable of reflecting packets at line rate on a 10G NIC.
Next we'll run a program that listens on a UDP port and drops packets that don't match a pattern. This type of XDP program can be useful to harden your game server against DDoS, although it's certainly not a panacea.
The general idea is to hash key packet data, for example: packet length, source and dest addresses and ports, and potentially also some rolling magic number that changes every minute if you want to get fancy. While it's not perfect, and it doesn't protect against packet replay attacks, at least a randomly generated UDP packet will fail the pattern check.
The trick is to shmear this 8 byte hash across 15 bytes at the start of the packet in a reversible way, effectively doing the opposite of compression. We're going to store this data very inefficiently, such that each byte in the header has a very narrow range of values that are actually valid, and most are not. Now we have an extremely low entropy pattern we can check for, without even calculating the hash.
Here's an example that takes the hash and fills a 16 byte header, with 15 bytes of a low entropy encoding of the hash, and the first byte reserved for packet type:
func GeneratePacketHeader(packet []byte, sourceAddress *net.UDPAddr, destAddress *net.UDPAddr) {
var packetLengthData [2]byte
binary.LittleEndian.PutUint16(packetLengthData[:], uint16(len(packet)))
hash := fnv.New64a()
hash.Write(packet[0:1])
hash.Write(packet[16:])
hash.Write(sourceAddress.IP.To4())
hash.Write(destAddress.IP.To4())
hash.Write(packetLengthData[:])
hashValue := hash.Sum64()
var data [8]byte
binary.LittleEndian.PutUint64(data[:], uint64(hashValue))
packet[1] = ((data[6] & 0xC0) >> 6) + 42
packet[2] = (data[3] & 0x1F) + 200
packet[3] = ((data[2] & 0xFC) >> 2) + 5
packet[4] = data[0]
packet[5] = (data[2] & 0x03) + 78
packet[6] = (data[4] & 0x7F) + 96
packet[7] = ((data[1] & 0xFC) >> 2) + 100
if (data[7] & 1) == 0 {
packet[8] = 79
} else {
packet[8] = 7
}
if (data[4] & 0x80) == 0 {
packet[9] = 37
} else {
packet[9] = 83
}
packet[10] = (data[5] & 0x07) + 124
packet[11] = ((data[1] & 0xE0) >> 5) + 175
packet[12] = (data[6] & 0x3F) + 33
value := (data[1] & 0x03)
if value == 0 {
packet[13] = 97
} else if value == 1 {
packet[13] = 5
} else if value == 2 {
packet[13] = 43
} else {
packet[13] = 13
}
packet[14] = ((data[5] & 0xF8) >> 3) + 210
packet[15] = ((data[7] & 0xFE) >> 1) + 17
}
To run the xdp drop program just change into the 'drop' directory and run it on your network interface:
cd xdp/drop && sudo ./drop enp4s0
And then on another computer, run the drop client, replacing the addresses with the client and drop XDP program addresses respectively:
cd xdp/drop && go run client.go 192.168.1.20 192.168.1.40
On the XDP machine you'll see in the logs that the packet filter passed:
sudo cat /sys/kernel/debug/tracing/trace_pipe
Try modifying the client.go to send randomly generated packet data without the header. You'll see in the logs now that the packet filter drops the packets. The encoding of the hash is so low entropy that it's virtually impossible for a randomly generated packet to pass the packet filter.
If you end up using this technique in production, please make sure to change the low entropy encoding to something unique for your game, because script kiddies read these articles too. In addition, make sure your encoding is reversible, so you can reconstruct the hash on the receiver side and drop the packet in XDP when the hash doesn't match the expected value. Now people can't spoof their source address or port!
What about an even simpler approach? Why not just maintain a list of IP addresses that are allowed to communicate with the game server, and drop any packets that aren't from a whitelisted address?
Sure, you need to do some work in the backend to "open" client addresses on the server prior to connect, and you have to do some work to "close" the address when clients disconnect... but this is do-able, and now, packets thrown at your game server from random addresses will be dropped by XDP, before the Linux kernel does any work to process them.
To do this we need a way to communicate the whitelist to the XDP program. And here we get to use a new feature of BPF: Maps.
Maps are an incredibly rich set of data structures that you can use from BPF. Arrays, Hashes, Per-CPU arrays, Per-CPU hashes and so on. All of these data structures are lockless, and you can read and write to them both from inside BPF programs, and from user space programs.
If you see where I'm going here, you now have a way to communicate from your BPF program back out to user space and vice-versa. It's almost too easy now: just call functions in your user space program to add and remove entries from a whitelist hash map.
Run the whitelist XDP program, replacing the interface name with your own:
cd xdp/whitelist && sudo ./drop enp4s0
And then on another computer, run the whitelist client, replacing the addresses with the client and drop XDP program addresses respectively:
cd xdp/whitelist && go run client.go 192.168.1.20 192.168.1.40
If you look at the logs from the XDP program:
sudo cat /sys/kernel/debug/tracing/trace_pipe
You'll see that it prints out that it's dropping the packets because they're not in the whitelist. Edit whitelist/whitelist.c and add the address of the machine running client.go, then reload the XDP program. Run the client.go again, and the packets should pass. At this point, if you bind a UDP socket to port 40000 on the XDP machine, it would receive only packets that pass the whitelist check.
If you use this in production, you'll need to write your own system to add and remove whitelist entries. Maybe your server hits a backend periodically for the list of open addresses? Maybe it subscribes to some queue? In addition, the whitelist hash values in this example are empty, but you could put data in there. What about a secret key per-client to make the packet filter hash more secure? You can combine whitelist with packet filters and hash checks to stop attackers from spoofing their IPv4 source address to get through the whitelist.
What if your game server gets still gets overloaded by DDoS attacks even though XDP is dropping packets with whitelists, packet filter checks and hash checks?
The DDoS attacks are getting bigger. Much bigger. Congratulations, your game is super successful. Why not put a relay in front of your game server, that only forwards valid packets, hiding the IP address of your game server entirely? You could have relays in each datacenter protecting your game servers, and these relays could have 10, 40 or 100gbps NICs.
I'm leaving this one as an exercise for the reader. Combine the whitelist approach above with enough information in the whitelist hash value entry for the relay to forward the packet from the client to the server and vice-versa.
Now drop any packets from addresses that aren't in the whitelist as quickly as possible. Bonus points: track sequence numbers per-client connection to avoid replay attacks and rate limit the client connection to some maximum bandwidth envelope per-client. This is starting to become a pretty solid system.
At this point you basically have your own version of Steam Data Relay (SDR) that isn't free. It's quite possibly even better than SDR. Well done! If you have infinite resources and your own money printing machine in your basement like Valve has, you can afford to run this system at scale too.
Did you know that at any time around 5-10% of your players are experiencing bad network performance like much higher latency than usual, high jitter or high packet loss? It's difficult to believe, but it's true. I have the data from more than 50 million unique players to prove it.
Even more interesting is the fact that this bad network performance moves around and affects ~90% of your players every month. It's not the same 5-10% of players every day.
This is not just some small minority of players with bad internet connections, this is a systemic problem. An inconsistency in network performance from one match to the next that affects the majority of your players.
My company Network Next fixes this problem by steering game packets through relays, and yes... we even beat Google cloud to their own datacenters with premium transit, Amazon to their own datacenters with Amazon Global Accelerator, and we massively outperform Steam Data Relay (SDR).
And the Network Next relays are implemented in XDP.
One problem I encountered when implementing the Network Next relay was, how can I access crypto from inside my XDP program? Sure I can forward or drop packets quickly, but my decision whether I would forward or drop packets was based not only on whitelists, packet filters and hash checks, but also on cryptographic tests like sha256 and chachapoly.
While it's certainly possible that I could fight the verifier and write my own cryptographic primitives in BPF directly, it seemed counter productive. I'd spend a lot of time fighting the verifier and in the end, it might not even be possible to implement a given crypto primitive within the limitations of the verifier. It has an uncanny ability not to not be able to see why your code is completely fucking safe. Seriously, by the end of writing an XDP program you really want to punch it in the face.
Coming to the rescue are the last two features of BPF that we'll discuss in this article. BTF and kfuncs.
In short, you can write your own kernel module, then export functions from that module called kernel funcs or (kfuncs), and call them from inside XDP. You can even annotate those functions so that the BPF verifier knows, ok, this is function parameter is a void pointer data, and here is the length of data int data__sz. These annotations are done via BTF, a sort of light weight type system that exports type data from the Linux kernel, including kernel modules, so they can be accessed from BPF.
Using this we can implement this function inside our own custom crypto_module kernel module to perform sha256 using the existing Linux kernel crypto primitives.
To see this in action, first build and load the kernel module:
cd xdp/crypto
make module
Next, build and run the XDP program, replacing the network interface name with your own:
make && sudo ./crypto enp4s0
Now on another computer, change into the crypto dir and run the client, replacing the address with the IP address of the machine running the XDP program:
cd xdp/crypto && go run client.go 192.168.1.40
If everything goes correctly, you'll see the XDP program respond with a 32 byte packets for each packet you send, containing the sha256 of the first 256 bytes of your packets. Why only the first 256 bytes? Well, to understand this, you need to understand the limitations of the BPF verifier...
With kfuncs, it would seem that you should be able to now, pass the entire void * packet, and int packet__sz from XDP in to a kfunc and process it entirely inside your kernel module.
Well, not so fast. The BPF verifier has some limitations and these restrict what you can do (at least in 2024). Hopefully these get fixed in the future by the Linux BPF developers.
The best way I can describe the limitations of the BPF for XDP programs is that it's extremely "anchored" around processing a packet from left to right. Typically, you start at the beginning of the packet and you can check to see if there is enough bytes in the packet to read the ethernet header, then move the pointer right by some constant amount, read the ip header, move a constant amount to the right again and read the UDP header, and so on.
But if you try to write code that reads the last 2 bytes in a packet, I simply cannot find any way to make this code pass the verifier, even though the code is completely safe, and does not read memory outside of bounds.
The next limitation is that it seems (in 2024) that you can only pass constant sized portions of the packet data into kfuncs. For example, I could check to see if there is at least 256 bytes in the UDP packet payload, then call a kfunc with a pointer to the packet data and a constant size of 256 bytes, and this passes the verifier. But if you pass in the actual size of the packet derived from the XDP context, there just seems to be no way to convince the verifier it's safe.
This is huge shame, because if we could simply pass the XDP packet into a kernel module and do some stuff there, we'd really be cooking with gas. Again, hopefully this gets fixed in a future version of BPF.
In this article we've explored XDP and eBPF, a kernel bypass technology in Ubuntu 22.04 LTS with Kernel 6.5. Previously unstable and the domain of only neck-bearded kernel hackers, it's now now stable and mature enough for general use by game developers.
It's a surprisingly powerful and easy to use system once you get the hang of it. You can write an XDP function that reflects packets, drops them, forwards them, run packet filters, perform whitelist checks and even some crypto. You can do all this at line rate of 10gbps and above with minimal work. View the example source code for the article (https://github.com/mas-bandwidth/xdp) and see for yourself.
I know it sounds crazy, but in the future I'm actually exploring implementing entire backend systems and scalable game servers in UDP request/response almost entirely inside XDP. With maps, kernel modules and kfuncs, you can really do almost anything you want, and if you can't, you can at worst case pass the packet down to user space for processing. For example, if you were creating a new MMO or hyper player count game in 2024, I can't think of a better foundational technology than XDP/eBPF.
I hope this article helps you get started writing XDP programs, they're extremely powerful and fun to write, even if the verifier drives you crazy. I look forward to seeing what you'll create with it!
Best wishes,
- Glenn
]]>