Thoughts on defensive programming

2026-01-25T04:00:00+00:00

One of the few programming books I’ve read is Effective Java by Joshua Bloch. Although it’s a book about Java, not all the advice it gives is Java-specific, and it changed the way I think about programming in general. Effective Java introduced me to the idea of “invariants,” in particular “class invariants.” At the time, it wasn’t obvious to me that it’s absolutely okay to not defend against object states that are guaranteed impossible by invariants. One of the code snippets in the book presents an example of this:

public class Favorites {
    private Map<Class, Object> favorites = new HashMap<>();

    public <T> void putFavorite(Class<T> type, T instance) {
        favorites.put(Objects.requireNonNull(type), instance);
    }

    public <T> T getFavorite(Class<T> type) {
        return type.cast(favorites.get(type));
    }
}

Strictly speaking, type.cast can throw an exception if the argument is not the correct type, but we know that getFavorite can never do this since all key-value pairs of favorites maintain an invariant: the key is a type token, and the value is an instance of that type token. Hence, type.cast is favored over a type-erased, non-throwing cast (T) since any bug that would violate the above invariant would cause an exception thrown by the class, rather than by the caller. This is a toy example, but it demonstrates what I have observed in practice: projects I have worked on where defensive programming was discouraged had fewer bugs, because bugs manifest sooner, closer to where they actually are.

While “class invariants” are matter-of-fact, I have not heard an analogous expression used when talking about programs consisting of many classes or systems consisting of many programs. But that doesn’t mean they don’t exist - the vocabulary simply changes to “method contracts,” “API contracts,” “architecture,” or “agreements” which all can refer to state invariants at a larger scale.

An example still remains fresh in my memory. One of the projects I’ve worked on was heavily dependent on blob of data that configured the application. The data consisted of many pieces which could be referred to by ID, and such IDs were spread throughout the system - in the data itself and in our databases. However, as a team we had agreed that these references could never be broken, and consequently, most application code never entertained the possibility that an ID could refer to missing data. I say most because some parts of the system were dedicated to maintaining that invariant, so that the rest of the system did not have to worry about it.

As a result we could write code like this:

class App {
    private final AppData data;

    // ...

    void doSomethingWithData() {
        var a = data.getThingA();
        var b = data.getThingB();
        var c = b.getThingCByIdOfA(a.getId());
        // do something with c
        // ...
    }
}

Instead of code like this:

class App {
    private final AppData data;

    // ...

    void doSomethingWithData() {
        var a = data.getThingA();
        var b = data.getThingB();
        var c = b.getThingCByIdOfA(a.getId());
        if (c == null) {
            // no invariant, so entertain possibility of null
        } else {
            // do what the app is actually supposed to do
        }
    }
}

Which doesn’t seem that bad but when multiplied hundreds of times by a large team of developers, it can get pretty bad:

Hundreds of places in the code are dedicated to working around a possibility that could have instead been eliminated by an invariant.
If the chosen failure mode is not visible, inconsistent data may go unnoticed.
Time is wasted on implementing a failure mode in the first place, resulting in a lot more code that is less readable.

This isn’t an argument against defensive programming, it’s an argument against defensive programming that could instead be replaced by invariants. The problem can then be reduced to choosing and enforcing invariants at the system level, which is not just a software problem, but (I believe) an organizational and cultural one.

防御的プログラミングについて

2026-01-25T04:00:00+00:00

これまで読んできたプログラミング本の中では、Joshua Bloch 著書の「Effective Java」が未だに印象に残っています。Java を中心にした内容ですが、オブジェクト指向プログラミング全般に当てはめられるアドバイスが詰まっています。この本を読んで、「クラス不変条件」という概念を初めて知りました。不変条件によって排除される、不可能なオブジェクト状態に対して、防御的なコードを書かなくても良いと。本の中では、以下の Java コードがその例として挙げられています。

public class Favorites {
    private Map<Class, Object> favorites = new HashMap<>();

    public <T> void putFavorite(Class<T> type, T instance) {
        favorites.put(Objects.requireNonNull(type), instance);
    }

    public <T> T getFavorite(Class<T> type) {
        return type.cast(favorites.get(type));
    }
}

厳密に言えば、引数の型が間違っていれば、type.castは例外をスローしますが、それでもgetFavoriteは絶対に例外をスローしないと断言できます。なぜなら、favoritesはクラス不変条件を守っています：キーが型のトークンで、値がその型のインスタンスです。よって、例外をスローしない(T)よりも、もし不変条件がバグが原因で守られなかった場合、呼び出し元じゃなくクラス内から例外をスローするtype.castが採用されています。少々非現実的な例ではありますが、様々なプロジェクトに関わってきて僕が見てきた事実を示しています：防御的プログラミングを良しとしないプロジェクトの方がバグが少ないです。

「クラス不変条件」は表現としてよく使われますが、クラスの規模を超えて、複数のクラスを含むプログラム、あるいは複数のプログラムを含むシステムとなると、「不変条件」という表現が目に入らなくなります。が、不変条件が当てはまらなくなるというわけではなく、それを示す用語が変わってくるだけです。「メソッドの契約」、「API の契約」、「アーキテクチャー」、「チームの合意」が全て、ある種の不変条件を示しています。

まだ記憶に残っている例が挙げられます。アプリを設定として、一つのデータのブロブを採用していたプロジェクトの話です。そのデータの塊が小さなエントリーによって構成されており、それぞれのエントリーにはそれを参照するための ID が付いていました。そういった ID が、他のエントリーだったり、データベースだったり、システム全体に散らばっていました。ところが、チーム全員で「ID の参照先のエントリーが絶対に存在する」と合意したため、アプリケーションコードのほとんどが、ID の参照先が正しいを前提として書かれていました。「ほとんど」というのは、壊れた参照を意識せざるを得ない、不変条件を守るためのシステムの一部があったからです。おかげで、それ以外のコードは、不変条件の恩恵を受けることができました。

結果として、以下のようなコードを書くことができました：

class App {
    private final AppData data;

    // ...

    void doSomethingWithData() {
        var a = data.getThingA();
        var b = data.getThingB();
        var c = b.getThingCByIdOfA(a.getId());
        // cをnullチェックしないでそのまま使います
        // ...
    }
}

ちなみに、不変条件がなかったら、以下のようなコードになってしまいます：

class App {
    private final AppData data;

    // ...

    void doSomethingWithData() {
        var a = data.getThingA();
        var b = data.getThingB();
        var c = b.getThingCByIdOfA(a.getId());
        if (c == null) {
            // 不変条件がないため、nullの可能性に対応せざるを得ません
        } else {
            // 実際のアプリの動作はここです
        }
    }
}

nullチェックは別にいいんじゃないかと思われるかもしれませんが、開発者全員が何百回も同じ対応を重ねると、少々よろしくない状況が生まれてきます：

不変条件で排除することができた問題に何百回も対応しなければなりません。
可視化されていない対応策が選択されたら、データの不整合が気づかれない恐れがあります。
対応策を書くための時間が費やされます。コードが余計に多くなり、読みにくくなります。

防御的プログラミングを一方的に否定するつもりはありませんが、不変条件によって置き換えられる防御的プログラミングを避けた方が良いんじゃないかと思っています。そうすれば、どの不変条件を、どう守れば良いのか、という組織やカルチャーの問題に絞られるでしょう。

Effective engineering on Tetris

2025-11-24T07:30:00+00:00

The coolest project I’ve had the privilege to work on is the mobile version of Tetris at N3TWORK. It was ambitious - featuring an timezone-based realtime multiplayer mode where players in a single region logged on at exactly the same time to compete with the same pieces. It also featured Royale multiplayer mode - basically the mobile version of Tetris 99. This wasn’t a fast-follow. N3TWORK had already prototyped the game mode before Tetris 99 was released.

Not only was the project ambitious, we managed to pull it off with an international team across South and North America, on an aggressive schedule where we rebuilt the game for worldwide release in months, not years.

How were we able to do this? Even though it wasn’t always roses and sunshine, our team had a lot going for us that made extremely effective.

Low architectural debt

N3TWORK had already delivered, to great success, a game called Legendary: Game of Heroes. Thanks to the brilliance of the engineering leaders in our org, the game’s architecture was prescriptive enough to standardize how feature were built, while being flexible enough to make almost anything possible. Tetris inherited much of that game’s architecture.

Why is prescription a good thing? Because it allows developers to not spend time on problems that the architecture already solves. Problems like:

How are we going to configure this feature?
Where is the server going to put new state?
How do we guarantee that state consistency on both server and client?
How is the client going to receive new state?
How is the client going to receive new assets?
How is the client going to project state onto the player’s screen?

These may seem like obvious questions with obvious answers, but having been through a few companies I can say that not everyone can come up with elegant solutions to them. We were lucky to have people that did. As a result, our projects benefited immeasurably from the time developers saved by not having to solve these problems on their own, with solutions of varying quality.

Great team

The majority of the team that developed Tetris was in Chile. I was one of a few engineers based in San Francisco. There must be something about Chilean culture, because our team just worked really well together. There was trust, respect, and dialog - all values of the company culture - but with Tetris it felt like something that was already there, and you didn’t need a company to prescribe it to you.

Everyone was also very strong and accountable in their roles, relaxing the need for oversight and synchronous communication, both of which are more difficult in a remote environment.

Almost no code review

“No code review” may seem like an engineering antipattern but I can say confidently that our velocity benefited from having almost no code review:

Obviously, there was no time spent waiting for someone to review. On a small, capable team with low architectural debt, it’s possible to do this without creating bugs that would cancel out the time saved.
Increases accountability and ownership. You are responsible for your changes, and no one else!
No barrier to change. Created a bug? Fix it now. See something that could be improved? Just improve it.

Of course, this created other kinds of issues:

Lower awareness of code you don’t own.
Less opportunity to learn from others.
High bus factor.

Despite the drawbacks, it did help us ship a lot faster, which is usually what a startup needs. Different companies of different sizes may have different priorities.

Empowered individuals

The common thread in all of the points above is that we had all of the pieces in place to empower individuals to give their very best.

We had a bug-resistant and somewhat prescriptive architecture, allowing lower oversight. Lower oversight means lower barriers to change, allowing individuals to be more effective. When individuals are more effective, they will see that they are making an impact, and ultimately care more about what they’re doing. In my view, we had all of these things going for us, and that’s what made Tetris a great team to be on.

While the company no longer exists, I’m thankful to N3TWORK for making a team like this possible.

テトリスを支えた強いエンジニアリング

2025-11-24T07:30:00+00:00

これまで関わってきたプロジェクトの中で、テトリスのモバイル版をチームの一員として一番誇らしく思っています。同じ地域のプレイヤーが決まった時間にログインし、同じピースを使って競い合う、といったマルチプレイヤーゲームモードもあり、テトリス99と似たようなゲームモードもありました。モノマネではありませんでした。テトリス99がリリースされる以前にも、N3TWORKがテトリスのリアルタイムマルチプレイヤー機能のプロトタイプ開発に成功しました。

要件の難易度が高かっただけではなく、南と北アメリカを渡った国際的なチームで、数ヶ月でゲームを大幅に再開発し、全世界のリリースもできました。

どうやって成し遂げられたんでしょうか。何でもかんでもうまくいってたわけではありませんが、チームとしての効率や生産性の要因はいくつかありました。

低い技術的負債

テトリスを開発することになるまで、N3TWORKがパズドラと似たような、Legendaryというモバイルゲームをリリースしました。ソフトウェアとしても、ビジネスとしても、非常にうまく回っていたゲームです。経験豊富なエンジニアリングリーダーたちが考えてくれたソフトウェアアーキテクチャが、機能の開発過程を統一化させながら、可能性を狭めるほど制限的でありませんでした。テトリスはそんなアーキテクチャーを受け継ぎました。

統一化はなぜ良いことかというと、アーキテクチャがすでに解決してくれた問題に、開発者の時間を費やすことがなくなるからです。例えば：

機能をどうやって設定するのか。
サーバーが状態をどこに保存するのか。
サーバーとクライアントの状態一貫性をどうやって守るのか。
クライアントがどうやって新しい状態を取得するのか。
クライアントがどうやって新しいアセットをダウンロードするのか。
状態をどうやって端末の画面で表示するのか。

誰でも答えらる問題かもしれませんが、綺麗に解決できる人は少ないと身をもって言えます。N3TWORKは、運が良いことにそんな人がいました。おかげで、開発たちはそれぞれの問題を自分で下手に解決することに膨大な時間を使わないで済みました。

素晴らしいチーム

テトリスの開発者たちは主にチリにいました。僕はサンフランシスコに住んでいた数人のエンジニアの一人でした。チリの文化なのか、チームがすごくうまく回っていたのです。信頼性と尊重性と協力性は全て会社のカルチャーが抱いていた価値観でしたが、会社がそんなものを押し付けなくても、テトリスではすでに存在していたものだと感じていました。

一人一人のメンバーは実力が高く信頼できたので、リモート環境では難しい管理や同期的ななコミュニケーションの必要性は最小限にできました。

コードレビューがほとんどなかった

コードを全くレビューしないことは、アンチパターンに思われるかもしれませんが、僕たちの場合はコードレビューをなくすことでチーム全体のスピードが大幅に上がりました：

いうまでもありませんが、貢献するには誰かのレビューを待つ必要はありませんでした。少人数で技術的な負債が少ないチームでは、レビューをなくすことで取り戻せた時間が、かえってバグによって費やされることはあまりありません。
責任感が増えます。他人が貢献を見てくれないので、自分自身が責任を持って貢献するしかありません。
変更の妨げがなくなります。バグを書いたら、すぐ直せます。改善したいものがあれば、勝手にどうぞ。

もちろん、メリットもればデメリットもあります：

自分が書かなかったコードへの意識が薄くなります。
他人から学ぶ機会が減ります。
バスファクターが高くなります。

とはいえ、コードレビューをなくすことでプロダクトの開発速度が高くなり、スタートアップとしてはもってこいでした。会社によっては目的が変わり、コードレビューの必要性も変わります。

自信を持って貢献できる開発者たち

以上の内容のテームとしてあるのは、開発者一人一人に自信を持って自由に行動させるための要因が揃っていたことです。

バグへの忍耐性が高く、開発を統一化させたアーキテクチャーがあり、管理の必要性を最小限にできました。管理がなければ、貢献への妨げもなくなり、開発者の生産性も上がります。生産性が高ければ、チームへの影響が感じやすくなり、やっていることに意味を感じるようにもなります。僕から見て、テトリスはそんな素晴らしいチームでした。

会社がもうなくなりましたが、そんなチームを可能にしたN3TWORKには感謝しています。

Driving school in Japan: conclusion

2025-10-26T09:15:00+00:00

I finally graduated Japanese driving school around mid-September. I’m sharing my notes in case it’s helpful for others considering doing this.

Second stage

The second stage consists of ~15 lectures and ~20 more driving practice sessions, making it around twice as long as the first stage. Most of the practice sessions are on actual roads with an instructor, and a few are on-premises sessions dedicated to special topics like parking.

Tokyo roads are quite crowded and naturally produce a lot of opportunities for practicing good judgment. Overall I found the second stage material to be extremely useful and I am much more confident in my skills as a result.

Graduation exam

Once I finished all of the course material I was able to reserve time for the graduation exam. The school I was attending conducted graduation exams nearly every day. I was able to reserve one on a national holiday.

Students who reserved the same timeslot are split up into different cars with instructors. I was in a group with two other students. Students take turns driving pre-determined courses and are evaluated on the safety and correctness of their driving. The final part of the exam is on-premises where students are tested either on parallel parking or reversing out of tight corners.

During my turn I screwed up at the final moment by driving into the opposite lane, but thankfully this didn’t count against me because my exam finished the moment I successfully parallel parked. Perhaps I was just exceptionally lucky.

The results were announced soon after the exam. However, I had to come back to the school in the afternoon to receive the graduation certificate. This was the last time I would set foot in the school.

The exam itself took up most of the morning, and finished between 11AM - 12PM. I received the graduation certificate at around 2PM.

Written exam

The graduation exam did not earn me a driver’s license. In Japan, graduating a government-recognized driving school is merely a way to bypass the driving exam at the DMV. I still needed to take the written exam.

Unfortunately the DMV’s schedule is much less flexible and I had to sacrifice some work hours to reserve an exam slot in the morning. Most slots for at least the next couple weeks were completely filled. On top of that, I had to reschedule once due to a conflict with a company event. So the exam ended up being a month and half after I had already graduated driving school.

Each day there are only two possible slots, morning and afternoon. I reserved a morning slot. The entire appointment looked like this:

I arrived at ~8AM on the day of the exam.
8AM - 9AM was spent on submitting the official application for the driver’s license. ~50ish other people were lined up with me, so it really did take almost the whole hour.
9AM - 10AM was spent taking the written test. The test was conducted on a tablet, and was a slightly easier version of the practice tests I took online via the driving school’s online portal.
10AM - 11AM was spent announcing results and taking photos for the driver’s license.
11AM - 12PM was spent waiting for the driver’s licenses to be made.
12PM onwards was spent receiving the driver’s license.

The entire process was run like a well-oiled machine. From what I could tell on the order of a 100 or more drivers were being produced by this building every day.

Test taking

While I do not recommend it to others, as it’s a shallow way to study for something as serious as driving, my test study method consisted of practicing test questions online through the online MUSASI system. Each test is 95 questions and I took (and retook) all 6 of them, and also practiced the “difficult question” compilation.

Scheduling discipline

8 months passed between signing up for driving school and receiving the license. I took my classes at a fairly relaxed pace, and when I started at a new job with real salaryman hours I had to do a lot of hours on weekends or the rare weekday when I could reserve the earliest slot. It took some discipline to make sure I would graduate with some buffer, so as not to blow the school’s 9 month time limit.

Language hurdles

Compared to many other non-natives I’ve met, I do not consider myself to be particularly amazing at Japanese. However, in order to not be completely lost I would still recommend to others considering Japanese driving school (in Japanese) that they have fairly high listening and reading comprehension of day-to-day Japanese.

It’s worth noting that having high language comprehension does not necessarily indicate having high language fluency. For example, while I can read most day-to-day Japanese I still read at the speed of an elementary school student, and speak with even lower fluency.

Going forward

Even though it’s nothing special for most people living here, getting a Japanese driving license is probably my proudest accomplishment since moving to Japan. It exercised my ability to tolerate discomfort, being in an environment intended for Japanese natives, on top of having a brain not particularly suited to driving nor irregular schedules. 16 years after getting my USA license, I feel one step closer to graduating from “paper driver.”

自動車学校：感想

2025-10-26T09:15:00+00:00

9月中旬にようやく自動車学校を卒業しました。自動車学校を検討している方々の参考になればと思い、その流れや感想を共有したいと思います。

第２段階

第２段階は、約15の学科教習と、約２０の技能教習で、第１段階の2倍の長さです。技能教習は、教員の指導を受けながら実際の東京の路上で行われ、また、特別項目にフォーカスを当てた校内の技能教習も複数あります。

東京の路上は混雑しており、運転における判断を練習する機会を多く与えてくれました。全体的に、第２段階の内容が非常に役に立ったと思っており、おかげで自信が大分ついてきたと感じます。

卒業検定

第２段階を終わらせたら、卒業検定の時間を予約できるようになりました。僕が通っていた自動車学校は、ほぼ毎日卒業検定を行われており、祝日も予約可能でした。

僕を含めた同じ時間に卒業検定を受ける生徒たちが、数台の車と教員が割り当てられ、順番に検定のコースを運転し、運転の正しさや安全性を評価されました。最後に、校内の部分で、縦列駐車か、方向変換のどちらかをさせられ、評価されました。

最後の最後で、校内の対向車線に入ってしまい、しくじったんですが、採点の範囲外と言われ、奇跡的に合格しました。日本特有の細かさか、教員が優しかっただけなのか、もしくは運が良かっただけなのか、分かりません。

試験が終わってすぐ、試験の結果が発表されました。が、卒業証明書をもらうのが数時間後でした。学校に足を運ぶのはそれが最後になりました。

まとめてみると、検定自体が朝ので午前１１時から１２時の間に終わり、卒業証明書は午後２時にもらいました。

学科試験

卒業証明書が手に入りましたが、運転免許を取得するにはまだ手続きが残っていました。日本では、政府に指定された自動車学校を卒業しても、警察庁の試験所での技能試験を免除されるだけで、まだ実際に試験所に行って、正式に運転免許の申請を提出し、学科試験を受けなければなりません。

政府に運営されているからか、試験所のスケジュールが自動車学校ほどフレキシブルではなく、仕事を数時間休ませてもらい、平日の時間を予約するしかありませんでした。卒業後の数週間が完全に埋まっていたし、会社のイベントと被って一回リスケをしないといけなかったので、実際試験所に行けたのが自動車学校を卒業して約１ヶ月半でした。

毎日予約可能な時間帯は、朝と昼があり、僕は朝にしました。全体の流れが以下の通りでした：

試験当日の８時ぐらいに試験所につきました。
８時から９時までは申請の手続き。数十人が一緒に並んでいて、実質ほぼ一時間かかりました。
９時から１０時までは試験を受けました。タブレットを利用した、モダンな試験形態でした。
１０時から１１時までは結果の発表と、免許のための写真の撮影。
１１時から１２時までは、免許の作成を待ちました。
１２時からは、並んで免許を順番にもらいました。

毎日100人以上の運転手たちを世に出しているだけに、全体の流れが機械みたいに、順調に進みました。

試験戦略

オススメはしませんが、僕の試験戦略はMUSASIのオンライン模擬試験をひたすら受けることでした。「卒業前」の模擬試験を全部受け、「みんな苦手問題集」もたくさん解いてみました。

スケジューリング

自動車学校の入校から運転免許の取得まで、８ヶ月が経ちました。一日に教習をあまり詰めないようにしましたが、転職して普通のサラリーマンの生活を始めたら、さらに学校に通える時間帯少なくなりました。学校の９ヶ月の時間制限をオーバーしないように、スケジューリングを気をつけないといけませんでした。

言語の壁

今まで会ってきた他の日本語の非ネイティブたちと比べて、僕は日本語が上手いほうではないと分かっています。が、非ネイティブで日本語の自動車学校への入校を検討している方々には、日本語が流暢じゃなくても、自動車学校を通うには比較的高い聴解力と読解力が必要だと理解して欲しいです。

これからのこと

社会人として大したことないだと分かっていますが、自動車学校を卒業して運転免許を取得したことを誇りに思っています。なかなか落ち着くことができない場所や空間、みたことのない状況などを受け入れて対応する練習の機会を多く与えてくれました。特に、運転という行為や、規則正しくない生活にとても不向きな脳を持っている僕は、なおさら苦労して達成した出来事なんです。アメリカで免許を取得して16年が経った今は、ペーパードライバーの卒業に一歩近づいたと感じます。

Breaking prod chapter 2: disappearing friends

2025-10-19T03:25:00+00:00

This is another incident that is memorable because of the unwavering hate for PHP it aroused in me.

Disappearing friend lists on Zynga Poker

Zynga Poker is one of Zynga’s oldest games. When I worked on it years ago, it had a frankenstein backend consisting of sedimentary layers of PHP, strapped with duct tape to a Java-based TCP socket server. These stood atop a home-grown user storage layer, lovingly called “Sexy,” that used Memcached as a front cache backed by MySQL. Custom clients for Sexy were implemented in both PHP and Java, so that each part of the stack could interact directly with it.

Among many other pieces of data, we stored user friend lists in Sexy. We served the data from our Java server, but for some reason there was an extra bit of indirection that went through PHP. So serving the friend list took a path like this:

Flash game client -> Java -> PHP -> Sexy -> PHP -> Java -> Flash game client

I don’t remember exactly why, but I decided to take out the PHP part of that path so that it would look like this:

Flash game client -> Java -> Sexy -> Java -> Flash game client

It seemed like a straightforward change, I just needed to port some PHP into Java, what could possibly go wrong?

Incidentally, the friend list was stored in a data structure like this:

{
  // social network -> user ID list
  "1": ["user id 1", "user id 2"], // "facebook" friends
  "29": ["user id 3", "user id 4"] // "some other social network" friends
}

But sometimes the data looked like this:

{
  "0": [], // probably some pre-historic data migration bug
  "1": ["user id 1", "user id 2"], // "facebook" friends
  "29": ["user id 3", "user id 4"] // "some other social network" friends
}

Which is fine - it’s not user-facing - but what if a user only has Facebook friends and no friends from social network 29?

PHP’s json_encode happily gives you this:

[[], ["user id 1", "user id 2"]]

Thanks to a peculiarity of PHP arrays, our serialized data can now either be an array or an object! Which is fine and completely invisible, if you only read and write the data in PHP.

After porting the code to Java, everything looked perfectly fine. Until the code reached production of course, where suddenly some users suddenly lost their friends lists!

Aftermath

It was easy enough to stop the bleeding - if memory serves me right I used some feature of the very excellent and flexible Jackson JSON API to maintain compatibility with the PHP-JSON-serialized arrays.

But I was not able to do much more! My very capable tech lead had to jump in and work with our analytics team to reconstruct friend lists from analytics.

At the time, I did not have the communication skills, nor the resourcesfulness to step outside of our day-to-day process and proactively work with the right people to manage this unique production incident to complete resolution. However, having broken prod quite a few times since then, I’d like to think that I’ve improved in this area.

If it ain’t broke, don’t fix it?

It may be tempting to use this incident as proof of the mantra “if it ain’t broke, don’t fix it,” but I think that this is a simplistic and dangerous statement. It instills fear when instead (in my humble opinion) skillful and calculated risk-taking ought to be encouraged, both for the happiness of the developer and the maintainability of the software project.

「本番環境を壊した」第二章：消えるフレンドリスト

2025-10-19T03:25:00+00:00

揺らがぬ PHP への怒りが芽生えたきっかけとして、まだ記憶に残っている本番環境障害の話です。

Zynga Poker で消えるフレンドリスト

Zynga Poker は Zynga の最も古いゲームです。何年も前にそのチームの一員として働かせていただいていた時、古代から積んできた PHP の堆積層が危うくダクトテープで繋いだ Java ベースの TCP ソケットサーバーのフランケンシュタインが、ゲームのバックエンドでした。「セクシー」という愛称で呼ばれた、Memcached を MySQL のフロントキャッシュにした自家製データーレイヤがこれら二つのアプリサーバーを支えていました。どちらからもセクシーを直接アクセスできるように、セクシーのクライアントは PHP と Java に対応した実装が存在しました。

他のユーザーデータと同じく、フリエンドリストもセクシーに保存されており、ゲームクライアントがそのデータを Java サーバーから取得していました。ところが、データの流れが遠回りで、なぜか PHP の部分も含まれていたのです。大まかにこんな感じでした：

Flashゲームクライアント -> Java -> PHP -> セクシー -> PHP -> Java -> Flashゲームクライアント

理由は詳しく思い出せませんが、なぜか PHP の部分を抜こうと思いました：

Flashゲームクライアント -> Java ->　セクシー -> Java -> Flashゲームクライアント

PHP を Java に書き直すという単純労働が、うまくいかないはずがないんじゃないかと。

ちなみに、フレンドリストのデータ構造がこんな感じでした：

{
  // SNS -> ユーザーIDの配列
  "1": ["ユーザーID 1", "ユーザーID 2"], // "Facebook"フレンド
  "29": ["ユーザーID 3", "ユーザーID 4"] // "某SNS"フレンド
}

が、ごく一部のユーザーのデータは「0」インデックスの配列も謎に含まれていたのです：

{
  "0": [], // おそらく、前史のデータ移行バグ
  "1": ["ユーザーID 1", "ユーザーID 2"], // "Facebook"フレンド
  "29": ["ユーザーID 3", "ユーザーID 4"] // "某SNS"フレンド
}

ユーザーに影響がなかったので、それでも良かったんですが、Facebook のフレンドしかいないユーザーのデータは不思議な現象が起きます！ PHP のjson_encodeが黙々とこんなものを返してくれます：

[[], ["ユーザーID 1", "ユーザーID 2"]]

PHP 配列の特徴のおかげで、シリアライズ後のデータが、配列としてもオブジェクトとしても存在しうるのです！PHP を通してデータの読み書きを行えばなんの問題もありません。

待ち構えている災害に全く気づかず、コードを Java に書き直し、本番環境に出しました。

後処理

出血を止めるのは簡単でした。記憶が正しければ、使い勝手が良く、柔軟性で富んでいる Jackson JSON をうまく使い、PHP によってシリアライズされたデータとの互換性を維持することに成功しました。

が、まだ青い僕はそれ以上何もできなかったのです！とても頼れるテックリードがアナリティクスチームと連携を取り、フレンドリストの再構築に成功し、助けていただきました。

その頃の僕は、非常時に他チームと連携を取るためのコミュ力と行動力が不足しており、今回の特徴的な本番環境障害を最後の最後まで解決に向かわせることができなかったのです。とはいえ、その時から数多くの本番環境障害の経験を経ってきた今の僕は、少しぐらいは成長してきたんじゃないかと思っています。

「壊れていないものを直すな」

英語で「If it ain’t broke, don’t fix it」ということわざがありますが、「壊れていないものを治すな」を意味します。今回の出来事はその一例ではないかと考えるのが自然かもしれませんが、僕が思うに、ことわざ自体は単純すぎて、誤解しやすいのです。ソフトウェアの変化に対する恐怖心を煽り、開発者の満足度と、ソフトウェアプロジェクトの保守性の妨げになりかねないのです。

Breaking prod chapter 1: chat via polling

2025-10-18T03:25:00+00:00

I’ve been working on live service games for a while and have broken production plenty of times, both as an inexperienced intern and even as a slightly less-experienced middle-aged man. I thought it would be interesting to write about some of the memorable and interesting ones before they fade from my sleep-deprived brain.

Breaking prod at gloops

This incident is memorable because it was the very first one I caused, in my very first engineering role as an intern at a Japanese games company called gloops. In fact, it happened not once, but twice! And knocked down the game during its busiest time of day when the company made the most money, resulting in a loss of tens of thousands of dollars.

The game was basically an server-side rendered web application with some of the flashier UI built in Flash. It featured a bulletin board where team members could communicate with each other to coordinate during battle events.

As a young and energetic intern, I thought it was lame that the users had to refresh the entire page to see new posts on the bulletin board. Wouldn’t it be cool if the bulletin board updated in realtime? AJAX was all the rage in 2012, why not just poll the server for new messages? Of course, I never seriously considered the question of why not and sprung to action.

The result? When the battle event started, servers were almost instantly overloaded and nobody was able to play the game! And I did this, not only once, but twice, thinking for sure that Redis would save the feature. Needless to say, we did not try a third time.

Looking back

It was a combination of a hungry but in-experienced intern (me), and a team that was okay with load-testing on prod that allowed this to happen. Was there a possible implementation that would not have broken the game? Maybe, but I did not have the skill or experience to find out.

This article is not intended to say anything about technical design choices. I have successfully used polling for receiving asynchronous, realtime notifications in other projects, and I would argue that it’s much easier to maintain eventually consistent state with polling because you are forced to think about that problem, rather than throwing everything at a WebSocket and hoping for the best.

「本番環境を壊した」第一章：チャット

2025-10-18T03:25:00+00:00

長年ライブサービスゲームを携わってきた僕は、未経験のインターンとしても、経験を積んできたおじさんでも、数えくれないほど本番環境を壊した経験があります。その記憶が睡眠不足の脳から消える前に、より印象に残った件について書いてみたいと思います。

gloops で本番環境を壊した話

この件は僕がインターンだったとはいえ、初めてのエンジニアとしての仕事で、初めて本番環境を壊した件なので、13 年後になっても忘れていない話です。障害は 1 回だけでなく、2 回も起こしてしまい、ゲームが一番ユーザー数が多く、売り上げが高い時間帯にゲームをダウンさせました。

ゲーム自体は、UI がサーバー側でレンダリングされていたウェブアプリで、キャラクターを動かすシーンなど、より重い UI は Flash で作られていました。チーム連携を取るための掲示板もありました。

元気でまだ青いインターンだった僕は、掲示板の最新投稿を見るためにユーザーがページを再読み込みしないといけないのが非常にダサいと思い、リアルタイムで掲示板が自動的に更新されたら、かっこいいではないかと思いました。AJAX が流行っている 2012 年だし、新しい投稿を非同期的にポーリングしない理由なんてないんじゃないかと。もちろん、そんな理由をを真剣に考えようともせず、行動に移ったのみです。

結果として、バトルのイベントが始まった途端、サーバーが負荷に耐えきれず、ゲームが動かなくなってしまいました。Redis を採用した再挑戦も同じ結果になってしまい、流石にそれ以上は機能をリリースすることはありませんでした。

反省

思い返してみると、行動力がありながら未経験だったインターンの僕が、本番環境で負荷テストをしても良いチームに入っていたという事情が、障害の一番の原因だったと思っています。ゲームをダウンさせないで済む実装方法はあったかもしれませんが、それを知る経験と実力はなかったのです。

以上の内容はポーリングがどうとか意見を述べようという意図は全くありません。他のプロジェクトでリアリタイム通信を受け取る方法としてポーリングを採用したこともあり、WebSocket にデータを丸投げするより、ポーリングを使うと結果整合性の問題と直接向き合わないといけないので、ポーリングのほうが技術的選択肢として全然アリだと思っています。