GitHub Engineering, Author at The GitHub Blog

Vulcanizer: a library for operating Elasticsearch

GitHub Engineering — Tue, 05 Mar 2019 18:00:28 +0000

At GitHub, we use Elasticsearch as the main technology backing our search services. In order to administer our clusters, we use ChatOps via Hubot. As of 2017, those commands were a collection of Bash and Ruby-based scripts.

Although this served our needs for a time, it was becoming increasingly apparent that these scripts lacked composability and reusability. It was also difficult to contribute back to the community by open sourcing any of these scripts due to the fact they are specific to bespoke GitHub infrastructure.

Why build something new?

There are plenty of excellent Elasticsearch libraries, both official and community driven. For Ruby, GitHub has already released the Elastomer library and for Go we make use of the Elastic library by user olivere. However, these libraries focus primarily on indexing and querying data. This is exactly what an application needs to use Elasticsearch, but it’s not the same set of tools that operators of an Elasticsearch cluster need. We wanted a high-level API that corresponded to the common operations we took on a cluster, such as disabling allocation or draining the shards from a node. Our goal was a library that focused on these administrative operations and that our existing tooling could easily use.

Full speed ahead with Go…

We started looking into Go and were inspired by GitHub’s success with freno and orchestrator.

Go’s structure encourages the construction of composable (self-contained, stateless, components that can be selected and assembled) software, and we saw it as a good fit for this application.

… Into a wall

We initially scoped the project out to be a packaged chat app and planned to open source only what we were using internally. During implementation, however, we ran into a few problems:

GitHub uses a simple protocol based on JSON-RPC over HTTPS called ChatOps RPC. However, ChatOps RPC is not widely adopted outside of GitHub. This would make integration of our application into ChatOps infrastructure difficult for most parties.
The internal REST library our ChatOps commands relied on was not open sourced. Some of the dependencies of this REST library would also need to be open sourced. We’ve started the process of open sourcing this library and its dependencies, but it will take some time.
We relied on Consul for service discovery, which not everyone uses.

Based on these factors we decided to break out the core of our library into a separate package that we could open source. This would decouple the package from our internal libraries, Consul, and ChatOps RPC.

The package would only have a few goals:

Access the REST endpoints on a single host.
Perform an action.
Provide results of the action.

This module could then be open sourced without being tied to our internal infrastructure, so that anyone could use it with the ChatOps infrastructure, service discovery, or tooling they choose.

To that end, we wrote vulcanizer.

Vulcanizer

Vulcanizer is a Go library for interacting with an Elasticsearch cluster. It is not meant to be a full-fledged Elasticsearch client. Its goal is to provide a high-level API to help with common tasks that are associated with operating an Elasticsearch cluster such as querying health status of the cluster, migrating data off of nodes, updating cluster settings, and more.

Examples of the Go API

Elasticsearch is great in that almost all things you’d want to accomplish can be done via its HTTP interface, but you don’t want to write JSON by hand, especially during an incident. Below are a few examples of how we use Vulcanizer for common tasks and the equivalent curl commands. The Go examples are simplified and don’t show error handling.

Getting nodes of a cluster

You’ll often want to list the nodes in your cluster to pick out a specific node or to see how many nodes of each type you have in the cluster.

$ curl localhost:9200/_cat/nodes?h=master,role,name,ip,id,jdk
- mdi vulcanizer-node-123 172.0.0.1 xGIs 1.8.0_191
* mdi vulcanizer-node-456 172.0.0.2 RCVG 1.8.0_191

Vulcanizer exposes typed structs for these types of objects.

v := vulcanizer.NewClient("localhost", 9200)

nodes, err := v.GetNodes()

fmt.Printf("Node information: %#v\n", nodes[0])
// Node information: vulcanizer.Node{Name:"vulcanizer-node-123", Ip:"172.0.0.1", Id:"xGIs", Role:"mdi", Master:"-", Jdk:"1.8.0_191"}

Update the max recovery cluster setting

The index recovery speed is a common setting to update when you want balance time to recovery and I/O pressure across your cluster. The curl version has a lot of JSON to write.

$ curl -XPUT localhost:9200/_cluster/settings -d '{ "transient": { "indices.recovery.max_bytes_per_sec": "1000mb" } }'
{
"acknowledged": true,
"persistent": {},
"transient": {
"indices": {
"recovery": {
"max_bytes_per_sec": "1000mb"
}
}
}
}

The Vulcanizer API is fairly simple and will also retrieve and return any existing setting for that key so that you can record the previous value.

v := vulcanizer.NewClient("localhost", 9200)
oldSetting, newSetting, err := v.SetSetting("indices.recovery.max_bytes_per_sec", "1000mb")
// "50mb", "1000mb", nil

Move shards on to and off of a node

To safely update a node, you can set allocation rules so that data is migrated off a specific node. In the Elasticsearch settings, this is a comma-separated list of node names, so you’ll need to be careful not to overwrite an existing value when updating it.

$ curl -XPUT localhost:9200/_cluster/settings -d '
{
"transient" : {
"cluster.routing.allocation.exclude._name" : "vulcanizer-node-123,vulcanizer-node-456"
}
}'

The Vulcanizer API will safely add or remove nodes from the exclude settings so that shards won’t allocate on to a node unexpectedly.

v := vulcanizer.NewClient("localhost", 9200)

// Existing exclusion settings:
// vulcanizer-node-123,vulcanizer-node-456

exclusionSetttings1, err := v.DrainServer("vulcanizer-node-789")
// vulcanizer-node-123,vulcanizer-node-456,vulcanizer-node-789

exclusionSetttings2, err := v.FillOneServer("vulcanizer-node-456")
// vulcanizer-node-123,vulcanizer-node-789

Command-line application

Included is a small CLI application that leverages the library:

$ vulcanizer -h
Usage:
  vulcanizer [command]

Available Commands:
  allocation  Set shard allocation on the cluster.
  drain       Drain a server or see what servers are draining.
  fill        Fill servers with data, removing shard allocation exclusion rules.
  health      Display the health of the cluster.
  help        Help about any command
  indices     Display the indices of the cluster.
  nodes       Display the nodes of the cluster.
  setting     Interact with cluster settings.
  settings    Display all the settings of the cluster.
  snapshot    Interact with a specific snapshot.
  snapshots   Display the snapshots of the cluster.

Flags:
  -c, --cluster string      Cluster to connect to defined in config file
  -f, --configFile string   Configuration file to read in (default to "~/.vulcanizer.yaml")
  -h, --help                help for vulcanizer
      --host string         Host to connect to (default "localhost")
  -p, --port int            Port to connect to (default 9200)

Use "vulcanizer [command] --help" for more information about a command.

ChatOps

ChatOps is important for GitHub and our geographically distributed workforce. Vulcanizer enables us to build ChatOps tooling around Elasticsearch quickly and easily for common tasks:

jessbreckenridge

.vulcan-go health testbed

Hubot

The cluster is green: All primary and replica shards are allocated. This does NOT mean the cluster is otherwise healthy.
+--------+------------+------+------------+----------+
| STATUS | RELOCATING | INIT | UNASSIGNED | ACTIVE % |
+--------+------------+------+------------+----------+
| green  | 0          | 0    | 0          | 100      |
+--------+------------+------+------------+----------+

jessbreckenridge

.vulcan-go nodes testbed

Hubot

testbed nodes
+--------+------+------------------------+-----------+-----------+
| MASTER | ROLE |          NAME          |    JDK    |      IP   |
+--------+------+------------------------+-----------+-----------+
| *      | mi   | search-arbiter-aaaaaaa | 1.8.0_111 | 172.0.0.1 |
| -      | mi   | search-arbiter-bbbbbbb | 1.8.0_111 | 172.0.0.2 |
| -      | mi   | search-arbiter-ccccccc | 1.8.0_111 | 172.0.0.3 |
| -      | di   | search-storage-ddddddd | 1.8.0_111 | 172.0.0.4 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.5 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.6 |
+--------+------+------------------------+-----------+-----------+

nickcanz

.vulcan-go drain status testbed

Hubot

IP's filtered: None
Name Excluded: None
Host Excluded: None

nickcanz

.vulcan-go drain server search-storage-zzzzzzz testbed

Hubot

Server "search-storage-zzzzzzz" added to list of servers to drain.

Complete list servers draining:
    Ips: None
    Names: search-storage-zzzzzzz
    Hosts: None

nickcanz

.vulcan-go settings testbed

Hubot

testbed settings
No persistent settings are set.
+------------------------------------------+------------------------+
|            TRANSIENT SETTINGS            |         VALUE          |
+------------------------------------------+------------------------+
| cluster.routing.allocation.exclude._name | search-storage-zzzzzzz |
+------------------------------------------+------------------------+

Closing

We stumbled a bit when we first started down this path, but the end result is best for everyone:

Since we had to regroup about what exact functionality we wanted to open source, we made sure we were providing value to ourselves and the community instead of just shipping something.
Internal tooling doesn’t always follow engineering best practices like proper release management, so developing Vulcanizer in the open provides an external pressure to make sure we follow all of the best practices.
Having all of the Elasticsearch functionality in its own library allows our internal applications to be very slim and isolated. Our different internal applications have a clear dependency on Vulcanizer instead of having different internal applications depend on each other or worse, trying to get ChatOps to talk to other ChatOps.

Visit the Vulcanizer repository to clone or contribute to the project. We have ideas for future development in the Vulcanizer roadmap.

Authors

nickcanz

jessbreckenridge

The post Vulcanizer: a library for operating Elasticsearch appeared first on The GitHub Blog.

Towards Natural Language Semantic Code Search

GitHub Engineering — Tue, 18 Sep 2018 07:00:00 +0000

This blog post complements a live demonstration on our recently announced site: experiments.github.com

Motivation

Searching code on GitHub is currently limited to keyword search. This assumes either the user knows the syntax, or can anticipate what keywords might be in comments surrounding the code they are looking for. Our machine learning scientists have been researching ways to enable the semantic search of code.

To fully grasp the concept of semantic search, consider the below search query, “ping REST api and return results”:

Note that the demonstrated semantic search returns reasonable results even though there are no keywords in common between the search query and the text (the code & comments found do not contain the words “Ping”, “REST” or “api”)! The implications of augmenting keyword search with semantic search are profound. For example, such a capability would expedite the process of on-boarding new software engineers onto projects and bolster the discoverability of code in general.

In this post, we want to share how we are leveraging deep learning to make progress towards this goal. We also share an open source example with code and data that you can use to reproduce these results!

Introduction

One of the key areas of machine learning research underway at GitHub is representation learning of entities, such as repos, code, issues, profiles and users. We have made significant progress towards enabling semantic search by learning representations of code that share a common vector space as text. For example, consider the below diagram:

In the above example, Text 2 (blue) is a reasonable description of the code, whereas Text 1 (red) is not related to the code at all. Our goal is to learn representations where (text, code) pairs that describe the same concept are close neighbors, whereas unrelated (text, code) pairs are further apart. By representing text and code in the same vector space, we can vectorize a user’s search query and lookup the nearest neighbor that represents code. Below is a four-part description of the approach we are currently using to accomplish this task:

1. Learning Representations of Code

In order to learn a representation of code, we train a sequence-to-sequence model that learns to summarize code. A way to accomplish this for Python is to supply (code, docstring) pairs where the docstring is the target variable the model is trying to predict. One active area of research for us is incorporating domain specific optimizations like tree-based LSTMs, gated-graph networks and syntax-aware tokenization. Below is a screenshot that showcases the code summarizer model at work. In this example, there are two python functions supplied as input, and in both cases the model produces a reasonable summary of the code as output:

It should be noted that in the above examples, the model produces the summary by using the entire code blob, not merely the function name.

Building a code summarizer is a very exciting project on its own, however, we can utilize the encoder from this model as a general purpose feature extractor for code. After extracting the encoder from this model, we can fine-tune it for the task of mapping code to the vector space of natural language.

We can evaluate this model objectively using the BLEU score. Currently we have been able to achieve a BLEU score of 13.5 on a holdout set of python code, using the fairseq-py library for sequence to sequence models.

2. Learning Representations of Text Phrases

In addition to learning a representation for code, we needed to find a suitable representation for short phrases (like sentences found in Python docstrings). Initially, we experimented with the Universal Sentence Encoder, a pre-trained encoder for text that is available on TensorFlow Hub. While the embeddings from worked reasonably well, we found that it was advantageous to learn embeddings that were specific to the vocabulary and semantics of software development. One area of ongoing research involves evaluating different domain-specific corpuses for training our own model, ranging from GitHub issues to third party datasets.

To learn this representation of phrases, we trained a neural language model by leveraging the fast.ai library. This library gave us easy access to state of the art architectures such as AWD LSTMs, and to techniques such as cyclical learning rates with random restarts. We extracted representations of phrases from this model by summarizing the hidden states using the concat pooling approach found in this paper.

One of the most challenging aspects of this exercise was to evaluate the quality of these embeddings. We are currently building a variety of downstream supervised tasks similar to those outlined here that will aid us in evaluating the quality of these embeddings objectively. In the meantime, we sanity check our embeddings by manually examining the similarity between similar phrases. The below screenshot illustrates examples where we search the vectorized docstrings for similarity against user-supplied phrases:

3. Mapping Code Representations To The Same Vector-Space as Text

Next, we map the code representations we learned from the code summarization model (part 1) to the vector space of text. We accomplish this by fine-tuning the encoder of this model. The inputs to this model are still code blobs, however the target variable the model is now the vectorized version of docstrings. These docstrings are vectorized using the approach discussed in the previous section.

Concretely, we perform multi-dimensional regression with cosine proximity loss to bring the hidden state of the encoder into the same vector-space as text.

We are actively researching alternate approaches that directly learn a joint vector space of code and natural language, borrowing from some ideas outlined here.

4. Creating a Semantic Search System

Finally, after successfully creating a model that can vectorize code into the same vector-space as text, we can create a semantic search mechanism. In its most simple form, we can store the vectorized version of all code in a database, and perform nearest neighbor lookups to a vectorized search query.

Another active area of our research is determining the best way to augment existing keyword search with semantic results and how to incorporate additional information such as context and relevance.
Furthermore, we are actively exploring ways to evaluate the quality of search results that will allow us to iterate quickly on this problem. We leave these topics for discussion in a future blog post.

Summary

The below diagram summarizes all the steps in our current semantic-search workflow:

We are exploring ways to improve almost every component of this approach, including data preparation, model architecture, evaluation procedures, and overall system design. What is described in this blog post is only a minimal example that scratches the surface.

Open Source Examples

Our open-source end-to-end tutorial contains a detailed walkthrough of the approach outlined in this blog, along with code and data you can use to reproduce the results.

This open source example (with some modifications) is also used as a tutorial for the kubeflow project, which is implemented here.

Limitations and Intended Use Case(s)

We believe that semantic code search will be most useful for targeted searches of code within specific entities such as repos, organizations or users as opposed to general purpose “how to” queries. The live demonstration of semantic code search hosted on our recently announced Experiments site does not allow users to perform targeted searches of repos. Instead, this demonstration is designed to share a taste of what might be possible and searches only a limited, static set of python code.

Furthermore, like all machine learning techniques, the efficacy of this approach is limited by the training data used. For example, the data used to train these models are (code, docstring) pairs. Therefore, search queries that closely resemble a docstring have the greatest chance of success. On the other hand, queries that do not resemble a docstring or contain concepts for which there is little data may not yield sensible results. Therefore, it is not difficult to challenge our live demonstration and discover the limitations of this approach. Nevertheless, our initial results indicate that this is an extremely fruitful area of research that we are excited to share with you.

There are many more use cases for semantic code search. For example, we could extend the ideas presented here to allow users to search for code using the language of their choice (French, Mandarin, Arabic, etc.) across many different programming languages simultaneously.

Get In Touch

This is an exciting time for the machine learning research team at GitHub and we are looking to expand. If our work interests you, please get in touch!

Contributors

hamelsmu

hohsiangwu

The post Towards Natural Language Semantic Code Search appeared first on The GitHub Blog.

Removing jQuery from GitHub.com frontend

GitHub Engineering — Thu, 06 Sep 2018 07:00:00 +0000

We have recently completed a milestone where we were able to drop jQuery as a dependency of the frontend code for GitHub.com. This marks the end of a gradual, years-long transition of increasingly decoupling from jQuery until we were able to completely remove the library. In this post, we will explain a bit of history of how we started depending on jQuery in the first place, how we realized when it was no longer needed, and point out that—instead of replacing it with another library or framework—we were able to achieve everything that we needed using standard browser APIs.

Why jQuery made sense early on

GitHub.com pulled in jQuery 1.2.1 as a dependency in late 2007. For a bit of context, that was a year before Google released the first version of their Chrome browser. There was no standard way to query DOM elements by a CSS selector, no standard way to animate visual styles of an element, and the XMLHttpRequest interface pioneered by Internet Explorer was, like many other APIs, inconsistent between browsers.

jQuery made it simple to manipulate the DOM, define animations, and make “AJAX” requests— basically, it enabled web developers to create more modern, dynamic experiences that stood out from the rest. Most importantly of all, the JavaScript features built in one browser with jQuery would generally work in other browsers, too. In those early days of GitHub when most of its features were still getting fleshed out, this allowed the small development team to prototype rapidly and get new features out the door without having to adjust code specifically for each web browser.

The simple interface of jQuery also served as a blueprint to craft extension libraries that would later serve as building blocks for the rest of GitHub.com frontend: pjax and facebox.

We will always be thankful to John Resig and the jQuery contributors for creating and maintaining such a useful and, for the time, essential library.

Web standards in the later years

Over the years, GitHub grew into a company with hundreds of engineers and a dedicated team gradually formed to take responsibility for the size and quality of JavaScript code that we serve to web browsers. One of the things that we’re constantly on the lookout for is technical debt, and sometimes technical debt grows around dependenices that once provided value, but whose value dropped over time.

When it came to jQuery, we compared it against the rapid evolution of supported web standard in modern browsers and realized:

The $(selector) pattern can easily be replaced with querySelectorAll();
CSS classname switching can now be achieved using Element.classList;
CSS now supports defining visual animations in stylesheets rather than in JavaScript;
$.ajax requests can be performed using the Fetch Standard;
The addEventListener() interface is stable enough for cross-platform use;
We could easily encapsulate the event delegation pattern with a lightweight library;
Some syntactic sugar that jQuery provides has become reduntant with the evolution of JavaScript language.

Furthermore, the chaining syntax didn’t satisfy how we wanted to write code going forward. For example:

$('.js-widget')
  .addClass('is-loading')
  .show()

This syntax is simple to write, but to our standards, doesn’t communicate intent really well. Did the author expect one or more js-widget elements on this page? Also, if we update our page markup and accidentally leave out the js-widget classname, will an exception in the browser inform us that something went wrong? By default, jQuery silently skips the whole expresion when nothing matched the initial selector; but to us, such behavior was a bug rather than a feature.

Finally, we wanted to start annotating types with Flow to perform static type checking at build time, and we concluded that the chaining syntax doesn’t lend itself well to static analysis, since almost every result of a jQuery method call is of the same type. We chose Flow over alternatives because, at the time, features such as @flow weak mode allowed us to progressively and efficiently start applying types to a codebase which was largely untyped.

All in all, decoupling from jQuery would mean that we could rely on web standards more, have MDN web docs be de-facto default documentation for our frontend developers, maintain more resilient code in the future, and eventually drop a 30 kB dependency from our packaged bundles, speeding up page load times and JavaScript execution times.

Incremental decoupling

Even with an end goal in sight, we knew that it wouldn’t be feasible to just allocate all resources we had to rewriting everything from jQuery to vanilla JS. If anything, such a rushed endeavor would likely lead to many regressions in site functionality that we would later have to weed out. Instead, we:

Set up metrics that tracked ratio of jQuery calls used per overall line of code and monitored that graph over time to make sure that it’s either staying constant or going down, not up.
We discouraged importing jQuery in any new code. To facilitate that using automation, we created eslint-plugin-jquery which would fail CI checks if anyone tried to use jQuery features, for example $.ajax.
There were now plenty of violations of eslint rules in old code, all of which we’ve annotated with specific eslint-disable rules in code comments. To the reader of that code, those comments would serve as a clear signal that this code doesn’t represent our current coding practices.
We created a pull request bot that would leave a review comment on a pull request pinging our team whenever somebody tried to add a new eslint-disable rule. This way we would get involved in code review early and suggest alternatives.
A lot of old code had explicit coupling to external interfaces of pjax and facebox jQuery plugins, so we’ve kept their interfaces relatively the same while we’ve internally replaced their implementation with vanilla JS. Having static type checking helped us have greater confidence around those refactorings.
Plenty of old code interfaced with rails-behaviors, our adapter for the Ruby on Rails approach to “unobtrusive” JS, in a way that they would attach an AJAX lifecycle handler to certain forms:
```
// LEGACY APPROACH
$(document).on('ajaxSuccess', 'form.js-widget', function(event, xhr, settings, data) {
  // insert response data somewhere into the DOM
})
```
Instead of having to rewrite all of those call sites at once to the new approach, we’ve opted to trigger fake ajax* lifecycle events and keep these forms submitting their contents asynchronously as before; only this time fetch() was used internally.
We maintained a custom build of jQuery and whenever we’ve identified that we’re not using a certain module of jQuery anymore, we would remove it from the custom build and ship a slimmer version. For instance, after we have removed the final usage of jQuery-specific CSS pseudo-selectors such as :visible or :checkbox, we were able to remove the Sizzle module; and when the last $.ajax call was replaced with fetch(), we were able to remove the AJAX module. This served a dual purpose: speeding up JavaScript execution times while at the same time ensuring that no new code is created that would try using the removed functionality.
We kept dropping support for old Internet Explorer versions as soon as it would be feasible to, as informed by our site analytics. Whenever use of a certain IE version dropped below a certain threshold, we would stop serving JavaScript to it and focus on testing against and supporting more modern browsers. Dropping support for IE 8–9 early on allowed us to adopt many native browser features that would otherwise be hard to polyfill.
As part of our refined approach to building frontend features on GitHub.com, we focused on getting away with regular HTML foundation as much as we could, and only adding JavaScript behaviors as progressive enhancement. As a result, even those web forms and other UI elements that were enhanced using JS would usually also work with JavaScript disabled in the browser. In some cases, we were able to delete certain legacy behaviors altogether instead of having to rewrite them in vanilla JS.

With these and similar efforts combined over the years, we were able gradually reduce our dependence on jQuery until there was not a single line of code referencing it anymore.

Custom Elements

One technology that has been making waves in the recent years is Custom Elements: a component library native to the browser, which means that there are no additional bytes of a framework for the user to download, parse and compile.

We had created a few Custom Elements based on the v0 specification since 2014. However, as standards were still in flux back then, we did not invest as much. It was not until 2017 when the Web Components v1 spec was released and implemented in both Chrome and Safari that we began to adopt Custom Elements on a wider scale.

During the jQuery migration, we looked for patterns that would be suitable for extraction as custom elements. For example, we converted our facebox usage for displaying modal dialogs to the element.

Our general philosophy of striving for progressive enhancement extends to custom elements as well. This means that we keep as much of the content in markup as possible and only add behaviors on top of that. For example, shows the raw timestamp by default and gets upgraded to translate the time to the local timezone, while , when nested in the

element, is interactive even without JavaScript, but gets upgraded with accessibility enhancements.

Here is an example of how a custom element could be implemented:

// The local-time element displays time in the user's current timezone
// and locale.
//
// Example:
//   Sep 6, 2018
//
class LocalTimeElement extends HTMLElement {
  static get observedAttributes() {
    return ['datetime']
  }

  attributeChangedCallback(attrName, oldValue, newValue) {
    if (attrName === 'datetime') {
      const date = new Date(newValue)
      this.textContent = date.toLocaleString()
    }
  }
}

if (!window.customElements.get('local-time')) {
  window.LocalTimeElement = LocalTimeElement
  window.customElements.define('local-time', LocalTimeElement)
}

One aspect of Web Components that we’re looking forward to adopting is Shadow DOM. The powerful nature of Shadow DOM has the potential to unlock a lot of possibilities for the web, but that also makes it harder to polyfill. Because polyfilling it today incurs a performance penalty even for code that manipulates parts of the DOM unrelated to web components, it is unfeasible for us to start using it in production.

Polyfills

These are the polyfills that helped us transition to using standard browser features. We try to serve most of these polyfills only when absolutely necessary, i.e. to outdated browsers as part of a separate “compatibility” JavaScript bundle.

Authors

Application Engineer

Application Engineer

Application Engineer

Application Engineer

The post Removing jQuery from GitHub.com frontend appeared first on The GitHub Blog.

Mitigating replication lag and reducing read load with freno

GitHub Engineering — Fri, 13 Oct 2017 07:00:00 +0000

At GitHub, we use MySQL as the main database technology backing our services. We run classic MySQL master-replica setups, where writes go to the master, and replicas replay master’s changes asynchronously. To be able to serve our traffic we read data from the MySQL replicas. To scale our traffic we may add more replica servers. Reading from the master does not scale and we prefer to minimize master reads.

With asynchronous replication, changes we make to the master are not immediately reflected on replicas. Each replica pulls changes from its master and replays them as fast as it can. There is a nonzero delay between the point in time where changes are made visible on a master and the time where those changes are visible on some replica or on all replicas. This delay is the replication lag.

The higher the replication lag on a host, the more stale its data becomes. Serving traffic off of a lagging replica leads to poor user experience, as someone may make a change and then not see it reflected. Our automation removes lagging replicas from the serving pool after a few seconds, but even those few seconds matter: we commonly expect sub-second replication lag.

Maintaining low replication lag is challenging. An occasional INSERT or UPDATE is nothing, but we routinely run massive updates to our databases. These could be batched jobs, cleanup tasks, schema changes, schema change followups or otherwise operations that affect large datasets. Such large operations may easily introduce replication lag: while a replica is busy applying a change to some 100,000 rows, its data quickly becomes stale, and by the time it completes processing it’s already lagging and requires even more time to catch up.

Running subtasks

To mitigate replication lag for large operations we use batching. We never apply a change to 100,000 rows all at once. Any big update is broken into small segments, subtasks, of some 50 or 100 rows each.

As an example, say our app needs to purge some rows that satisfy a condition from a very large table. Instead of running a single DELETE FROM my_table WHERE some_condition = 1 we break the query to smaller subqueries, each operating on a different slice of the table. In its purest form, we would get a sequence of queries such as:

DELETE FROM my_table WHERE some_condition = 1 AND (id >= 0 AND id < 100);
DELETE FROM my_table WHERE some_condition = 1 AND (id >= 100 AND id < 200);
DELETE FROM my_table WHERE some_condition = 1 AND (id >= 200 AND id < 300);
DELETE FROM my_table WHERE some_condition = 1 AND (id >= 300 AND id < 400);
DELETE FROM my_table WHERE some_condition = 1 AND (id >= 400 AND id < 500);
...

These smaller queries can each be processed by a replica very quickly, making it available to process more events, some of which may be the normal site’s update traffic or the next segments.

However, the numbers still add up. On a busy hour a heavily loaded replica may still find it too difficult to manage both read traffic and massive changes coming from the replication stream.

Throttling

We recognize that most of the large volume operations come from background jobs, such as archiving or schema migrations. There is no particular user of API requests waiting on those operations to complete. It is fine for these operations to take a little while longer to complete.

In order to apply these large operations, we break them into smaller segments and throttle in between applying those segments. Each segment is small enough and safe to send down the replication stream, but a bunch of those segments can be too much for the replicas to handle. In between each segment we pause and ask: is replication happy and in good shape?

There is no direct mechanism in MySQL to do that. Closest would be semisynchronous replication, but even that doesn’t guarantee replication lag to be caught up nor be within reasonable margins.

It is up to us to be able to identify our relevant production traffic serving replicas and ask: “What is your current lag?”

If lag on all relevant replicas is within good margins (we expect subsecond), we proceed to run the next segment, then ask again. If lag is higher than desired, we throttle: we stall the operation, and keep polling lag until we’re satisfied it is low enough for us.

This flow suggests:

The app needs a way to identify production serving replicas.
The app needs to be able to gather lag information from those replicas.
It needs to do so from whatever hosts it’s running on, and potentially from multiple hosts concurrently.

The old GitHub throttler

Our site runs Ruby on Rails. Over time, we built into our Rails app a series of abstractions to identify the set of replicas that were active in a cluster, ask them their current replication delay value, and determine whether that value was
low enough to continue writing on the master. In its simplest form, the api looks like this:

big_dataset.each_slice do |subset|
  GitHub::Throttler::MyCluster.throttle do
    write(subset)
  end
end

We would routinely inspect HAProxy for the list relevant replicas and populate a special table with that list. The table was made available to the throttler.
When .throttle was called, the throttler pulled the list of replicas and polled replication delay metrics from the MySQL servers.

A common tool part of Percona Toolkit called pt-heartbeat inserts a timestamp each 100ms in the master. That timestamp is replicated along with the rest of the information to the replicas, and as a consequence, the following query returns the replication delay in seconds from each of the replicas.

select unix_timestamp(now(6)) - unix_timestamp(ts) as replication_delay from heartbeat order by ts desc limit 1

The aggregated value of replication delay was the maximum of the different metrics polled from each replica.

The throttler also had local configuration to determine when that aggregated value was low-enough. If it wasn’t, the code block above would sleep for a second and check the replication delay again; if it instead was good, it would run the block, thus writing the subset.

Non ruby throttlers

As we’ve grown, we’ve introduced write workloads outside of our main Rails application, and they’ve needed to be throttled as well.

We routinely archive or purge old data via pt-archiver. This is a Perl script, and fortunately comes with its own implementation for replication lag based throttling. The tool crawls down the topology to find replicating servers, then periodically checks their replication lag.

Last year we introduced gh-ost, our schema migration tool. gh-ost, by definition, runs our most massive operations: it literally rebuilds and rewrites entire tables. Even if not throttled, some of our tables could take hours or days to rebuild. gh-ost is written in Go, and could not use the Ruby throttler implementation nor the Perl implementation. Nor did we wish for it to depend on either, as we created it as a general purpose, standalone solution to be used by the community. gh-ost runs its own throttling mechanism, checking first and foremost the very replica on which it operates, but then also the list of --throttle-control-replicas. gh-ost‘s interactive commands allow us to change the list of throttle control replicas during runtime. We would compute the list dynamically when spawning gh-ost, and update, if needed, during migration.

Then Spokes brought more massive writes. As our infrastructure grew, more and more external Ruby and non-Ruby services began running massive operations on our database.

An operational anti-pattern

What used to work well when we were running exclusively Ruby on Rails code and in smaller scale didn’t work so well as we grew. We increasingly ran into operational issues with our throttling mechanisms.

We were running more and more throttling tasks, many in parallel. We were also provisioning, decommissioning, and refactoring our MySQL fleets. Our traffic grew substantially. We realized our throttling setups had limitations.

Different apps were getting the list of relevant replicas in different ways. While the Ruby throttler always kept an updated list, we’d need to educate pt-archiver and gh-ost about an initial list, duplicating the logic the Ruby throttler would use. And while the Ruby throttler found out in real time about list changes (provisioning, decommissioning servers in production), gh-ost had to be told about such changes, and pt-archiver‘s list was immutable; we’d need to kill it and restart the operation for it to consider a different list. Other apps were mostly trying to operate similarly to the Ruby throttler, but never exactly.

As result, different apps would react differently to ongoing changes. One app would be able to gain the upper hand on another, running its own massive operation while starving the other.

The databases team members would have more complicated playbooks, and would need to run manual commands when changing our rosters. More importantly, the database team had no direct control of the apps. We would be able to cheat the apps into throttling if we wanted to, but it was all-or-nothing: either we would throttle everything or we would throttle nothing. Occasionally we would like to prioritize one operation over another, but had no way to do that.

The Ruby throttler provided great metric collection, but the other tools did not; or didn’t integrate well with the GitHub infrastructure, and we didn’t have visibility into what was being throttled and why.

We were wasting resources. The Ruby throttler would probe the MySQL replicas synchronously and sequentially per server host. Each throttle check by the app would introduce a latency to the operation by merely iterating the MySQL fleet, a wasted effort if no actual lag was found. It invoked the check for each request, which implied dozens of calls per second. That many duplicate calls were wasteful. As result we would see dozens or even hundreds of stale connections on our replica servers made by the throttler from various endpoints, either querying for lag or sleeping.

Introducing freno

We built freno, GitHub’s central throttling service, to replace all our existing throttling mechanisms and solve the operational issues and limitations described above.

freno is Spanish for brake, as in car brakes. The name throttled was already taken and freno was just the next most sensible name.

In its very basic essence, freno runs as a standalone service that understands replication lag status and can make recommendations to inquiring apps. However, let’s consider some of its key design and operational principles:

Continuous polls

freno continuously probes and queries the MySQL replicas. It does so independently of any app wanting to issue writes. It does so asynchronously to any app.

Knowledgable

freno continuously updates the list of servers per cluster. Within a 10 second timeframe freno will recognize that servers were removed or added. In our infrastructure, freno polls our GLB servers (seen as HAProxy) to get the roster of production replicas.

An app doesn’t need to (and actually just doesn’t) know the identity of backend servers, their count or location. It just needs to know the cluster it wants to write to.

Highly available

Multiple freno services form a raft cluster. Depending on cluster size some boxes can go down and still the freno service would be up, with a leader to serve traffic. Our proxies direct traffic to the freno leader (we have highly available proxies setup, but one may also detect the leader’s identity directly).

Cooperative

freno use is voluntary. It is not a proxy to the MySQL servers, and it has no power over the applications. It provides recommendations to applications that are interested in replication lag status. Those applications are expected to cooperate with freno‘s recommendations.

An app issues a HEAD request to freno, and gets a 200 OK when it is clear to write, or a different code when it should throttle.

Throttle control

We are able to throttle specific apps, for a predefined duration and to a predefined degree. For example, an engineer can issue the following in chat:

shlomi-noach

.freno throttle gh-ost ttl 120 ratio 0.5

Hubot

gh-ost: ratio 0.5, expires at 2017-08-20T08:01:34-07:00

We may altogether refuse an app’s requests or only let it operate in low volumes.

Metrics

freno records all requests by all apps. We are able to see which app requested writes to which cluster and when. We know when it was granted access and when it was throttled.

Visibility

We are able to query freno and tell which metric hasn’t been healthy in the past 10 minutes, or who requested writes to a particular cluster, or what is being forcefully throttled, or what would be the response right now for a pt-archiver request on a cluster.

miguelff

.freno sup

Hubot

All existing metrics:  
  mysql/clusterA lag: 0.095545s
  mysql/clusterB lag: 0.070422s
  mysql/clusterC lag: 0.04162s
  mysql/clusterD lag: 0.39256s
  ...

shlomi-noach

.freno health 300

Hubot

all metrics were reported healthy in past 300 seconds

miguelff

.freno check clusterA as pt-archiver

Hubot

{
 "StatusCode": 200,
 "Value": 0.085648,
 "Threshold": 0.8,
 "Message": ""
}

freno brought a unified, auditable and controllable method for throttling our app writes. There is now a single, highly available, entity querying our MySQL servers.

Querying freno is as simple as running:

$ curl -s -I http://my.freno.com:9777/check/myscript/mysql/clusterB

where the myscript app requests access to the clusterB cluster. This makes it easy to use by any client.

Open source

freno has been open source since early in its development and is available under the MIT license.

While the simplest client may just run a HTTP GET request, we’ve also made available more elaborate clients:

freno-client is our Ruby client for freno, which we use at GitHub.
A Perl module to be used as plugin for pt-archiver, see also this doc page.
gh-ost is freno aware. Pass --throttle-http=, see doc page.

freno was built for monitoring MySQL replication, but may be extended to collect and serve other metrics.

And then things took a turn

We wanted freno to serve those massive write operation jobs, typically initiated by background processes. However, we realized we could also use it to reduce read load from our masters.

Master reads

Reads from the master should generally be avoided, as they don’t scale well. There is only one master and it can only serve so many reads. Reads from the master are typically due to the consistent-read problem: a change has been made, and needs to be immediately visible in the next read. If the read goes to a replica, it may hit the replica too soon, before the change was replayed there.

There are various ways to solve the consistent-read problem, that include blocking on writes or blocking on reads. At GitHub, we have a peculiar common flow that can take advantage of freno.

Before freno, web and API GET requests were routed to the replicas only if the last write happened more than five seconds ago. This pseudo-arbitrary number had some sense: If the write was five seconds ago or more, we can safely read from the replica, because if the replica is lagging above five seconds, it means that there are worse problems to handle than a read inconsistency, like database availability being at risk.

When we introduced freno, we started using the information it knows about replication lag across the cluster to address this. Now, upon a GET request after a write, the app asks freno for the maximum replication lag across the cluster, and if the reported value is below the elapsed time since the last write (up to some granularity), the read is known to be safe and can be routed to a replica.

By applying that strategy, we managed to route to the replicas ~30% of the requests that before were routed to the master. As a consequence, the number of selects and threads connected on the master was reduced considerably, leaving the master with more free capacity:

We applied similar strategies to other parts of the system: Another example is search indexing jobs.

We index in Elasticsearch when a certain domain object changes. A change is a write operation, and as indexing happens asynchronously, we sometimes need to re-hydrate data from the database.

The time it takes to process a given job since the last write is in the order of a hundred milliseconds. As replication lag is usually above that value, we were also reading from the master within indexing jobs to have a consistent view of the data that was written. This job was responsible for another 11% of the reads happening on the master.

To reduce reads on the master from indexing jobs, we used the replication delay reported by freno to delay the execution of each indexing job until the data has been replicated. To do it, we store in the job payload the timestamp at which the write operation that triggered the job occurred, and based on the replication delay reported by freno, we wait until we are sure the data was replicated. This happens in less than 600ms 95% of the time.

The above two scenarios account for ~800 rps to freno, but replication delay cannot grow faster than the clock, and we used this fact to optimize access and let freno scale to growing usage demands. We implemented client side caching over memcache, using the same replication delay values for 20ms, and adding a cushioning time to compensate both freno sampling rates and the caching TTL. This way, we capped to 50rps to freno from the main application.

Scaling freno

Seeing that freno now serves production traffic, not just backend processes, we are looking into its serving capacity. At this time a single freno process is well capable of serving what requests we’re sending its way, with much room to spare. If we need to scale it, we can bootstrap and run multiple freno clusters: either used by different apps and over different clusters (aka sharding), or just for sharing read load, much like we add replicas to share query read load. The replicas themselves can tolerate many freno clients, if needed. Recall that we went from dozens/hundreds of throttler connections to a single freno connection; there’s room to grow.

Motivated by the client-side caching applied in the web application, we now can tell freno to write results to memcached, which can be used to decouple freno’s probing load from the app’s requests load.

Looking into the future

freno & freno-client make it easy for apps to add throttling to massive operations. However, the engineer still needs to be aware that their operation should use throttling, and it takes a programmatic change to the code to call upon throttling. We’re looking into intelligently identifying queries: the engineer would run “normal” query code, and an adapter or proxy would identify whether the query requires throttling, and how so.

Authors

Miguel Fernández
Senior Platform Engineer

Shlomi Noach
Senior Infrastructure Engineer

The post Mitigating replication lag and reducing read load with freno appeared first on The GitHub Blog.

MySQL infrastructure testing automation at GitHub

GitHub Engineering — Thu, 06 Jul 2017 07:00:00 +0000

Our MySQL infrastructure is a critical component to GitHub. MySQL serves GitHub.com, GitHub’s API, authentication and more. Every git request touches MySQL in some way. We are tasked with keeping the data available, and maintaining its integrity. Even while our MySQL clusters serve traffic, we need to be able to perform tasks such as heavy duty cleanups, ad-hoc updates, online schema migrations, cluster topology refactoring, pooling and load balancing and more. We have the infrastructure to automate away such operations; in this post we share a few examples of how we build trust in our infrastructure through continuous testing. It is essentially how we sleep well at night.

Backups

It is incredibly important to take backups of your data. If you are not taking backups of your database, it is likely a matter of time before this will become an issue. Percona Xtrabackup is the tool we have been using for issuing full backups for our MySQL databases. If there is data that we need to be certain is saved, we have a server that is backing up the data.

In addition to the full binary backups, we run logical backups several times a day. These backups allow our engineers to get a copy of recent data. There are times that they would like a complete set of data from a table so they can test an index change on a production sized table or see data from a certain point of time. Hubot allows us to restore a backed up table and will ping us when the table is ready to use.

tomkrouper

.mysql backup-list locations

Hubot

+-----------+------------+---------------+---------------------+---------------------+----------------------------------------------+
| Backup ID | Table Name | Donor Host    | Backup Start        | Backup End          | File Name                                    |
+-----------+------------+---------------+---------------------+---------------------+----------------------------------------------+
|   1699494 | locations  | db-mysql-0903 | 2017-07-01 22:09:17 | 2017-07-01 22:09:17 | backup-mycluster-locations-1498593122.sql.gz |
|   1699133 | locations  | db-mysql-0903 | 2017-07-01 16:11:37 | 2017-07-01 16:11:39 | backup-mycluster-locations-1498571521.sql.gz |
|   1698772 | locations  | db-mysql-0903 | 2017-07-01 10:09:21 | 2017-07-01 10:09:22 | backup-mycluster-locations-1498549921.sql.gz |
|   1698411 | locations  | db-mysql-0903 | 2017-07-01 04:12:32 | 2017-07-01 04:12:32 | backup-mycluster-locations-1498528321.sql.gz |
|   1698050 | locations  | db-mysql-0903 | 2017-06-30 22:18:23 | 2017-06-30 22:18:23 | backup-mycluster-locations-1498506721.sql.gz |
| ...
|   1262253 | locations  | db-mysql-0088 | 2016-08-01 01:58:51 | 2016-08-01 01:58:54 | backup-mycluster-locations-1470034801.sql.gz |
|   1064984 | locations  | db-mysql-0088 | 2016-04-04 13:07:40 | 2016-04-04 13:07:43 | backup-mycluster-locations-1459494001.sql.gz |
+-----------+------------+---------------+---------------------+---------------------+----------------------------------------------+

tomkrouper

.mysql restore 1699133

Hubot

A restore job has been created for the backup job 1699133. You will be notified in #database-ops when the restore is complete.

Hubot

@tomkrouper: the locations table has been restored as locations_2017_07_01_16_11 in the restores database on db-mysql-0482

The data is loaded onto a non-production database which is accessible to the engineer requesting the restore.

The last way we keep a “backup” of data around is we use delayed replicas. This is less of a backup and more of a safeguard. For each production cluster we have a host that has replication delayed by 4 hours. If a query is run that shouldn’t have, we can run mysql panic in chatops. This will cause all of our delayed replicas to stop replication immediately. This will also page the on-call DBA. From there we can use delayed replica to verify there is an issue, and then fast forward the binary logs to the point right before the error. We can then restore this data to the master, thus recovering data to that point.

Backups are great, however they are worthless if some unknown or uncaught error occurs corrupting the backup. A benefit of having a script to restore backups is it allows us to automate the verification of backups via cron. We have set up a dedicated host for each cluster that runs a restore of the latest backup. This ensures that the backup ran correctly and that we are able to retrieve the data from the backup.

Depending on dataset size, we run several restores per day. Restored servers are expected to join the replication stream and to be able to catch up with replication. This tests not only that we took a restorable backup, but also that we correctly identified the point in time at which it was taken and can further apply changes from that point in time. We are alerted if anything goes wrong in the restore process.

We furthermore track the time the restore takes, so we have a good idea of how long it will take to build a new replica or restore in cases of emergency.

The following is an output from an automated restore process, written by Hubot in our robots chat room.

Hubot

gh-mysql-backup-restore: db-mysql-0752: restore_log.id = 4447

gh-mysql-backup-restore: db-mysql-0752: Determining backup to restore for cluster ‘prodcluster’.

gh-mysql-backup-restore: db-mysql-0752: Enabling maintenance mode

gh-mysql-backup-restore: db-mysql-0752: Setting orchestrator downtime

gh-mysql-backup-restore: db-mysql-0752: Disabling Puppet

gh-mysql-backup-restore: db-mysql-0752: Stopping MySQL

gh-mysql-backup-restore: db-mysql-0752: Removing MySQL files

gh-mysql-backup-restore: db-mysql-0752: Running gh-xtrabackup-restore

gh-mysql-backup-restore: db-mysql-0752: Restore file: xtrabackup-notify-2017-07-02_0000.xbstream

gh-mysql-backup-restore: db-mysql-0752: Running gh-xtrabackup-prepare

gh-mysql-backup-restore: db-mysql-0752: Starting MySQL

gh-mysql-backup-restore: db-mysql-0752: Update file ownership

gh-mysql-backup-restore: db-mysql-0752: Upgrade MySQL

gh-mysql-backup-restore: db-mysql-0752: Stopping MySQL

gh-mysql-backup-restore: db-mysql-0752: Starting MySQL

gh-mysql-backup-restore: db-mysql-0752: Backup Host: db-mysql-0034

gh-mysql-backup-restore: db-mysql-0752: Setting up replication

gh-mysql-backup-restore: db-mysql-0752: Starting replication

gh-mysql-backup-restore: db-mysql-0752: Replication catch-up

gh-mysql-backup-restore: db-mysql-0752: Restore complete (replication running)

gh-mysql-backup-restore: db-mysql-0752: Enabling Puppet

gh-mysql-backup-restore: db-mysql-0752: Disabling maintenance mode

gh-mysql-backup-restore: db-mysql-0752: Setting orchestrator downtime

gh-mysql-backup-restore: db-mysql-0752: Restore process complete.

One thing we use backups for is adding a new replica to an existing set of MySQL servers. We will initiate the build of a new server, and once we are notified it is ready, we can start a restore of the latest backup for that particular cluster. We have a script in place that runs all of the restore commands that we would otherwise have to do by hand. Our automated restore system essentially uses the same script. This simplifies the system build process and allows us to have a host up and running with a handful of chat commands opposed to dozens of manual processes. Shown below is a restore kicked manually in chat:

jessbreckenridge

.mysql backup-restore -H db-mysql-0007 -o -r magic_word=daily_rotating_word

Hubot

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Determining backup to restore for cluster ‘mycluster’.

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: restore_log.id = 4449

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Enabling maintenance mode

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Setting orchestrator downtime

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Disabling Puppet

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Stopping MySQL

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Removing MySQL files

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Running gh-xtrabackup-restore

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Restore file: xtrabackup-mycluster-2017-07-02_0015.xbstream

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Running gh-xtrabackup-prepare

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Update file ownership

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Starting MySQL

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Upgrade MySQL

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Stopping MySQL

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Starting MySQL

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Setting up replication

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Starting replication

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Backup Host: db-mysql-0201

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Replication catch-up

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Replication behind by 4589 seconds, waiting 1800 seconds before next check.

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Restore complete (replication running)

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Enabling puppet

@jessbreckenridge gh-mysql-backup-restore: db-mysql-0007: Disabling maintenance mode

Failovers

We use orchestrator to perform automated failovers for masters and intermediate masters. We expect orchestrator to correctly detect master failure, designate a replica for promotion, heal the topology under said designated replica, make the promotion. We expect VIPs to change, pools to change, clients to reconnect, puppet to run essential components on promoted master, and more. A failover is a complex task that touches many aspects of our infrastructure.

To build trust in our failovers we set up a production-like, test cluster, and we continuously crash it to observe failovers.

The production-like cluster is a replication setup that is identical in all aspects to our production clusters: types of hardware, operating systems, MySQL versions, network environments, VIP, puppet configurations, haproxy setup, etc. The only thing different to this cluster is that it doesn’t send/receive production traffic.

We emulate a write load on the test cluster, while avoiding replication lag. The write load is not too heavy, but has queries that are intentionally contending to write on same datasets. This isn’t too interesting in normal times, but proves to be useful upon failovers, as we will shortly describe.

Our test cluster has representative servers from three data centers. We would like the failover to promote a replacement replica from within the same data center. We would like to be able to salvage as many replicas as possible under such constraint. We require that both apply whenever possible. orchestrator has no prior assumption on the topology; it must react on whatever the state was at time of the crash.

We, however, are interested in creating complex and varying scenarios for failovers. Our failover testing script prepares the grounds for the failover:

It identifies existing master
It refactors the topology to have representatives of all three data centers under the master. Different DCs have different network latencies and are expected to react in different timing to master’s crash.
It chooses a crash method. We choose from shooting the master (kill -9) or network partitioning it: iptables -j REJECT (nice-ish) or iptables -j DROP (unresponsive).

The script proceeds to crash the master by chosen method, and waits for orchestrator to reliably detect the crash and to perform failover. While we expect detection and promotion to both complete within 30 seconds, the script relaxes this expectation a bit, and sleeps for a designated time before looking into failover results. It will then:

Check that a new (different) master is in place
There is a good number of replicas in the cluster
The master is writable
Writes to the master are visible on the replicas
Internal service discovery entries are updated (identity of new master is as expected; old master removed)
Other internal checks

These tests confirm that the failover was successful, not only MySQL-wise but also on our larger infrastructure scope. A VIP has been assumed; specific services have been started; information got to where it was supposed to go.

The script further proceeds to restore the failed server:

Restoring it from backup, thereby implicitly testing our backup/restore procedure
Verifying server configuration is as expected (the server no longer believes it’s the master)
Returning it to the replication cluster, expecting to find data written on the master

Consider the following visualization of a scheduled failover test: from having a well-running cluster, to seeing problems on some replicas, to diagnosing the master (7136) is dead, to choosing a server to promote (a79d), refactoring the topology below that server, to promoting it (failover successful), to restoring the dead master and placing it back into the cluster.

What would a test failure look like?

Our testing script uses a stop-the-world approach. A single failure in any of the failover components fails the entire test, disabling any future automated tests until a human resolves the matter. We get alerted and proceed to check the status and logs.

The script would fail on an unacceptable detection or failover time; on backup/restore issues; on losing too many servers; on unexpected configuration following the failover; etc.

We need to be certain orchestrator connects the servers correctly. This is where the contending write load comes useful: if set up incorrectly, replication is easily susceptible to break. We would get DUPLICATE KEY or other errors to suggest something went wrong.

This is particularly important as we make improvements and introduce new behavior to orchestrator, and allows us to test such changes in a safe environment.

Coming up: chaos testing

The testing procedure illustrated above will catch (and has caught) problems on many parts of our infrastructure. Is it enough?

In a production environment there’s always something else. Something about the particular test method that won’t apply to our production clusters. They don’t share the same traffic and traffic manipulation, nor the exact same set of servers. The types of failure can vary.

We are designing chaos testing for our production clusters. Chaos testing would literally destroy pieces in our production, but on expected schedule and under sufficiently controlled manner. Chaos testing introduces a higher level of trust in the recovery mechanism and affects (thus tests) larger parts of our infrastructure and application.

This is delicate work: while we acknowledge the need for chaos testing, we also wish to avoid unnecessary impact to our service. Different tests will differ in risk level and impact, and we will work to ensure availability of our service.

Schema migrations

We use gh-ost to run live schema migrations. gh-ost is stable, but also under active developments, with major new features being added or planned.

gh-ost migrates tables by copying data onto a ghost table, applying ongoing changes intercepted by the binary logs onto the ghost table, even as the original table is being written to. It then swaps the ghost table in place of the original table. At migration completion GitHub proceeds to work with a table generated and populated by gh-ost.

At this time almost all of GitHub’s MySQL data has been recreated by gh-ost, and most of it multiple times. We must have high trust in gh-ost to let it tamper with our data over and over again, even in face of active development. Here’s how we gain this trust.

gh-ost provides a testing-in-production capability. It supports running a migration on a replica, in much the same way as it would run on the master: gh-ost would connect to the replica and treat it as if it were the master. It would parse its binary logs the same way it would for a real master migration. However it would copy rows and apply binlog events to the replica, and avoid making writes onto the master.

We run gh-ost-dedicated replicas in production. These replicas do not serve production traffic. Each such replica retrieves the current list of production tables and iterates them in random order. One by one it picks a table and performs a replica-migration on that table. The migration doesn’t actually modify table structure, but instead runs a trivial ENGINE=InnoDB. The test runs the migration even as the table is being used in production, thus copying real production data and applying true production traffic off the binary logs.

These migrations can be audited. Here’s how we can inspect status of running tests from chat:

ggunson

.migration test-status

Hubot

# Migrating `prod`.`pull_requests`; Ghost table is `prod`.`_pull_requests_gho`
# Migrating ghost-db-mysql-0007:3306; inspecting ghost-db-mysql-0007:3306; executing on ghost-db-mysql-0007
# Migration started at Mon Jan 30 02:13:39 -0800 2017
# chunk-size: 2500; max-lag-millis: 1500ms; max-load: Threads_running=30; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# panic-flag-file: /tmp/ghost-test-panic.flag
# Serving on unix socket: /tmp/gh-ost.test.sock
Copy: 57992500/86684838 66.9%; Applied: 57708; Backlog: 1/100; Time: 3h28m38s(total), 3h28m36s(copy); streamer: mysql-bin.000576:142993938; State: migrating; ETA: 1h43m12s

When a test migration completes copying of table data it stops replication and performs the cut-over, replacing the original table with the ghost table, and then swaps back. We’re not interested in actually replacing the data. Instead we are left with both the original table and the ghost table, which should both be identical. We verify that by checksumming the entire table data for both tables.

A test can complete with:

success: All went well and checksum is identical. We expect to see this.
failure: Execution problem. This can occasionally happen due to the migration process being killed, a replication issue etc., and is typically unrelated to gh-ost itself.
checksum failure: table data inconsistency. For a tested branch, this call for fixes. For an ongoing master branch test, this would imply immediate blocking of production migrations. We don’t get the latter.

Test results are audited, sent to robot chatrooms, sent as events to our metrics systems. Each vertical line in the following graph represents a successful migration test:

These tests run continuously. We are notified by alerts in case of failures. And of course we can always visit the robots chatroom to know what’s going on.

Testing new versions

We continuously improve gh-ost. Our development flow is based on git branches, which we then offer to merge via pull requests.

A submitted gh-ost pull request goes through Continuous Integration (CI) which runs basic compilation and unit tests. Once past this, the PR is technically eligible for merging, but even more interestingly it is eligible for deployment via Heaven. Being the sensitive component in our infrastructure that it is, we take care to deploy gh-ost branches for intensive testing before merging into master.

shlomi-noach

.deploy gh-ost/fix-reappearing-throttled-reasons to prod/ghost-db-mysql-0007

Hubot

@shlomi-noach is deploying gh-ost/fix-reappearing-throttled-reasons (baee4f6) to production (ghost-db-mysql-0007).

@shlomi-noach’s production deployment of gh-ost/fix-reappearing-throttled-reasons (baee4f6) is done! (2s)

@shlomi-noach, make sure you watch for exceptions in haystack

jonahberquist

.deploy gh-ost/interactive-command-question to prod/ghost-db-mysql-0012

Hubot

@jonahberquist is deploying gh-ost/interactive-command-question (be1ab17) to production (ghost-db-mysql-0012).

@jonahberquist’s production deployment of gh-ost/interactive-command-question (be1ab17) is done! (2s)

@jonahberquist, make sure you watch for exceptions in haystack

shlomi-noach

.wcid gh-ost

Hubot

shlomi-noach testing fix-reappearing-throttled-reasons 41 seconds ago: ghost-db-mysql-0007

jonahberquist testing interactive-command-question 7 seconds ago: ghost-db-mysql-0012

Nobody is in the queue.

Some PRs are small and do not affect the data itself. Changes to status messages, interactive commands etc. are of lesser impact to the gh-ost app. Others pose significant changes to the migration logic and operation. We would tests these rigorously, running through our production tables fleet until satisfied these changes do not pose data corruption threat.

Summary

Throughout testing we build trust in our systems. By automating these tests, in production, we get repetitive confirmation that everything is working as expected. As we continue to develop our infrastructure we also follow up by adapting tests to cover the newest changes.

Production always surprises with scenarios not covered by tests. The more we test on production environment, the more input we get on our app’s expectations and our infrastructure’s capabilities.

Authors

Tom Krouper
Staff Software Engineer

Shlomi Noach
Senior Infrastructure Engineer

The post MySQL infrastructure testing automation at GitHub appeared first on The GitHub Blog.

How Four Native Developers Wrote An Electron App

GitHub Engineering — Tue, 16 May 2017 07:00:00 +0000

Today we released the new GitHub Desktop Beta, rewritten on Electron.

Electron is a well-known on-ramp for web developers to build desktop apps using familiar web technologies: HTML, CSS, and JavaScript. Our situation was different. Everyone on the GitHub Desktop team is a native developer by trade—three from the .NET world and one from Cocoa. We knew how to make native apps, so how and why did we end up here?

Why Rewrite?

First, the elephant in the room: why? Rewrites are rarely a good idea. Why did we decide to walk away from two codebases and rewrite?

From the start, GitHub Desktop for macOS and Windows were two distinct products, each with their own team. We worked in two separate tech stacks using two different skill sets. To maintain parity across the codebases, we had to implement and design the same features twice. If we ever wanted to add Linux support, we’d have to do it all a third time. All this meant we had twice the work, twice the bugs, and far less time to build new features.

As it turns out, building native apps for multiple platforms doesn’t scale.

Why Electron?

This isn’t a new or unique problem. Over the years we explored various ways for evolving the existing applications towards a shared codebase, including Electron, Xamarin, a shared C++, or in our wildest dreams, Haskell.

We’d already experimented with web technologies to share work. The gravitational pull of the web was too strong to resist.

Beyond our own experience, we had to acknowledge the critical mass accumulating around web technologies. Companies like Google, Microsoft, and Facebook, not to mention GitHub, are investing incredible amounts of time, money, and engineering effort in the web as a platform. With Electron, we leverage that investment.

Tradeoffs

The web isn’t a perfect platform, but native apps aren’t built on perfect platforms either. Rewriting on Electron does mean swapping one set of tradeoffs for another.

Now we can share our logic and UI across all platforms, which is fantastic. But Windows and macOS are different and their users have different expectations. We want to meet those expectations as much as possible, while still sharing the same UI. This manifests in a number of ways, some big and some small.

In the small, some button behaviors vary depending on your platform. On macOS, buttons are Title Case. On Windows they are Sentence case. Or on Windows, the default button in dialogs is on the left, where on macOS it’s the right. We enforce the latter convention both at runtime and with a custom lint rule.

On macOS, Electron gives us access to the standard app menu bar, but on Windows, the menu support is less than ideal. The menu is shown in the window frame, which doesn’t work with our frameless window design. The built-in menu’s usability is also rough around the edges and doesn’t support the keyboard accessibility Windows users expect.

We worked hard to recreate a Windows-appropriate menu in web technologies, complete with access keys and appropriate focus states.

There were times when the web platform or Electron didn’t provide us with the APIs we needed. But in contrast to building a web app, building on Electron meant that we weren’t stuck. We could do something about it.

We added support in Electron for importing certificates into the user’s certificate store, using the appropriate platform-specific APIs.

The web’s internationalization support doesn’t provide fine-grained localization information. For example, we can’t format numbers in a locale-aware manner, distinct from the preferred language defined by the user. Similar to adding the import certificate support, we plan to add better localization support to Electron.

Development Experience

If you’re making a native app, your tech stack choices are pretty limited. On the macOS side, you’ll use Xcode, Swift, and AppKit. On Windows, you’ll use Visual Studio, C#, and WPF or UWP. But in the web world, choices abound. React? Angular? CSS? SASS? CSS in JS? Some compile-to-JavaScript language? Browserify? Webpack? Gulp? Grunt? The ecosystem is enormous. The relative openness of the web platform also means developers are able to experiment and innovate, independent from the constraints of any single vendor. This cuts both ways: the array of choices can be overwhelming, but it also means you’re more able to pick the right tool for the job.

We were coming from C#, Objective-C, and Swift where static type systems let the compiler watch our back and help us along the way. For us, the question wasn’t if we’d choose a compile-to-Javascript language but rather which one.

At the same time, one of the big benefits to writing an Electron app is JavaScript itself. It’s the lingua franca of programming. This lowers the barrier of entry for an open source project like ours. So while languages like Elm and PureScript are interesting and would scratch our static types itch, they were too far outside the mainstream for us to consider.

Our best two options were Flow and TypeScript. We landed on TypeScript. At the time we started the project, Flow’s Windows support lagged far behind macOS. For a team like ours where more than half of the engineers live on Windows, this was an instant deal-breaker. Thankfully, TypeScript has been fantastic. Its type system is incredibly expressive, the team at Microsoft moves fast and is responsive to the community, and the community grows every day.

We learned to take the long feedback cycles of native development as a given. Change the code, compile, wait, launch the app, wait, see the change. This doesn’t seem like much, but it adds up. But every minute spent waiting for the compiler to compile or the app to launch is waste. It is time where we could lose our focus, fall out of the flow, and get distracted.

Using web technologies tightens up our feedback cycle. We can tweak designs live, in the app, as it shows real data. Code changes reload in place. Our feedback cycle went from minutes to seconds. It keeps us motivated!

We’ve been working on GitHub Desktop Beta for just over a year. We’re happy with how far we’ve been able to come in that time, but it’s far from done. Check it out, leave us your feedback, and get involved!

Authors

William Shepherd
Application Engineer
GitHub Profile

Josh Abernathy
Engineering Manager
GitHub Profile | Twitter Profile

Markus Olsson
Application Engineer
GitHub Profile | Twitter Profile

Brendan Forster
Application Engineer
GitHub Profile | Twitter Profile

The post How Four Native Developers Wrote An Electron App appeared first on The GitHub Blog.

Integrating Git in Atom

GitHub Engineering — Tue, 16 May 2017 07:00:00 +0000

The Atom team has been working to bring the power of Git and GitHub as close to your cursor as possible. With today’s release of the GitHub package for Atom, you can now perform common Git operations without leaving the editor: stage changes, make commits, create and switch branches, resolve merge conflicts, and more.

In this post, we’ll look at the evolution of how the Atom GitHub package interacts with the .git folder in your project.

Interacting with Git

GitHub is a core contributor to a library called libgit2, which is a reentrant C implementation of Git’s core methods and is used to power the backend of GitHub.com via Ruby bindings. Our initial approach to the development of this new Atom package used Nodegit, a Node module that provides native bindings to libgit2.

Months into development we started to question whether this was the optimal approach for our Atom integration. libgit2 is a powerful library that implements the core data structures and algorithms of the Git version control system, but it intentionally implements only a subset of the system. While it is very effective as a technology that powers the backend of GitHub.com, our use case is sufficiently different and more akin to the Git command-line experience.

Compare what we had to do with Nodegit/libgit2 versus shelling out:

Nodegit/libgit2

Read the current index file
Update the files that have changed
Create a tree with this state
Write the updated index back to disk
Manually run pre-commit hooks
Create the new commit with the tree
Manually sign the commit if necessary
Update the currently active branch to point to the commit
Manually run post-commit hook

Shelling out

Run git commit, the command-line tool made for our exact use case

Shelling out to Git simplifies development, gives us access to the full set of commands, options, and formatting that Git core provides, and enables us to use all of the latest Git features without having to reimplement custom logic or wait for support in libgit2. For these reasons and more, we made the switch.

Reboot to shelling out

We bundled a minimal version of Git for Mac, Windows, and Linux into a package called dugite-native and created a lightweight library called dugite for making Node execFile calls. Bundling Git makes package installation easier for the user and gives us full control over the Git API we are interacting with.

As much as possible, we keep your Git data in Atom in sync with the actual state of your local repo to allow for maximal flexibility. You can partially stage a file in Atom, switch to the command line and find the state of your repo exactly as you’d expect. Additionally, any changes you make outside of Atom will be detected by a file watcher and the Git data in your editor will be refreshed automatically.

Overall, the transition from Nodegit to shelling out went pretty well. However, there were noticeable performance tradeoffs and overhead costs associated with spawning a new process every time we asked for Git data.

Performance concerns and optimizations

Recent Atom releases have delivered numerous performance improvements, and we wanted this new package to demonstrate our continued focus on responsiveness. After core functionality was in place, we introduced a series of optimizations. To inform and measure progress on this front, we created a custom waterfall view to visualize the time spent shelling out to Git; the red section shows the time an operation spent waiting in the queue for its turn to run, while the yellow and green represent the time the operation took to actually execute.

Here’s what it looked like before and after we parallelized read operations based on the number of cores on a user’s computer:

We also noticed that for larger repos we would get file-watching update events in several batches, each causing a model update to be scheduled. A merge with conflicts in github/github, the GitHub.com codebase, would queue up 12 updates. To address this we redesigned our ModelObserver to never schedule more than a single pending fetch if new update requests come in while a fetch is in progress, preventing ModelObserver update backlogs.

Aggressive caching and selectively invalidating cached repository state reduced the number of times we shell out to Git so that we avoid the performance penalty of launching a new process:

Even though we spawn subprocesses asynchronously, there is still a small synchronous overhead to shelling out to Git. Normally, this is no more than a couple milliseconds. On rare occasions, however, the application would get into a strange state, and this time would begin to grow; this overhead is represented in the waterfall views above by the yellow sections. The additional time spent in synchronous code would block the UI thread long enough to degrade the user experience. The issue would persist until the Atom window was refreshed.

After investigating the root cause of this issue, we realized that a proper fix for it would have involved changing Node or libuv. Since our launch date was looming on the horizon, we needed a more immediate solution and made the decision to work around the problem by making Git calls in a separate process. This would keep the main thread free and prevent locking the UI when this issue arises.

Shelling out in a dedicated side process

Our first approach used forked Node processes, but benchmarking revealed that IPC time grows quadratically relative to message size, which could become an issue when reading large diffs from stdout. This issue seems to be fixed in future versions of Node, but again, time was of the essence and we couldn’t afford to wait. Thankfully, IPC times using Electron renderer processes were much more reasonable, so our short term solution involved using a dedicated renderer process to run Git commands.

We introduced a WorkerManager which creates a Worker that wraps a RendererProcess which shells out to Git and sends results back over IPC. If the renderer process is not yet ready, we fall back to shelling out in process. We track a running average of the time it takes to make a spawn call and if this exceeds a specified threshold, the WorkerManager creates a new Worker and routes all new Git data requests to it. With this approach, if the long spawn call issue manifests, users will experience no freezing due to a blocked main thread. At worst, they may experience slower UI updates, but once a new renderer process is up the spawn times should drop back down and normal responsiveness should be restored.

As with most decisions, there are tradeoffs. Here we prevent indefinite locking of the UI, but there is now extra time spent in IPC and overhead costs associated with creating new Electron renderer processes. In the timeline below, the pink represents the IPC time associated with each Git command.

Once we upgrade Atom to Electron v1.6.x in the next release cycle, we’ll be able to re-implement this system using Chromium Web Workers with Node integration. Using the SharedArrayBuffer object, we can read shared memory and bypass IPC, cutting down overall operation time. And using native Web Workers rather than Electron Renderer Processes will reduce the overhead associated with these side processes and save on computing resources for shelling out to Git.

Continuing the vision for Atom

In addition to more performance improvements, you can look forward to more Git features, UI/UX improvements, and more comprehensive and in-depth GitHub integration.

As developers, much of our work is powered by a few key tools that enable us to write software and collaborate. We spend most of our days in our editors, periodically pausing to take version control snapshots of our code, and soliciting input and feedback from our colleagues. It’s every developer’s dream to be able to do all of these things with minimal friction and maximal ease. With these new integrations the Atom team is working to make those dreams more of a reality.

Want to help the Atom team make developers’ lives easier? We’d love for you to join us. Keep an eye out for a job posting coming soon!

Authors

Katrina Uychaco
Application Engineer, Atom
GitHub Profile | Twitter Profile

Michelle Tilley
Application Engineer, Atom
GitHub Profile | Twitter Profile

Ash Wilson
Application Engineer, Atom
GitHub Profile | Twitter Profile

The post Integrating Git in Atom appeared first on The GitHub Blog.

A formal spec for GitHub Flavored Markdown

GitHub Engineering — Tue, 14 Mar 2017 07:00:00 +0000

We are glad we chose Markdown as the markup language for user content at GitHub. It provides a powerful yet straightforward way for users (both technical and non-technical) to write plain text documents that can be rendered richly as HTML.

Its main limitation, however, is the lack of standardization on the most ambiguous details of the language. Things like how many spaces are needed to indent a line, how many empty lines you need to break between different elements, and a plethora of other trivial corner cases change between implementations: very similar looking Markdown documents can be rendered as wildly different outputs depending on your Markdown parser of choice.

Five years ago, we started building GitHub’s custom version of Markdown, GFM (GitHub Flavored Markdown) on top of Sundown, a parser which we specifically developed to solve some of the shortcomings of the existing Markdown parsers at the time.

Today we’re hoping to improve on this situation by releasing a formal specification of the syntax for GitHub Flavored Markdown, and its corresponding reference implementation.

This formal specification is based on CommonMark, an ambitious project to formally specify the Markdown syntax used by many websites on the internet in a way that reflects its real world usage. CommonMark allows people to continue using Markdown the same way they always have, while offering developers a comprehensive specification and reference implementations to interoperate and display Markdown in a consistent way between platforms.

The Specification

Taking the CommonMark spec and re-engineering our current user content stack around it is not a trivial endeavour. The main issue we struggled with is that the spec (and hence its reference implementations) focuses strictly on the common subset of Markdown that is supported by the original Perl implementation. This does not include some of the extended features that have been always available on GitHub. Most notably, support for tables, strikethrough, autolinks and task lists are missing.

In order to fully specify the version of Markdown we use at GitHub (known as GFM), we had to formally define the syntax and semantics of these features, something which we had never done before. We did this on top of the existing CommonMark spec, taking special care to ensure that our extensions are a strict and optional superset of the original specification.

When reviewing the GFM spec, you can clearly tell which parts are GFM-specific additions because they’re highlighted as such. You can also tell that no parts of the original spec have been modified and therefore should remain fully compliant with all other implementations.

The Implementation

To ensure that the rendered Markdown in our website is fully compliant with the CommonMark spec, the new backend implementation for GFM parsing on GitHub is based on cmark, the reference implementation for CommonMark developed by John MacFarlane and many other fantastic contributors.

Just like the spec itself, cmark focuses on parsing a strict subset of Markdown, so we had to also implement support for parsing GitHub’s custom
extensions on top of the existing parser. You can find these changes on our fork of cmark; in order to track the always-improving upstream project, we continuously rebase our patches on top of the upstream master. Our hope is that once a formal specification for these extensions is settled, this patchset can be used as a base to upstream the changes in the original project.

Besides implementing the GFM-specific features in our fork of cmark, we’ve also contributed many changes of general interest to the upstream. The vast majority of these contributions are focused around performance and security. Our backend renders a massive volume of Markdown documents every day, so our main concern lies in ensuring we’re doing these operations as efficiently as possible, and making sure that it’s not possible to abuse malicious Markdown documents to attack our servers.

The first Markdown parsers in C had a terrible security history: it was feasible to cause stack overflows (and sometimes even arbitrary code execution) simply by nesting particular Markdown elements sufficiently deep. The cmark implementation, just like our earlier parser Sundown, has been designed from scratch to be resistant to these attacks. The parsing algorithms and its AST-based output are thought out to gracefully handle deep recursion and other malicious document formatting.

The performance side of cmark is a tad more rough: we’ve contributed many optimizations upstream based on performance tricks we learnt while implementing Sundown, but despite all these changes, the current version of cmark is still not faster than Sundown itself: Our benchmarks show it to be between 20% to 30% slower on most documents.

The old optimization adage that “the fastest code is the code that doesn’t run” applies here: the fact is that cmark just does more things than Sundown ever did. Amongst other functionality, cmark is UTF8 aware, has better support for references, cleaner interfaces for extension, and most importantly: it doesn’t translate Markdown into HTML, like Sundown did. It actually generates an AST (Abstract Syntax Tree) out of the source Markdown, which we can transform and eventually render into HTML.

If you consider the amount of HTML parsing that we had to do with Sundown’s original implementation (particularly regarding finding user mentions and issue references in the documents, inserting task lists, etc), cmark‘s AST-based approach saves us a tremendous amount of time and complexity in our user content stack. The Markdown AST is an incredibly powerful tool, and well worth the performance cost that cmark pays to generate it.

The Migration

Changing our user content stack to be CommonMark compliant is not as simple as switching the library we use to parse Markdown: the fundamental roadblock we encountered here is that the corner cases that CommonMark specifies (and that the original Markdown documentation left ambiguous) could cause some old Markdown content to render in unexpected ways.

Through synthetic analysis of GitHub’s massive Markdown corpus, we determined that less than 1% of the existing user content would be affected by the new implementation: we gathered these stats by rendering a large set of Markdown documents with both the old (Sundown) and the new (cmark, CommonMark compliant) libraries, normalizing the resulting HTML, and diffing their trees.

1% of documents with minor rendering issues seems like a reasonable tradeoff to swap in a new implementation and reap its benefits, but at GitHub’s scale, 1% is a lot of content, and a lot of affected users. We really don’t want anybody to check back on an old issue and see that a table that was previously rendering as HTML now shows as ASCII — that is bad user experience, even though obviously none of the original content was lost.

Because of this, we came up with ways to soften the transition. The first thing we did was gathering separate statistics on the two different kinds of Markdown user content we host on the website: comments by the users (such as in Gists, issues, Pull Requests, etc), and Markdown documents inside the Git repositories.

There is a fundamental difference between these two kinds of content: the user comments are stored in our databases, which means their Markdown syntax can be normalized (e.g. by adding or removing whitespace, fixing the indentation, or inserting missing Markdown specifiers until they render properly). The Markdown documents stored in Git repositories, however, cannot be touched at all, as their contents are hashed as part of Git’s storage model.

Fortunately, we discovered that the vast majority of user content that was using complex Markdown features were user comments (particularly Issue bodies and Pull Request bodies), while the documents stored in Git repositories were rendering properly with both the old and the new renderer in the overwhelming majority of cases.

With this in mind, we proceeded to normalize the syntax of the existing user comments, as to make them render identically in both the old and the new implementations.

Our approach to translation was rather pragmatic: Our old Markdown parser, Sundown, has always acted as a translator more than a parser. Markdown content is fed in, and a set of semantic callbacks convert the original Markdown document into the corresponding markup for the target language (in our use case, this was always HTML5). Based on this design approach, we decided to use the semantic callbacks to make Sundown translate from Markdown to CommonMark-compliant Markdown, instead of HTML.

More than translation, this was effectively a normalization pass, which we had high confidence in because it was performed by the same parser we’ve been using for the past 5 years, and hence all the existing documents should be parsed cleanly while keeping their original semantic meaning.

Once we updated Sundown to normalize input documents and sufficiently tested it, we were ready to start the transition process. The first step of the process was flipping the switch on the new cmark implementation for all new user content, as to ensure that we had a finite cut-off point to finish the transition at. We actually enabled CommonMark for all new user comments in the website several months ago, with barely anybody noticing — this is a testament to the CommonMark team’s fantastic job at formally specifying the Markdown language in a way that is representative of its real world usage.

In the background, we started a MySQL transition to update in-place the contents of all Markdown user content. After running each comment through the normalization process, and before writing it back to the database, we’d render it with the new implementation and compare the tree to the previous implementation, as to ensure that the resulting HTML output was visually identical and that user data was never destroyed in any circumstances. All in all, less than 1% of the input documents were modified by the normalization process, matching our expectations and again proving that the CommonMark spec really represents the real-world usage of the language.

The whole process took several days, and the end result was that all the Markdown user content on the website was updated to conform to the new Markdown standard while ensuring that the final rendered output was visually identical to our users.

The Conclusion

Starting today, we’ve also enabled CommonMark rendering for all the Markdown content stored in Git repositories. As explained earlier, no normalization has been performed on the existing documents, as we expect the overwhelming majority of them to render just fine.

We are really excited to have all the Markdown content in GitHub conform to a live and pragmatic standard, and to be able to provide our users with a clear and authoritative reference on how GFM is parsed and rendered.

We also remain committed to following the CommonMark specification as it irons out any last bugs before a final point release. We hope GitHub.com will be fully conformant to the 1.0 spec as soon as it is released.

To wrap up, here are some useful links for those willing to learn more about CommonMark or implement it on their own applications:

The CommonMark website, with information on the project.
The CommonMark discussion forum, where questions and changes to the specification can be proposed.
The CommonMark specification
The reference C Implementation
Our fork with support for all GFM extensions
The GFM specification, based on the original spec.
A list of CommonMark implementations in many programming languages

Author

Vicent Martí
Systems Engineer
GitHub Profile | Twitter Profile

Ashe Connor
Senior Systems Engineer
GitHub Profile | Twitter Profile

The post A formal spec for GitHub Flavored Markdown appeared first on The GitHub Blog.

How we made diff pages three times faster

GitHub Engineering — Tue, 06 Dec 2016 08:00:00 +0000

We serve a lot of diffs here at GitHub. Because it is computationally expensive to generate and display a diff, we’ve traditionally had to apply
some very conservative limits on what gets loaded. We knew we could do better, and we set out to do so.

Historical approach and problems

Before this change, we fetched diffs by asking Git for the diff between two commit objects. We would then parse the output, checking it against the various limits we had in place. At the time they were as follows:

Up to 300 files in total.
Up to 100KB of diff text per file.
Up to 1MB of diff text overall.
Up to 3,000 lines of diff text per file.
Up to 20,000 lines of diff text overall.
An overall RPC timeout of up to eight seconds, though in some places it would
be adjusted to fit within the remaining time allotted to the request.

These limits were in place to both prevent excessive load on the file servers, as well as prevent the browser’s DOM from growing too large and making the web page less responsive.

In practice, our limits did a pretty good job of protecting our servers and users’ web browsers from being overloaded. But because these limits were applied in the order Git handed us back the diff text, it was possible for a diff to be truncated before we reached the interesting parts. Unfortunately, users had to fall back to command-line tools to see their changes in these cases.

Finally, we had timeouts happening far more frequently than we liked. Regardless of the size of the requested diff, we shouldn’t force the user to wait up to eight seconds before responding, and even then occasionally with an error message.

Our Goals

Our main goal was to improve the user experience around (re)viewing diffs on GitHub:

Allow users to (re)view the changes that matter, rather than just whatever
appears before the diff is truncated.
Reduce request timeouts due to very large diffs.
Pave the way for previously inaccessible optimizations (e.g. avoid loading
suppressed diffs).
Reduce unnecessary load on GitHub’s storage infrastructure.
Improve accuracy of diff statistics.

A new approach

To achieve the aforementioned goals, we had to come up with a new and better approach to handling large diffs. We wanted a solution that would allow us to get a high-level overview of all changes in a diff, and then load the patch texts for the individual changed files “progressively”. These discrete sections could later be assembled by the user’s browser.

But to achieve this without disrupting the user experience, our new solution also needed to be flexible enough to load and display diffs identically to how we were doing it in production to date. We wanted to verify accuracy and monitor any performance impact by running the old and new diff loading strategies in production, side-by-side, before changing to the new progressive loading strategy.

Lucky for us, Git provides an excellent plumbing command called git-diff-tree.

Diff “table of contents” with `git-diff-tree`

git-diff-tree is a low-level (plumbing) git command that can be used to compare the contents of two tree objects and output the comparison result in different ways.

The default output format is --raw, which prints a list of changed files:

> git-diff-tree --raw -r --find-renames HEAD~ HEAD
:100644 100644 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 441624ae5d2a2cd192aab3ad25d3772e428d4926 M  fileA
:100644 100644 5716ca5987cbf97d6bb54920bea6adde242d87e6 4ea306ce50a800061eaa6cd1654968900911e891 M  fileB
:100644 100644 7c4ede99d4fefc414a3f7d21ecaba1cbad40076b fb3f68e3ca24b2daf1a0575d08cd6fe993c3f287 M  fileC

Using git-diff-tree --raw we could determine what changed at a high level very quickly, without the overhead of generating patch text. We could then later paginate through this list of changes, or “deltas”, and load the exact patch data for each “page” by specifying a subset of the deltas’ paths to git-diff-tree --patch.

To better understand the obvious performance overhead of calling two git commands instead of one, and to ensure that we wouldn’t cause any regressions in the returned data, we initially focused on generating the same output as a plain call to git-diff-tree --patch, by calling git-diff-tree --raw and then feeding all returned paths back into git-diff-tree --patch.

We started a Scientist experiment which ran both algorithms in parallel, comparing accuracy and timing. This gave us detailed information on cases where results were not as expected, and allowed us to keep an eye on performance.

As expected, our new algorithm, which was replacing something that hadn’t been materially refactored in years, had many mismatches and performance was worse than before.

Most of the issues that we found were simply unexpected behaviors of the old code under certain conditions. We meticulously emulated these corner cases, until we were left only with mismatches related to rename detection in git diff.

Fetching diff text with `git-diff-pairs`

Loading the patch text from a set of deltas sounds like it should have been a pretty straightforward operation. We had the list of paths that changed, and just needed to look up the patch texts for these paths. What could possibly go wrong?

In our first attempt we loaded the diffs by passing the first 300 paths from our deltas to git-diff-tree --patch. This emulated our existing behaviour, – and we unexpectedly ran into rare mismatches. Curiously, these mismatches were all related to renames, but only when multiple files containing the same or very similar contents got renamed in the same diff.

This happened because rename detection in git is based on the contents of the tree that it is operating on, and by looking at only a subset of the original tree, git’s rename detection was failing to match renames as expected.

To preserve the rename associations from the initial git-diff-tree --raw run, @peff added a git-diff-pairs command to our fork of
Git. Provided a set of blob object IDs (provided by the deltas) it returns the corresponding diff text, exactly what we needed.

On a high level, the process for generating a diff in Git is as follows:

Do a tree-wide diff, generating modified pairs, or added/deleted paths (which
are just considered pairs with a null before/after state).
Run various algorithms on the whole set of pairs, like rename detection. This
is just linking up adds and deletes of similar content.
For each pair, output it in the appropriate format (we’re interested in
--patch, obviously).

git-diff-pairs lets you take the output from step 2, and feed it individually into step 3.

With this new function in place, we were finally able to get our performance and accuracy to a point where we could transparently switch to this new diff method without negative user impact.

If you’re interested in viewing or contributing to the source for git-diff-pairs we submitted it upstream here.

Change statistics with `git-diff-tree --numstat --shortstat`

GitHub displays line change statistics for both the entire diff and each delta. Generating the line change statistics for a diff can be a very
costly operation, depending on the size and contents of the diff. However, it is very useful to have summary statistics on a diff at a glance so that the user can have a good overview of the changes involved.

Historically we counted the changes in the patch text as we processed it so that only one diff operation would need to run to display a diff. This operation and its results were cached so performance was optimal. However, in the case of truncated diffs there were changes that were never seen and therefore not included in these statistics. This was done to give us better performance at the cost of slightly inaccurate total counts for large diffs.

With our move to progressive diffs, it would become increasingly likely that we would only ever be looking at a part of the diff at any one time so the counts would be inaccurate most of the time instead of rarely.

To address this problem we decided to collect the statistics for the entire diff using git-diff-tree --numstat --shortstat. This would not only solve the problem of dealing with partial diffs, but also make the counts accurate in cases where they would have been incorrect before.

The downside of this change is that Git was now potentially running the entire diff twice. We determined this was acceptable, however as the remaining diff processing for presentation was far more resource intensive. Also, with progressive diffs, it was entirely probable that many larger diffs would never have the second pass since those deltas might never be loaded anyway.

Due to the nature of how git-diff-tree works, we were even able to combine the call for these statistics with the call for deltas into a single command, to further improve performance. This is because Git already needed to perform a full diff in order to determine what the statistics were, so having it also print the tree diff information is essentially free.

Patches in batches: a whole new diff

For the initial request of a page containing a diff, we first fetched the deltas along with the diff statistics. Next we fetched as much diff text as we could, but with significantly reduced limits compared to before.

To determine optimal limits, we turned to some of our copious internal metrics. We wanted results as quickly as possible, but we also wanted a solution which would display the full diff in “most” cases. Some of the information our metrics revealed was:

81% of viewed diffs contain changes to fewer than ten files.
52% of viewed diffs contain only changes to one or two files.
80% of viewed diffs have fewer than 20KB of patch text.
90% of viewed diffs have fewer than 1000 lines of patch text.

From these, it was clear a great number of diffs only involved a handful of changes. If we set our new limits with these metrics in mind, we could continue to be very fast in most cases while significantly improving performance in previously slow or inaccessible diffs.

In the end, we settled on the following for the initial request for a diff page:

Up to 400 lines of diff text.
Up to 20KB of diff text.
A request cycle dependent timeout.
A maximum individual patch size of 400 lines or 20KB.

This allowed the initial request for a large diff to be much faster, and the rest of the diff to automatically load after the first batch of patches was already rendered.

After one of the limits on patch text was reached during asynchronous batch loading, we simply render the deltas without their diff text and a “load diff” button to retrieve the patch as needed.

Overall, the effective limits we enforce for the entire diff became:

Up to 3,000 files.
Up to 60,000,000 lines (not loaded automatically).
Up to 3GB of diff text (also not loaded automatically).

With these changes, you got more of the diff you needed in less time than ever before. Of course, viewing a 60,000,000 line diff would require the user to press the “load diff” button more than a couple thousand times.

The benefits to this approach were a clear win. The number of diff timeouts dropped almost immediately.

Additionally, the higher percentile performance of our main diffs pages improved by nearly 3x!

Our diff pages pages were traditionally among our worst performing, so the performance win was even noticeable on our high percentile graph for overall requests’ performance across the entire site, shaving off around 3.5s from the 99.9th percentile:

Looking to the future

This new approach opens the door to new types of optimizations and interface ideas that weren’t possible before. We’ll be continuing to improve how we fetch and render diffs, making them more useful and responsive.

Authors

Brian Lopez
Application Engineering Manager
GitHub Profile | Twitter Profile

Thomas Maurer
Application Engineer
GitHub Profile | Twitter Profile

Matt Clark
Application Engineer
GitHub Profile

Arthur Schreiber
Application Engineer
GitHub Profile | Twitter Profile

The post How we made diff pages three times faster appeared first on The GitHub Blog.

GLB part 2: HAProxy zero-downtime, zero-delay reloads with multibinder

GitHub Engineering — Thu, 01 Dec 2016 17:00:31 +0000

Recently we introduced GLB, the GitHub Load Balancer that powers GitHub.com. The GLB proxy tier, which handles TCP connection and TLS termination is powered by HAProxy, a reliable and high performance TCP and HTTP proxy daemon. As part of the design of GLB, we set out to solve a few of the common issues found when using HAProxy at scale.

Prior to GLB, each host ran a single monolithic instance of HAProxy for all our public services, with frontends for each external IP set, and backends for each backing service. With the number of services we run, this became unwieldy, our configuration was over one thousand lines long with many interdependent ACLs and no modularization. Migrating to GLB we decided to split the configuration per-service and support running multiple isolated load balancer instances on a single machine. Additionally, we wanted to be able to update a single HAProxy configuration easily without any downtime, additional latency on connections or disrupting any other HAProxy instance on the host. Today we are releasing our solution to this problem, multibinder.

HAProxy almost-safe reloads

HAProxy uses the SO_REUSEPORT socket option, which allows multiple processes to create LISTEN sockets on the same IP/port combination. The Linux kernel then balances new connections between all available LISTEN sockets. In this diagram, we see the initial stage of an HAProxy reload starting with a single process (left) and then causing a second process to start (right) which binds to the same IP and port, but with a different socket:

This works great so far, until the original process terminates. HAProxy sends a signal to the original process stating that the new process is now accept()ing and handling connections (left), which causes it to stop accepting new connections and close its own socket before eventually exiting once all connections complete (right):

Unfortunately there’s a small period between when this process last calls accept() and when it calls close() where the kernel will still route some new connections to the original socket. The code then blindly continues to close the socket, and all connections that were queued up in that LISTEN socket get discarded (because accept() is never called for them):

For small scale sites, the chance of a new connection arriving in the few microseconds between these calls is very low. Unfortunately at the scale we run HAProxy, a customer impacting number of connections would hit this issue each and every time we reload HAProxy. Previously we used the official solution offered by HAProxy, dropping SYN packets during this small window, causing the client to retry the SYN packet shortly afterwards. Other potential solutions to the same problem include using tc qdisc to stall the SYN packets as they come in, and then un-stall the queue once the reload is complete. During development of GLB, we weren’t satisfied with either solution and sought out one without any queue delays and sharing of the same LISTEN socket.

Supporting zero-downtime, zero-delay reloads

The way other services typically support zero-downtime reloads is to share a LISTEN socket, usually by having a parent process that holds the socket open and fork()s the service when it needs to reload, leaving the socket open for the new process to consume. This creates a slightly different situation, where the kernel has a single LISTEN socket and clients are queued for accept() by either process. The file descriptors in each process may be different, but they will point to the same in-kernel socket structure.

In this scenario, a new process would be started that inherits the same LISTEN socket (left), and when the original pid stops calling accept(), connections remain queued for the new process to process because the kernel LISTEN socket and queue are shared (right):

Unfortunately, HAProxy doesn’t support this method directly. We considered patching HAProxy to add built-in support but found that the architecture of HAProxy favours process isolation and non-dynamic configuration, making it a non-trivial architectural change. Instead, we created multibinder to solve this problem generically for any daemon that needs zero-downtime reload capabilities, and integrated it with HAProxy by using a few tricks with existing HAProxy configuration directives to get the same result.

Multibinder is similar to other file-descriptor sharing services such as einhorn, except that it runs as an isolated service and process tree on the system, managed by your usual process manager. The actual service, in this case HAProxy, runs separately as another service, rather than as a child process. When HAProxy is started, a small wrapper script calls out to multibinder and requests the existing LISTEN socket to be sent using Ancillary Data over an UNIX Domain Socket. The flow looks something like the following:

Once the socket is provided to the HAProxy wrapper, it leaves the LISTEN socket in the file descriptor table and writes out the HAProxy configuration file from an ERB template, injecting the file descriptors using file descriptor binds like fd@N (where N is the file descriptor received from multibinder), then calls exec() to launch HAProxy which uses the provided file descriptor rather than creating a new socket, thus inheriting the same LISTEN socket. From here, we get the ideal setup where the original HAProxy process can stop calling accept() and connections simply queue up for the new process to handle.

Example & multiple instances

Along with the release of multibinder, we’re also providing examples of running multiple HAProxy instances with multibinder leveraging systemd service templates. Following these instructions you can launch a set of HAProxy servers using separate configuration files, each using the same system-wide multibinder instance to request their binds and having true zero-downtime, zero-delay reloads.

Authors

Joe Williams
Senior Infrastructure Engineer
GitHub Profile | Twitter Profile | Blog

Theo Julienne
Senior Production Engineer
GitHub Profile

The post GLB part 2: HAProxy zero-downtime, zero-delay reloads with multibinder appeared first on The GitHub Blog.