Jekyll2026-01-08T06:36:06+00:00https://bobheadxi.dev/feed.xmlrobert lin📰 Source for Robert's site, blog and portfolio - https://bobheadxi.dev/From single-tenant to multi-tenant2025-09-21T00:00:00+00:002025-09-21T00:00:00+00:00https://bobheadxi.dev/multi-tenantIn February 2025, Sourcegraph quietly launched the ability to click a few buttons, leave your credit card details, and get the full-fledged Sourcegraph code search experience for yourself and your team. This was only possible thanks to a project internally dubbed the “multi-tenant Sourcegraph” project.

The Sourcegraph code search product had been a single-tenant, self-hosted product for nearly a decade. Because individual Sourcegraph instances are expensive and complex to operate - code search at scale is hard! - it can be quite an investment to try it out, and even our managed single-tenant offering requires reaching out to sales and all that. For most people, our public code search deployment at sourcegraph.com/search was the only way to easily get access to reliable Sourcegraph code search, and this was limited to public code.

We wanted to bring the code search experience that Sourcegraph’s big enterprise customers loved to a wider self-serve audience, at a much lower price. Bringing multi-tenancy to Sourcegraph would allow us to offer code search at a dramatically lower cost, and allow users to get started in just seconds, instead of hours and days. Externally, we called this new multi-tenant Sourcegraph platform “Sourcegraph Workspaces”1.

Sourcegraph code search, just a click and your credit card away!

Working on the 6-month project to bring Sourcegraph Workspaces to fruition will always be one of the highlights of my career. It was an exciting experience to collaborate with some extremely talented colleagues, meeting up in Montreal and Berlin to bring the many pieces of the project together. As a bonus, several components of the project represented a satisfying culmination of things that I’ve worked on at Sourcegraph over the past few years:

My direct contributions to Sourcegraph Workspaces during the project mostly pertained to the coordination machinery required to make this happen and guiding some of the architectural decisions, which I will focus on in this post. The changes that enabled multi-tenancy without an insurmountable amount of rewrites were made by other talented engineers across the company, and I’ll try to cover the strategy here lightly as well.

The requirements

A few requirements for the multi-tenancy in Sourcegraph was decided upon from the onset:

  1. Data isolation between each workspace must be extremely strict and robust by design, and there must be minimal engineering overhead for day-to-day feature development to handle multi-tenant special cases
  2. Creating a workspace must be seamless, and users must have be able to start setting up their workspace within 10 seconds of clicking “purchase”
  3. The local development story for the entire multi-tenant experience, including creating and testing multiple workspaces locally, must be rock-solid and easy to set up

The strategy

The core concept was to enable each Sourcegraph instance “host” to run in a “multi-tenant mode”, housing many “tenants”, with strict data isolation between tenants living on the same host. The multi-single-tenant Sourcegraph Cloud platform would be the key technology to allow us to confidently scale the fleet horizontally to accommodate more tenants.

There are a few capabilities, however, that necessitated another level of abstraction on top of “tenants”:

  • Creating a tenant, and providing your billing details, must happen somewhere outside of any tenant, as this needed to be done before the user has a tenant to work with at all.
  • Our multi-tenancy model would be predicated upon a fleet of many hosts that could scale horizontally to accommodate more tenants as needed - this necessitated a coordination mechanism to decide where new tenants go, and track the host that each tenant lives on.

This is where the user-facing concept of a “workspace”, which would abstract away the concepts of “hosts” and “tenants” so that users would only interact with a workspace that they access at my-workspace.sourcegraph.app/search.

For our on-host tenant isolation mechanism, this also abstracts away the concept of coordinating across multiple hosts and product requirements like billing from the design of tenant isolation, simplifying the workspaces-specific capabilities that needed to be baked into the core Sourcegraph product.

We built this workspaces coordination layer around a standalone “Workspaces service” (referred to with a capital “W” because naming is hard) operated on our services platform. This service would serve as a source-of-truth for workspace state, handle billing concerns, keep track of all hosts in the fleet, and assign newly created workspaces as tenants on available hosts.

flowchart TD
    %% Reverse direction + no arrows as a hack to get better positioning
    sourcegraph1 ===|routes requests to| router
    sourcegraph2 ===|routes requests to| router

    requests([🧑‍🦲 Visit a workspace]) 
    requests ==> router[Router]

    subgraph instances[Host fleet]
        subgraph sourcegraph1[Sourcegraph instance 1]
            tenant1
            tenant2
        end

        subgraph sourcegraph2[Sourcegraph instance 1]
            tenant3
            tenant4
        end
    end

    subgraph workspaces[Workspaces service]
        workspace1
        workspace2
        workspace3
        workspace4
    end

    workspace1-.->|represented by|tenant1
    workspace2-.->|represented by|tenant2
    workspace3-.->|represented by|tenant3
    workspace4-.->|represented by|tenant4

    create([🧑‍🦲 Create a workspace]) 
    create ==> workspaces

Workspace and tenant coordination

I drew heavily from my work on multi-single-tenant Sourcegraph Cloud in designing the architecture for the workspaces coordination layer. I proposed that tenants should be provisioned and managed in an eventually-consistent reconciliation model, with each host serving as the reconciler of the tenants assigned to it. This side-stepped issues around potential host downtime, particularly during Sourcegraph version upgrades, or general instability.

First, we needed awareness of all available hosts within a fleet. This is done via a registration and heartbeat process: hosts would frequently report liveness state to the central Workspaces service, and would only be eligible for tenant assignment if they have reported in healthy within the last N seconds. This was also used to report capcity pressure, if the host became overpopulated or came under heavy load.

service IntegrationsService {
  // ...

  // ReportInstanceState should be used by Sourcegraph instances to report their
  // status and eligibility as a host for workspaces.
  rpc ReportInstanceState(ReportInstanceStateRequest) returns (ReportInstanceStateResponse) {
    option idempotency_level = IDEMPOTENT;
    option (sams_required_scopes) = "workspaces::instance::write";
  }

  // ...
}

message ReportInstanceStateRequest {
  // Static, globally unique, self-reported instance ID.
  string instance_id = 1;
  // The current state of the instance.
  InstanceState instance_state = 2;
  // The class of the instance. Should be a string of the format 'cloudv1://...'
  // denoting the class of the Cloud instance.
  string instance_class = 3;
}

message ReportInstanceStateResponse {}

In our design, hosts would make requests to the central Workspaces service to retrieve desired workspace state, while the Workspaces service would only communicate to the host via GCP Pub/Sub messages. This kept the data and control flow in one direction only, from the Workspaces service to hosts.

Pub/Sub messages published by the Workspaces service was the primary trigger3 for tenant reconciliation, with individual hosts requesting the desired state from the central Workspaces service to ensure a single source of truth. During each tenant reconciliation, each host would be able to create tenants, add users to the tenant, and bring the tenant closer to the desired state, as dictated by the Workspaces service.

For example, consider what happens when a workspace owner purchases an additional seat and adds a member4, user A. The seat purchase and membership change is persisted to the Workspaces service, and now we broadcast a message: this is our Pub/Sub trigger.

Hello INSTANCE_X! Something has changed within a tenant you host, TENANT_Y. Please make sure everything is up to date and report back when done.

All hosts will receive the message, but only the indicated host will attempt to reconcile the corresponding tenant. It pulls the tenant’s desired state from the Workspaces service, and compares it to the state of the tenant locally. The reconciler notices that user A is not in the tenant, and corrects the diff to grant user A access to the tenant.

During this process, we needed a system for each host to report the state of each tenant in case something goes wrong - somewhat analogous to ReportInstanceState, but for individual workspaces - which we use for alerting, retries, and informing users of various error states:

service IntegrationsService {
  // ...

  // ReportWorkspaceState should be used by Sourcegraph instances to report the
  // status of workspaces they host.
  rpc ReportWorkspaceState(ReportWorkspaceStateRequest) returns (ReportWorkspaceStateResponse) {
    option idempotency_level = IDEMPOTENT;
    option (sams_required_scopes) = "workspaces::workspace::write";
  }

  // ...
}

message ReportWorkspaceStateRequest {
  // ID of the Sourcegraph instance hosting the workspace and reporting on the
  // workspace's state.
  string instance_id = 1;
  // The ID of the relevant workspace, of the format 'ws_...'
  string workspace_id = 2;
  // The state of the workspace.
  WorkspaceState workspace_state = 3;
}

message ReportWorkspaceStateResponse {}

We used a state machine library5 to ensure that workspaces can only make predictable, known state transitions. For example, the reconciler can say that “workspace was in provision pending, but is now provision failed”, but it cannot say that “workspaces is now pending deletion” - that is something that can only be done by the customer, or by a human operator. Every state transition would be validated against the state machine’s rules to ensure the transition was legal, and each transition would be recorded to a durable audit log. The state machine for what the host’s tenant reconciler looks like the following, for example - note how certain state transitions are not possible:

stateDiagram-v2
    ACTIVE_RECONCILE_FAILED
    ACTIVE_RECONCILE_SUCCESS
    DESTROY_FAILED
    DESTROY_PENDING
    DESTROY_SUCCESS
    DORMANT_RECONCILE_FAILED
    DORMANT_RECONCILE_SUCCESS
    PROVISION_FAILED
    PROVISION_PENDING
    PROVISION_PENDING
    SUSPENSION_FAILED
    SUSPENSION_PENDING
    state ACTIVE {
            ACTIVE_RECONCILE_FAILED
            ACTIVE_RECONCILE_SUCCESS
    }

    state DESTROY {
            DESTROY_FAILED
            DESTROY_PENDING
            DESTROY_SUCCESS
    }

    state DORMANT {
            DORMANT_RECONCILE_FAILED
            DORMANT_RECONCILE_SUCCESS
    }

    state PROVISION {
            PROVISION_FAILED
            PROVISION_PENDING
            PROVISION_PENDING
    }

    state SUSPENSION {
            SUSPENSION_FAILED
            SUSPENSION_PENDING
    }

    ACTIVE_RECONCILE_FAILED --> ACTIVE_RECONCILE_SUCCESS
    ACTIVE_RECONCILE_FAILED --> DORMANT_RECONCILE_SUCCESS
    ACTIVE_RECONCILE_SUCCESS --> ACTIVE_RECONCILE_FAILED
    ACTIVE_RECONCILE_SUCCESS --> DORMANT_RECONCILE_SUCCESS
    DESTROY_FAILED --> DESTROY_SUCCESS
    DESTROY_PENDING --> DESTROY_FAILED
    DESTROY_PENDING --> DESTROY_SUCCESS
    DESTROY_SUCCESS --> [*]
    DORMANT_RECONCILE_FAILED --> ACTIVE_RECONCILE_FAILED
    DORMANT_RECONCILE_FAILED --> ACTIVE_RECONCILE_SUCCESS
    DORMANT_RECONCILE_FAILED --> DORMANT_RECONCILE_SUCCESS
    DORMANT_RECONCILE_SUCCESS --> ACTIVE_RECONCILE_FAILED
    DORMANT_RECONCILE_SUCCESS --> ACTIVE_RECONCILE_SUCCESS
    DORMANT_RECONCILE_SUCCESS --> DORMANT_RECONCILE_FAILED
    PROVISION_FAILED --> ACTIVE_RECONCILE_SUCCESS
    PROVISION_PENDING --> ACTIVE_RECONCILE_SUCCESS
    PROVISION_PENDING --> PROVISION_FAILED
    SUSPENSION_FAILED --> SUSPENSION_SUCCESS
    SUSPENSION_PENDING --> SUSPENSION_FAILED
    SUSPENSION_PENDING --> SUSPENSION_SUCCESS
    [*] --> PROVISION_PENDING

This made for some case-heavy code in our host-side tenant reconciler, as various error and success paths must carefully take into account the current state of the workspace to evaluate the appropriate error or success state to report, but it kept things explicit.

Putting everything together, we have the following lifecycle for changes to workspaces, where each change would:

  1. Start with a user action, which would update the state of the workspace in the Workspaces service
  2. Trigger a Pub/Sub message, prompting a tenant reconciliation within the tenant’s host
  3. ReportWorkspaceState would indicate the end of the reconcile, whether successful or not

Example workspace lifecycle

As mentioned earlier, the local development experience needed to be rock solid for this complicated back-and-forth. We built the Workspaces service on our managed services platform, which came with a number of local development conventions that integrated nicely with our standard developer tooling for running services locally. To emulate the Pub/Sub component, we built a wrapper around the official pstest package to mock a GCP Pub/Sub API that had the desired topics and subscriptions configured.

In combination with using shared credentials for the live staging instance of our accounts provider, “Sourcegraph Accounts”, this allowed a simple command (sg start multitenant) to give you the full workspaces experience, from creation to purchasing and more, locally with zero additional configuration.

Workspace creation

While eventual consistency has many nice properties, it was somewhat contrary to one of our core goals: provisioning a workspace must feel seamless and frictionless to the user. There should not be a step where we tell the user to hang in there and just check back eventually. We needed certain time-sensitive events - namely workspace creation, and joining a workspace - to be able to synchronously await the resolution of an event within our trigger-reconcile-report lifecycle.

workspace creation screen

To solve this, the design included the ability to subscribe to changes within the Workspaces service as well, when a Pub/Sub notification is issued to instance hosts. This was implemented with PostgreSQL’s LISTEN and NOTIFY commands, and the lifecycle of a synchronous series of events would go:

  1. User initiates action - in this example, “take my money and create my workspace”
  2. Workspaces service processes this request:
    1. Create a workspace entity and assign it to a host
    2. Broadcast a Pub/Sub message with the workspace ID, and the host the workspace should be assigned to
    3. Register a LISTEN on a notification channel for the workspace, and wait for an update
  3. While the Workspaces service waits for an update, the target host:
    1. Receives the Pub/Sub message
    2. Creates a tenant to represent the workspace
    3. Uses ReportWorkspaceState to indicate completion
  4. Workspaces service processes the ReportWorkspaceState to validate and commit the workspace state transition, using NOTIFY to let the waiting LISTEN know the workspace is ready and the request can proceed
  5. User gets redirected to active workspace

This flow had the speed properties we needed, and it also gave reasonable error-handling opportunities in the event the tenant creation fails, as the workspace entity is a durable one (agnostic of tenant) and we can be alerted to manually handle any issues.

To put all the concepts explained so far into a single illustration:

sequenceDiagram
    actor User
    participant WorkspacesService
    participant Instance1
    participant Instance2
    loop Liveness report every few seconds
        Instance1-->>+WorkspacesService: [RPC] I'm alive!
        Instance2-->>+WorkspacesService: [RPC] I'm alive!
      end
    User-->>+WorkspacesService: "[RPC] Give me a workspace!"
    Note over WorkspacesService: I know that Instance 2 is alive<br/>and eligible to host a new tenant.

    WorkspacesService->>+Instance1: [Pub/Sub] New workspace for Instance2!
    Note over Instance1: I was not chosen - message is not relevant to me
    WorkspacesService->>+Instance2: [Pub/Sub] New workspace for Instance2!
    Note over Instance2: Create tenant for the workspace...
    Note over WorkspacesService: Waiting for a state transition<br/>in the workspace to be reported...

    Instance2-->>-WorkspacesService: [RPC] State change: workspace is ready!
    WorkspacesService-->>-User: [Response] Here ya go!

Note that this doesn’t cover the many things that happen within behind the scenes in the seconds between the purchase event and the user being seamlessly directed to their workspace:

  1. Basic validation for bad names, existence checks, discount codes, etc.
  2. Stripe integration: we need to validate the provided payment method and track it for future billing needs6 before we could finalize instance provisioning
  3. Cloudflare workers: we need to ensure that request routing is configured correctly for the workspace
  4. Host-side tenant provisioning: we’ve alluded to this, but creating an isolated tenant is a whole thing

It’s very fun imagining all this back-and-forth happening in the short time you see the loading spinner during workspace creation (which I did many times in manual testing), completely invisible to the user.

Treating each workspace as a full Sourcegraph instance

Because of Sourcegraph’s single-tenant history, everything is built around the concept of Sourcegraph instances living on a URL like sourcegraph.my-company.com or mycompany.sourcegraphcloud.com (for Sourcegraph Cloud customers): our editor extensions, telemetry reporting, application URL paths, and more. Making sure that each workspace in multi-tenant Sourcegraph looked and behaved more or less exactly the same as single-tenant Sourcegraph instance helped us plug in neatly into existing assumptions.

A key part of this is ensuring each workspace gets its own domain, namely my-workspace.sourcegraph.app. To route users correctly, we needed to be able to efficiently and dynamically route users of my-workspace to the host where my-workspace lives in as a tenant. Michael selected Cloudflare Workers for this task - latency was crucial as this must happen for every single request to a workspace, and Cloudflare Workers’ at-the-edge properties, combined with equally-at-the-edge Cloudflare Workers KV for data storage, made them a good choice7.

The implementation of this routing layer was simple:

  1. The Workspaces service, which knows which host each workspace lives in, writes workspace-to-host mappings to the “routes table” in Cloudflare Workers KV
  2. Each request to a worksapce goes through our Cloudflare Worker, which would check the KV store for where the request should go and redirect it to the appropriate host
  3. The host would look up the tenant corresponding to the requested workspace, and attach the correct tenant context to the request

This basically allowed all Sourcegraph integrations, and even unofficial Raycast extensions, to “just work” by pointing them to a workspace’s URL.

sequenceDiagram
    actor User
    %% this is Cloudflare orange lol
    box rgb(244, 129, 32) Cloudflare Network
        participant Workers as Cloudflare Workers
        participant KV as Cloudflare Workers KV
    end
    box Instance1
        participant Frontend1
        participant DB1
    end

    User->>+Workers: GET tenant-01.sourcegraph.app
    Workers->>+KV: LookupRoute tenant-01.sourcegraph.app
    KV-->>-Workers: host-01.sourcegraph.app
    Workers->>+Frontend1: Forward request<br/>Origin: host-01.sourcegraph.app<br/>Host: tenant-01.sourcegraph.app
    Frontend1->>+DB1: LookupTenant tenant-01
    DB1-->>-Frontend1: tenant(id=1, name=tenant-01)
    Frontend1-->>-Workers: Response
    Workers-->>-User: Response

Prior to the multi-tenant Sourcegraph project, I also lead a redesign of Sourcegraph’s telemetry system to enable collection of usage telemetry from (nearly) all Sourcegraph instances to a central telemetry ingestion service. This made it easy to extend the telemetry ingestion system to account for a new source, “workspaces”, analogous to sources like “licensed instance”, and also have all our telemetry collection “just work” in workspaces. Our analytics team could treat workspaces just like all our other single-tenant Sourcegraph instances, and all our existing dashboarding for analysing usage could used to study workspace usage with minimal changes. Later, we were able to extend our customer-facing analytics service to support workspaces with minimal effort as well thanks to this property.

A number of other abstractions also had to be redesigned to accommodate workspaces - for example, I also hacked on building a new “gating” abstraction on top of our traditional license enforcement code, so that we could have feature gating backed by either licenses, or be “granted” by the workspace corresponding to the tenant. This contributed to our goal of making workspaces support as simple as possible when building new features.

Tenant isolation

This is not a component that I worked significantly in, but without tenant isolation, we wouldn’t have a multi-tenant offering to speak of, so I will try my best to summarise it here. The core strategy, designed largely by talented colleague Erik8, was to push enforcement of data isolation as close to stored data as possible, with varying approaches based on the data storage system:

  1. Row-level security in Postgres
  2. Namespacing in Redis, e.g. tnt_42:somekey:...
  3. Stateful services use directory namespacing, e.g. /data/tenant/42/...
  4. Enforcement of standardised tenancy-enforcing in-memory data structure libraries

This was coupled in-application with an audit and expansion of our usage of Go’s context.Context propagation9. Since all the microservices within a Sourcegraph instance is written in Go, we could leverage context propagation both within and between services to enforce tenant data isolation, by attaching the acting tenant to the context.

Importantly, through some Go packaging trickery10, we don’t provide an easy way to create a new context with a specific tenant — we only allow this in places like our HTTP middleware that decides which tenant a request is designated for, or in HTTP middleware where tenant context must be propagated over the wire between Sourcegraph services (but not from external traffic). This prevents on-demand tenant impersonation except in places that absolutely need it, ensuring that requests are processed in the context of the tenant that the request pertains to only, and therefore preventing cross-tenant data leakage.

Utilities that need to support multi-tenancy explicitly - for example, enforcing tenant isolation in in-memory data structures - will retrieve tenant information from context:

type keyWithTenant[K comparable] struct {
  tenant int // ensure identical keys are separated by tenant ID
  key    K
}

func newKey[K comparable](tenantID int, key K) keyWithTenant[K] {
  return keyWithTenant[K]{tenant: tenantID, key: key}
}

// Get looks up a key's value from the cache.
func (c *LRU[K, V]) Get(ctx context.Context, key K) (value V, ok bool, err error) {
  tnt, err := tenant.FromContext(ctx) // retrieve tenant from context
  if err != nil {
    return value, false, err
  }
  v, ok := c.cache.Get(newKey(tnt.ID, key)) // attach tenant to data key
  return v, ok, nil
}

To most application components, tenancy concerns do not have to be handled explicitly, as long as the approved data management mechanisms are used (which we enforce in linters). This was important for our requirement that building new features with tenancy in mind should require minimal engineering overhead. For example, when using the above cache, note the lack of any explicit tenancy checks - callsites just need to pass along the request context:

import (
  // ...
  "github.com/sourcegraph/sourcegraph/internal/memcache"
)

// Multi-tenant in-memory LRU cache
var globalRevAtTimeCache = memcache.NewLRU[revAtTimeCacheKey, api.CommitID](/* ... */)

func (g *gitCLIBackend) RevAtTime(ctx context.Context, spec string, t time.Time) (api.CommitID, error) {
  // ...

  // Don't need to worry about the tenant; assume we are acting in a tenant context.
  key := revAtTimeCacheKey{g.repoID, sha, t}
  entry, ok, err := g.revAtTimeCache.Get(ctx, key)
  // ...
}

The same goes for database access: for all the many, many database queries used throughout Sourcegraph, as long as the shared database connection infrastructure is used, the appropriate session variable is automatically added based on tenant context to satisfy PostgreSQL row-level security (RLS) policies before data access is granted:

func (n *extendedConn) setTenant(ctx context.Context) error {
  tnt, err := tenant.FromContext(ctx) // retrieve tenant from context
  if err != nil { /* ... */ }

  _, err := n.execerContext.ExecContext(ctx, 
    // Set `app.current_tenant` for RLS policy, which are enforced on all tables:
    //
    //   ((tenant_id = (SELECT (current_setting('app.current_tenant'::text))::integer AS current_tenant)))
    //
    // where `tenant_id` is a column required on every single table. This means that this
    // session can only access data where `tenant_id = app.current_tenant`.
    fmt.Sprintf("SET app.current_tenant = '%s'", strconv.Itoa(tenantID)),
    nil,
  )
  // ...
}

Note that all the descriptions above - particularly the consistent enforcement of row-level security policies in our databases - is very simplified. Adding a tenant_id column to all tables alone came with the discovery of all sorts of edge cases, performance issues, and difficult migrations. A lot of work had to be put into polishing the implementation of background jobs, for example: how do we extend our background work framework to efficiently and fairly queue and process jobs on a per-tenant basis, for potentially thousands of tenants on a single host, without making future background jobs hard to build? How do we make sure this whole system is leak-proof, and what tools11 can we add to make plugging the gaps as simple as possible?

I didn’t get to work very closely on most of this side of things - and there would be too many things to cover in one blog post anyway - but it was really cool to see the team systematically build libraries, tools, and processes to cover everything we needed to offer tenant isolation with confidence.

The launch

As launch day approached, the excitement within the team was palpable. Everyone working on every aspect of Sourcegraph Workspaces - from tenant isolation, to workspace coordination, to host management, to billing integrations, to the local development experience, to a fully functional staging environment for QA, to the in-product experience, to the Linear issue trackers that managed this sprawling project, to the operational playbooks for every conceivable scenario - had worked very hard over the preceding 6 months to deliver the most robust, polished experience we could offer with the resources we had on hand. We knew Sourcegraph had a solid product that our customers already loved. We were hopeful that we were about to turn a new page in the Sourcegraph’s history by making this product more accessible than ever.

Much to our relief, the launch went smoothly: we had no show-stopping issues, and everything seemed to chug along more or less exactly as designed. It “just worked”, and I was very happy to see the product come together.

Unfortunately, the launch also became a (perhaps obvious) learning experience for me: that even the most polished product might not do well without a lot of other factors also aligning in just the right way. Sourcegraph Workspaces was not the game-changer we hoped for, but it continues to serve some customers well, and I’m proud of what we built.


  1. February 2025 launch post: https://sourcegraph.com/blog/introducing-sourcegraph-enterprise-starter 

  2. In particular, the standalone Sourcegraph Accounts service from my colleague Joe was an instrumental piece of the story, as it gave us the cross-tenant identity provider we needed to get started. 

  3. Host-wide reconciles are also triggered on a cron as a fallback. 

  4. Workspace membership was implemented a concept tracked by the Workspaces service, even though there is a rudimentary in-Sourcegraph user management system for Sourcegraph administrators. The decision to not use the in-Sourcegraph user management system largely came from a need for “workspace billing admin” state to live in Workspaces service as a source of truth, and to make it easier to build the more product-lead-growth-oriented invitation flows that we anticipated building. 

  5. https://github.com/hishamk/statetrooper 

  6. We implemented our own payment scheduler, also designed by my colleague Joe, using Stripe only for processing invoices as determined by the Workspaces service. This gave us a lot more flexibility. 

  7. Michael made a very in-depth evaluation of a number of options, comparing pricing, latency, operational overhead, implementation complexity, and more before landing on this choice. Like many aspects of this project mentioned here, there’s a lot of details that I probably can’t get into as much as I would like to in a public post! 

  8. MVP of this whole project, really. 

  9. In case you aren’t familiar, context.Context is one of my favourite things about Go: it “carries deadlines, cancellation signals, and other request-scoped values across API boundaries and between processes”. There’s a good example of propagating values with context.Context in https://go.dev/blog/context#package-userip. This is very useful for propagating authentication/authorization state for enforcement throughout an application. 

  10. There’s a lot of ways to structure code using Go’s /internal/ subpackages to ensure that only certain components have access to certain things. We built a lot of utilities into the tenant subpackage, which exports Sourcegraph-wide APIs like retrieving tenant from context and various HTTP middlewares, but not a way to directly create a context with a tenant. That is reserved for tenant/internal/tenantcontext, making it accessible only to multi-tenancy-related utilities in the tenant/... subpackages. 

  11. For example, using runtime/pprof to record callsites where tenant context is not correctly propagated, which I never thought of as a profiling use case. 

]]>
robert
Building and operating online services around an on-prem, single-tenant product2025-09-20T00:00:00+00:002025-09-20T00:00:00+00:00https://bobheadxi.dev/managed-services-platformThe advent of LLMs and “AI-enabled” products lead to a rapid pivot within Sourcegraph from a purely single-tenant product, into new product surfaces with managed, multi-tenant components. This required rethinking how we built and operated mutli-tenant services.

For a very long time, the public deployment of Sourcegraph’s single-tenant, self-hosted code search product at sourcegraph.com/search - internally dubbed “dotcom” - was the only user-facing service that we operated. As a result, anything that required an online, central service was simply glued onto our self-hosted product and conditionally enabled. This was easier in some ways, but later began to present a significant liability.

Use cases for these glued-on capabilities varied: there was an early attempt from many years ago at a Stripe integration for purchasing Sourcegraph licenses, which we would obviously not want to ship to our self-deployed customers. There were also core product needs like license key management, telemetry collection, and more that provided functionality required for our self-deployed product, but had to live in a service operated by the company - so into “dotcom” it went.

As the LLM hype really started to kick off in 2023-ish, we began building Cody in earnest. This also meant that even our most security-conscious enterprise customers were interested in accessing LLM-enabled features, necessitating an easy way to get them onboarded - which is where this story starts.

Managed LLM Proxy

The advent of the big LLM providers like OpenAI and Anthropic meant that to get the latest and greatest LLM functionality, one had to send data to a third party and be charged on a metered basis. This was a significant paradigm shift in Sourcegraph’s enterprise customer base, where many customers signed annual commitments and would sometimes even opt for a completely air-gapped deployment that could not make any requests outside of the customer’s defined network boundaries. Now, we suddenly needed an option for these customers to make LLM requests using our pre-negotiated agreements with various LLM providers.

Given the self-hosted nature of many of Sourcegraph’s largest customers at the time, issuing Anthropic and OpenAI API keys for them was a non-starter, both for operational and security reasons. We decided to build a LLM proxy service1: this proxy would accept an API key derived from the license keys we already issue to each customer, and use that to determine the LLM usage quota allocated to the customer.

By this point, the long tradition of tacking functionality like this onto “dotcom” had started to show its flaws:

  • Code search at scale is hard, and “dotcom” hosted huge numbers of repositories that would cause instability that cascaded to some of its add-on capabilities
  • Deploying “dotcom” continuously was a poor experience, as the self-hosted code search product was designed for once-a-month deployments, adding to the instability of all services within
  • Configuration options and feature flags for “dotcom”-only functionality contributed to complexity and technical debt
  • The “Sourcegraph monolith”2 was designed for high-trust, single-tenant environments, which often resulted in design choices and mechanisms that ran contrary to the needs of a multi-tenant service3

I joined the project to build out this LLM proxy - later dubbed “Cody Gateway”4 - as Sourcegraph’s first ever standalone customer-facing online service. The technology was ultimately quite simple: we picked an infrastructure platform, GCP Cloud Run5, built out a bunch of Terraform around it, and wrote a service.

Straight-forward as it sounded, the greenfield nature of the project meant we sunk a lot of time answering questions that previously weren’t really a concern within the core “monolith” codebase frameworks: do we borrow patterns for X, Y, Z? Where do we need to diverge, what must we avoid importing? What security patterns and requirements do we need to follow? What can we simplify, or do better, in an independent service? Documentation and standard operating procedures also needed to be written down and defined for this custom infrastructure.

The investment paid off, however! The standalone nature of Cody Gateway allowed us to quickly develop a small service that satisfied our requirements, and we had a successful rollout that allowed all Sourcegraph customers to begin using our new AI product with zero additional configuration. In August 2023, I authored an internal document titled “Post-GA Cody Gateway: takeaways for future Sourcegraph-managed services” which opened with a paragraph highlighting some of the big throughput numbers we were already serving, and then an idea:

Since this is Sourcegraph’s first major managed service that isn’t a Sourcegraph instance, Cody Gateway’s infrastructure setup is completely new. This document outlines how Cody Gateway is operated and deployed, and takeaways and ideas we can apply towards building future Sourcegraph-managed services, including a proposal that we generalise Cody Gateway’s infrastructure setup and use a generative approach to allow it to be re-used by future managed Sourcegraph services. This platform can be used to deploy the new services slated to be built by Core Services: Core Services team scope for FY24 Q3+, as proposed in RFC 805 PRIVATE: Sourcegraph Managed Services Platform.

The document featured a cost analysis, overview of the technical components that seemed obviously reusable, and some of the benefits of operating a service with this new methodology.

I remember this document well because the proposed “Managed Services Platform” - which I will refer to as “MSP”6 - became one of the Core Services team’s cornerstone projects and the foundation of how we build and operate online services at Sourcegraph.

Building on the fly

As usually happens, the need for such a platform surfaced at the same time as a number of projects that would benefit from it, two of which fell within my team’s ownership. One of these projects was a time-sensitive push to build “Sourcegraph Accounts”7, a standalone user accounts management service that was needed to more robustly back the company’s venture into a self-serve “PLG” (product-lead-growth) product, “Cody Pro”8.

These other projects helped with cementing the need for a platform like MSP, but I knew that adoption from the get-go would be key to the platform’s success. If any of these started up either as “dotcom” add-ons as before, or used some other hand-crafted infrastructure, migration to the envisioned platform would likely be time-consuming and difficult to justify. This meant that the ability to adopt MSP components incrementally as they reached readiness was crucial.

As a result, the first piece of MSP that came together was an bare-bones infrastructure generator. From a simple service specification, we would be able to run a command to generate a bunch of Terraform. Generated Terraform configuration could be applied using our standard Terraform Cloud setup, and even augmented with custom Terraform if the use case called for it.

This had the effect of being very easy to adopt in a similar manner to hand-crafting infrastructure based on each project’s needs, but instead of writing plain Terraform, we wrote code to generate equivalent Terraform and encoded each requirement into the service specification.

# service.yaml
service:
  id: accounts
  # ...
environments:
  - id: dev
    domain:
      # Where this service will be served, e.g. accounts.sourcegraph.com
    resources:
      # Declare requirements for a Postgres database or Redis instance
    # ... other options like scaling, etc

Ensuring each addition to MSP did not block progress on dependent proejcts required very active involvement, but as a result, MSP was able to “own” the infrastructure for several services from the very start. The approach saved us from potentially frustrating migrations down the road and presented a fantastic dogfooding opportunity to ensure the platform actually served all our needs.

Scaling changes

The Terraform generation tooling is built on CDKTF, the “Cloud Development Kit for Terraform”. We use official SDK and generated bindings9 for existing Terraform modules to build a single entrypoint for “given this service specification, give me all the infrastructure I should have”. This was a tech stack first introduced for our managed multi-single-tenant platform, and existing experience helped us hit the ground running.

For example, the following Terraform:

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 4.0"
    }
  }
}

variable "project_id" {
  description = "The GCP project ID"
  type        = string
  default     = "my-project-123"
}

variable "project_name" {
  description = "The GCP project name"
  type        = string
  default     = "My Project"
}

provider "google" {
  project = var.project_id
  region  = "us-central1"
}

resource "google_project" "main" {
  name                = var.project_name
  project_id          = var.project_id
  billing_account     = "123456-ABCDEF-789012"
  auto_create_network = false
}

Looks like the following in our Terraform generator, written in Go using generated CDKTF bindings - note the similarities:

package main

import (
    "github.com/aws/constructs-go/constructs/v10"
    "github.com/aws/jsii-runtime-go"
    "github.com/hashicorp/terraform-cdk-go/cdktf"
    "github.com/sourcegraph/controller-cdktf/gen/google/project"
    google "github.com/sourcegraph/controller-cdktf/gen/google/provider"
)

type ProjectConfig struct {
    ProjectID string
    Name      string
}

func NewProjectStack(scope constructs.Construct, name string, config ProjectConfig) cdktf.TerraformStack {
    stack := cdktf.NewTerraformStack(scope, &name)

    // Configure Google provider
    google.NewGoogleProvider(stack, jsii.String("google"), &google.GoogleProviderConfig{
        Project: jsii.String(config.ProjectID),
        Region:  jsii.String("us-central1"),
    })

    // Create the project
    project.NewProject(stack, jsii.String("main"), &project.ProjectConfig{
        Name:              jsii.String(config.Name),
        ProjectId:         jsii.String(config.ProjectID),
        BillingAccount:    jsii.String("123456-ABCDEF-789012"),
        AutoCreateNetwork: jsii.Bool(false),
    })

    return stack
}

func main() {
    app := cdktf.NewApp(&cdktf.AppOptions{})
    
    // Configure the app
    NewProjectStack(app, "project-stack", ProjectConfig{
        ProjectID: "my-project-123",
        Name:      "My Project",
    })
    
    // Generate Terraform!
    app.Synth()
}

The similarities allowed me, with some luck, to feed OpenAI or Anthropic LLM models some existing Terraform as reference, and use that to generate some CDKTF Go equivalents as starter code. Once implemented in CDKTF Go, having a full programming language available for templating Terraform with type-safe bindings gave us a ton of flexibility, and is significantly easier to work with than the semantics that HCL or Terraform modules offer. The concept is similar to Pulumi, another very popular tool for declaring infrastructure in mainstream programming languages.

For example, we used this Terraform generator such that:

  • based the “tier” of the service (test, internal, or external/customer-facing) to conditionally apply stricter access control rules in line with our security policies, without any intervention from the service operator
  • when GCP Cloud Run made a more efficient VPC connection mechanism available, we were able to update our generator to use this new mechanism for all services that needed it, and apply the change across the MSP fleet
  • for services that had to meet SOC2 requirements for testing changes, service owners can deploy multiple service environments and configure a multi-stage deployment pipeline for them
  • recently - more than 2 years since the start of the MSP project - a new audit logging strategy for MSP databases was introduced: any service that had a database automatically got an upgrade that applied the new strategy, and all that was needed was a single code change within the generator

This approach has also been very helpful when building integrations with other company functions, like data analytics: all that great data stored within service databases is very interesting for our analytics team to build dashboards that provided important insights for decision-making processes across the company. Collaborating with the analytics team, we added a few configuration fields that would provision a PostgreSQL publication for tables of choice, integrate them into GCP Datastream, and plumb data into BigQuery for analysis. The setup is fairly involved, but doing it once in our Terraform generator made this easy to reproduce and familiar for our analytics team: this feature was eventually used in nearly a third of the MSP fleet.

This year, we reached a state where we had so many security features available on an on-by-default or opt-in basis in MSP that completing a SOC2 audit has never been easier: time-bound access, staged rollouts, audit trails, observability, debugging, profiling, and more was bundled in a single solution. Our investments were put to the test when we decided to include Amp in our SOC2 audit: by flipping on a bunch of MSP features and performing some light application auditing to tidy up a few things, we were able to get Amp into a SOC2-compliant state with just a week’s heads up.

The explicit update-spec-and-regenerate-infrastructure-then-commit flow can definitely be clunky at times, and it is nowhere near as polished as the fully magical, self-managing control plane. As a result, major service changes still require handholding from someone on my team, especially during initial setup. However, even after several years, we’ve still found this compromise acceptable, and it makes “hacking” on MSP infrastructure a lot easier for teammates more well-versed in MSP internals.

Developer experience

My ambitions for MSP did not stop at just an infrastructure specification and generator. My time working on Sourcegraph Cloud and the Cody Gateway’s build-operate-transfer project10 informed another three key components I wanted to offer:

  1. Operational tooling: the Sourcegraph Cloud CLI offered a wide range of commands for interacting with instances and performing common operations, like connecting to the database with all the right IAP tunnelling and gcloud CLI invocations. It made things like checking data and performing manual surgery to fix broken states just as easy as if it were in local development, as long as you had the required time-bound access.
  2. Generated documentation: I introduced a comprehensive documentation generator for Sourcegraph Cloud instances, which allowed us to create in-depth operational guidance with copy-pastable commands and links directly to the resources relevant to that instance. I wanted each service to have service-specific guidance like this, for free, just by opting in to MSP.
  3. All-batteries-included SDK: we had to reinvent a few wheels to get Cody Gateway out the door. I wanted MSP to provide a “runtime” that engineers could build their service around, and encode best practices for accessing deployed resources (like databases), setting up observability, and more into this “runtime”.

The vision was for engineers to focus on building cool things, and let us handle the plumbing. Once MSP had its initial tenants confirmed, I quickly built out the foundation for all of the above:

  1. I extended Sourcegraph’s developer tool sg with a new set of tools under sg msp. This toolchain already included our new Terraform generator, but it eventually became home to commands like:
    • sg msp db connect $SERVICE $ENV to connect to the database with the appropriate IAP tunnelling and configuration. We later added an analogous sg msp redis connect, which was another common use case.
    • sg msp logs $SERVICE $ENV and other resource quick-links to get service operators to exactly where they needed to go, without having to figure things out themselves.
  2. I added a MSP service operation guide generator that would write documentation for each service through Notion’s API11, so that engineers could easily figure out what they might need to do to work with their deployed service. Generated documentation is highly tailored to each service specification, so that only relevant guidance is included.
    • This was also available in Markdown via sg msp operations, which came in handy when it later turned out that LLMs are really good at figuring things out if provided some Markdown documentation.
Part of the generated documentation for a MSP service, as viewed in the terminal.
A teammate even introduced generated architecture diagrams based on the resources provisioned in the service specification, which came in handy in explanations!

For our all-batteries-included SDK, I introduced a new package in our monorepo, lib/managedservicesplatform/runtime, which services could adopt to ensure they had the latest and greatest in service initialization conventions. For example, with the following:

// main.go
package main

import (
  "github.com/sourcegraph/sourcegraph/lib/managedservicesplatform/runtime"

  // Your implementation!
  "github.com/sourcegraph/my-service/service"
)

func main() {
  runtime.Start(service.Service{})
}

You could rest easy that knowing that everything you might need for error reporting, profiling, tracing, and more would be set up in accordance with our suggested best practices. In your service implementation, you would also be able to easily access all the resources you’ve provisioned via your MSP service specification using the corresponding runtime SDK:

// Interface required for runtime.Start(...)
func (s Service) Initialize(ctx context.Context, logger log.Logger, contract runtime.ServiceContract, config Config) (background.Routine, error) {
  // Connect to Redis
  redisClient, ) := contract.Redis.NewClient()

  // MSP conventions for running database migrations
  if err := contract.PostgreSQL.ApplyMigrations(/** ... */); err != nil {
    return nil, errors.Wrap(err, "failed to apply migrations")
  }
  pool, _ := contract.PostgreSQL.GetConnectionPool(ctx, databaseName)
}

Over time, more and more elaborate features were added to the runtime, giving most service requirements a straight-forward “golden path”. The runtime eventually included integrations with other core services like telemetry and our central Sourcegraph Accounts service, and we even added a wrapper around OpenFGA so that a full-blown fine-grained authorization integration could be provisioned for a few services that needed it.

With infrastructure, tooling, documentation, processes, and even the general shape of service implementations all guided by MSP, providing support and guidance to MSP users across the company became much easier: we could simply guide people to resources and options that we’ve already invested in. Whenver a gap was raised, we would improve the relevant aspect of MSP, and it would become available to all MSP services. New features and integrations for standalone managed services now had an obvious home in the MSP stack, no matter that layer of the application stack the addition pertained to.

Adoption

When we first set out to build this “managed services platform”, I had hoped to see it adopted in maybe 2 or 3 production services and maybe a handful of internal utilities to justify our initial investment in the platform.

Today, the platform hosts over 30 services, just under half of which are denoted “external” or customer-facing. Some of these projects required a decent amount of hand-holding, encouragement, and influence to get them to adopt MSP, but especially in the past year I’ve seen many clever and useful internal services get themselves up and running on MSP. Even more exciting than adoption has been the eagerness from some teams to contribute new integrations to MSP to make everyone’s lives easier, and the efforts of my own teammates to advocate for the platform and provide hands-on support for our users and integrators. The impact of this fairly simple project has far exceeded my expectations.

Getting the ball rolling, however, undeniably required quite a lot of advocacy. I wrote documentation of various shapes and forms in Notion to advertise the headache-solving potential of MSP, and to convince engineers that we had a solid, well-thought-out platform that they could trust. The occasional rushed feature addition was required to demonstrate that the platform had our full commitment. I spoke at company offsites, recorded demo videos, and got caught up in many a Slack debate. But I’m glad we put in the time, and I’m grateful for the many contributions from across the company to make MSP a valuable component of Sourcegraph’s engineering ecosystem.

One example of my attempts at relatable humour during a company offsite.

  1. The first commit in April 2023! 

  2. This is a term I use to refer to all the bits and bobs that go into the core self-hosted Sourcegraph product. We have a monorepo that had nothing else for a long time, so a special term felt needed. 

  3. My former colleague wrote briefly about the disconnect between Sourcegraph’s single-tenant designs and its multi-tenant ambitions in his blog post: https://unknwon.io/posts/250910-babysitting-and-ai-agents/#why-now 

  4. https://sourcegraph.com/docs/cody/core-concepts/cody-gateway 

  5. A respected colleague was a Xoogler and advocated strongly for this, and it’s served us well (a few hiccups aside) 

  6. Not to be confused with the medical services plan of British Columbia, Canada 

  7. https://accounts.sourcegraph.com, designed by my colleagues Joe 

  8. This was recently spun down in favour of Sourcegraph’s Amp efforts: https://sourcegraph.com/blog/changes-to-cody-free-pro-and-enterprise-starter-plans 

  9. We use a custom CDKTF binding gneerator built by a talented colleague: https://github.com/sourcegraph/cdktf-provider-gen 

  10. After building the initial Cody Gateway service and using its takeaways to kick off the MSP project, the Cody team took over further development and operation. 2 years later, in 2025, we finally managed to migrate Cody Gateway onto the infrastructure it inspired! 

  11. I did not, and still do not, have a good time with Notion’s API. Prior to adopting Notion, we used Markdown docsites: I frequently find myself wishing we were still using those. 

]]>
robert
Scaling Sourcegraph’s managed multi-single-tenant product2024-08-23T00:00:00+00:002024-08-23T00:00:00+00:00https://bobheadxi.dev/multi-single-tenantAs the customer base for Sourcegraph’s “multi-single tenant” Sourcegraph Cloud offering grew, I had the opportunity to join the team to scale out the platform to support the hundreds of instances the company aimed to reach - which it does today!

Sourcegraph’s first stab at a managed offering of our traditionally self-hosted, on-premises code search product started way back during my internship at Sourcegraph. Dubbed “managed instances”, this was a “multi-single tenant” product where each “instance” was a normal Sourcegraph deployment operated on isolated infrastructure managed by the company. A rushed implementation was built to serve the very small number of customers that were initially interested in a managed Sourcegraph offering.

Managed Sourcegraph instances proved to be a good model for customers and Sourcegraph engineers alike: customers did not need to deal with the hassle of managing infrastructure and upgrades, and Sourcegraph engineers had direct access to diagnose problems and ensure a smooth user experience. The multi-single-tenant model ensured customer data remained securely isolated.

The decision was made to invest more in the “managed instances” platform with the goal of bringing “Sourcegraph Cloud” to general availability, and eventually make it the preferred option for all customers onboarding to Sourcegraph. A team of talented engineers took over to build what was internally referred to as “Cloud V2”.

I’m pretty proud of the work I ended up doing on this project, the “Cloud control plane”, and am very happy to see what the project has enabled since I left the Sourcegraph Cloud team in September 2023. So I thought it might be cool to write a little bit about what we did!

The prototype

The first “managed instances” was managed by copy-pasting Terraform configuration and some basic VM setup scripts. I maintained and worked on this briefly before I rejoined Sourcegraph as a full-time engineer in the Dev Experience team. Operating these first “managed instances” was a very manual ordeal. At the scale of less than a dozen instances, a fleet-wide upgrade would take several days of painstakingly performing blue-green deploys for each by copy-pasting Terraform configurations and applying them directly, one instance at a time. The only automation to speak of was some gnarly Terraform-rewriting scripts that I built using Typescript and Comby to make the task marginally less painful, and even this was prone to breaking on any unexpected formatting of the hand-written Terraform configurations.

The state of the first “managed instances” was a necessary first step to quickly serve the customers that first asked for the offering, and validate that customers were willing to allow a small company like Sourcegraph to hold the keys to their private code and secret sauce. As the customer base grew, however, upgrades were needed.

Version 2

By the time I joined the newly formed “Cloud team” that had inherited the first “managed instances” platform, the sweeping upgrades that comprised “Cloud V2” had been built, and the migration was already underway. These upgrades, largely driven by the talented and knowledgeable Michael Lin, were sorely needed: operating individual Sourcegraph instances with Kubernetes and Helm instead of docker-compose, and leveraging off-the-shelf solutions like GCP Cloud SQL and Terraform Cloud to operate the prerequisite infrastructure. CDKTF was also adopted so that Terraform manifests could be generated using a Go program, instead of being hand-written. Each instance got a YAML configuration file that was used to generate Terraform with CDKTF based on the desired attributes, which all got committed to a centralised configuration repository. These upgrades were the pieces needed to kickstart the company’s transition to bring the Cloud platform to general availability and encourage customers to consider “managed Sourcegraph” as the preferred option to self-hosting.

This infrastructure was managed by a CLI tool we called mi2, based on its predecessor mi, which stood for “managed instances”. The tool was generally run by a human operator to perform operations on the fleet of instances by manipulating its infrastructure-as-code components, such as the aforementioned CDKTF and Kubernetes manifests, based on each instance’s YAML specification. It was also used to configure “runtime” invariants such as application configuration, also based on each instance’s YAML specification.

“Cloud V2” wasn’t the end of the planned upgrades, however: defining each instance as a YAML configuration was a hint at what Michael’s grand vision for the “Cloud V2” platform was: to treat instances as Kubernetes custom resources, and manage each instance with individual “instance specifications”, just like any other native Kubernetes resource. The design of the “Cloud V2” instance specifications also featured Kubernetes-like fields, such as spec and status, similar to native Kubernetes resources like Pods, for example:

In the Kubernetes API, Pods have both a specification and an actual status. The status for a Pod object consists of a set of Pod conditions.

In other words, each instance:

  • …was defined by its spec: the desired state and attributes. For example, the version of Sourcegraph, the domain the instance should be served on, or the number of replicas for a particular services it should have (for services that can’t scale automatically).
  • …reports its status: the actual deployed state and as details that are only known after deployment, such as randomly generated resource names or IP addresses. A difference between spec and status for attributes that are reflected in both would indicate that the configuration change has not been applied yet.

When the team first launched “Cloud V2”, both spec and status were set in the configuration file, such that spec was generally handwritten, and status would be set and written to the file by the platform’s CLI, mi2. In addition, there were some generated Kustomize and Helm assets that also required a human to run some generation command with the mi2 CLI.

This meant that Git diffs representing a change usually must be made after changes have been already applied to GCP and other infrastructure, so that the status of the instance can be correctly reflected in the repository. This approach was error-prone and constantly caused drift between the repository state (where configurations were committed), and actual state of an instance in our infrastructure. Because the changes between specification and status are closely intertwined, pull requests with updates usually require review, further adding latency to the drift between actual status and the recorded status when left un-merged.

To complicate matters further, there were various other “runtime configurations” that were applied by hand using the mi2 tool. These were needed in scenarios where we did not have an infrastructure-as-code offering off-the-shelf, so we built ad-hoc automation to make API calls against various dependencies to make required configuration changes. This included configuration changes for Sourcegraph itself, and external dependencies like email delivery vendors1.

The key problems this situation posed were:

  1. The possibility of accidents and conflicts was very real. The consequences of mistakes were also very real, as we were highly reliant on customer trust that the service they paid for would be secure and reliable.
  2. The overhead required to operate the fleet, though much improved from the first “managed instances”, was still high: it was very unlikely the small team could handle a fleet size in the hundreds of instances with the tooling we had.
    1. To compound the problem further, instances had to be created and torn down on a frequent basis to enable customers to trial the product - this was partially automated by had to be manually triggered, and would frequently require intervention.
  3. We started relying heavily on GitHub Actions for automation. This worked well for simple processes like “create a specification from a template and run the necessary commands to apply it”, but the number of workflows grew, and some of them got very complex. These were difficult to test and prone to typos and conflicts due to the way our “Git ops” setup worked.

To enable the Cloud V2 platform to scale out to more customers reliably, we had to take it further. Michael and I started discussing our next steps in earnest sometime in January 2023. Together, we circulated 2 RFCs within the team: RFC 775: GitOps in Sourcegraph Cloud with Cloud control plane by myself, and RFC 774: Task Framework for Sourcegraph Cloud Orchestration by Michael.2

These two RFCs formed the key building blocks of the “Cloud control plane” project.

Taking things to the control plane

In my RFC, I drew this diagram to try and illustrate the desired architecture:

There’s a lot to unpack here, but the overall gist of the plan was:

  1. There would be no writing-to-the-repository by state changes. Operators (and operator-triggered automations) would commit changes to instances specifications (denoted by the blue box), and the required changes would somewhat opaquely happen in the “control plane”.
    • This would significantly reduce conflicts we were seeing in our existing Cloud infrastructure-as-code repository, because changes would now only occur in one direction when a change is merged, without needing to write back what changed to the repository.
  2. The platform would have a central “control plane”, denoted by the green boxes (“Cloud Manager” and “Tasks”).
    • “Tasks” are an internal abstraction for serverless jobs using Cloud Run. They allow us to run arbitrary tasks that mirror the mi2 commands a human operator would run today.
    • The “Cloud Manager” is the Kubernetes “controllers” service that would manage our Sourcegraph instances. We called it “manager” since that is the terminology used in kubebuilder - in the sense that a single manager service implements multiple “controllers”, and each controller owns the reconciliation of one Kubernetes custom resource type.
  3. We would continue to rely on off-the-shelf components, like existing dependencies on Terraform Cloud and Kubernetes + Helm (illustrated by the brown boxes).

In the central “control plane”, each instance specification would be “applied” as a custom resource in Kubernetes. This is enabled by kubebuilder, which makes it easy to write custom resource definitions (CRDs) and “controllers” for managing each custom resource type.

By defining a custom resource definition, operators can interact with the instance specifications via the Kubernetes API just like any other Kubernetes resource, including using kubectl. For example:

kubectl apply -f environments/dev/deployments/src-1234/config.yaml
kubectl get instance.cloud.sourcegraph.com/src-1234 -o yaml    

Would dump the custom resource from Kubernetes:

apiVersion: cloud.sourcegraph.com/v1
kind: Instance
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {...}
  creationTimestamp: "2023-01-24T00:19:35Z"
  generation: 1
  labels:
    instance-type: trial
  name: src-1234
spec:
  # ...
status:
  # ...

I proposed a design that would build on Michael’s “Tasks” abstraction by representing each “Task” type (for example, “apply these changes to the cluster” or “update the node pool to use another machine type”) with a “subresource” in the control plane. Each subresource would be another custom resource we define, and each subresource type’s sole task would be to detect if changes to resources it owns needs to be reconciled, and execute the required “Task” to bring the relevant resources to the desired state.

graph TD
  Instance --> InstanceTerraform
  Instance --> InstanceKubernetes
  Instance --> InstanceInvariants
  Instance --> UpgradeInstanceTask
  subgraph Subresources
    InstanceTerraform --> t1[Tasks] --> tfc[(Terraform Cloud)]
    InstanceKubernetes --> t2[Tasks] --> gke[(GKE)]
    InstanceInvariants --> t3[Tasks] --> src[(Misc. APIs)]
    UpgradeInstanceTask --> t4[Tasks] --> etc[(...)]
  end

In the diagram above, InstanceTerraform is one of our “subresource” types. It manages changes to an instance’s underlying infrastructure. The example showcases an infrastructure change, for example:

  1. Human operator updates the Instance spec to use a new machine type
  2. Instance controller propagates the change to instance’s child InstanceTerraform spec
  3. InstanceTerraform would detect that its current spec differs from the last known infrastructure state. It will then regenerate the updated Terraform manifests using CDKTF and apply it directly using Terraform Cloud using “Tasks”.
  4. Once the “task” execution completes, InstanceTerraform will updates its own status, which will be reflected by the Instance. This may cause cascading changes to “subsequent” subresourcs with dependencies on the modified subresource to apply.

Operators would rarely interact directly with these subresources - instead, they would only interact with the top-level Instance definition to request changes to the underlying infrastructure. Changes to the instance specification would automatically propagate to these subresources through the top-level Instance controller. Each subresource implemented an abstraction called “task driver” that generalised the ability for the top-level Instance controller to poll for completion or errors in a uniform manner.

Updated diagram adapted from my original RFC illustrating how the parent "Instance" controller creates child "subresources", which each own reconciling a specific component of an instance's state.

Reconciliation

This was a pretty new concept for me, though Kubernetes experts out there will probably find this familiar. The idea is to achieve “eventual consistency” by repeatedly “reconciling” and object until the specified state (spec) and desired state (status) are aligned. I think the most relevant dictionary definition is:3

[…] make (one account) consistent with another, especially by allowing for transactions begun but not yet completed.

At reconciliation time, each reconcile should be idempotent - the cause of the reconciliation cannot be used to change its behaviour. The goal of Reconcile implementations should be to bring the actual state closer to the desired state. This means that you don’t need to do everything in a single reconcile: you can do one thing, and then requeue for an update - the next reconciliation should proceed to the next thing, and so on. There may be a difference between actual state and the desired state for some time, but the system will eventually shift to the correct configuration.

For example, consider reconciling object O, where O.x and O.y are not yet in the desired state.

  1. Reconcile on object O. Fix O.x and requeue for another update immediately.
  2. Reconcile on object O (again). O.x is now fixed, so fix O.y and requeue for another update immediately (again).
  3. Reconcile on object O (again!). Everything is in the desired state! Do not requeue for update immediately, because all is now right in this (particular) world.

After the steps above, where O is reconciled several times, all attributes of O are now in the desired state. Nice!

Writing the code

In Kubebuilder code terms (the SDK we use to build custom Kubernetes CRDs), reconciliations are effectively the Reconcile method of a controller implementation being called repeatedly on an object in the cluster. Reconcile implementations can get pretty long, however, even from examples I looked at from other projects. Using gocyclo to evaluate the “cyclomatic complexity” (a crude measure of “how many code paths are in this function”) of the top-level Instance controller today, we get a cyclomatic complexity score almost twice as high as the rule-of-thumb “good” score of 15:

$ gocyclo ./cmd/manager/controllers/instance_controller.go
31 controllers (*InstanceReconciler).Reconcile ./cmd/manager/controllers/instance_controller.go:107:1

Even with a cyclomatic complexity score of 31, this is already fairly abstracted, as a lot of the complicated reconciliation that needs to take place by executing and tracking “Tasks” is delegated to subresource controllers. The top-level Instance controller only handles interpreting what subresources need to be updated to bring the Cloud instance to the desired state.

To keep this complexity under control, I developed a pattern for making “sub-reconcilers”: using package functions <some resource>.Ensure, these mini reconcilers would accept a variety of interfaces, with a touch of generics, that help us reuse similar behaviour over many subresources. The largest of these is taskdriver.Ensure, which encapsulates most of the logic required to dispatch task executions, track their progress, and collect their output.

$ gocyclo ./cmd/manager/controllers/taskdriver/taskdriver.go
57 taskdriver Ensure ./cmd/manager/controllers/taskdriver/taskdriver.go:123:1

With a cyclomatic complexity score of 57, this implementation spans around 550 lines, and is covered by nearly 1000 lines of tests providing 72% coverage on taskdriver.Ensure - not bad for a component dealing extensively with integrations.

This investment in a robust, re-usable component has paid dividends: the abstraction serves 5 “subresources” today, each handling a different aspect of Cloud instance management, and generalises the implementation of:

  • Diff detection: During reconciliation you cannot (by design) refer to a “previous version” of your resource. taskdriver.Ensure handles detecting if a task execution has already been dispatched, and whether a new one needs to be dispatched for the current inputs.
  • Tracking Task executions: taskdriver.Ensure handles creating Task executions, tracking their status, and collecting their outputs across many reconciles. Notable events are tracked in “conditions”, an ephemeral state field that records the last N interesting events to a subresource.
Sequence of TaskDriver events as viewed in ArgoCD, from creation, to checking for completion, to detected completion.
  • Concurrency control: Subresources often need global concurrency management (to throttle the rate at which we hit external resources like Terraform Cloud) as well as per-instance concurrency management (e.g. an upgrade can’t happen at the same time as a kubectl apply). taskdriver.Ensure consumes a configurable concurrency controller that can be tweaked based on the workload.
  • Teardown and orphaned resource management: On deletion of a subresource, taskdriver.Ensure can handle “finalisation” of tasks resources, deleting past executions in GCP Cloud Run. This is most useful for one-time-use subresources like instance upgrades - over time, we can delete our records of past upgrades for an instance. taskdriver.Ensure has also since been extended to handle picking up and clearing Task executions.
  • Uniform observability: Logs and metrics emitted by taskdriver.Ensure allow our various subresources to be monitored the same way for alerting and debugging.

To illustrate how this works in code, because I like interfaces, here’s an abbreviated version of what the abstraction looks like:

// Object is the interface a CRD must implement for managing tasks with Ensure.
//
// Generally, each CRD should only operate one Task type.
type Object[S any] interface {
	object.Kubernetes

	// object.Specified implements the ability to retrieve the driver resource's
	// specification, which should be exactly the Task's explicit inputs.
	object.Specified[S]

	// taskdrivertypes.TaskDriver implements the ability to read condition events for Tasks.
	taskdrivertypes.TaskDriver

	// AddTaskCondition should add cond as the first element in conditions -
	// cond will be the latest condition. This is interface is unqiue to
	// taskdriver.Object, as this package is the only place we should be adding
	// conditions.
	AddTaskCondition(cond cloudv1.TaskCondition)
}

// EnsureOptions denotes parameters for taskdriver.Ensure. All fields are required.
type EnsureOptions[
	// S is the type of subresource spec
	S any,
	// TD is the type of subresource that embeds the spec
	TD Object[S],
] struct {
	// Task is the type of task runs to operate over.
	Task task.Task
	// OwnerName is used when acquiring locks, and should denote the name of the
	// owner of Resource.
	OwnerName string
	// Resource is the resource that drives tasks runs of this type, changes to
	// the generation (spec) of which should driver a re-run of this
	// reconciliation task.
	Resource TD
	// Events must be provided to record events on Resource.
	Events events.Recorder
  // ...
}

// Ensure creates a reconciliation task run if there isn't one known in
// conditions, or retrieves its status. Both return values may be nil if the
// task is in progress with no error and no result.
//
// The caller MUST call handle.Update on resource if *result.Combined is not nil.
// The caller MUST apply a Status().Update() on resource if a result is returned.
func Ensure[SpecT any, TD Object[SpecT]](
	ctx context.Context,
	logger log.Logger,
	runs task.RunProvider,
	limiter concurrency.Checker,
	opts EnsureOptions[SpecT, TD],
) (_ any, _ result.ObjectUpdate, res *result.Combined) {
  // ...
}

The big hodgepodge of interfaces allow us to do a few things:

  1. Easy mocking in tests: Integration components can easily be provided as mock implementations for robust testing of every aspect of the taskdriver.Ensure lifecycle, which is pretty important given the complexity and business-critical nature of this one function. The taskdriver.Ensure test spans 20+ cases over 1000+ lines of assertions.
  2. Composable interfaces: In other parts of the codebase, we will leverage small parts of a complex implementation to do other sorts of work. For example, taskdrivertypes.TaskDriver indicates that it exposes interfaces for reading a task driver’s conditions - this is a critical part of taskdriver.Ensure, but is also useful for summarization capabilities elsewhere.
  3. Clearly express dependencies: It doesn’t matter too much what a “task run” really means in the context of taskdriver.Ensure, but it’s important to understand that the implementation needs to be able to dispatch runs and check on their status. For that we accept a task.RunProvider, and similarly, we accept a concurrency.Checker, and so on.

An abbreviated version of the callsite, a particular subresource’s reconciler, would then look like:

// Reconcile is part of the main kubernetes reconciliation loop which aims to
// move the current state of an upgrade instance task closer to the desired state.
//
// For more details, check Reconcile and its Result here:
// - https://pkg.go.dev/sigs.k8s.io/[email protected]/pkg/reconcile
func (r *UpgradeInstanceTaskReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrlResult ctrl.Result, err error) {
	// Get the resource being reconciled
	var resource cloudv1.UpgradeInstanceTask
	if err := r.Get(ctx, req.NamespacedName, &resource); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// Find the parent resource.
	instance, logger, err := taskdriver.MustResolveOwner(ctx, logger, r.Client, &resource)
	if err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// Set up task execution. Upgrades are immutable task drivers, so we use
	// resource.GetName() for convenience, since our name is unique.
	runProvider, err := r.TaskRunProvider(ctx, logger, *instance, resource.GetName())
	if err != nil {
		return ctrl.Result{}, err
	}
	defer runProvider.Close()

	// Run the taskdriver loop
	_, update, resultErr := taskdriver.Ensure(ctx, logger, runProvider,
		concurrency.NewSubresourceChecker(logger, r.Client, instance.GetName(), &resource,
			// Low concurrency - we are heavily limited by TFC
			concurrency.WithGlobalTypeConcurrency(...)),
		UpgradeInstanceTaskEnsureTaskOptions{
			Task:      upgradeinstance.Task,
			OwnerName: instance.Name,
			Resource:  &resource,
			Events:    r.Events,
		})
	if resultErr != nil {
		update.Handle(ctx, logger, r.Client, &resource)
		return resultErr.Handle(logger, "EnsureTask")
	}

	return ctrl.Result{}, r.Status().Update(ctx, &resource)
}

This allows the system to be easily extended to accommodate more types of subresources to handle different tasks, allowing implementors to focus on the Task execution that gets the work done, before plugging it into the control plane with a fairly small integration surface.

Control plane lifecycle summary

Putting it all together, here’s a diagram I wrote up for some internal dev docs showing the lifecycle of a change a human operator might make to an instance:

sequenceDiagram
    participant Instance
    participant SubResource1
    participant SubResource2
    participant task.RunProvider
    Note right of Instance: Instance spec is updated
    loop InstanceController.Reconcile
      activate Instance
      Note right of Instance: Instance is continuously queued<br/>for reconciliation based on updates,<br/> requeues, or SubResource updates
      alt subresource.Ensure: SubResource1 needs update
        Note right of Instance: Updates are determined by generating<br/>desired SubResource spec and diffing<br/>against the current SubResource spec
        Instance->>SubResource1: Apply updated SubResource1 spec
        activate SubResource1
        loop SubResource1.Reconcile
          Note left of SubResource1: Spec update triggers<br/>SubResource1 reconciliation
          alt taskdriver.Ensure: Task input does not match spec
            Note right of SubResource1: Input diffs are identified<br/>by recording the subresource<br/>generation and a hash of<br/>the annotations provided
            SubResource1->>task.RunProvider: Create new Task execution
            activate task.RunProvider
            SubResource1->>SubResource1: Update status conditions
            Note right of SubResource1: Conditions record execution<br/>state and metadata
            SubResource1->>SubResource1: Requeue for reconcile
          else Task input matches spec
            SubResource1->>task.RunProvider: Is task still running?
            deactivate task.RunProvider
            task.RunProvider-->>SubResource1: Update Status with result
            SubResource1->>SubResource1: Update status conditions
            deactivate SubResource1
            Note left of SubResource1: Instance owns SubResource1,<br/>so a status update will queue<br/>an Instance reconciliation
          end
        end
      else SubResource1 up-to-date
        Instance->>SubResource1: Is SubResource1 ready?
        Note right of Instance: We determine readiness based on the<br/>subresource status and conditions
        SubResource1-->>Instance: Update Status with result
        Note right of Instance: We do not proceed to next SubResource<br/>unless the previous is ready
        alt subresource.Ensure: SubResource2 needs update
          Instance->>SubResource2: Apply updated SubResource2 spec
          Note right of Instance: Not every Instance spec change will<br/>cause every SubResource to update,<br/>because SubResources are subsets<br/>of Instance spec
          activate SubResource2
          loop SubResource2.Reconcile
            alt taskdriver.Ensure: Task input does not match spec
              SubResource2->>task.RunProvider: Create new Task execution
              activate task.RunProvider
              SubResource2->>SubResource2: Update status conditions
              SubResource2->>SubResource2: Requeue for reconcile
            else Task input matches spec
              SubResource2->>task.RunProvider: Is task still running?
              deactivate task.RunProvider
              task.RunProvider-->>SubResource2: Update Status with result
              SubResource2->>SubResource2: Update status conditions
              deactivate SubResource2
            end
          end
        else SubResource2 up-to-date
          Instance->>SubResource2: Is SubResource2 ready?
          SubResource2-->>Instance: Update Status with result
        end
      end
      Note right of Instance: Full reconcile complete!
      deactivate Instance
    end

I don’t know if that helps much, but I think it looks nice!

The future

Sadly, I no longer work on the Sourcegraph Cloud platform, but since its launch, this system has delivered on our goals: today, the Cloud control plane operates over 150 completely isolated single-tenant Sourcegraph instances with a core team of just 2 to 3 engineers, nearly double the size of the fleet when we started this project.

The Cloud control plane has also proven extensible: I’ve seen some pretty nifty extensions built since I departed the project, like an automatic disk resizer and “ephemeral instances”, which can be used internally to deploy a branch of a Sourcegraph codebase to a temporary Cloud instance with just a few commands. Various features have also been added to accommodate scaling needs and specific customer requirements.

The rollout of the Cloud control plane, and adoption of Cloud from customers, have battle-tested the platform, and a lot of work has been done to cover more edge cases and improve the resilience of the Cloud control plane. There’s also DX improvements, such as robust support for our internal concepts in ArgoCD, allowing health and progress summaries to be surfaced in a friendly interface:

Note the parent resource ("instance") and the subresources it owns ("instanceinvariants", "instancekubernetes", and friends).

The design of the Cloud control plane has allowed all these additions to be built in a sustainable fashion for the small Cloud team that operates it. The core concepts we initially designed for the Cloud control plane have largely remained intact, which is a relief for sure. I’m very excited to see where else the team goes with the Sourcegraph Cloud offering, both internally and externally!



  1. I built and launched this (“managed SMTP”), which configured an external email vendor automatically so that Cloud instances could start sending emails “off the shelf”. 

  2. RFCs at Sourcegraph used to be primarily published as public Google Documents. This has become a bit rarer over the years, but hopefully this link doesn’t stop working! 

  3. I just found this with a Google search - the provided definition should have a permalink here. 

]]>
robert
Investing in the development of the developer experience2022-10-10T00:00:00+00:002022-10-10T00:00:00+00:00https://bobheadxi.dev/investing-in-development-of-devxAt Sourcegraph we have a developer tool called sg, which has become the way we ensure the development of tooling continues to scale at Sourcegraph. But why invest in ensuring contributions to your dev tooling scales?

Imagine you’re developing a sizable application spanning multiple services - say, a code search and code intelligence platform like Sourcegraph. You’ll want to be able to spin up everything to some degree locally to help you experiment.

So you pick up an off-the-shelf tool like goreman, a Procfile runner we used to use - but this could be any tool, really, like docker-compose or something else. A tool like this usually it takes a bit of configuration, but it works good enough to start off!

goreman -f dev/Procfile

Inevitably you add a few layers of configuration specific to your project for your tool of choice:

export SRC_LOG_LEVEL=${SRC_LOG_LEVEL:-info}
export SRC_LOG_FORMAT=${SRC_LOG_FORMAT:-condensed}

goreman --set-ports=false --exit-on-error -f dev/Procfile

This ends up going in a script or Makefile, to encode this setup as the de-facto way of running things that you can share with your team.

Then you realise your tool doesn’t have hot-reloading, or some other feature, which you end up writing some automation for.

Your little start script ends up with several hundred lines of configuration options, which you can only find out by reading it, and alongside that you have dozens of scripts that do various dev-related tasks:

  • running adjacent services,
  • generating code,
  • or running linters,
  • or just parts of linters in particular ways in CI,
  • or combining scripts and configured them in mysterious ways…

This eventually leads to a frustrating and brittle developer experience.

It’s nearly impossible to find out which development tasks I can run. It’s really hard to run them standalone without knowing about some global state they depend on. It’s really hard to extend these, because who knows which global state might influence them or depend on their global state.

— Thorsten Ball, RFC 348: Lack of conventions

It became hard to find out what tooling was available, how each script was configured, and how to extend them and add to them - hindering progress in our tooling.

That’s why we started sg, Sourcegraph’s developer tool, to become the centralised home for all development tasks.

sg started as a single command to run Sourcegraph locally in March 2021 - today it features over 60 commands covering all sorts of functionality and utilities that you might need throughout your development lifecycle:

  • dev environment setup
  • linters
  • RFC/ADR browser
  • migrations tooling
  • CI status checker, flakes investigation tooling, etc.
  • monitoring tooling
  • and more!

The tool is built in Go, and thus has the usual good Go stuff - it’s self-contained and portable, so it’s easy to build self-updating for. Installation is a simple one-liner, making sg very easy to distribute to teammates:

curl --proto '=https' --tlsv1.2 -sSLf https://install.sg.dev | sh

Introducing Go also enables more powerful, type-safe programming on top of just running commands - programming that is trickier to do in Bash, where you need to account for a more limited syntax and variants of unix commands and so on.

Using a CLI library with commands to represent tasks effectively encodes the available scripts in a powerful structured format, making documentation and configuration options easier to configure and access:

dbCommand = &cli.Command{
    Name:  "db",
    Usage: "Interact with local Sourcegraph databases for development",
    UsageText: `
# Reset the Sourcegraph 'frontend' database
sg db reset-pg

# Reset the 'frontend' and 'codeintel' databases
sg db reset-pg -db=frontend,codeintel

# Reset all databases ('frontend', 'codeintel', 'codeinsights')
sg db reset-pg -db=all

# Reset the redis database
sg db reset-redis

# Create a site-admin user whose email and password are [email protected] and sourcegraph.
sg db add-user -name=foo
`,
    Category: CategoryDev,
    Subcommands: []*cli.Command{
        {
            Name:        "reset-pg",
            Usage:       "Drops, recreates and migrates the specified Sourcegraph database",
            Description: `If -db is not set, then the "frontend" database is used (what's set as PGDATABASE in env or the sg.config.yaml). If -db is set to "all" then all databases are reset and recreated.`,
            Flags: []cli.Flag{
                &cli.StringFlag{
                    Name:        "db",
                    Value:       db.DefaultDatabase.Name,
                    Usage:       "The target database instance.",
                    Destination: &dbDatabaseNameFlag,
                },
            },
            Action: dbResetPGExec,
        },
    },
}

But to make this kind of tool effective, you need more than just converting scripts into a Go program. In developing sg, I’ve noticed some patterns come up that I believe are crucial to its utility - tooling should:

Tooling should be approachable

Firstly, tooling should be approachable, easy to learn and find out about, and easy to discover. The goal is to abstract implementation details away behind a friendly, usable interface.

For example, with documentation, you might want to meet your users where they are, and provide options for learning - whether it be through complete single-page references in the browser, or directly in the command line.

A structured CLI makes all this easy to generate from a single source of truth so that your documentation is available everywhere and always up-to-date.

Using the tool should be intuitive - to help with this, you can provide usability features like autocompletions, which in sg is configured for you during installation. This makes it easy to figure out what you can do on the fly!

When developing new sg commands, adding custom completions is also easy for commands that have a fixed set of possible arguments:

	BashComplete: cliutil.CompleteOptions(func() (options []string) {
		config, _ := getConfig()
		if config == nil {
			return
		}
		for name := range config.Commands {
			options = append(options, name)
		}
		return
	}),

Tooling should work with your tools

Secondly, tooling should interop and work with your tools - one of sg’s’ goals is specifically to not become a build system or container orchestrator, but to provide a uniform and programmable layer on top of them that is specific to Sourcegraph’s needs.

Take sg start, the command that replaced the goreman setup we talked about earlier, for example. sg start just uses whatever tools each service normally uses to build, run, and update itself, and provides some additional features on top that is specific to how Sourcegraph works. A service configuration might look like:

  oss-frontend:
    cmd: .bin/oss-frontend
    install: |
      if [ -n "$DELVE" ]; then
        export GCFLAGS='all=-N -l'
      fi
      go build -gcflags="$GCFLAGS" -o .bin/oss-frontend github.com/sourcegraph/sourcegraph-public-snapshot/cmd/frontend
    checkBinary: .bin/oss-frontend
    env:
      CONFIGURATION_MODE: server
      USE_ENHANCED_LANGUAGE_DETECTION: false
      # frontend processes need this to be so that the paths to the assets are rendered correctly
      WEBPACK_DEV_SERVER: 1
    watch:
      - lib
      - internal
      - cmd/frontend

You’re not constrained to using sg start - you can run all these steps yourself still with tools of your choice, but sg start combines everything for you into tidied up output, complete with configuration, colours, hot-reloading, and everything you might need to start experimenting with your new features!

Tooling should codify standards

Lastly, tooling should codify standards. Automation and scripting encodes best practices that, when shared, builds on past learnings to provide a smooth experience for everyone.

Consider the typical process of setting up your development environment, we’ve all been there - a big page of things to install and set up in certain ways:

### Prerequisites

- Install `A`
- Configure the thing
- Install `B`
- Install `C` (but not that version!)

Instead, at Sourcegraph we have sg setup, which automatically figures out what’s missing on your machine…

…and sg will take the steps required to get you set up!

Programming this fixes enables us to standardise installations over time, automatically addressing issues teammates run into so that future teammates won’t have to.

For example, we can configure PATH for you, or make sure things are installed in the right place and configured in the appropriate manner - building on top of other tool managers like Homebrew and asdf to provide a smooth experience.

Wrap-up

Enabling the development of good tooling, scripting, automation makes a difference. There’s a lot that can be done to improve how tooling is developed and improved, like the ideas I’ve brought up in this post - we don’t have to settle for cryptic tooling everywhere!

If you’re interested in how all this is implemented, sg is open source - come check us out on GitHub!

Note - I had originally hoped to present this as a lightning talk at Gophercon Chicago 2022, but I was too late to queue up on the day of the presentations, so I figured might as well turn it into a post.


]]>
robert
Anatomy of a logger2022-05-21T00:00:00+00:002022-05-21T00:00:00+00:00https://bobheadxi.dev/anatomy-of-a-loggerZap is a structured logging library from Uber that is built on top of a “reflection-free, zero-allocation JSON encoder” to achieve some very impressions performance comapred to other popular logging libraries for Go. As part of developing integrations for it at Sourcegraph, I thought I’d take the time to look at what goes on under the hood.

Logging seems like a simple thing that should be tangential to your application’s concerns - how complicated could writing some output be? Why bother making logging faster at all? The first item in Zap’s FAQ provides a brief explanation:

Of course, most applications won’t notice the impact of a slow logger: they already take tens or hundreds of milliseconds for each operation, so an extra millisecond doesn’t matter.

On the other hand, why not make structured logging fast? […] Across a fleet of Go microservices, making each application even slightly more efficient adds up quickly.

In my personal experience, I’ve seen logging cause some very real issues - a debug statement I left in a Sourcegraph service once caused a customer instance to stall completely!

Metrics indicated jobs were timing out, and a look at the logs revealed thousands upon thousands of lines of random comma-delimited numbers. It seemed that printing all this junk was causing the service to stall, and sure enough setting the log driver to none to disable all output on the relevant service allowed the sync to proceed and continue. […] At scale these entries could contain many thousands of entries, causing the system to degrade. Be careful what you log!

At Sourcegraph we currently use the cheekily named log15 logging library. Of course, a faster logger likely would not have prevented the above scenario from occurring (though we are in the process of migrating to our new Zap-based logger), but here’s a set of (very unscientific) profiles that compare a somewhat “average” scenario of logging 3 fields with 3 fields of existing context in JSON format to demonstrate just how different Zap and log15 handles rendering a log entry behind the scenes:

const iters = 100000

var (
	thing1 = &thing{Field: "field1", Date: time.Now()}
	thing2 = &thing{Field: "field2", Date: time.Now()}
)

func profileZap(f *os.File) {
	// Create JSON format l with fields, normalised against log15 features
	cfg := zap.NewProductionConfig()
	cfg.Sampling = nil
	cfg.DisableCaller = true
	cfg.DisableStacktrace = true
	l, _ := zap.NewProduction()
	l = l.With(
		zap.String("1", "foobar"),
		zap.Int("2", 123),
		zap.Any("3", thing1),
	)

	// Start profile and log a lot
	pprof.StartCPUProfile(f)
	for i := 0; i < iters; i += 1 {
		l.Info("message",
			zap.String("4", "foobar"),
			zap.Int("5", 123),
			zap.Any("6", thing2),
		)
	}
	l.Sync()
	pprof.StopCPUProfile()
}

func profileLog15(f *os.File) {
	// Create JSON format l with fields
	l := log15.New(
		"1", "foobar",
		"2", 123,
		"3", thing1,
	)
	l.SetHandler(log15.StreamHandler(os.Stdout, log15.JsonFormat()))

	// Start profile and log a lot
	pprof.StartCPUProfile(f)
	for i := 0; i < iters; i += 1 {
		l.Info("message",
			"4", "foobar",
			"5", 123,
			"6", thing2,
		)
	}
	pprof.StopCPUProfile()
}

The resulting call graphs, generated using go tool pprof -prune_from=^os -png, with log15 on the left and Zap on the right:

log15 zap
Profiles showing CPU time spent throughout log calls, up until it reaches package os code where work begins to write data to disk - log15 is on the left, and zap is on the right. You might have to zoom in a bit.

Check out the pprof documentation for intepreting the callgraph to learn more.

It is not immediately evident how the Zap logger is supposed to be better than the log15 logger, since both finish running pretty quickly, have similar-looking call graphs, and ultimately have I/O as the major bottleneck (the big red os.(*.File).write blocks). However, a closer look (like, really close - you gotta zoom all the way in!) reveals a key hint - both loggers spend enough time in JSON encoding stages for the profiler to pick up, but the details of their JSON encoding differs somewhat:

  • log15 quickly delegates what appears to be the entire log entry to json.Marshal, which accounts for ~6ms.
  • Zap delegates fields to several different handlers: we see an AddString and AddReflected, where only the latter ends up in the json library, where it only accounts for ~2ms. Presumably, it is handling certain fields differently than others, where in some cases it skips encoding with the json library entirely!

Zap’s documentation provides a brief explanation of why delegating to json is an issue:

For applications that log in the hot path, reflection-based serialisation and string formatting are prohibitively expensive — they’re CPU-intensive and make many small allocations. Put differently, using encoding/json and fmt.Fprintf to log tons of interface{}s makes your application slow.

As a more scientific approach to demonstrating the benefits of Zap’s implementation, here’s a snapshot of the advertised benchmarks against some other popular libraries (as of v1.21.0), emphasis mine:

Log a message and 10 fields:

Package Time Time % to zap Objects Allocated
:zap: zap 2900 ns/op +0% 5 allocs/op
:zap: zap (sugared) 3475 ns/op +20% 10 allocs/op
zerolog 10639 ns/op +267% 32 allocs/op
go-kit 14434 ns/op +398% 59 allocs/op
logrus 17104 ns/op +490% 81 allocs/op
apex/log 32424 ns/op +1018% 66 allocs/op
log15 33579 ns/op +1058% 76 allocs/op

Log a message with a logger that already has 10 fields of context:

Package Time Time % to zap Objects Allocated
:zap: zap 373 ns/op +0% 0 allocs/op
:zap: zap (sugared) 452 ns/op +21% 1 allocs/op
zerolog 288 ns/op -23% 0 allocs/op
go-kit 11785 ns/op +3060% 58 allocs/op
logrus 19629 ns/op +5162% 70 allocs/op
log15 21866 ns/op +5762% 72 allocs/op
apex/log 30890 ns/op +8182% 55 allocs/op

In these scenarios, log15 can be a whopping 10 to 50 times slower - very cool! Evidently Zap’s approach has impressive results, and we know roughly what it doesn’t do to achieve this performance - but how does it work in practice?

A writer for log entries

The README suggests the following as the preferred way to create and start using a Zap logger, which is very similar to what I do when I attempted to profile logging calls earlier:

logger, _ := zap.NewProduction()
defer logger.Sync()

Internally, this takes a default, high-level configuration and builds a logger from it using the following components:

  • a zapcore.Core, which is constructed from:
    • a zapcore.Encoder
    • a zapcore.WriteSyncer (also referred to as a “sink”)
  • a bunch of Options

For brevity, let’s forget about the Options for now and focus on the first component: zapcore.Core, which is described as the real logging interface beneath Zap, which exports the more traditional logging methods like .Info(), .Warn(), and so on - the equivalent of an io.Writer for structured logging instead of generic output.

zapcore.Core splits the logging of a message, such as .Info("message", fields...), into the following distinct steps:

  1. Check: Check(Entry, *CheckedEntry) *CheckedEntry that determines if the message should be logged at all. This is where the traditional level filtering comes in (i.e. when you want to only log messages above a certain level, like discarding .Debug() messages), or discarding repeated messages through sampling.
    1. In this interface, we get a read-only Entry and a mutable *CheckedEntry that a core registers itself onto if it decides the given Entry should be logged.
  2. Write: Write(Entry, []Field) error, where the rendering of a log entry into the destination occurs.

In addition, we have distinct steps for:

  1. Adding fields to the logger (as opposed to just a specific entry): With([]Field) Core - this allows Core implementations render fields once and not repeat work for subsequent log entries. We’ll get to how this works later!
    1. It’s not noted on the interface documentation, but because of the above, the fields provided to With() are not provided to Write().
  2. Flushing output: Sync() error allows for buffering output and batching writes together, minimising instances of being bottlenecked by I/O, or allowing Core implementations to handle logs in an asynchronous manner.

We can see this in action in the default *zap.Logger implementation. Let’s check out the seemingly innocuous .Info() function:

func (log *Logger) Info(msg string, fields ...Field) {
	if ce := log.check(InfoLevel, msg); ce != nil {
		ce.Write(fields...)
	}
}

Check

First up we have log.check, a whopping 102-line function that implements the check step of writing a log entry, which constructs an zapcore.Entry and calls the core.Check function:

func (log *Logger) check(lvl zapcore.Level, msg string) *zapcore.CheckedEntry {
	// ... omitted for brevity

	// Create basic checked entry thru the core; this will be non-nil if the
	// log message will actually be written somewhere.
	ent := zapcore.Entry{
		LoggerName: log.name,
		Time:       log.clock.Now(),
		Level:      lvl,
		Message:    msg,
	}
	ce := log.core.Check(ent, nil)

	// ...

	return ce
}

Note that log.core.Check(ent, nil) is pretty elaborate here - we noted previously that in this function, Core implementations should register themselves on the second argument CheckedEntry. How does that work if the CheckedEntry argument is a nil pointer? Taking a look at CheckedEntry.Write(), we can see the first hints of some pretty aggressive optimization:

// AddCore adds a Core that has agreed to log this CheckedEntry. It's intended to be
// used by Core.Check implementations, and is safe to call on nil CheckedEntry
// references.
func (ce *CheckedEntry) AddCore(ent Entry, core Core) *CheckedEntry {
	if ce == nil {
		ce = getCheckedEntry()
		ce.Entry = ent
	}
	ce.cores = append(ce.cores, core)
	return ce
}

var _cePool = sync.Pool{New: func() interface{} {
	// Pre-allocate some space for cores.
	return &CheckedEntry{
		cores: make([]Core, 4),
	}
}}

func getCheckedEntry() *CheckedEntry {
	ce := _cePool.Get().(*CheckedEntry)
	ce.reset()
	return ce
}

In short, CheckedEntry instances are created or reused on demand (this way, if no cores register themselves to write an Entry, no CheckedEntry is ever created) from a global sync.Pool:

A Pool is a set of temporary objects that may be individually saved and retrieved […] Pool’s purpose is to cache allocated but unused items for later reuse, relieving pressure on the garbage collector. […] Pool provides a way to amortise allocation overhead across many clients.

If many logs entries are written in a short time, allocated memory can be recycled by Pool, which is faster than having the Go runtime always allocate new memory and garbage-collecting unused CheckedEntry instances.

Write

Then we move on to the write step, done in ce.Write. This is the *zapcore.CheckedEntry we mentioned before performing a write on all registered cores:

func (ce *CheckedEntry) Write(fields ...Field) {
	if ce == nil {
		return
	}

	// ... omitted for brevity

	var err error
	for i := range ce.cores {
		err = multierr.Append(err, ce.cores[i].Write(ce.Entry, fields))
	}

	// ...

	putCheckedEntry(ce)

	// ...
}

func putCheckedEntry(ce *CheckedEntry) {
	if ce == nil {
		return
	}
	_cePool.Put(ce)
}

Note the call to putCheckedEntry - after the entry has been written, it is no longer needed, and this call places the entry into the entry for reuse. Nifty!

Sent into Write is still an Entry and Fields, however - we’ve yet to see how our message ends up as text, which is where the performance gains are supposed to be.

Encoding and writing output

Looking back, we have two components that are used to create a Core earlier on: zapcore.Encoder and zapcore.WriteSyncer.

	log := New(
		zapcore.NewCore(enc, sink, cfg.Level),
		cfg.buildOptions(errSink)...,
	)

Encoder exports a function, EncodeEntry, that seems to mirror the signature of Core.Write, and also embeds the ObjectEncoder interface:

// Encoder is a format-agnostic interface for all log entry marshalers. Since
// log encoders don't need to support the same wide range of use cases as
// general-purpose marshalers, it's possible to make them faster and
// lower-allocation.
type Encoder interface {
	ObjectEncoder

	// EncodeEntry encodes an entry and fields, along with any accumulated
	// context, into a byte buffer and returns it. Any fields that are empty,
	// including fields on the `Entry` type, should be omitted.
	EncodeEntry(Entry, []Field) (*buffer.Buffer, error)

	// ...
}

In ObjectEncoder we see the promise of a “reflection-free, zero-allocation JSON encoder” in the form of a giant interface, shortened for brevity:

// ObjectEncoder is a strongly-typed, encoding-agnostic interface for adding a
// map- or struct-like object to the logging context. Like maps, ObjectEncoders
// aren't safe for concurrent use (though typical use shouldn't require locks).
type ObjectEncoder interface {
	// Logging-specific marshalers.
	AddObject(key string, marshaler ObjectMarshaler) error

	// Built-in types.
	AddBool(key string, value bool)
	AddDuration(key string, value time.Duration)
	AddInt(key string, value int)
	AddString(key, value string)
	AddTime(key string, value time.Time)

	// AddReflected uses reflection to serialise arbitrary objects, so it can be
	// slow and allocation-heavy.
	AddReflected(key string, value interface{}) error

	// ...
}

This seemingly crazy interface allows messages to be incrementally built in the desired format without ever hitting json.Marshal. For example, we can look at what the JSON encoder does to add a string field:

func (enc *jsonEncoder) AddString(key, val string) {
	enc.addKey(key)
	enc.AppendString(val)
}

We start with adding the key, then the value:

func (enc *jsonEncoder) addKey(key string) {
	enc.addElementSeparator()
	enc.buf.AppendByte('"')
	enc.safeAddString(key)
	enc.buf.AppendByte('"')
	enc.buf.AppendByte(':')
}

Reading this carefully, given a key you’ll end up with the following being added to enc.buf (a bytes buffer):

"key":
^ ^ ^^
| | ||
| | |└ AppendByte(':')
| | └ AppendByte('"')
| └ safeAddString(key)
└ AppendByte('"')

Presumably what comes next is a value, for example a string:

func (enc *jsonEncoder) AppendString(val string) {
	enc.addElementSeparator()
	enc.buf.AppendByte('"')
	enc.safeAddString(val)
	enc.buf.AppendByte('"')
}
"key":"val"
      ^ ^ ^
      | | |
      | | |
      | | └ AppendByte('"')
      | └ safeAddString(val)
      └ AppendByte('"')

Encoding the entire entry in EncodeEntry works similarly, with your typical JSON opening and closing braces being written first:

final.buf.AppendByte('{')

// ... render log entry

final.buf.AppendByte('}')
final.buf.AppendString(final.LineEnding)
{"key":"val"}\n
^           ^ ^
|           | └ AppendString(final.LineEnding)
|           └ AppendByte('}')
└ AppendByte('{')

Stepping back up a bit, we can now better understand how zapcore.Field works, again condensed for brevity:

type Field struct {
	Key       string
	Type      FieldType
	Integer   int64
	String    string
	Interface interface{}
}

func (f Field) AddTo(enc ObjectEncoder) {
	var err error
	switch f.Type {
	case ObjectMarshalerType:
		err = enc.AddObject(f.Key, f.Interface.(ObjectMarshaler))
	case BoolType:
		enc.AddBool(f.Key, f.Integer == 1)
	case DurationType:
		enc.AddDuration(f.Key, time.Duration(f.Integer))
	case StringType:
		enc.AddString(f.Key, f.String)
	case ReflectType:
		err = enc.AddReflected(f.Key, f.Interface)

	// ...
	}

	// ...
}

Here we can see that for most cases, when one creates a strongly typed field with e.g. zap.String(key string, val string) Field, Zap can track the type information and pass the Field directly to the most appropriate function on the underlying encoder. Together with the fact that the entire log message is constructed incrementally, this means that it’s possible for most log messages to never encounter the need to reflect or use the json package to serialise the message. Nifty! This explains why we spend less time in json in the profile at the start of this post - most of the log message can be serialised directly, except for one field:

l.Info("message",
	zap.String("4", "foobar"),
	zap.Int("5", 123),
	zap.Any("6", thing2), // this goes to AddReflected, which uses JSON to marshal the field
)

To get around this, we could implement ObjectMarshaler which we saw on the Encoder interface previously. If implemented, we can serialise our object directly in an efficient manner:

type thing struct {
	Field string
	Date  time.Time
}

func (t *thing) MarshalLogObject(enc zapcore.ObjectEncoder) error {
	enc.AddString("Field", t.Field)
	enc.AddTime("Date", t.Date)
	return nil
}

We can re-run the profiling script from the start of the post to see that there’s no more usage of json!

Going back a bit, we can see that this also simplifies the encoding of fields that are added to the logger itself in the Core.WithFields we saw earlier by looking at the ioCore.With implementation, which immediately encodes the given fields:

func (c *ioCore) With(fields []Field) Core {
	clone := c.clone()
	for i := range fields {
		fields[i].AddTo(enc)
	}
	return clone
}

EncodeEntry checks if there are fields already encoded, and adds the partial JSON into the message directly - no additional work needed.

tl;dr

Turns out, seemingly simple things can be kind of complicated! However, in this case the result is a neat exhibit of a variety of optimization techniques and a logging implementation that can outpace other libraries by an order of magnitude.

Zap’s design also provides some interesting ways to hook into its behaviour - Zap itself offers some examples, such as zaptest, which creates a logger with a custom Writer that sends output to Go’s standard testing library.

At Sourcegraph, our new Zap-based logger offers utilities to hook into an our configured logger using Zap’s WrapCore API to assert against log output (mostly for testing the log library itself), partly built on the existing zaptest utilities. We’re also working on custom Core implementations to automatically send logged errors to Sentry, and we wrap Field constructors to define custom behaviours (we disallow importing directly from Zap for this reason). Pretty nifty to still have such a high degree of customizability in an implementation so focused on optimizations!

]]>
robert
Dynamic and stateless Kubernetes Jobs for stable CI2022-04-18T00:00:00+00:002022-04-18T00:00:00+00:00https://bobheadxi.dev/stateless-ciSourcegraph’s continuous integration infrastructure uses Buildkite, a platform for running pipelines on CI agents we operate. After using the default approach of scaling persistent agent deployments for a long time, we’ve recently switched over to completely stateless agents on dynamically dispatched Kubernetes Jobs to improve the stability of our CI pipelines.

In Buildkite, events (such as a push to a repository) trigger “builds” on a “pipeline” that consist of multiple “jobs”, each of which correspond to a “pipeline step”. This is all of which is managed by the hosted Buildkite service, which then dispatches Buildkite jobs onto any Buildkite agents that are live on our infrastructure that meet each job’s “queue” requirements.

Previously, our Buildkite agent fleet was operated as a simple Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: buildkite-agent
  # ...
spec:
  replicas: 5
  # ...
  template:
    metadata:
      # ...
    spec:
      containers:
        - name: buildkite-agent
          # ...

A separate deployment, running a custom service called buildkite-autoscaler, would poll the Buildkite API for a list of running and schedule jobs and scale the fleet accordingly by making a Kubernetes API call to update the spec.replicas value in the base Deployment:

sequenceDiagram
    participant ba as buildkite-autoscaler
    participant k8s as Kubernetes
    participant bk as Buildkite

    loop
        ba->>bk: list running, pending jobs
        activate bk
        bk-->>ba: job queue counts
        deactivate bk

        activate ba
        ba->>ba: determine desired agent count

        ba->>k8s: get Deployment 
        deactivate ba
        activate k8s
        k8s-->>ba: active Deployment
        ba->>k8s: list Deployment Pods
        k8s-->>ba: active Pods
        deactivate k8s

        ba->>k8s: set spec.replicas to desired
    end

As long as there are jobs in the Buildkite queue, deployed agent pods would remain online until the autoscaler deems it appropriate to scale down. As such, multiple jobs could be dispatched onto the same agent before the fleet gets scaled down.

While Buildkite has mechanisms for mitigating state issues across jobs, and most Sourcegraph pipelines have cleanup and best practices for mitigating them as well, we occasionally still run into “botched” agents. These are particularly prevalent in jobs where tools are installed globally, or Docker containers are started but not correctly cleaned up (for example, if directories are moounted), and so on. We’ve also had issues where certain pods encounter network issues, causing them to fail all the jobs they accept. We also have jobs work “by accident”, especially in some of our more obscure repositories, where jobs rely on tools being installed by other jobs, and suddenly stop working if they land on a “fresh” agent, or those tools get upgraded unexpected.

All of these issues eventually lead us to decide to build a stateless approach to running our Buildkite agents.

Preparing for the switch

The main Sourcegraph mono-repository, sourcegraph/sourcegraph, uses generated pipelines that create pipelines on the fly for Buildkite. Thanks to this, we could easily implement a flag within the generator to redirect builds to the new agents on a gradual basis.

var FeatureFlags = featureFlags{
	StatelessBuild: os.Getenv("CI_FEATURE_FLAG_STATELESS") == "true" ||
		// Roll out to 50% of builds
		rand.NewSource(time.Now().UnixNano()).Int63()%100 < 50,
}

This feature flag could be used to apply queue configuration and environment variables on builds, allowing us to easily test out larger loads on the new agents and roll back changes with ease.

Static Kubernetes Jobs

The initial approach undertaken by the team used a single persistent Kubernetes Job. Agents would start up with --disconnect-after-job, indicating that they should consume a single job from the queue and immediately disconnect.

A new autoscaler service, job-autoscaler, was set up that pretty much did the exact same thing as the old buildkite-autoscaler, but instead of adjusting spec.replicas, it updated spec.parallelism instead, setting spec.completions and spec.backoffLimit to arbitrarily large values to prevent the Job from ever completing and shutting down.

This initial approach was used to iterate on some refinements to our pipelines to accommodate stateless agents (namely improved caching of resources). Upon rolling this out on a larger scale, however, we immediately ran into issues resulting in major CI outages, after which I outlined my thoughts in sourcegraph#32843 dev/ci: stateless autoscaler: investigate revamped approach with dynamic jobs. It turns out, we probably should not be applying a stateful management approach (scaling a single Job entity up and down) to what should probably be a stateless queue processing mechanism. I decided to take point on re-implementing our approach.

Dynamic Kubernetes Jobs

In sourcegraph#32843 I proposed an approach where we dispatch agents by creating new Kubernetes Jobs with spec.parallelism and spec.completions set to roughly number of agents needed to process all the jobs within the Buildkite jobs queue. This would mean that as soon as all the agents within a dispatched Job are “consumed” (have processed a Buildkite job and exited), Kubernetes can clean up the Job and related resources, and that would be that. If more agents are needed, we simply keep dispatching more Jobs. This is done by a new service called buildkite-job-dispatcher.

Luckily, all the setup has been done for stateless agents with the existing Buildkite Job, so the way the dispatcher works is by fetching the deployed Job, resetting a variety of fields used internally by Kubernetes:

  • in metadata: UID, resource version, and labels
  • in the Job spec: selector and template.metadata.labels

Making a few changes:

  • setting parallelism = completions = number of jobs in queue + buffer
    • this means that we are dispatching agents to consume the queue, and exit when done
  • setting activeDeadlineSeconds, ttlSecondsAfterFinished to reasonable values
    • activeDeadlineSeconds prevents stale agents from sitting around for too long in case, for example, a build gets cancelled
    • ttlSecondsAfterFinished ensures resources are freed after use
  • adjusting the BUILDKITE_AGENT_TAGS environment variable on the Buildkite agent container

And deploying the adjusted spec as a new Job!

sequenceDiagram
    participant ba as buildkite-job-dispatcher
    participant k8s as Kubernetes
    participant bk as Buildkite
    participant gh as GitHub

    loop
      gh->>bk: enqueue jobs
      activate bk

      ba->>bk: list queued jobs and total agents
      bk-->>ba: queued jobs, total agents

      activate ba
      ba->>ba: determine required agents 
      alt queue needs agents
        ba->>k8s: get template Job
        activate k8s
        k8s-->>ba: template Job
        deactivate k8s

        ba->>ba: modify Job template

        ba->>k8s: dispatch new Job
        activate k8s
        k8s->>bk: register agents
        bk-->>k8s: assign jobs to agents

        loop while % of Pods not online or completed
          par deployed agents process jobs
            k8s-->>bk: report completed jobs
            bk-->>gh: report pipeline status
            deactivate bk
          and check previous dispatch
            ba->>k8s: list Pods from dispatched Job
            k8s-->>ba: Pods states
          end
        end
      end
      deactivate ba

      k8s->>k8s: Clean up completed Jobs

      deactivate k8s
    end

As noted in the diagram above, there’s also a “cooldown” mechanism where the dispatcher waits for the previous dispatch to roll out at least partially before dispatching a new Job to account for delays in our infrastructure. Without it, the dispatcher could continuously create new agents as the visible agent count appears low, leading to overprovisioning. We do this by simply listing the Pods associated with the most recently dispatched Job, which is easy enough to track within the dispatcher.

Observability

buildkite-job-dispatcher runs on a loop, with each run associated with a dispatchID, a simplified UUID with all special character removed. Everything that happens within a dispatch iteration is associated with this ID, starting with log entries, built on go.uber.org/zap:

import "go.uber.org/zap"

func (d *Dispatcher) run(ctx context.Context, k8sClient *k8s.Client, dispatchID string) error {
	// Allows us to key in on a specifc dispatch run when looking at logs
	runLog := d.log.With(zap.String("dispatchID", dispatchID))
	runLog.Debug("start run", zap.Any("config", config))
	// {"msg":"start run","dispatchID":"...","config":{...}}
}

Dispatched agents have the dispatch ID attached to their name and labels as well:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    description: Stateless Buildkite agents for running CI builds.
    kubectl.kubernetes.io/last-applied-configuration: # ...
  creationTimestamp: "2022-04-18T00:04:34Z"
  labels:
    app: buildkite-agent-stateless
    dispatch.id: 3506b2adb17945d7b690bd5f9e6a6fb0
    dispatch.queues: stateless_standard_default_job

This means that when something unexpected happens - for example, when agents are underpovisioned or overprovisioned, we can easily look at the Jobs dispatched and link back to the log entries associated with their creation:

The dispatcher’s structured logs also allow us to leverage Google Cloud’s log-based metrics by generating metrics from numeric fields within log entries. These metrics form the basis for our at-a-glance overview dashboard of the state of our Buildkite agent fleet and how the dispatcher is responding to demand, as well as alerting for potential issues (for example, if Jobs are taking too long to roll out).

Based on these metrics, we can make adjustments to the numerous knobs available for fine-tuning the behaviour of the dispatcher: target minimum and maximum agents, the frequency of polling, the ratio of agents to require to come online before starting a new dispatch, agent TTLs, and more.

Git mirror caches

During the initial stateless agent implementation, my teammates @jhchabran and @davejrt developed some nifty mechanisms for caching asdf (a tool management tool) and Yarn dependencies. It uses a Buildkite plugin for caching under the hood, and exposes a simple API for use with Sourcegraph’s generated pipelines:

func withYarnCache() buildkite.StepOpt {
	return buildkite.Cache(&buildkite.CacheOptions{
		ID:          "node_modules",
		Key:         "cache-node_modules-{{ checksum 'yarn.lock' }}",
		RestoreKeys: []string{"cache-node_modules-{{ checksum 'yarn.lock' }}"},
		Paths:       []string{"node_modules", /* ... */},
		Compress:    false,
	})
}
func addPrettier(pipeline *bk.Pipeline) {
	pipeline.AddStep(":lipstick: Prettier",
		withYarnCache(),
		bk.Cmd("dev/ci/yarn-run.sh format:check"))
}

A lingering problem continued to be the initial clone step, however, especially in the main sourcegraph/sourcegraph monorepo, which can take upwards of 30 seconds to perform a shallow clone. We can’t entirely depend on shallow clones either, since our pipeline generator depends on performing diffs against our main branch to determine how to construct a pipeline. This is especially painful for short steps, where the time to run a linter check might be around the same amount of time it takes to perform a clone.

Buildkite supports a feature that allows all jobs on a single host to share a single git clone, using git clone --mirror. Subsequent clones after the initial clone can leverage the mirror repository with git clone --reference:

If the reference repository is on the local machine, […] obtain objects from the reference repository. Using an already existing repository as an alternate will require fewer objects to be copied from the repository being cloned, reducing network and local storage costs.

On our old stateless agents, this means that while some jobs can take the same 30 seconds to clone the repository, most jobs that land on “warm” agents will have a much faster clone time - roughly 5 seconds.

To recreate this feature on our stateless agents, I created a daily cron job that:

  1. Creates a disk in Google Cloud, with gcloud compute disks create buildkite-git-references-"$BUILDKITE_BUILD_NUMBER"
  2. Deploys a Kubernetes PersistentVolume and PersistentVolumeClaim corresponding to the new disk
  3. Deploys a Kubernetes Job that mounts the generated PersistentVolumeClaim and creates a clone mirror
  4. Updates the PersistentVolumeClaim to be labelled state: ready

We generate resources to deploy using envsubst <$TEMPLATE >$GENERATED on a template spec. For example, the PersistentVolume template spec looks like:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: buildkite-git-references-$BUILDKITE_BUILD_NUMBER
  namespace: buildkite
  labels:
    deploy: buildkite
    for: buildkite-git-references
    state: $PV_STATE
    id: '$BUILDKITE_BUILD_NUMBER'
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  claimRef:
    name: buildkite-git-references-$BUILDKITE_BUILD_NUMBER
    namespace: buildkite
  gcePersistentDisk:
    fsType: ext4
    # the disk we created with 'gcloud compute disks create'
    pdName: buildkite-git-references-$BUILDKITE_BUILD_NUMBER
  capacity:
    storage: 16G
  persistentVolumeReclaimPolicy: Delete
  storageClassName: buildkite-git-references

PersitentVolumes are created with accessModes: [ReadWriteOnce, ReadOnlyMany] - the idea is that we will mount it as ReadWriteOnce to populate the disk with a mirror repository, before allowing all our agents to mount the disk as ReadOnlyMany:

apiVersion: batch/v1
kind: Job
metadata:
  name: buildkite-git-references-populate
  namespace: buildkite
  annotations:
    description: Populates the latest buildkite-git-references disk with data.
spec:
  parallelism: 1
  completions: 1
  ttlSecondsAfterFinished: 240 # allow us to fetch logs
  template:
    metadata:
      labels:
        app: buildkite-git-references-populate
    spec:
      containers:
        - name: populate-references
          image: alpine/git:v2.32.0
          imagePullPolicy: IfNotPresent
          command: ['/bin/sh']
          args:
            - '-c'
            # Format:
            # git clone [email protected]:sourcegraph/$REPO /buildkite-git-references/$REPO.reference;
            - |
              mkdir /root/.ssh; cp /buildkite/.ssh/* /root/.ssh/;
              git clone [email protected]:sourcegraph/sourcegraph.git \
                /buildkite-git-references/sourcegraph.reference;
              echo 'Done';
          volumeMounts:
            - mountPath: /buildkite-git-references
              name: buildkite-git-references
      restartPolicy: OnFailure
      volumes:
        - name: buildkite-git-references
          persistentVolumeClaim:
            claimName: buildkite-git-references-$BUILDKITE_BUILD_NUMBER

The buildkite-job-dispatcher can now simply list all the available PersistentVolumeClaims that are ready:

var gitReferencesPVC *corev1.PersistentVolumeClaim
var listGitReferencesPVCs corev1.PersistentVolumeClaimList
if err := k8sClient.List(ctx, config.TemplateJobNamespace, &listGitReferencesPVCs,
  k8s.QueryParam("labelSelector", "state=ready,for=buildkite-git-references"),
); err != nil {
  runLog.Error("failed to fetch buildkite-git-references PVCs", zap.Error(err))
} else {
  gitReferencesPVCs := PersistentVolumeClaims(listGitReferencesPVCs.GetItems())
  pvcCount := zapMetric("pvcs", len(gitReferencesPVCs))
  if len(gitReferencesPVCs) > 0 {
    sort.Sort(gitReferencesPVCs)
    gitReferencesPVC = gitReferencesPVCs[0]
  } else {
    runLog.Warn("no buildkite-git-references PVCs found", pvcCount)
  }
}

And apply it to the agent Jobs we dispatch:

if gitReferencePVC != nil {
  job.Spec.Template.GetSpec().Volumes = append(job.Spec.Template.GetSpec().GetVolumes(),
    &corev1.Volume{
      Name: stringPtr("buildkite-git-references"),
      VolumeSource: &corev1.VolumeSource{
        PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
          ClaimName: gitReferencePVC.GetMetadata().Name,
          ReadOnly:  boolPtr(true),
        },
      },
    })
  agentContainer.VolumeMounts = append(agentContainer.GetVolumeMounts(),
    &corev1.VolumeMount{
      Name:      stringPtr("buildkite-git-references"),
      ReadOnly:  boolPtr(true),
      MountPath: stringPtr("/buildkite-git-references"),
    })
}

And that’s it! We now have repository clone times that are consistently within the 3-7 seconds range, depending on how much your branch has diverged from main. As new disks become available, newly dispatched agents will automatically leverage more up-to-date mirror repositories.

Within the same daily cron job that deploys these disks, we can also prune disks that are no longer used by any agents:

kubectl describe pvc -l for=buildkite-git-references,id!="$BUILDKITE_BUILD_NUMBER" |
  grep -E "^Name:.*$|^Used By:.*$" | grep -B 2 "<none>" | grep -E "^Name:.*$" |
  awk '$2 {print$2}' |
  while read -r vol; do kubectl delete pvc/"${vol}" --wait=false; done

Interestingly enough, there is no way to easily detect if a PersistentVolumeClaim is completely unused. We can detect unbound disks easily, but that doesn’t mean the same thing - in this setup PersistentVolumes are always bound, even when that PersistentVolumeClaim may or may not be in use. kubectl describe has this information though1, which is what the above script (based on this StackOverflow answer) uses.

Stateless agents

So far, we have already seen a drastic reduction in tool-related flakes in CI, and the switch to stateless agents has helped us maintain confidence that issues are related to botched state and poor isolation. There are probably other mechanisms for maintaining isolation between builds, but for our case this seemed to have the easiest migration path.



  1. A quick Sourcegraph search for "Used By" quickly reveals this line as the source of the output. A custom getPodsForPVC is the source of the pods listed here, and looking for references reveals that no kubectl command exposes this functionality except kubectl describe, so lengthy script it is! 

]]>
robert
Extending Sourcegraph search2022-04-10T00:00:00+00:002022-04-10T00:00:00+00:00https://bobheadxi.dev/extending-searchSourcegraph recently held a brief internal hackathon where we got to work on a variety of ideas related to our freshly minted “Sourcegraph use cases”. One idea that was raised was extending Sourcegraph’s core code search functionality to allow queries over search notebooks, a new product that enables live and persistent documentation based on code search, to aid in content discovery for onboarding.

The minimum viable product of this project was to implement the ability to do the following search within the Sourcegraph search language:

type:notebook my notebook query select:notebook.block.md
_____________ _________________ ________________________
       |                |                 └ render Markdown sections of the notebook match
       |                └ query string
       └ type filter

And render search notebooks (and/or selected “blocks”, or sections) within search results! For some context, this is what Sourcegraph’s code search results usually look like:

And this is what search notebooks look like, with each section being a separate notebook block:

In this post, I’ll walk through a brief overview of what I learned about how Sourcegraph search works and what we did to implement an additional search and search result type!

A sneak peak of the end result:

End-to-end notebook block search!

Note that all the code internals mentioned in this post may change - you can view the Sourcegraph repository at 73a484e for a accurate picture of what the codebase looked like at the time! I’d also like to thank @tsenart who both proposed the original idea and worked with me through several brainstorming sessions to discuss the implementation.

Additionally, I am basically a complete outsider when it comes to our search internals, and the search code I interact with in this post was built by Sourcegraph’s fantastic search teams, so kudos1 to the teams for making this hack possible in the first place!

Introducing a search job

When you enter a query into sourcegraph.com/search:

  1. A client makes a request to (typically) the /.api/stream endpoint - see how it is done in the raycast-sourcegraph extension for a simplified example.
  2. The query makes its way to sourcegraph-frontend, which converts the query text into a search plan composed of search jobs to execute against various backends (such as Zoekt).
  3. Jobs get executed and the results get streamed back over the wire to the client.

For example, a typical query foobar will evaluate to a plan of jobs like the following, calling out to a variety of search backends (ZoektGlobalSearch, RepoSearch, ComputeExcludedRepos) within certain limits2, imposed by jobs for enforcing those limits on child jobs.

flowchart TB
0([TIMEOUT])
  0---1
  1[20s]
  0---2
  2([LIMIT])
    2---3
    3[500]
    2---4
    4([PARALLEL])
      4---5
      5([ZoektGlobalSearch])
      4---6
      6([RepoSearch])
      4---7
      7([ComputeExcludedRepos])

The typical example here is a search job that reaches out to our Zoekt backends. A Job could also combine multiple search jobs, such as to run a set of jobs in parallel or to prioritise results from certain jobs before others.

The evaluated search job varies based on your search query - an exhaustive commit search (foo type:commit count:all) will create the following job instead, with a longer timeout and higher limit:

flowchart TB
0([TIMEOUT])
  0---1
  1[1m0s]
  0---2
  2([LIMIT])
    2---3
    3[99999999]
    2---4
    4([PARALLEL])
      4---5
      5([Commit])
      4---6
      6([ComputeExcludedRepos])

Each search job within these plans are implemented behind the Job interface:

// Job is an interface shared by all individual search operations in the
// backend (e.g., text vs commit vs symbol search are represented as different
// jobs) as well as combinations over those searches (run a set in parallel,
// timeout). Calling Run on a job object runs a search.
type Job interface {
  Run(context.Context, database.DB, streaming.Sender) (*search.Alert, error)
  Name() string
}

So how do these jobs in the query plan get created? Poking around for constructors of the Job interface reveals (I think) the following flow for Job creation after a query.Plan is created (primarily with query.Pipeline, which handles query parsing, validation, transformation, and so on):

graph TD
  FromExpandedPlan --> ToEvaluateJob

  ToEvaluateJob --> ToSearchJob
  ToEvaluateJob -- "has pattern (AND or OR)" --> toPatternExpressionJob

  toPatternExpressionJob --> ToSearchJob
  toPatternExpressionJob --> toOrJob
  toPatternExpressionJob --> toAndJob

  toOrJob --> toPatternExpressionJob
  toAndJob --> toPatternExpressionJob

  ToSearchJob --> Job
  ToSearchJob -- has pattern --> optimizeJobs
  optimizeJobs --> Job

The ToSearchJob function, which appears to handle the bulk of creation of search jobs, with the additional layers applying a variety of processing.

// ToSearchJob converts a query parse tree to the _internal_ representation
// needed to run a search routine. To understand why this conversion matters, think
// about the fact that the query parse tree doesn't know anything about our
// backends or architecture. It doesn't decide certain defaults, like whether we
// should return multiple result types (pattern matches content, or a file name,
// or a repo name). If we want to optimise a Sourcegraph query parse tree for a
// particular backend (e.g., skip repository resolution and just run a Zoekt
// query on all indexed repositories) then we need to convert our tree to
// Zoekt's internal inputs and representation. These concerns are all handled by
// toSearchJob.
func ToSearchJob(jargs *Args, q query.Q, db database.DB) (Job, error) {
  b, err := query.ToBasicQuery(q)
  if err != nil {
    return nil, err
  }
  types, _ := q.StringValues(query.FieldType)
  resultTypes := search.ComputeResultTypes(types, b.PatternString(), jargs.SearchInputs.PatternType)

  // ...

  var requiredJobs, optionalJobs []Job
  addJob := func(required bool, job Job) {
    if required {
      requiredJobs = append(requiredJobs, job)
    } else {
      optionalJobs = append(optionalJobs, job)
    }
  }

  // ... various conditional calls to addJob
}

So to start off, we add a new field type result.TypeNotebook = "notebook", and attach a new Job when a query includes type: notebook:

if resultTypes.Has(result.TypeNotebook) {
  notebookSearchJob := &notebook.SearchJob{
    PatternString: b.PatternString(),
  }
  addJob(true, notebookSearchJob)
}

For now, we want to create a stub implementation that provides a few hard-coded notebooks that sends a few results over to the streaming.Sender provided in the (Job).Run interface. This requires implementing the result.Match interface:

type Match interface {
  ResultCount() int

  // Limit truncates the match such that, after limiting,
  // `Match.ResultCount() == limit`. It should never be called with
  // `limit <= 0`, since a single match cannot be truncated to zero results.
  Limit(int) int

  Select(filter.SelectPath) Match
  RepoName() types.MinimalRepo

  // Key returns a key which uniquely identifies this match.
  Key() Key
}

Right off the bat, it becomes clear that Sourcegraph’s search internals are heavily geared towards repository-oriented results, with the top-level RepoName being part of the Match interface. Repository matches, file content results, symbols, commits, diffs, and so on all return results that are part of a repository. Notebooks, on the other hand, are an entirely separate entity within the Sourcegraph application, and notebooks that are tracked in the database (it is also possible to create notebooks with .snb.md files within repositories, but we ignore that case for now) are not strictly associated with any repository.

This is even more evident within the Key type, which requires an unique combination Repo, Rev, Path, AuthorDate, Commit, Path, and TypeRank - none of which are fields that we can use to uniquely identify a search notebook. We could use Path as the notebook name, but that’s not strictly unique either.

To work around these issues for now, we just return a zero-value RepoName and add a new field ID to the Key type:

type Key struct {
  // ...

  // ID is an arbitrary identifier that can be used to distinguish this result,
  // e.g. if the result type is not associated with a repository.
  ID string

  // ...
}
type NotebookMatch struct {
  ID int64

  Title     string
  Namespace string
  Private   bool
  Stars     int
}

func (n NotebookMatch) RepoName() types.MinimalRepo {
  // This result type is not associated with any repository.
  return types.MinimalRepo{}
}

func (n NotebookMatch) Limit(limit int) int {
  // Always represents one result and limit > 0 so we just return limit - 1.
  return limit - 1
}

func (n *NotebookMatch) URL() *url.URL {
  return &url.URL{Path: "/notebooks/" + n.marshalNotebookID()}
}

func (n *NotebookMatch) Key() Key {
  return Key{
    ID:       n.marshalNotebookID(),
    TypeRank: rankRepoMatch,
  }
}

// other interface functions no-op for now

With our new types, we can create a stub job for searching search notebooks:

type SearchJob struct {}

func (s *SearchJob) Run(ctx context.Context, db database.DB, stream streaming.Sender) (*search.Alert, error) {
  stream.Send(streaming.SearchEvent{
    Results: result.Matches{
      &result.NotebookMatch{
        Title:     "FOOBAR",
        Namespace: "sourcegraph",
        ID:        1,
        Stars:     64,
        Private:   false,
      },
      &result.NotebookMatch{
        Title:     "BAZ",
        Namespace: "robert",
        ID:        2,
        Stars:     0,
        Private:   true,
      },
    },
  })
  return nil, nil
}

func (*SearchJob) Name() string { return "NotebookSearch" }

The workarounds above caused some funky behaviour, such as repository permissions post-processing rejecting notebook results as not being associated with a repository the current actor (user) has access to, so I just hacked in some a condition to ignore zero-value RepoNames in those checks to avoid dropping our notebook results.

We can test the evaluation of the query type:notebook select:notebook.block.md foobar to see our new search job type being registered (after implementing the appropriate printers):

flowchart TB
0([TIMEOUT])
  0---1
  1[20s]
  0---2
  2([LIMIT])
    2---3
    3[500]
    2---4
    4([SELECT])
      4---5
      5[notebook.block.md]
      4---6
      6([PARALLEL])
        6---7
        7([NotebookSearch])
        6---8
        8([ComputeExcludedRepos])

In this case, the select: term is just thrown in to demonstrate that it’s a job that occurs on top of a child job, which contains the NotebookSearch job we created. This will be important later)!

Sending results over the wire

That’s not the end of it! Distinct from plans, jobs, and matches, we also have event types, which are the types that get transmitted over the wire to search clients.

For the most part, this is a very thin layer that just simplifies the internal match types for consumption, and hydrates events with repository metadata from a cache (such how many stars the associated repository has, and when the repository was last updated) or decorations. For our new notebook results, we don’t really need to support any of that yet - we can simply map results more or less directly to a new event type.

func fromNotebook(notebook *result.NotebookMatch) *streamhttp.EventNotebookMatch {
  return &streamhttp.EventNotebookMatch{
    Type:      streamhttp.NotebookMatchType,
    ID:        notebook.Key().ID,
    Title:     notebook.Title,
    Namespace: notebook.Namespace,
    URL:       notebook.URL().String(),
    Stars:     notebook.Stars,
    Private:   notebook.Private,
  }
}

At this point, we basically have everything we need to see our results in the API results! We can confirm by spinning up Sourcegraph locally with sg start, executing a search, and inspecting the response of the network request to /.api/stream within a browser for our placeholder notebook results:

Look closely at the 'matches' entry for our hard-coded notebooks!

Querying the database for real results

Notebooks live in the Sourcegraph database, so to replace our stub results we can make a query to look for notebooks that returns relevant matches based on the provided query string.

SELECT
  notebooks.id,
  notebooks.title,
  NOT public as private, -- invert for consistency with other match types

  -- apply post-processing after query to merge namespace_user and  namespace_org into a
  -- single 'Namespace' field (only one can be set at a time)
  users.username as namespace_user,
  orgs.name as namespace_org,

  (
    SELECT COUNT(*)
    FROM notebook_stars
    WHERE notebook_id = notebooks.id
  ) as stars
FROM
  notebooks
  LEFT JOIN users on users.id = notebooks.namespace_user_id
  LEFT JOIN orgs on orgs.id = notebooks.namespace_org_id
WHERE
  (%s) -- permission conditions
  AND (%s) -- query conditions
ORDER BY
  stars DESC
LIMIT
  25

To generate query conditions, we use the notebook.SearchJob evaluated in ToSearchJob as the sole parameter. The idea is to extend SearchJob to contain all the parameters that can be used to adjust the generated query (such as pattern types, e.g. regexp, or additional fields, such as inclusion and exclusion of notebooks with notebook: and -notebook, and so on). For now, we generate simple queries solely based on the PatternString parameter:

func makeQueryConds(job *SearchJob) *sqlf.Query {
  conds := []*sqlf.Query{}

  // Allow querying against the 'full title'
  const concatTitleQuery = "CONCAT(users.username, orgs.name, notebooks.title)"
  if job.PatternString != "" {
    titleQuery := "%(" + job.PatternString + ")%"
    conds = append(conds, sqlf.Sprintf("%s ILIKE %s",
      concatTitleQuery, titleQuery))
  }

  if len(job.PatternString) > 0 {
    // Query against notebook contents, embedded as a tsvector field.
    conds = append(conds, sqlf.Sprintf("notebooks.blocks_tsvector @@ to_tsquery('english', %s)",
      toPostgresTextSearchQuery(job.PatternString)))
  }

  if len(conds) == 0 {
    // If no conditions are present, append a catch-all condition to avoid a SQL syntax error
    conds = append(conds, sqlf.Sprintf("1 = 1"))
  }

  return sqlf.Join(conds, "\n OR")
}

The CONCAT means that we cannot use indexes to hasten the query, but this is a hackathon so oh well. I decided to keep it in because I felt like a query for $namespace $topic felt like a very natural query to want to make, and I wanted to the demo supported that.

After writing a bit more boilerplate to execute the database query and scan the resulting rows, we can update our search job to return real results instead:

func (s *SearchJob) Run(ctx context.Context, db database.DB, stream streaming.Sender) (*search.Alert, error) {
  store := Search(db)
  notebooks, err := store.SearchNotebooks(ctx, s)
  if err != nil {
    return nil, errors.Wrap(err, "NotebookSearch")
  }
  matches := make([]result.Match, len(notebooks))
  for i, n := range notebooks {
    matches[i] = n
  }
  stream.Send(streaming.SearchEvent{
    Results: matches,
  })
  return nil, nil
}

We can test this out by creating a few notebooks in our local Sourcegraph instance and inspecting the network requests in-browser again to see real notebooks being returned!

Implementing notebook blocks results

Seeing the notebook titles that match your query is great and all, but to demonstrate the potential of this capability we wanted to make sure users can also see notebook content results - in other words, the matching notebook blocks - for their query.

For now, we decided to implement this such that notebook blocks only get returned with the select:notebook.block parameter. The Sourcegraph query language already features selections like select:repo or select:commit.diff.added, so this approach felt like it fitted in with how other search types are implemented.

Selections are part of the Match interface we previously implemented, and they work via selectJob, which wraps the streaming.Sender with another streaming.Sender that calls Select on each result it receives before passing it to the underlying stream.

This means that all we have to do is also query for blocks within our notebooks database query, and only expose the blocks within the Select implementation. To start off, we extend our NotebookMatch with a Blocks field, and implement Select such that we generate a new NotebookBlocksMatch type:

type NotebookMatch struct {
  // ... as before

  Blocks NotebookBlocks `json:"-"`
}

/// ... as before

func (n *NotebookMatch) Select(path filter.SelectPath) Match {
  // Only support 'select:notebook.*' on this result type
  if path.Root() != filter.Notebook {
    return nil
  }

  switch len(path) {
  case 1:
    return n // This is just 'select:notebook', so return self

  case 2, 3: // Support 'select:notebook.block' and 'select:notebook.block.*'
    if path[1] == "block" {
      if len(n.Blocks) == 0 {
        return nil // No results!
      }

      return (&NotebookBlocksMatch{
        Notebook: *n,
        Blocks:   n.Blocks,
      }).Select(path) // Allow blocks to continue selecting for 'select:notebook.block.*'
    }
  }

  return nil
}

To support select:notebook.blocks.$TYPE, where $TYPE is a block type (such as Markdown, query, symbol, and so on), the NotebookBlocksMatch type must also implement Select to only provide blocks of the requested type:

func (n *NotebookBlocksMatch) Select(path filter.SelectPath) Match {
  // Only support 'select:notebook.*' on this result type
  if path.Root() != filter.Notebook {
    return nil
  }

  switch len(path) {
  case 2:
    if path[1] == "block" {
      return n // This is just 'select:notebook.block', so return self
    }

  case 3:
    // Filter by the requested block type, which is the third path parameter. For example,
    // 'select:notebook.block.md' will filter for blocks of type 'md'.
    blockType := path[2]
    var blocks NotebookBlocks
    for _, b := range n.Blocks {
      if b["type"] == blockType {
        blocks = append(blocks, b)
      }
    }
    if len(blocks) == 0 {
      return nil // No results!
    }
    return &NotebookBlocksMatch{
      Notebook: n.Notebook,
      Blocks:   blocks,
    }
  }

  return nil
}

And as before, we need to implement an event type EventNotebookBlockMatch and the relevant adapters as well.

func fromNotebookBlocks(blocks *result.NotebookBlocksMatch) *streamhttp.EventNotebookBlockMatch {
  return &streamhttp.EventNotebookBlockMatch{
    Type:     streamhttp.NotebookBlockMatchType,
    Notebook: *fromNotebook(&blocks.Notebook),
    Blocks:   blocks.Blocks,
  }
}

For the database layer, we now need to add blocks to our result type. Blocks are currently store as a JSON blob within the notebooks.blocks column, so adding that to our SELECT and including it in the result scan is fairly straight-forward.

However, this does mean that we can’t only select relevant blocks within the database query. A better long-term solution to this is likely to split notebooks.blocks out into a separate table and joining it at query time, but that’s a lot of work for a hackathon so I decided to go for a cheap hack: post-filtering! This isn’t too bad for now because the notebooks.blocks_tsvector @@ to_tsquery in our query conditions means that the returned notebooks are likely to have a matching block, but it definitely isn’t very pretty.

Even worse, blocks of various types have varying shapes (i.e. there’s no single block.text field we can filter on), and I didn’t want to special-case each block type for now. A closer look at notebooks.blocks_tsvector reveals it is backed by a magic Postgres feature that indexes all fields of type string within the notebooks.blocks JSON:

ALTER TABLE
  notebooks
ADD
  COLUMN
IF NOT EXISTS
  blocks_tsvector TSVECTOR
GENERATED ALWAYS AS
  (jsonb_to_tsvector('english', blocks, '["string"]')) STORED;

It is a neat implementation that does not require any knowledge of blocks fields, but sadly there does not seem to be an equivalent function built with Go for us to post-filter with. So I just marshal each block as JSON and do a regexp search over the whole thing:

func (s *notebooksSearchStore) SearchNotebooks(ctx context.Context, job *SearchJob) ([]*result.NotebookMatch, error) {
  // ... query for notebooks

  // do our post-filtering
  if len(job.PatternString) > 0 {
    searchRe, err := regexp.Compile("(?i).*(" + job.PatternString + ").*")
    if err != nil {
      return nil, err
    }
    for _, n := range notebooks {
      var matchBlocks result.NotebookBlocks
      // filter notebook blocks
      for _, block := range n.Blocks {
        b, err := json.Marshal(block)
        if err != nil {
          continue
        }
        // regexp match over the marshalled block
        if searchRe.Match(b) {
          matchBlocks = append(matchBlocks, block)
        }
      }
      n.Blocks = matchBlocks
    }
  }

  return notebooks, nil
}

Hey, it’s a hackathon!

Similarly to before, we can verify this works end-to-end by running a type:notebook select:notebook.block query and inspecting the response:

Rendering search notebook results

Rendering results in the network tab is great and all, but we want to demo something pretty as well! We start off by adding types in the web app that correspond to our new event types:

export type SearchType = /* ... */ | 'notebook' | null
export type SearchMatch = /* ... */ | NotebookMatch | NotebookBlocksMatch

export interface NotebookMatch {
    type: 'notebook'
    id: string
    title: string
    namespace: string
    url: string
    stars?: number
    private: boolean
}

export interface NotebookBlocksMatch {
    type: 'notebook.block'
    notebook: NotebookMatch
    // TODO lots of variants of these types, leave as any for now and massage the data
    // as needed
    blocks: any[]
}

To extend type: completions in the search bar, we update FILTERS:

export const FILTERS: Record<NegatableFilter, NegatableFilterDefinition> &
    Record<Exclude<FilterType, NegatableFilter>, BaseFilterDefinition> = {
    /* ... */
    [FilterType.type]: {
        description: 'Limit results to the specified type.',
        discreteValues: () => [/* ... */, 'notebook'].map(value => ({ label: value })),
    },
    /* ... */
}

And similarly for select: completions, we update SELECTORS:

export const SELECTORS: Access[] = [
  /* ... */
  {
    name: 'notebook',
    fields: [
      {
        name: 'block',
        fields: [{ name: 'md' }, { name: 'query' }, { name: 'file' }, { name: 'symbol' }],
      },
    ],
  },
]
Suggestions!

And now things get a bit hacky. For plain notebook results, we can leverage the same components used for repository matches with reasonable results by extending the StreamingSearchResultsList component:

export const StreamingSearchResultsList: React.FunctionComponent<StreamingSearchResultsListProps> = ({
    /* ... */
}) => {
    /* ... */

    const renderResult = useCallback(
        (result: SearchMatch, index: number): JSX.Element => {
            switch (result.type) {
                /* ... */
                case 'notebook':
                    return (
                        <SearchResult
                            icon={NotebookIcon}
                            result={result}
                            repoName={`${result.namespace} / ${result.title}`}
                            platformContext={platformContext}
                            onSelect={() => logSearchResultClicked(index, 'notebook')}
                        />
                    )
            }
        }
    )

    return (/* ... */)
}

For notebook blocks, things started to get really hacky. I had originally expected to just render the parameters encoded in the block (for example, the query in a query block). However, @tsenart pointed out that maybe we could render the blocks exactly as it is rendered within a notebook. I thought this would be brilliant! Surely it would be as easy as simply importing the correct component and providing it with the blocks in a block match - how messy could this be?

Well, using NotebookComponent ended up looking like this:

  case 'notebook.block':
      return (
          <ResultContainer
              icon={NotebookIcon}
              title={
                  <Link to={result.notebook.url}>
                      {result.notebook.namespace} / {result.notebook.title}
                  </Link>
              }
              collapsible={false}
              defaultExpanded={true}
              resultType={result.type}
              onResultClicked={noop}
              expandedChildren={
                  <div className={styles.notebookBlockResult}>
                      <NotebookComponent
                          key={`${result.notebook.id}-blocks`}
                          isEmbedded={true}
                          noRunButton={true}
                          // TODO HACK: DB, component, and GraphQL block types
                          // don't align so we need to massage it into a type
                          // this component finds acceptable
                          blocks={result.blocks.map(b => {
                              if (b.queryInput) {
                                  return { ...b, input: { query: b.queryInput.text } }
                              }
                              return {
                                  ...b,
                                  input:
                                      b.markdownInput || b.fileInput || b.symbolInput || b.computeInput,
                              }
                          })}
                          authenticatedUser={null}
                          globbing={false}
                          isReadOnly={true}
                          extensionsController={extensionsController}
                          hoverifier={hoverifier}
                          platformContext={platformContext}
                          exportedFileName={result.notebook.title}
                          onSerializeBlocks={noop}
                          onCopyNotebook={() => NEVER}
                          streamSearch={() => NEVER} // TODO make this jump to new search page instead
                          isLightTheme={isLightTheme}
                          telemetryService={telemetryService}
                          fetchHighlightedFileLineRanges={fetchHighlightedFileLineRanges}
                          searchContextsEnabled={searchContextsEnabled}
                          settingsCascade={settingsCascade}
                          isSourcegraphDotCom={isSourcegraphDotCom}
                          showSearchContext={showSearchContext}
                      />
                  </div>
              }
          />
      )

Gnarly, eh? All these fields required me to do all sorts of things to StreamingSearchResultsListProps to get the props needed. Full disclaimer: I am far from a professional when it comes to web apps and React, so I’m sure there’s a better way to do this than prop drilling, but oh well. The NotebookComponent also doesn’t feel like it was meant for this kind of import and use, given notebooks is a pretty new product and the whole philosophy of iterate fast and polish later and all.

That said, once the compiler stopped complaining the results were great - everything kind of just worked, and looked pretty good after some CSS adjustments! Even running query blocks worked nicely.

Of course, this begs the question - what if you make a notebook search, within a search notebook? Well, that works too!

Search-notebooks-ception?

You can also check out a brief final demo I made of the state of the project at the end of the hackathon for how this all ties together:

demo

You can also check out the (messy) (and incomplete) code here: sourcegraph#33316


Wrap-up

Thanks for reading! I hope this was an interesting glimpse at how search works at Sourcegraph. I’m not sure if this will ever make it into the product, but regardless, this was a really fun foray into a part of the codebase I’ve only interacted with at a surface level through my Sourcegraph for Raycast extension project, and learning about the abstractions used to power code search (and more!) was fascinating, and a nice change of pace from my usual work!


  1. So somewhat embarrassingly, on one of my iterations of this project I complained a bit about the tedium of the many layers in the search backend, at which point I was educated by Comby (structural search) creator @rvantonder on how cleaning up the search internals is an ongoing effort and has improved significantly over the past year. One of my biggest takeaways from this project is that search a very complex system and that building a suitable abstraction for the myriad of types of search that Sourcegraph already features is a monumental undertaking! 

  2. By default, Sourcegraph search is limited to optimise for fast results. This extensiveness of a search is configurable through the count: and timeout:, as well as a special count:all mode, as described in our documentation: Exhaustive search. 

]]>
robert
Self-documenting and self-updating tooling2022-02-20T00:00:00+00:002022-02-20T00:00:00+00:00https://bobheadxi.dev/self-documenting-self-updatingIn a rapidly moving organization, documentation drift is inevitable as the underlying tools undergoes changes to suit changing needs, especially for internal tools where leaning on tribal knowledge can often be more efficient in the short term. As each component grows in complexity, however, this introduces debt that makes for a confusing onboarding process, a poor developer experience, and makes building integrations more difficult.

One approach for keeping documentation debt at bay is to choose tools that come with automated writing of documentation built-in. You can design your code in such a way that code documentation generators can also double as user guides (which I explored with my rewrite of the UBC Launch Pad website’s generated configuration documentation), or specifications that can generate both code and documentation (which I tried with Inertia’s API reference). Some libraries, like Cobra, a Go library for build CLIs, can also generate reference documentation for commands (such as Inertia’s CLI reference). This allows you to meet your users where they are - for example, the less technically oriented can check out a website while the more hands-on users can find what they need within the code or in the command line - while maintaining a single source of truth that keeps everything up to date.

Of course, in addition to generated documentation you do still need to write documentation to tie the pieces together - for example, the UBC Launch Pad website still had a brief intro guide and we did put together a usage guide for Inertia, but generated documentation helps you ensure the nitty gritty stays up to date, and focus on high-level guidance in your handcrafted writing.

At Sourcegraph, I’ve been exploring avenues for taking this even further. Once you move away from off-the-shelf generators and invest in leveraging your code to generate exactly what you need, you can build a pretty neat ecosystem of not just documentation generators, but also interesting integrations and tooling that is always up to date by design. In this article, I’ll talk about some of the things we’ve built with this approach in mind: Sourcegraph’s observability ecosystem and continuous integration pipelines.


Observability ecosystem

The Sourcegraph product has shipped with Prometheus metrics and Grafana dashboards for quite a while, used both by Sourcegraph for Sourcegraph Cloud and by self-hosted customers to operate Sourcegraph instances. These have been created from our own Go-based specification since before I started working here. The spec would look something like this (truncated for brevity):

func GitServer() *Container {
	return &Container{
        Name:        "gitserver",
        Title:       "Git Server",
        Description: "Stores, manages, and operates Git repositories.",
        Groups: []Group{{
            Title: "General",
            Rows: []Row{{
                // Each dashboard panel and alert is associated with an "observable"
                Observable{
                    Name:        "disk_space_remaining",
                    Description: "disk space remaining by instance",
                    Query:       `(src_gitserver_disk_space_available / src_gitserver_disk_space_total)*100`,
                    // Configure Prometheus alerts
                    Warning: Alert{LessOrEqual: 25},
                    // Configure Grafana panel
                    PanelOptions: PanelOptions().LegendFormat("{{instance}}").Unit(Percentage),
                    // Some options, like this one, makes changes to both how the panel
                    // is rendered as well as when the alert fires
                    DataMayNotExist: true,
                    // Configure documentation about possible solutions if the alert fires
                    PossibleSolutions: `
                        - **Provision more disk space:** Sourcegraph will begin deleting...
                    `,
                },
            }},
        }},
    },
}
Explore what our monitoring generator looked like in Sourcegraph 3.17 (circa mid-2020)

From here, a program will import the definitions and generate the appropriate Prometheus recording rules, Grafana dashboard specs, and a simple customer-facing “alert solutions” page. Any changes that engineers made to their monitoring definitions using the specification would automatically update everything that needed to be updated, no additional work needed.

For example, the Grafana dashboard spec generation automatically calculates appropriate widths and heights for each panel you add, ensuring they are evenly distributed and include lines that indicate Prometheus alert thresholds, a uniform look and feel, and more.

I loved this idea, so I ran with it and worked on a series of changes that expanded the capabilities of this system significantly. Today, our monitoring specification powers:

  • Grafana dashboards that now automatically includes links to the generated documentation, annotation layers for generated alerts, improved alert overview graphs, and more.
Version and alert annotations in Sourcegraph's generated dashboards. Dashboard like these are automatically provided by defining observables using our monitoring specification, alongside everything else mentioned previously.
  • Prometheus integration that now generates more granular alert rules that include additional metadata such as the ID of the associated generated dashboard panel, the team that owns the alert, and more.
  • An entirely new Alertmanager integration (related blog post) that allows you to easily configure alert notifications via the Sourcegraph application, which automatically sets up the appropriate routes and configures messages to include relevant information for triaging alerts: a helpful summary, links to documentation, and links to the relevant dashboard panel in the time window of the alert. This leverages the aforementioned generated Prometheus metrics!
Automatically configured alert notification messages feature a helpful summary and links to diagnose the issue further for a variety of supported notification services, such as Slack and OpsGenie.

The API has changed as well to improve its flexibility and enable many of the features listed above. Nowadays, a monitoring specification might look like this (also truncated for brevity):

// Definitions are separated from the API so everything is imported from 'monitoring' now,
// which allows for a more tightly controlled API.
func GitServer() *monitoring.Container {
    return &monitoring.Container{
        Name:        "gitserver",
        Title:       "Git Server",
        Description: "Stores, manages, and operates Git repositories.",
        // Easily create template variables without diving into the underlying JSON spec
        Variables: []monitoring.ContainerVariable{{
            Label:        "Shard",
            Name:         "shard",
            OptionsQuery: "label_values(src_gitserver_exec_running, instance)",
            Multi:        true,
        }},
        Groups: []monitoring.Group{{
            Title: "General",
            Rows: []monitoring.Row{{
                {
                    Name:        "disk_space_remaining",
                    Description: "disk space remaining by instance",
                    Query:       `(src_gitserver_disk_space_available / src_gitserver_disk_space_total)*100`,
                    // Alerting API expanded with additional options to leverage more
                    // Prometheus features
                    Warning: monitoring.Alert().LessOrEqual(25).For(time.Minute),
                    Panel: monitoring.Panel().LegendFormat("{{instance}}").
                        Unit(monitoring.Percentage).
                        // Functional configuration API that allows you to provide a
                        // callback to configure the underlying Grafana panel further, or
                        // use one of the shared options to share common options
                        With(monitoring.PanelOptions.LegendOnRight()),
                    // Owners can now be defined on observables, which allows support
                    // to help triage customer queries and is used internally to route
                    // pager alerts
                    Owner: monitoring.ObservableOwnerCoreApplication,
                    // Documentation fields are still around, but an 'Interpretation' can
                    // now also be provided for more obscure background on observables,
                    // especially if they aren't tied to an alert
                    PossibleSolutions: `
                        - **Provision more disk space:** Sourcegraph will begin deleting...
                    `,
                },
            }},
        }},
    }
}           
Explore what our monitoring generator looks like today!

Since the specification is built on a typed language, the API itself is self-documenting in that authors of monitoring definitions can easily access what options are available and what each does through generated API docs or code intelligence available in Sourcegraph or in your IDE, making it very easy to pick up and work with.

Example Sourcegraph API docs of the monitoring API, though similar docs can also be generated by other language-specific tools.

We also now have a tool, dubbed sg, that enables us to spin up just the monitoring stack, complete with hot-reloading of Grafana dashboards, Prometheus configuration, and with a single command: sg start monitoring. You can even easily test your dashboards against production metrics! This is all enabled by having a single tool and set of specifications as the source of truth for all our monitoring integrations.

This all comes together to form a cohesive monitoring development and usage ecosystem that is tightly integrated, encodes best practices, self-documenting (both in the content it generates as well as the APIs available), and easy to extend.

You can check out the monitoring generator source code here.


Continuous integration pipelines

At Sourcegraph, our core continuous integration pipeline are - you guessed it - generated! Our pipeline generator program analyses a build’s variables (changes, branch names, commit messages, environment variables, and more) in order to create a pipeline to run on our Buildkite agent fleet.

Typically, Buildkite pipelines are specified similarly to GitHub Action workflows - by committing a YAML file to your repository that build agents pick up and run. This YAML file will specify what commands should get run over your codebase, and will usually support some simple conditions.

These conditions are not very ergonomic to specify, however, and will often be limited in functionality - so instead, we generate the entire pipeline on the fly:

steps:
  - group: "Pipeline setup"
    steps:
      - label: ':hammer_and_wrench: :pipeline: Generate pipeline'
        # Prioritise generating pipelines so that jobs can get generated and queued up as soon
        # as possible, so as to better assess pipeline load e.g. to scale the Buildkite fleet.
        priority: 10
        command: |
          echo "--- generate pipeline"
          go run ./enterprise/dev/ci/gen-pipeline.go | tee generated-pipeline.yml
          echo "--- upload pipeline"
          buildkite-agent pipeline upload generated-pipeline.yml

The pipeline generator has also been around at Sourcegraph since long before I joined, but I’ve since done some significant refactors to it, including refactoring some of its core functionality - what we call “run types” and “diff types”, which are used to determine the appropriate pipeline go generate for any given build. This allows us to do a ton of cool things.

First, some background on the technical details. A run type is specified as follows:

// RunTypeMatcher defines the requirements for any given build to be considered a build of
// this RunType.
type RunTypeMatcher struct {
    // Branch loosely matches branches that begin with this value, unless a different type
    // of match is indicated (e.g. BranchExact, BranchRegexp)
    Branch       string
    BranchExact  bool
    BranchRegexp bool
    // BranchArgumentRequired indicates the path segment following the branch prefix match is
    // expected to be an argument (does not work in conjunction with BranchExact)
    BranchArgumentRequired bool

    // TagPrefix matches tags that begin with this value.
    TagPrefix string

    // EnvIncludes validates if these key-value pairs are configured in environment.
    EnvIncludes map[string]string
}

When matched, a RunType = iota is associated with the build, which can then be leveraged to determine what kinds of steps to include. For example:

  • Pull requests run a bare-bones pipeline generated from what has changed in your pull requests (read on to learn more) - this enables us to keep feedback loops short on pull requests.
  • Tagged release builds run our full suite of tests, and publishes finalised images to our public Docker registries.
  • The main branch runs our full suite of tests, and publishes preview versions of our images to internal Docker registries. It also generates notifications that can notify build authors if their builds have failed in main.
  • Similarly, a “main dry run” run type is available by pushing to a branch prefixed with main-dry-run/ - this runs almost everything that gets run on main. Useful for double-checking your changes will pass when merged.
  • Scheduled builds are run with specific environment variables for browser extension releases and release branch health checks.
A search notebook walkthrough of how run types are used!

A “diff type” is generated by a diff detector that can work similarly to GitHub Action’s on.paths, but also enables a lot more flexibility. For example, we detect basic “Go” diffs like so:

if strings.HasSuffix(p, ".go") || p == "go.sum" || p == "go.mod" {
    diff |= Go
}

However, engineers can also define database migrations that might not change Go code - in these situations, we still want to run Go tests, and we also want to run migration tests. We can centralise this detection like this:

if strings.HasPrefix(p, "migrations/") {
    diff |= (DatabaseSchema | Go)
}

Our Diff = 1 << iota type is constructed by bit-shifting an iota type, so we can easily check for what diffs have been detected with diff&target != 0, which is done by a helper function, (*DiffType).Has.

A search notebook walkthrough of how diff types are used!

The programmatic generation approach allows for some complex step generation that would be very tedious to manage by hand. Take this example:

if diff.Has(changed.DatabaseSchema) {
    ops.Merge(operations.NewNamedSet("DB backcompat tests",
        addGoTestsBackcompat(opts.MinimumUpgradeableVersion)))
}

In this scenario, a group of checks (operations.NewNamedSet) is created to check that migrations being introduced are backwards-compatible. To make this check, we provide it MinimunUpgradeableVersion - a variable that is updated automatically the Sourcegraph release tool to indicate what version of Sourcegraph all changes should be compatible with. The tests being added look like this:

func addGoTestsBackcompat(minimumUpgradeableVersion string) func(pipeline *bk.Pipeline) {
    return func(pipeline *bk.Pipeline) {
        buildGoTests(func(description, testSuffix string) {
            pipeline.AddStep(
                fmt.Sprintf(":go::postgres: Backcompat test (%s)", description),
                bk.Env("MINIMUM_UPGRADEABLE_VERSION", minimumUpgradeableVersion),
                bk.Cmd("./dev/ci/go-backcompat/test.sh "+testSuffix),
            )
        })
    }
}

buildGoTests is a helper that generates a set of commands to be run against each of the Sourcegraph repository’s Go packages. It is configured to split out more complex packages into separate jobs so that they can be run in parallel across multiple agents. Right now, the generated commands for addGoTestsBackcompat look like this:

 • DB backcompat tests
      • :go::postgres: Backcompat test (all)
      • :go::postgres: Backcompat test (enterprise/internal/codeintel/stores/dbstore)
      • :go::postgres: Backcompat test (enterprise/internal/codeintel/stores/lsifstore)
      • :go::postgres: Backcompat test (enterprise/internal/insights)
      • :go::postgres: Backcompat test (internal/database)
      • :go::postgres: Backcompat test (internal/repos)
      • :go::postgres: Backcompat test (enterprise/internal/batches)
      • :go::postgres: Backcompat test (cmd/frontend)
      • :go::postgres: Backcompat test (enterprise/internal/database)
      • :go::postgres: Backcompat test (enterprise/cmd/frontend/internal/batches/resolvers)

With just the pretty minimal configuration above, each step is generated with a lot of baked-in configuration, many of which is generated automatically for every build step we have.

  - agents:
      queue: standard
    command:
    - ./tr ./dev/ci/go-backcompat/test.sh only github.com/sourcegraph/sourcegraph-public-snapshot/internal/database
    env:
      MINIMUM_UPGRADEABLE_VERSION: 3.36.0
    key: gopostgresBackcompattestinternaldatabase
    label: ':go::postgres: Backcompat test (internal/database)'
    timeout_in_minutes: "60"

In this snippet, we have:

  • A default queue to run the job on - this can be feature-flagged to run against experimental agents.
  • The shared MINIMUM_UPGRADEABLE_VERSION variable that gets used for other steps as well, such as upgrade tests.
  • A generated key, useful for identifying steps and creating step dependencies.
  • Commands prefixed with ./tr: this script creates and uploads traces for our builds!
Build traces help visualise and track the performance of various pipeline steps. Uploaded traces are automatically linked from builds via Buildkite annotations for easy reference, and can also be queried directly in Honeycomb.

Features like the build step traces was implemented without having to make sweeping changes pipeline configuration, thanks to the generated approach - we just had to adjust the generator to inject the appropriate scripting, and now it just works across all commands in the pipeline.

Additional functions are also available that tweak how a step is created. For example, with bk.AnnotatedCmd one can indicate that a step will generate annotations by writing to ./annotations - a wrapper script is configured to make sure these annotations gets picked up and uploaded via Buildkite’s API:

// AnnotatedCmd runs the given command, picks up files left in the `./annotations`
// directory, and appends them to a shared annotation for this job. For example, to
// generate an annotation file on error:
//
//	if [ $EXIT_CODE -ne 0 ]; then
//		echo -e "$OUT" >./annotations/shfmt
//	fi
//
// Annotations can be formatted based on file extensions, for example:
//
//  - './annotations/Job log.md' will have its contents appended as markdown
//  - './annotations/shfmt' will have its contents formatted as terminal output on append
//
// Please be considerate about what generating annotations, since they can cause a lot of
// visual clutter in the Buildkite UI. When creating annotations:
//
//  - keep them concise and short, to minimze the space they take up
//  - ensure they are actionable: an annotation should enable you, the CI user, to know
//    where to go and what to do next.
func AnnotatedCmd(command string, opts AnnotatedCmdOpts) StepOpt {
    var annotateOpts string
    // ... set up options
    // './an' is a script that runs the given command and uploads the exported annotations
    // with the given annotation options before exiting.
    annotatedCmd := fmt.Sprintf("./an %q %q %q",
        tracedCmd(command), fmt.Sprintf("%v", opts.IncludeNames), strings.TrimSpace(annotateOpts))
    return RawCmd(annotatedCmd)
}

The author of a pipeline step can then easily opt in to having their annotations uploaded by changing bk.Cmd(...) to bk.AnnotatedCmd(...). This allows all steps to easily create annotations by simply writing content to a file, and get them uploaded, formatted, and grouped nicely without having to learn the specifics of the Buildkite annotations API:

Annotations can help guide engineers to how to fix build issues.

The usage of iota types for both RunType and DiffType enables us to iterate over available types for some useful features. For example, turning a DiffType into a string gives a useful summary of what is included in the diff:

var allDiffs []string
ForEachDiffType(func(checkDiff Diff) {
    diffName := checkDiff.String()
    if diffName != "" && d.Has(checkDiff) {
        allDiffs = append(allDiffs, diffName)
    }
})
return strings.Join(allDiffs, ", ")

We can take that a bit further to iterate over all our run types and diff types in order to generate a reference page of what each pipeline does - since this page gets committed, it is also a good way to visualise changes to generated pipelines caused by code changes as well!

// Generate each diff type for pull requests
changed.ForEachDiffType(func(diff changed.Diff) {
    pipeline, err := ci.GeneratePipeline(ci.Config{
        RunType: runtype.PullRequest,
        Diff:    diff,
    })
    if err != nil {
        log.Fatalf("Generating pipeline for diff type %q: %s", diff, err)
    }
    fmt.Fprintf(w, "\n- Pipeline for `%s` changes:\n", diff)
    for _, raw := range pipeline.Steps {
        printStepSummary(w, "  ", raw)
    }
})

// For the other run types, we can also generate detailed information about what
// conditions trigger each run type!
for rt := runtype.PullRequest + 1; rt < runtype.None; rt += 1 {
    m := rt.Matcher()
    if m.Branch != "" {
        matchName := fmt.Sprintf("`%s`", m.Branch)
        if m.BranchRegexp {
            matchName += " (regexp match)"
        } else if m.BranchExact {
            matchName += " (exact match)"
        }
        conditions = append(conditions, fmt.Sprintf("branches matching %s", matchName))
        if m.BranchArgumentRequired {
            conditions = append(conditions, "requires a branch argument in the second branch path segment")
        }
    }
    if m.TagPrefix != "" {
        conditions = append(conditions, fmt.Sprintf("tags starting with `%s`", m.TagPrefix))
    }
    // etc.
}
You can also check out the docs generation code directly!

Taking this even further, with run type requirements available we can also integrate run types into other tooling - for example, our developer tool sg can help you create builds of various run types from a command like sg ci build docker-images-patch to build a Docker image for a specific service:

// Detect what run-type someone might be trying to build
rt := runtype.Compute("", fmt.Sprintf("%s/%s", args[0], branch), nil)
// From the detected matcher, we can see if an argument is required and request it
m := rt.Matcher()
if m.BranchArgumentRequired {
    var branchArg string
    if len(args) >= 2 {
        branchArg = args[1]
    } else {
        branchArg, err = open.Prompt("Enter your argument input:")
        if err != nil {
            return err
        }
    }
    branch = fmt.Sprintf("%s/%s", branchArg, branch)
}
// Push to the branch required to trigger a build
branch = fmt.Sprintf("%s%s", rt.Matcher().Branch, branch)
gitArgs := []string{"push", "origin", fmt.Sprintf("%s:refs/heads/%s", commit, branch)}
if *ciBuildForcePushFlag {
    gitArgs = append(gitArgs, "--force")
}
run.GitCmd(gitArgs...)
// Query Buildkite API to get the created build
// ...

Using a similar iteration over the available run types we can also provide tooltips that automatically list out all the supported run types that can be created this way:

Check out the sg ci build source code directly, or the discussion behind the inception of this feature.

So now we have generated pipelines, documentation about them, the capability to extend pipeline specifications with additional feature like tracing, and tooling that is integrated and automatically kept in sync with pipeline specifications - all derived from a single source of truth!

You can check out the pipeline generator source code here.


Wrap-up

The generator approach has helped us build a low-maintenance and reliable ecosystem around parts of our infrastructure. Tailor-making such an ecosystem is a non-trivial investment at first, but as an organization grows and business needs become more specific, the investment pays off by making systems easy to learn, use, extend, integrate, validate, and more.

Also, it’s a lot of fun!

]]>
robert
Mirroring GitHub permissions at scale2021-10-08T00:00:00+00:002021-10-08T00:00:00+00:00https://bobheadxi.dev/mirroring-github-permissions-at-scaleAs a tool for searching over all your code, accurately mirroring repository permissions defined in the relevant code hosts is a core part of Sourcegraph’s functionality. Typically, the only way to do this is through the APIs of code hosts, though rate limits can mean it can take several weeks to work through a large number of users and repositories.

This article goes over some of the work I did on improving GitHub permissions mirroring at Sourcegraph, with the help of several co-workers - primarily Joe Chen (who wrote most of Sourcegraph’s original permissions mirroring code and helped me get up to speed - and is also the author of some big open-source projects like gogs/gogs and go-ini/ini) and Ben Gordon (who helped a ton on the customer-facing side of things).

GitHub rate limits

The GitHub API has a base rate limit of 5000 requests an hour. Let’s look at what it takes to provide access lists for a user: with page size limits of 100 items per page, iterating over all users can take can take up to the following number of requests, all of which should ideally fall under the rate limit constraints:

\[\dfrac{\text{users} \times \text{repositories}}{100} < 5000\]

This means that we will need $\text{users} \times \text{repositories}$ to be greater than 500000 to hit rate limiting.

To come up with a hopefully representative example for this post, I found a random article that claims some companies are hiring upwards of 3000 to 5000 developers, so let’s consider a case of 4000 developers and 5000 repositories (Microsoft has about 4.5k public repos alone, not including anything private or hosted in different organizations), and we get the following time to sync:

\[\left(\dfrac{\text{4000} \times \text{5000}}{100} \times 2 \right) / 5000 = 80 \text{ hours}\]

Three days is okay, but definitely enroaching into the territory of “cannot be done in a weekend”. In practice, implementation details mean that realistically we will consume far more requests than this, since we currently perform several types of sync1, so the process will likely take longer than 80 hours.

The time to sync increases dramatically for even larger numbers of users and repositories - such as one customer that was projected to take upwards of an entire month to perform a full sync. Imagine paying thousands of dollars for a software product, only to have it unusable for the first month! Excessive rate limiting also means that permissions are far more likely to go stale, and can cause issues with other parts of Sourcegraph that also leverage GitHub APIs. The issue became a blocker for this particular customer, so we had to devise a solution to this issue.

Sourcegraph and repository authorization

I got my first hands-on experience with Sourcegraph’s authorization providers when expanding p4 protect support for the Perforce integration.

In a nutshell, Sourcegraph internally defines an interface authorization providers can implement to provide access lists for users (user-centric permissions) and repositories (repo-centric permissions) - authz.Provider - to populate a single source-of-truth table for permissions. This happens continuously and passively in the background. The populated table is then queried by various code paths that use the data to decide what content can and cannot be shown to a user.

Sourcegraph's repository permissions sync state indicator shows when the last sync occurred. Site administrators can also trigger a manual sync.

⚠️ Update: Since the writing of this post, I’ve contributed an improved and more in-depth description of how permissions sync works in Sourcegraph, if you are interested in a better overview: Repository permissions - Background permissions syncing.


For something like Perforce, user-centric sync is as simple as building a list of patterns from the Perforce protections table that work with PostgreSQL’s SIMILAR TO operator, like so:

// For the following p4 protect:
//    open user alice * //Sourcegraph/Engineering/.../Frontend/...
//    open user alice * //Sourcegraph/.../Handbook/...
// FetchUserPerms would return:
repos := []extsvc.RepoID{
    "//Sourcegraph/Engineering/%/Frontend/%",
    "//Sourcegraph/%/Handbook/%",
}

Repo-centric sync is left unimplemented in this case.

For GitHub, we query for all private repositories a user can explicitly access via their OAuth token, and return a list in a similar manner:

hasNextPage := true
for page := 1; hasNextPage; page++ {
	var err error
	var repos []*github.Repository
	repos, hasNextPage, _, err = client.ListAffiliatedRepositories(ctx, github.VisibilityPrivate, page, affiliations...)
	if err != nil {
		return perms, errors.Wrap(err, "list repos for user")
	}
	for _, r := range repos {
		addRepoToUserPerms(extsvc.RepoID(r.ID))
	}
}

Note that for public repositories, Sourcegraph simply doesn’t enforce permissions, so authorization only needs to care about explicit permissions.

The above is where we bump into GitHub’s rate limits easily - in a organization with 5000 repositories, that’s up to 50 API requests for each and every user to page through all their repositories. The GitHub authorization implementation also does the same thing for repo-centric permissions by listing all users with access to each repository.

Introducing a cache

Caches don’t solve all problems, but in this case there was an opportunity to save significant amounts of work through caching. GitHub repository permissions at companies are typically distributed through teams and organizations - membership to either would grant you access to relevant repositories, and teams are strict subsets of organizations. There are still instances of direct permissions - where a user is explicitly added to a repository - but it is unlikely to find a case of repositories without thousands of users added explicitly.

This means that in the vast majority of cases, when querying for user Foo’s repositories, we are actually asking what teams and organizations Foo is in. At a high level, we could do the following instead:

  1. Get Foo’s direct repository affiliations
  2. Get the organizations Foo is in
    1. Get the teams a user is in within this organization
  3. For each organization and team:
    1. If organization allows read permissions on all repositories, or Foo is an organization administrator, get all organization repositories from cache as part of this Foo’s access list
    2. Get all team repositories from cache eas part of Foo’s access list

Cache misses would prompt a new query to GitHub to mirror access lists for specific teams and organizations. In the best-case scenario, where all users are part of large teams and organizations and there are very few instances of being directly granted access to a repository, cache hits should be very frequent and greatly reduce the amount of work required. Going back to the earlier example of 4000 developers and 5000 repositories, we get a best case performance of:

\[\dfrac{(\text{teams} + \text{organizations}) \times \text{5000}}{100} = (\text{teams} + \text{organizations}) \times 50\]

Even if we had a 100 teams and organizations, this would fall under the hourly rate limit - a huge improvement from the previously projected 80 hours. Even in the worse case, this would only be marginally less efficient than the existing implementation.

To mitigate outdated caches, a flag to the provider interface was added to allow partial cache invalidation along the path of a sync (important because you don’t want every single team and organization queued for a sync all at once) and tying it into the various ways of triggering a sync (notably webhook receivers and the API).

The approach was promising, and a feature-flagged2 user-centric sync backed by a Redis cache was implemented in sourcegraph#23978 authz/github: user-centric perms sync from team/org perms caches.

Two-way sync

As mentioned earlier, Sourcegraph’s authorization providers provide two-way sync: user-centric and repo-centric. To make the cache-backed sync complete, equivalent functionality had to be implemented for repo-centric sync.

Because GitHub organizations are conveniently supersets of teams (unlike some code hosts), user-centric cache was implemented with either organization or organization/team as keys and a big list of repositories as its value:

org/team: {
    repos: [repo-foo, repo-bar]
}

To make this cache work both ways, I simply added users to the cache values, and implemented a similar approach to finding a repository’s relevant organizations and teams. In this case, a relevant organization would be one that has default-read access (otherwise members of an organization do not necessarily have access to said repository).

This makes for somewhat large cache values, but also makes it easy to perform partial cache updates. For example, if user user-foo is created and added to org/team, the user can be added to the cache for org/team during user-centric sync, and subsequent syncs of repo-foo and repo-bar will include the new user without having the perform a full sync, and vice versa.

org/team: {
    repos: [repo-foo, repo-bar]
    users: [user-bar, user-foo]
}

On paper, the performance improvements gained here are similar to the ones when implementing caching for user-centric sync, except scaling off the number of users in teams and organizations instead of repositories.

This was implemented in sourcegraph#24328 authz/github: repo-centric perms sync from team/org perms caches.

Scaling in practice

Throughout the implementation of the cache-backed GitHub permissions mirroring, a large number of unit tests were included, as well as a few integration tests, that tested the behaviour of various combinations of cache hits and misses.

To write integration tests, we use “golden testing”, where we record network interactions to a file (called “VCRs”). Tests then use the recorded network interactions instead of reaching out to external services by default, unless explicitly asked to update the recordings. Interestingly, despite the significant improvements of this approach for larger numbers of users and repositories, this also made clear just how inefficient the cache-based approach is for smaller instances:

  • with caching disabled, the integration test recorded just 2 network requests for repo-centric sync.
  • with caching enabled, the integration test recorded a whopping 22 network requests for repo-centric sync with the exact same number of repositories and users

This is why we continue to leave the cache-backed sync as a opt-in behaviour.

However, despite reasonably robust testing of the behaviour of the code, we had no way to easily perform and end-to-end test of this at the scale of thousands of repositories and users with the appropriate teams and organizations. In hindsight, I could have invested some effort into generating VCRs to emulate such an environment and test against it, but with the agreement of the customer requesting this the decision was made to ship the changes and ask them to try it out.

Debug logging

All was well at first in the trial run - the backlog of repositories queued for an initial permissions sync was very rapidly being worked through, with a projected 3-day time to full sync - a huge improvement from the the previously projected 30 days. However, with just a few thousand repositories left to process, the sync stalled.

Metrics indicated jobs were timing out, and a look at the logs revealed thousands upon thousands of lines of random comma-delimited numbers. It seemed that printing all this junk was causing the service to stall, and sure enough setting the log driver to none to disable all output on the relevant service allowed the sync to proceed and continue.

Where did the log come from? I left a stray log.Printf("%+v\n", group) in there when I was debugging cache entries. At scale these entries could contain many thousands of entries, causing the system to degrade. Be careful what you log!

Postgres parameter limits

A service we call repo-updater has an internal service called PermsSyncer that manages a queue of jobs to request updated access lists using these authorization providers for users and repositories based on a variety of heuristics such as permissions age, as well as on events like webhooks and repository visits (diagram). Access lists returned by authorization providers are upserted into a single repo_permissions table that is the source of truth for all repositories a Sourcegraph user can access, and vice versa.

Entries can also be upserted into a table called repo_pending_permissions, which is home to permissions that do not have a Sourcegraph user attached yet. When a user logs in via a code host’s OAuth mechanism to Sourcegraph, the user’s Sourcegraph identity attached to the user’s identity on that code host (this allows a Sourcegraph user to be associated with multiple code hosts), and relevant entries in repo_pending_permissions are “granted” to the user.

This means that once the massive number of repositories in the trial run was fully mirrored from GitHub, a user attempting to log in could have a huge set of pending permissions granted to it all at once. Of course, this broke with a fun-looking error:

execute upsert repo permissions batch query: extended protocol limited to 65535 parameters

I was able to reproduce this in an integration test of the relevant query by generating a set of 17000 entries:

{
	name:     postgresParameterLimitTest,
	updates: func() []*authz.UserPermissions {
		user := &authz.UserPermissions{
			UserID: 1,
			Perm:   authz.Read,
			IDs:    toBitmap(),
		}
		for i := 1; i <= 17000; i += 1 {
			user.IDs.Add(uint32(i))
		}
		return []*authz.UserPermissions{user}
	}(),
	expectUserPerms: func() map[int32][]uint32 {
		repos := make([]uint32, 17000)
		for i := 1; i <= 17000; i += 1 {
			repos[i-1] = uint32(i)
		}
		return map[int32][]uint32{1: repos}
	}(),
	expectRepoPerms: func() map[int32][]uint32 {
		repos := make(map[int32][]uint32, 17000)
		for i := 1; i <= 17000; i += 1 {
			repos[int32(i)] = []uint32{1}
		}
		return repos
	}(),
},

This would break because we were performing an insert of 4 values per row, and at 17000 rows we reach 68000 parameters bound to a query. Postgres uses Int16 codes to denote bind variables, which would mean a maximum of $2^{16} =$ 65536 parameters (hence the seemingly magic number indicated in the error).

INSERT INTO repo_permissions 
	(repo_id, permission, user_ids_ints, updated_at)
VALUES
	%s
ON CONFLICT ON CONSTRAINT
	/* ... */

Funnily enough, you can get around this by providing columns as arrays. In this case, if you can provide each of the 4 columns here as an array, that would only count for 4 parameters, allowing this insert to scale indefinitely!

Sadly, one of the columns here is of type INT[]. When I attempted to perform an UNNEST on an INT[][], it completely unwrapped the array instead of just unwrapping it by a single dimension like one might expect:

SELECT * FROM unnest(ARRAY['hello','world']::TEXT[], ARRAY[[1,2],[3,4]]::INT[][])

Frustratingly returns:

unnest unnest
hello 1
world 2
  3
  4

When the desired result was just a one-dimensional unwrapping:

unnest unnest
hello [1, 2]
world [3, 4]

I briefly toyed with the idea of hacking around this by combining the array type as a single string and splitting it on the fly:

SELECT
 a,
 string_to_array(b,',')::INT[]
FROM
 unnest(ARRAY['hello','world']::TEXT[], ARRAY['1,2,3','4,5,6']::TEXT[]) AS t(a, b)

An EXPLAIN ANALYZE on the 5000-row sample query that didn’t hit the parameter limit, however, indicated that the performance of this was about 5x worse than before (with a cost of 337.51, compared to the previous cost of 62.50). It was also a bit of a dirty hack anyway, so I ended up resorting to simply paging the insert instead to avoid hitting the parameter limit. This was implemented in sourcegraph#24852 database: page upsertRepoPermissionsBatchQuery.

However, it seemed that this was not the only instance of us exceeding the parameter limits. Another query was running into a similar issue on a different customer instance. This time, there were no array types in the values being inserted, so I was able to try out the insert-as-arrays workaround:

INSERT INTO user_pending_permissions 
  (service_type, service_id, bind_id, permission, object_type, updated_at) 
VALUES
- %s
+ (service_type, service_id, bind_id, permission, object_type, updated_at)
+ (
+   SELECT %s::TEXT, %s::TEXT, UNNEST(%s::TEXT[]), %s::TEXT, %s::TEXT, %s::TIMESTAMPTZ 
+ )
ON CONFLICT ON CONSTRAINT
  /* ... */

This implementation of the query was slower for smaller cases, but for larger datasets was either on par or faster than the original query:

Case Accounts Cost Clock Comparison
Before 100 0.00..1.75 287.071 ms  
After 100 0.02..1.51 430.941 ms ~50% slower
Before 5000 0.00..87.50 7199.440 ms  
After 5000 0.02..75.02 7218.860 ms ~same
Before 10000 0.00..175.00 16858.613 ms  
After 10000 0.02..150.01 14566.492 ms ~13% faster
Before 15000 fail fail  
After 15000 0.02..225.01 22938.112 ms success

I originally had the function decide which query to use based on the size of the insert, but during code review it was recommended that we just stick to one implementation for simplicity, since permissions mirroring happens asynchronously and is not particularly latency-sensitive.

This was implemented in sourcegraph#24972 database: provide upsertUserPendingPermissionsBatchQuery insert values as array.

Results

After working through the issues mentioned in this article as well as a variety of other minor fixes, the customer was finally able to run a full permissions mirror to completion with everything working as expected. The final result was roughly 2.5 days to full sync, a more than 10x improvement to the previously projected 30 days. The improved performance unblocked the customer in question on this front and will hopefully open the door for Sourcegraph to function fully in even larger environments in the future!


  1. See Two-way sync. 

  2. Well, admittedly, it was only feature-flagged to off by default in a follow-up PR when I realised this required additional authentication scopes we do not request by default against the GitHub API (in order to query organizations and teams). 

]]>
robert
June 2021 updates for bobheadxi.dev2021-06-20T00:00:00+00:002021-06-20T00:00:00+00:00https://bobheadxi.dev/introducing-dark-modeWith dark mode on every website nowadays, my website seems to have fallen a bit behind the times. I decided it was about time to give my website a bit of a facelift - and over-hype it with a blog post!

This round of improvements didn’t strictly happen this month, but a lot of it was spurred on by my recent reading of the iA Design Blog. I think their website is absolutely gorgeous, and it made the lacklustre of bobheadxi.dev all the more apparent.

For the unfamiliar, my site started off over 2 years ago with the indigo Jekyll theme. I have since made quite a number of changes to it, mostly in random spurts of effort, and started writing about these periods of changes last year.

I quite like how things turned out for this set of changes - hope you do as well!

Refinements

Updated typography

A big part of bobheadxi.dev is my blog posts, even though I’m unsure how many people read them (Google Analytics indicates a lot of traffic, particularly on my really old Object Casting in Javascript post). Anyway, I’ve always been rather dissatisfied with the reading experience on my site, but could never quite put my finger on what exactly was wrong with it.

All I knew was that I didn’t like the previous fonts - ‌Helvetica Neue - but until I started using iA Writer recently, I didn’t have much of an inkling of what font I would like.

iA Writer uses these gorgeous fonts - aptly named Mono, Duo, and Quattro - that I think looks so nice when typing and reading. They have a neat blog post introducing these fonts, and while I’m not really sure what this stuff means, I decided to make the switch.

This site now uses Quattro as its serif font, and Mono as its monospaced font. I think the results are quite nice.

Outdented heading anchors

While editing in iA Writer, headings get nicely outdented ‘#’s like so:

When I started thinking about it, I’m pretty sure this is a very common style in many websites already. Either way, I quite like how it looks, so I tried to replicate it on my site. I currently generate somewhat similar-looking (but not outdented) anchor links using allejo/jekyll-anchor-headings, which allows a little bit of customization - I can give the anchor link elements a class, for example, and style it through that.

<div class="post-content">
    {% include anchor_headings.html html=content anchorBody='#' anchorClass='heading-anchor' beforeHeading=true %}
</div>

Turns out the outdenting can be achieved using the handy translateX transformation, and a bit of @media helps me scale this effect for smaller screens (where outdenting could position the anchors very close to the edge of your screen).

h1, h2, h3, h4
	// ... some CSS
	> .heading-anchor
		position: absolute
		transform: translateX(-2rem)
		@media #{$tablet}, #{$mobile}
			position: inherit
			transform: none

Sadly, I wasn’t able to figure out a nontrivial way to have the number of ‘#’s correspond to the depth of the heading, but I figured this was close enough, and is definitely an improves the look of headings (in my opinion).

Bold introductions

Some books and blogs get big first letters for the first paragraph of a chapter or article. The effect looks nice on books, but I was never really sold on its usage in blog posts - though the look of an emphasised introduction is certainly striking. As I browsed through iA Design Blog, I noticed that their first paragraphs were big, and it made each essay feel much more compelling.

However, as I went about considering different options for making my intros real big as well, I realised a lot of my introductory paragraphs were complete garbage. While sometimes that was the intent - leading with a tangent before diving into the article’s main topic - they definitely did not age well.

So perhaps a fortunate side effect is that this prompted me to go back through my posts and make the bare minimum effort to make them a bit more interesting. At least I look like I know what I’m talking about now!

Exciting listings

I just learned about Jekyll’s post.excerpt feature that gives you the first paragraph of a blog post. Again inspired by the iA Design Blog, which uses excerpts instead of custom descriptions to great effect, I decided to use them here as well.

I think this gives a far better preview into the content of each post, and kind of makes them look more important. Thankfully my updating of each post’s first paragraphs to accommodate bigger introductions meant that the excerpts are at least somewhat meaningful.

I also made minor improvements such as adding an on-hover effect to the clickable tags, which previously had no indication they were clickable.

The big picture

I like to include all sorts of media in my blog posts - images, code snippets, diagrams, quotes, and more. Unfortunately, I also like somewhat narrow widths for my content, which makes for a poor viewing experience for various forms of media.

On articles in the Sourcegraph Blog (and I recall that you can do this on Medium as well), I noticed that images were “blown up” - wider than the content - and I thought the effect looked quite nice, giving an expansive canvas for media to be enjoyed while still maintaining a nice reading experience for all the other stuff.

To do this myself, I turned images I wanted to be blown up into <figure> elements, and gave them expanded widths, along with <figcaption>. This also served nicely to standardise the raw HTML I’d been previously using to give images captions.

Big!!!!

Code blocks ran into similar problems, where snippets I didn’t careful adjust to adhere to an 80-character line limit would have to be scrolled to viewed, even on very wide screens. So I made them massive.

I’ve also always liked the big quotes used in magazine and newspaper sites to give quotes an even more authoritative and dramatic feel - so quotes joined the big club.

Mermaid diagrams and some other things I might have forgotten also got this treatment. Hopefully these changes make the reading experience more exciting!

Dark mode

And last but not least, the star of today’s show… dark mode! Because no site is complete without one.

The site now switches do dark mode if you have dark mode enabled on your device!

Luckily for me, the theme my site was based on made decent use of SASS variables for colours (though the naming of the colours left quite a bit to be desired, as you’ll see in a moment).

I found to my dismay that because these variables are compiled away at build time, they cannot be used to respond to prefers-color-scheme: dark, which seems to be the standard way to detect for what theme you should show to the user.

Instead, I found some blog posts talking about CSS variables, which turns out to be the only way to have properly variable variables in stylesheets. To be honest this is the first time I’ve had to do something like this myself, and this was news to me!

My implementation ended up pretty straight forward, using universal selectors and setting the theme in JavaScript, though I’m sure there are other ways to do this too (maybe even JavaScript-free?).

[data-theme="theme-light"]
    --background: #ffffff
    --alpha: #333
    --beta: #222
    --gama: #aaa
    --delta: #5A85F3
    --epsilon: #ededed
    --zeta: #666

[data-theme="theme-dark"]
    --background: #141414
    --alpha: #aaa
    --beta: #eeeeee
    --gama: #474747
    --delta: #5A85F3
    --epsilon: #202020
    --zeta: #929292
var prefersDark = false;
function setDarkMode(isDark) {
    const theme = `theme-${isDark ? 'dark' : 'light'}`;
    document.querySelector('html').dataset.theme = theme;
    prefersDark = isDark;
    console.log(`Set ${theme}`);
}

// set the initial theme
const prefersDarkMatch = window.matchMedia('(prefers-color-scheme: dark)');
setDarkMode(prefersDarkMatch.matches);

// watch for changes to the user's dark mode configuration
prefersDarkMatch.addEventListener('change', (e) => setDarkMode(e.matches));

Having the setDarkMode function available is useful for development, allowing me to switch between the modes via console, and I added the prefersDark variable… just because, I guess. Maybe handy if I want to add a button to toggle dark mode?

In the end, despite picking the colours semi-randomly and not making an awful lot of adjustments, I’m pretty happy with how this (in my opinion) quick effort turn out! I’m particularly pleased with how the blog listings look:

Up next

There are still a lot of issues with dark mode - most noticeably the company logos I’m using that don’t have transparent backgrounds, but also a few contrast issues in code highlighting.

There also seems to be an issue with the tags page where posts from different collections do not get included that I definitely want to fix now that interaction with tags is more prominent.

I recently wrote a newsletter featuring a ludicrous number of footnotes, and at some point I want to get Tufte “sidenotes” here so that I can abuse footnotes in my blog posts as well. Sadly, I haven’t found a particularly elegant solution to this, so I’m putting it off for the time being.

And, of course, I’m hoping to do more blog-writing as well.

That’s all for now - feel free to highlight anything on this post if you have comments for questions!

]]>
robert