Source Code Repository for the PILOT Project

This repo contains the research artifact (demo) for our paper "Pilot Execution: Simulating Failure Recovery In Situ for Production Distributed Systems" — NSDI'26.

PILOT is a tool that enables operators to safely "dry-run" recovery actions on a live production system before committing to them. It instruments the target system's bytecode to support pilot execution—an isolated trial of a recovery procedure that runs alongside the real system.

If the pilot run succeeds, the operator proceeds with the real recovery. If it fails, PILOT provides detailed feedback (error information and a context tree) so the operator can diagnose the issue, adjust the recovery strategy, then commit to production.

Overview

PILOT works in two phases:

Offline (Instrumentation): A static bytecode transformer instruments the target system to support pilot execution.
Online (Pilot Run): The operator triggers pilot execution of their intended recovery. PILOT reports success or failure and produces a context tree for debugging.

Repository Structure

Component	Description	Path
PILOT Instrumentation Engine	Transforms bytecode of target systems to enable pilot execution	`PilotExecution/src`
PILOT Runtime Library	Manages pilot execution at runtime	`PilotExecution/Pilot`

Requirements

OS: Ubuntu 20.04 / 22.04
JDK: OpenJDK 8
Git: >= 2.16.2
Apache Maven: >= 3.6.3
Apache Ant: >= 1.10.9

Tip

We highly recommend using our CloudLab profile, which provisions a ready-to-use three-node cluster with all dependencies pre-installed.

Getting Started

0. Set Up CloudLab Machines (Recommended)

We provide a CloudLab profile that automatically starts a three-node cluster (node0, node1, node2) with all dependencies pre-configured.

Instantiate the profile via this link: PILOT-cloudlab. Keep hitting Next to create the experiment.
Access the machines via SSH from the CloudLab Web UI.

1. Set Up Environment

On all nodes (node0, node1, node2):

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> ~/.bashrc

export PATH="/usr/lib/jvm/java-8-openjdk-amd64/bin:$PATH"
echo 'export PATH="/usr/lib/jvm/java-8-openjdk-amd64/bin:$PATH"' >> ~/.bashrc

Verify:

echo $JAVA_HOME
# Expected: /usr/lib/jvm/java-8-openjdk-amd64

Important

All remaining steps are performed on node0 only, unless otherwise noted.

2. Build PILOT Instrumentation Engine

cd ~ && git clone -b main --recurse-submodules https://github.com/LiftLab-UVA/PilotExecution
cd ~/PilotExecution
chmod +x clone_build.sh
./clone_build.sh instrumentationengine

You should see All tasks completed on success.

3. Build PILOT Runtime Library

./clone_build.sh runtimelib

You should see BUILD SUCCESS on success.

4. Deploy ZooKeeper

PILOT uses a ZooKeeper cluster as a status registry during pilot runs.

cd ~/PilotExecution/experiments/zookeeper_setup/
./setupzookeeper.sh

You should see ZooKeeper Cluster Deployment Completed on success.

End-to-End Example: Detecting SOLR-17515

This walkthrough demonstrates PILOT's full workflow using SOLR-17515, a real-world recovery failure in Apache Solr.

Scenario: One node in a Solr cluster fails. The operator attempts recovery by having the failed node sync with an alive node and rejoin the cluster. On the buggy version, this recovery fails with NPE due to an incorrect configuration.

Important

Complete all Getting Started steps before proceeding. All commands run on node0.

Step 1. Build Solr

Compiles Solr from source and integrates the PILOT runtime library:

cd ~/PilotExecution/experiments/solr17515/managesolr
./build_solr.sh

You should see Build and Deploy Completed.

Step 2. Instrument Solr

Instruments Solr's bytecode and repackages it with pilot execution enabled:

cd ~/PilotExecution/experiments/solr17515
./generate_original.sh && ./instrument_pilot.sh

You should see Update task completed.

Step 3. Reproduce on Buggy Version

3a. Inject the Bug

cd ~/PilotExecution/experiments/solr17515
./reproduce.sh bug

3b. Trigger Pilot Execution

./execute_recovery.sh pilot

Because the code is buggy, pilot execution detects the failure and reports an error along with a context tree for debugging:

========================================
     PILOT execution results
========================================

Error detected
... java.lang.NullPointerException ...

========================================
context tree is
========================================

[pilotexecution] ROOT (tid=0, ts=0)
├── [node1] getCore$instrumentation (tid=64, ts=...)
│   └── ...
├── [node1] doRecovery$instrumentation (tid=64, ts=...)
│   └── [node1] run$instrumentation (tid=65, ts=...)
│       └── ...

Each node in the context tree shows: ([hostname] functionname (threadId, Timestamp)).

3c. Apply a Minor Tweak

The root cause of SOLR-17515 is a misconfigured system property that causes an NPE during replica recovery. As described in the issue, SolrCloud users can work around this by unsetting that property on the affected node to make recovery succeed. For simplicity, we manually inject the error for reproduction, and to simulate the tweak, we simply remove the fault marker:

./tweak_conf.sh

3d. Verify with Normal Recovery

./execute_recovery.sh normal

Expected output: Finished recovery process, successful=[true]

Step 4. Reproduce on Fixed Version

After the bug was reported, the Solr developers pushed a fix. On the fixed version, pilot execution succeeds without any tweaks.

4a. Apply the Fix

cd ~/PilotExecution/experiments/solr17515
./reproduce.sh fix

4b. Trigger Pilot Execution

./execute_recovery.sh pilot

This time pilot execution succeeds:

========================================
     PILOT execution results
========================================

Recovery successful:
... Finished recovery process, successful=[true]

4c. Verify I/O Isolation

All Pilot I/O is redirected to /opt/ShadowDirectory for I/O isolation. To see the redirected files, run:

cd ~/PilotExecution/experiments
./print_shadow_file.sh

The expected output is:

/opt/ShadowDirectory
└── opt
    └── SolrData
        ├── filestore
        ├── mycollection_shard1_replica_n1
        │   └── data
        │       └── index
        │           ├── _0.fdm
        │           ├── _0.fdt
        │           └── ...

4d. Perform Normal Recovery

After a successful pilot run, the operator proceeds with the real recovery:

cd ~/PilotExecution/experiments/solr17515
./execute_recovery.sh normal

Expected output: Finished recovery process, successful=[true]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Pilot @ 9d2bf25		Pilot @ 9d2bf25
bin		bin
cloning @ fe0c283		cloning @ fe0c283
experiments		experiments
fig		fig
lib		lib
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
build_pilot_runtime.sh		build_pilot_runtime.sh
clone_build.sh		clone_build.sh
pom.xml		pom.xml
readme.md		readme.md
run_engine.sh		run_engine.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Source Code Repository for the PILOT Project

Table of Contents

Overview

Repository Structure

Requirements

Getting Started

0. Set Up CloudLab Machines (Recommended)

1. Set Up Environment

2. Build PILOT Instrumentation Engine

3. Build PILOT Runtime Library

4. Deploy ZooKeeper

End-to-End Example: Detecting SOLR-17515

Step 1. Build Solr

Step 2. Instrument Solr

Step 3. Reproduce on Buggy Version

3a. Inject the Bug

3b. Trigger Pilot Execution

3c. Apply a Minor Tweak

3d. Verify with Normal Recovery

Step 4. Reproduce on Fixed Version

4a. Apply the Fix

4b. Trigger Pilot Execution

4c. Verify I/O Isolation

4d. Perform Normal Recovery

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Source Code Repository for the PILOT Project

Table of Contents

Overview

Repository Structure

Requirements

Getting Started

0. Set Up CloudLab Machines (Recommended)

1. Set Up Environment

2. Build PILOT Instrumentation Engine

3. Build PILOT Runtime Library

4. Deploy ZooKeeper

End-to-End Example: Detecting SOLR-17515

Step 1. Build Solr

Step 2. Instrument Solr

Step 3. Reproduce on Buggy Version

3a. Inject the Bug

3b. Trigger Pilot Execution

3c. Apply a Minor Tweak

3d. Verify with Normal Recovery

Step 4. Reproduce on Fixed Version

4a. Apply the Fix

4b. Trigger Pilot Execution

4c. Verify I/O Isolation

4d. Perform Normal Recovery

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages