This repo contains the research artifact (demo) for our paper "Pilot Execution: Simulating Failure Recovery In Situ for Production Distributed Systems" — NSDI'26.
PILOT is a tool that enables operators to safely "dry-run" recovery actions on a live production system before committing to them. It instruments the target system's bytecode to support pilot execution—an isolated trial of a recovery procedure that runs alongside the real system.
If the pilot run succeeds, the operator proceeds with the real recovery. If it fails, PILOT provides detailed feedback (error information and a context tree) so the operator can diagnose the issue, adjust the recovery strategy, then commit to production.
PILOT works in two phases:
- Offline (Instrumentation): A static bytecode transformer instruments the target system to support pilot execution.
- Online (Pilot Run): The operator triggers pilot execution of their intended recovery. PILOT reports success or failure and produces a context tree for debugging.
| Component | Description | Path |
|---|---|---|
| PILOT Instrumentation Engine | Transforms bytecode of target systems to enable pilot execution | PilotExecution/src |
| PILOT Runtime Library | Manages pilot execution at runtime | PilotExecution/Pilot |
- OS: Ubuntu 20.04 / 22.04
- JDK: OpenJDK 8
- Git: >= 2.16.2
- Apache Maven: >= 3.6.3
- Apache Ant: >= 1.10.9
Tip
We highly recommend using our CloudLab profile, which provisions a ready-to-use three-node cluster with all dependencies pre-installed.
We provide a CloudLab profile that automatically starts a three-node cluster (node0, node1, node2) with all dependencies pre-configured.
- Instantiate the profile via this link: PILOT-cloudlab. Keep hitting Next to create the experiment.
- Access the machines via SSH from the CloudLab Web UI.
On all nodes (node0, node1, node2):
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> ~/.bashrc
export PATH="/usr/lib/jvm/java-8-openjdk-amd64/bin:$PATH"
echo 'export PATH="/usr/lib/jvm/java-8-openjdk-amd64/bin:$PATH"' >> ~/.bashrcVerify:
echo $JAVA_HOME
# Expected: /usr/lib/jvm/java-8-openjdk-amd64Important
All remaining steps are performed on node0 only, unless otherwise noted.
cd ~ && git clone -b main --recurse-submodules https://github.com/LiftLab-UVA/PilotExecution
cd ~/PilotExecution
chmod +x clone_build.sh
./clone_build.sh instrumentationengineYou should see All tasks completed on success.
./clone_build.sh runtimelibYou should see BUILD SUCCESS on success.
PILOT uses a ZooKeeper cluster as a status registry during pilot runs.
cd ~/PilotExecution/experiments/zookeeper_setup/
./setupzookeeper.shYou should see ZooKeeper Cluster Deployment Completed on success.
This walkthrough demonstrates PILOT's full workflow using SOLR-17515, a real-world recovery failure in Apache Solr.
Scenario: One node in a Solr cluster fails. The operator attempts recovery by having the failed node sync with an alive node and rejoin the cluster. On the buggy version, this recovery fails with NPE due to an incorrect configuration.
Important
Complete all Getting Started steps before proceeding. All commands run on node0.
Compiles Solr from source and integrates the PILOT runtime library:
cd ~/PilotExecution/experiments/solr17515/managesolr
./build_solr.shYou should see Build and Deploy Completed.
Instruments Solr's bytecode and repackages it with pilot execution enabled:
cd ~/PilotExecution/experiments/solr17515
./generate_original.sh && ./instrument_pilot.shYou should see Update task completed.
cd ~/PilotExecution/experiments/solr17515
./reproduce.sh bug./execute_recovery.sh pilotBecause the code is buggy, pilot execution detects the failure and reports an error along with a context tree for debugging:
========================================
PILOT execution results
========================================
Error detected
... java.lang.NullPointerException ...
========================================
context tree is
========================================
[pilotexecution] ROOT (tid=0, ts=0)
├── [node1] getCore$instrumentation (tid=64, ts=...)
│ └── ...
├── [node1] doRecovery$instrumentation (tid=64, ts=...)
│ └── [node1] run$instrumentation (tid=65, ts=...)
│ └── ...
Each node in the context tree shows: ([hostname] functionname (threadId, Timestamp)).
The root cause of SOLR-17515 is a misconfigured system property that causes an NPE during replica recovery. As described in the issue, SolrCloud users can work around this by unsetting that property on the affected node to make recovery succeed. For simplicity, we manually inject the error for reproduction, and to simulate the tweak, we simply remove the fault marker:
./tweak_conf.sh./execute_recovery.sh normalExpected output: Finished recovery process, successful=[true]
After the bug was reported, the Solr developers pushed a fix. On the fixed version, pilot execution succeeds without any tweaks.
cd ~/PilotExecution/experiments/solr17515
./reproduce.sh fix./execute_recovery.sh pilotThis time pilot execution succeeds:
========================================
PILOT execution results
========================================
Recovery successful:
... Finished recovery process, successful=[true]
All Pilot I/O is redirected to /opt/ShadowDirectory for I/O isolation. To see the redirected files, run:
cd ~/PilotExecution/experiments
./print_shadow_file.shThe expected output is:
/opt/ShadowDirectory
└── opt
└── SolrData
├── filestore
├── mycollection_shard1_replica_n1
│ └── data
│ └── index
│ ├── _0.fdm
│ ├── _0.fdt
│ └── ...
After a successful pilot run, the operator proceeds with the real recovery:
cd ~/PilotExecution/experiments/solr17515
./execute_recovery.sh normalExpected output: Finished recovery process, successful=[true]
