We use the scripts to analyze the predictability of server failures due to DRAM errors as well as the impacting factors on server failures.
Python 3: please install numpy and pandas
+ pip3 install -r requirement.txt
- The raw data is stored under
../data/python3 measurement.py ../data/
- The results will be stored under
./result/
overall_distribution: output the percentage of servers with CEs and percentage of servers with server failures per month in Finding 1.- Results:
overall_distribution.txt
- Results:
predictable_analysis: output the relative percentage of predictable server failures for different prediction windows in Finding 2.- Results:
predictable_analysis.txt
- Results:
num_ce_analysis: output the average number of CEs for different types of server failures with different prediction windows in Finding 3.- Results:
num_ce_analysis.txt
- Results:
mtbe_analysis: output the median mean time between errors (MTBE) per predictable server failures for different types of failures with different prediction windows in Finding 4- Results:
mtbe_analysis.txt
- Results:
frac_failure_per_component: output the relative fraction of predictable servers that associated with different memory subsystem component failures for different types of failures when the prediction window is five minutes in Finding 5- Results:
frac_failure_per_component_5m.txt
- Results:
frac_ce_per_component: output the relative fraction of CEs that are associated with different memory subsystem component failures for different typs of failures whehn the prediction window is five minute in Finding 5- Results:
frac_ce_per_component_5m.txt
- Results:
hardware_configuration_impact_analysis: output the relative percentage of predictable server failures breakdown by different hardware configures factors in Findings 6-8- Results:
DRAM_model_breakdown.txtfor DRAM modelsDIMM_number_breakdown.txtfor number of attached DIMMs per serverserver_manufacturer_breakdown.txtfor server manufacturer
- Results:
read_scrubbing_analysis: output the average number of read errors and scrubbing errors per predictable server failures for different types of server failures when the prediction window is five minutes in Finding 9- Results:
read_error_mean.txtandscrub_error_mean.txt)
- Results:
hard_soft_analysis: output the average number of hard errors and soft errors per predictable server failures for different typs of server failures when the prediction window is five minutes in Finding 10.- Results:
hard_erorr_mean.txtandsoft_error_mean.txt
- Results:
Please email to Zhinan Cheng ([email protected]) if you have any questions.