To install everything that is needed, you can perform scalable_bo's installation process on ThetaGPU that you can find here. This will build the conda environment with all the plasma dependencies and the pipeline to run HPO with the minimalistic_frnn as problem (this problem is defined in scalbo/benchmarks/minimalistic_frnn.py of which you can find a copy in the hps/ folder).
Each part of this project is in a different folder, we have :
data_study/: for generating plots to visualize the inputs.hps/: to run the HPO.training/: to select the models to train from the HPO' results and train them.prediction/: to perform trained model's predictions as well as ensemble construction, prediction, and uncertainty quantification.
Each of these is meant to be an experiment space, which is why you can find a jobs/ folder (or one in each of their subfolders). For every submission script, be careful to modify these lines accordingly (maybe can this be automated with global variables)
#COBALT -O jobs/logs/main
PROJECT=/home/jgouneau/projects/grand/deephyper/frnn
source $PROJECT/repos/scalable-bo/build/activate-dhenv.sh
export PYTHONPATH=$PROJECT/repos/scalable-bo/build/dhenv/lib/python3.8/site-packages/:$PYTHONPATH
The #COBALT -O specifies the file's relative path in which are written the job's logs, be careful to create the according folders if any are in the path (for example here the folder logs/ that I always create in jobs/). The two other commands (source and export) refer to the conda environment build's location once it was created following the installation procedure.
- you can find an additional
plots/for all the plot generating scripts that were not run on Theta, but in local after download of the specific data plotted.
A word ont the scripts in the deephyper/plasma-python (tf2) repo that are used and were modified for this project.
The plasma/conf_parser.py might be the most important one, as it is here that is defined the base configuration (that is modified depending on the config at hand during the HPS, training, or predictions), in which is specified the dataset used with these key values :
base_params = {
'fs_path': '/lus/theta-fs0/projects/fusiondl_aesp/felker',
'fs_path_output': '/lus/grand/projects/datascience/jgouneau/deephyper/frnn/exp/run_function_tests/temp/',
'paths': {
'data': 'd3d_2019', # 'd3d_all', 'jet_0D', 'jet_1D'
'signal_prepath': ['/signal_data/', '/signal_data_new_nov2019/'], # ['/signal_data/']
...fs_path is where the data is located, make sure you have access to this Kyle's folder.
fs_path_output is where some files generated the old FRNN pipeline, if you only use the minimalistic FRNN pipeline this folder shouldn't even be accessed, but just in case you can set it to some temp folder.
paths['data'] specifies the dataset used.
paths['signal_prepath'] specifies which data from the dataset is used, the '/signal_data_new_nov2019/' is only relevant for d3d_2019, this is what allows to use all the data it has in addition to what's already in d3d_all.
The batch generator is defined in plasma/models/loader.py. This is where the buffer improvement was done ; the old batch generator was training_batch_generator_partial_reset, the new one is training_batch_generator_partial_reset_bis which has a cycling buffer ; the following is a view of the functions and subfunctions called by these that were modified in order to achieve this modification :
training_batch_generator_partial_reset(l.451)return_from_training_buffer(l.272)shift_buffer(l.284)
fill_training_buffer(l.247)resize_buffer(l.289)
training_batch_generator_partial_reset_bis(l.66)return_from_training_buffer_bis(l.118)fill_training_buffer_bis(l.91)resize_buffer_bis(l.129)
The functions to compute the FRNN AUC are in plasma/utils/performances.py. The AUC can be computed with get_roc_area (l.1088) but another custom version was implemented, get_roc_area_dh (l.1095), with an added optional argument, early_pred_counts=True, with which it is possible to choose wether the early predictions should be accounted as TP or FP. It calls the custom get_metrics_vs_p_thresh_dh (l.146) which also has this option, and returns all the stats of all the thresholds that were evaluated in an ascending order (hence the range in every variable) : thresh_range, accuracy_range, precision_range, tp_rate_range, fp_rate_range, tp_range, fp_range. It does so by calling the custom function get_shot_prediction_metrics_from_threshold_arrays (l.194) on all the thresholds generated by get_threshold_arrays (l.235).
We have two subfolders in here ; visualization and shots_study.
visualization contains a script to generate UMAP and TSNE visualizations of the datasets, but it wasn't really usefull. It uses the ground truth or the output of an ensemble's predictions (ensemble/y_pred.npz) or its results (TP, FP, etc. with a specified threshold) or its (ensemble/uq/alea.npz, ensemble/uq/epis.npz, ensemble/uq/tota.npz) to color the points. It wasn't a successful attempt so it is still in a state of raw code uneasy to use for which it would be a loss of time to enter the details.
shots_study is where we generate the inputs visualizations. First create a folder plots/ in here, in which you can also create scalars and profiles, everything happens at the end of the script (l.210 -> l.216) :
shot_list = shot_list_test
shot_list.sort()
end = 100
for i in range(min(end, len(shot_list))):
plot_shot_separated(shot_list[i], i, suffix="png")
plot_shot(shot_list[i], i, suffix="png")
plot_shot(shot_list[i], i, suffix="pdf")You can select the dataset from which you want to generate the inputs visualizations with shot_list = shot_list_test as well as until which one with end = 100. Then you can choose which views are generated ;
plot_shot()creates the whole view (with scalars and profiles in the same figure) directly in theplots/folderplot_shot_separated()creates the scalars and profiles views separatley in the corresponding foldersplots/scalars/andplots/profiles/(this was usefull for the shared Google Sheet) You can choose in which format to save it with thesuffixargument.
To run this script, execute qsub-gpu jobs/main.sh in the current folder.
The hp_problem is defined in the minimalistic_frnn benchmark from scalbo (l.80->l.96) (not the one given as example in hps/minimalistic_frnn.py):
hp_problem = HpProblem()
hp_problem.add_hyperparameter((32, 256, "log-uniform"), "batch_size", default_value=128)
hp_problem.add_hyperparameter((32, 256, "log-uniform"), "dense_size", default_value=128)
hp_problem.add_hyperparameter((0.0, 1.0), "dense_regularization", default_value=0.001)
hp_problem.add_hyperparameter((0.0, 0.5), "dropout_prob", default_value=0.1)
hp_problem.add_hyperparameter((32, 256, "log-uniform"), "length", default_value=128)
hp_problem.add_hyperparameter(['hinge', 'cross', 'focal', 'balanced_hinge', 'balanced_cross', 'balanced_focal'], "loss", default_value='focal')
hp_problem.add_hyperparameter((1e-7, 1e-2, "log-uniform"), "lr", default_value=2e-5)
hp_problem.add_hyperparameter((0.9, 1.0), "lr_decay", default_value=0.97)
hp_problem.add_hyperparameter((0.9, 1.0), "momentum", default_value=0.9)
hp_problem.add_hyperparameter((32, 256, "log-uniform"), "num_conv_filters", default_value=128)
hp_problem.add_hyperparameter((1, 4), "num_conv_layers", default_value=3)
hp_problem.add_hyperparameter((1, 32, "log-uniform"), "num_epochs", default_value=32)
hp_problem.add_hyperparameter((0.0, 1.0), "regularization", default_value=0.001)
hp_problem.add_hyperparameter((1, 4), "rnn_layers", default_value=2)
hp_problem.add_hyperparameter((32, 256, "log-uniform"), "rnn_size", default_value=200)as well as the run() function (l.541->l.630)
The allocated time for the training of each model is defined in seconds l.585:
timeout_callback = TimeoutCallback(30*60)The parameters of the search are defined in the submission script (the final one is jobs/minimalistic-frnn-DBO-async-qUCB-qUCB-16-8-42000-42.sh) with scalbo's cli :
#COBALT -n 16
#COBALT -t 720
export RANKS_PER_NODE=8
export COBALT_JOBSIZE=16
export acq_func="qUCB"
export strategy="qUCB"
export timeout=42000
export random_state=42
export problem="minimalistic-frnn"
export sync_val=0
export search="DBO"#COBALT -n 16 and export COBALT_JOBSIZE=16 must be the same, #COBALT -t 720 is the maximum submission time of 12h, and export timeout=42000 corresponds to a search of 11h40min.
The results are saved in export log_dir="results/$problem-$search-$sync_str-$acq_func-$strategy-$COBALT_JOBSIZE-$RANKS_PER_NODE-$timeout-$random_state" so make sure that you have a results/ folder in the current folder.
There are other scripts with different parameters, as well as scripts to quickly test the search on 1 gpu or 1 node (test_HPS_1_gpu.sh, test_HPS_1.sh) or even only the run function with the baseline (test_base_run.sh, to be more precise it executes what's in the __main__ part of scalbo's minimalistic_frnn.py benchmark script).
Once the HPO is done its results are in hps/results/minimalistic-frnn-DBO-async-qUCB-qUCB-16-8-42000-42/results.csv.
First execute (this can be done localy) the gather_top_k_configs.py script for which the key variables are defined at the beginning (l.4->l.8):
path_to_results = "../hps/results/minimalistic-frnn-DBO-async-qUCB-qUCB-16-8-42000-42/results.csv"
k = 80
path_to_configs = f"configs/top_{k}.json"it will gather the top k results from path_to_results and save them in configs/top_'k'.json (in which we already have the baseline config).
To then perform the training of the top models gathered, it happens with the train_top_k.py script, which takes the top configs generated previously in path_to_top_configs an then simply reproduces the run() function with an added FrnnEvaluatorCallback whose role is to evaluate the model at each epoch, save the metrics evaluated in 'results_path'/histories/'model_name'.json and save the model's weights in 'results_path'/model_weights/'model_name'.h5 if the valid_frnn_auc was improved.
All the key variables are defined l.593->l.595 :
path_to_top_configs = "configs/top_80.json"
results_path = "results/top_80"
model_name = f"top_{rank+1}"Make sure you have the results_path folder created in training/. Also because of the way the FrnnEvaluatorCallback checkpoints the model, make sure you have both a histories/ and a model_weights/ folders in this results_path folder.
The number of epochs during which to run the training is defined l.555:
num_epochs = 128with the periodic evaluation of the model it is not necessary for the training to be finished by the end of the 12h job so this can be set to a very large value.
To compare to the baseline it is also possible to run train_baseline.py (even though I left the results of this training in results/baseline/) but it will take one node to train only one model, this can be improved. Also, just like for the top models' training, make sure you have the folders created for the FrnnEvaluatorCallback to checkpoint the baseline's training.
There are also two other scripts, which don't checkpoint the training and run for only an hour on single-gpu, to compare the gpu utilization of the two buffer methods o the baseline (ours train_baseline_buffer.py and the original one train_baseline_old_buffer.py) ; the gpustat outputs are written in results/gpustat_buffer.txt and results/gpustat_old_buffer.txt respectively so make sure you have a results/ folder in training/. To get the gpu utilization from the generated gpustats file you can use this snippet with the correct "path/to/gpustat.txt" :
with open("path/to/gpustat.txt", 'r') as f:
use = []
for line in f:
if line.startswith('['):
use.append(int(line.split('%')[0].split(' ')[-2]))
print(sum(use)/len(use))All these training scripts have their associated job submission script in jobs/, as always.
Once these models are trained the predictions can be made by submitting jobs/predict.sh, for which parameters are defined in the corresponding script predict.py, l.569->l.577:
top_configs_path = "../training/configs/top_80.json"
top_training_results_dir = "../training/results/top_80"
baseline_config_path = "../training/configs/baseline.json"
baseline_training_results_dir = "../training/results/baseline"
dataset="predictions/d3d_2019"
subset="test"
predictor = "top"the config_paths specify where are defined the configurations of the different models, while the training_results_dir where the checkpoint of the models were saved during the training. subset specifies which subset we want predictions on, while dataset is just used to indicate the output folder, the data used for prediction depends like for the HPS on what's specified in the conf_parser.py, which is currently on d3d_2019. The predictions are saved in predictions/'dataset'/'subset'/('group'/)'model_name'.npz, along with a specs.yaml in which are the specs of the model's prediction like the auc and the prediction time on the subset. Again make sure you have the corresponding folders created.
predictor is what's used to perform the top models' or the baseline's predictions, it is recommended to first execute the baseline because it will also generate a y_gold.npz that contains the corresponding true labels and that is used during top models' prediction to verify that the prediction made is on the same data (l.525->l.526):
for gold, true_gold in zip(y_gold, true_y_gold.values()):
assert np.all(gold == true_gold)The top models' predictions takes too long to be performed in a single 1 hour single-gpu job, which is why you can choose which portion of these models you want to get predictions from (begin=0, end=79 l.582).
This is the big part, the parameters are defined right at the beginning (l.11->l.30):
dataset = "d3d_2019"
subset = "test"
criteria = "valid"
group = "top_models"
top = 80
method = "gradient" # "gradient", "caruana", "topk"
methods_kwargs = dict(
topk=dict(
calibrator="balanced_sigmoid", # "base", "sigmoid", "balanced_sigmoid"
k=80,
),
caruana=dict(
calibrator="balanced_sigmoid", # "base", "sigmoid", "balanced_sigmoid"
k=80,
),
gradient=dict(
ensemble=list(range(top)),
keep_model_thresh=1e-06,
),
)datasetandsubsetspecify the folder from which the predictions are taken, whilecriteriaspecifies on which subset the calibrators and ensemble constructors should have been trained.groupis the top models' group name, top is how many we want to load (in a descendingcriteriaauc order).methodis the ensemble construction method used along with its associatedmethod_kwargs.
topk and caruana require sklearn calibrators of type method_kwargs['calibrator'] to be trained on criteria for each model. caruana and gradient are also trained on criteria to save the model's list they keep (and also the gradient).
In order to train calibrators/constructors on a specific subset, let's say valid, you must first execute this script with subset = criteria = valid. Make sur you have a :
calibration/'method'/'dataset'/'criteria'/'group'folder (forcaruanaandgradient)calibration/calibrators/'dataset'/'criteria'/'group'/method_kwargs['calibrator']folder for thegroupmodels (not the baseline because it is not calibrated) (fortopkandcaruana).
Once this is done you can use the trained calibrators/constructors on any other subset as long as criteria stays at valid.
This script will execute the method to build the associated ensemble's prediction (using existing saved calibrators/constructors parameters if any for the given criteria or training them), compute the stats, auc, balanced_loss, uq, and generate some plots. The plot generation is at the end of the script (l.155->end) ; first we have the ROC curve of baseline and best_model (you can also add the curve of the ensemble by uncommenting the lines l.56->l.62 in utils/plots.py), and then in order the baseline, best_model, and ensemble pred/uq plots for the num_shots first shots.
The UQ shots can be found in this box folder.
There are additional scripts that can be run in local (hence the .ipynb) with the results of the corresponding step.
The plot_search_trajectory.ipynb generates the search_trajectory and num_better_baseline plots in a plots/ folder using the search's results.csv for which the location can be specified in cell 2 :
hps = pd.read_csv("results.csv")The plot_configs.ipynb does different things ;
- it generates the parallel coordinates view of the configuration better than the baseline generated by the search (
cell 3), assuming the search'sresults.csvis here :
data = pd.read_csv("results.csv")and the objective of the baseline on the run function corresponds to :
data = data[data['objective'] > 0.8859217675140613]
(the objective of the baseline on the run function can be obtained by submitting test_base_run.sh from the Training part)
- It creates an
ensemble.csvfromresults.csvwith only the$k$ top models (which corresponds to the configurations chosen for further training) incell 4, with$k$ and the path toresults.csvbeing defined here :
data = pd.read_csv("results.csv")
k = 80Be careful to add the following line
0,0,128,0.001,128,0.1,128,hinge,2e-05,0.97,0.9,128,3,2,0.001,2,200,0.8859217675140613,0,0to the resulting ensemble.csv before proceeding to the next cell.
cell 5generates the parallel coordinates view of the freshly builtensemble.csvalong with the baseline configuration.
All these plots are saved in the same plots/ folder.
The plot.ipynb notebook simply generates the plot of the evolution of each model' AUC during training, along with the baseline. The paths to the different histories are specified in cell 3 (the path of the baseline is the path down to the .json file in histories, while the path of the top models is only down to the histories directory containing all the .json for each model):
baseline_ori = load_file("results/baseline/histories/baseline.json")
results = load_dir("results/top_models/histories")