This repository contains code for the paper Perturbation-based Effect Measures for Compositional Data by Anton Rask Lundborg and Niklas Pfister.
The experiments of the paper are run in Python 3.11. The required packages are specified in the requirements.txt file (be aware that the Cython, pandas and numpy-packages need specific versions!).
To be able to run the code, the regressiontree package needs to be installed and compiled. To do so, run the command pip install -e regressiontree command. If you have any trouble with this step, feel free to contact one of the authors of the paper via email or open a GitHub issue.
The main folder contains the code for the functions used in the experiments while the experiments folder contains functions that run the different experiments. The data folder holds the two real datasets used in the experiments. The plots folder is empty and used to output the results of the Adult data experiment and the semiparametric robustness experiment.
There are five modules in the main folder.
derivative_estimationcontains the functions used for the nonparametric derivative estimation based on local polynomial smoothing with random forest weights.perturbation_effectscontains some wrapper-functions for the functions insemiparametric_estimatorsto estimate particular perturbation effects.semiparametric_estimatorscontains the primary function calls for the semiparametric estimators used in the experiments.smoothing_splinecontains a python implementation of theRsmoothing spline functionsspline_scorecontains functions used for the nonparametric score estimation
There are six modules in the experiments folder. Some of the modules run full experiments while others are configurable and run a single simulation from an experiment in the paper.
sim_adult_experimentscontains the code for the experiment and plot in Section 5.1.1 of the paper based on the "Adult" dataset. This requires downloading the dataset as described in the data README file indata/adult.sim_introcontains the simulations included in Table S1 of the paper (along with some additional computations that were not included in the paper).sim_microbiome_simplecontains the code for the simple regressions performed in Section 5.2 of the paper. Running the script will create two.pklfiles. The first,microbiome-simple.pkl, contains the marginal effects of L, log-contrast and penalized log-contrast results used to produce Figures 3 and S12. The second,microbiome-pseudocount.pkl, contains the different log-contrast results when varying pseudocount as shown on the right of Figure 3.sim_microbiomecontains additional code to run simulations for the experiment in Section 5.2. By looping over theregression,measureandvar_namevariables appropriately, it is possible to reconstruct the results of the paper. Each individual call will produce a.pklfile with results. Be aware that the computation time can exceed several hours for a single run.sim_ny_schools.pycontains the code for the experiment in Section 5.1.2 of the paper based on the PASSNYC data set. This requires downloading the dataset as described in the data README file indata/ny-schools.sim_semiparametric_robustnesscontains the code to run the simulations and construct the first two figures in Section S4.4 of the supplementary material of the paper.sim_semiparametriccontains code to run a single instance of the simulations of Section S5.2 of the paper. By looping over theY_regression,typ,n,dandestimatorvariables and repeating this many times, it is possible to recreate Figures 1, S6, S7 and S8. Each call will produce a.pklwith results. Be aware that the computation time for a single run can be long whennanddare large and the estimators areNPM.