Oct, 2025
Fixed several issues. Fix prediction bug, lazyloading bug; update plotting function; update docs. #82. Also: A previous bug: after getting an attribute of a LazyLoadingEstimator object, the model was not auto-dumped. This is now fixed.
Oct, 2025
This is a large update
Features:
- The major changes are that the
AdaSTEMclass now supportsduckdbandparquetfile path as input, this allow the user to pass in large dataset without duplicating the pandas dataframe cross the processors when working with n_jobs>1 parallel computing. See the new Jupyter notebooks for details. #76 - The lazy loading is no longer realized by the
LazyLoadingEnsembleclass. Instead, it is realized byLazyLoadingEstimator. This allow the model to be dumped once its training/prediction is finished, and we don't need to accumulate the models (hence, memory) until the training is finished for the whole ensemble. This will largely reduce the memory use. See the new Jupyter notebooks for details. #77 - n_jobs > ensemble_folds are no longer supported for user-end clarity. Those jobs are paralleled by ensemble folds so n_jobs > ensemble_folds is meaning less. We do not want to mislead users to think that a 10-ensemble model will be trained faster using n_jobs=20 compared to n_jobs=10.
- These features will not be available in
SphereAdaSTEMdue to the negligible user market and the negligible advantages. #75
Major bugs fixed:
- Previously the models are stored in
self.model_dictdynamically during the parallel ensemble training process, which means the dictionary is being altered during this process. However, we ask for aselfas input argument for the ensemble-level training function serialization. This is not ideal since the object being serialized should not be changing. This is fixed by assigning themodel_dicttoselfafter all trainings are finished. - Also fixed #74
May 14, 2025:
- ensemble_bootstrap argument: Defaults to False. if True, the data will be bootstrapped once for each ensemble. In this case users can generate ensemble-level uncertainty, accounting for variance in data.
- joblib_backend argument: Defaults 'loky'. Other available arguments include 'threading' ('multiprocessing' will not work with
generatoras the return). Sometimes only threading may work on certain systems. - base_model_method argument: defaults to None. If None,
predictorpredict_probawill be used depending on the tasks. This argument is handy if you have a custom base model class that has a special prediction function. Notice that dummy model will still predict 0, so the ensemble-aggregated result is still an average of zeros and your special prediction function output. Therefore, it may only make sense if your special prediction function predicts 0 as the absense/control value. Defaults to None.
Only updated for AdaSTEM and STEM, not for SphereAdaSTEM.
Nov 20, 2024:
Added support for:
- min_class_sample.
This allows the user to specify the threshold of "not training this base model", for the classification and hurdle tasks. In the past, this is hard coded as 1, meaning that the base model is only trained if there is at least 1 sample from a different class. Now users can set it to, e.g., 3, so that a stixel with 100 data points -- 98 0s and two 1s, will not be trained (instead, a dummy model that always predict zero will be used here), and a stixel will 100 data points -- 97 0s and three 1s will be trained.
This feature can be useful if you need to do cross-validation at base model level.
n_jobsin thesplitmethod.
The split method now use the user defined n_jobs. It was previously set to 1 since the performance on multi-core seems to be off. However, with large number of ensembles it seems to be doing a good job.
- Passing arguments to the prediction method of base model.
This can now be realized by passing base_model_prediction_param parameters when you are calling model.predict or model.predict_proba, as long as the predict or predict_proba methods of your base model accept this argument.
- The
logit_aggparameter.
The logit_agg argument in the prediction method will allows "real" probability averaging. Meaning whether to use logit aggregation for the classification task. If True, the model is averaging the probability prediction estimated by all ensembles in logit scale, and then back-tranforms it to probability scale. It's recommended to be jointly used with the CalibratedClassifierCV class in sklearn as a wrapper of the classifier to estimate the calibrated probability. If False, the output is essentially the proportion of "1s" across the related ensembles; e.g., if 100 stixels covers this spatiotemporal points, and 90% of them predict that it is a "1", then the output probability is 0.9; Therefore it would be a probability estimated by the spatiotemporal neighborhood. Default is False, but can be set to truth for "real" probability averaging.
Minor changes:
- The self.rng is now set at call of
fit, instead of initiation stage. - The lazy-loading dir is created upon calling
fit, instead of initiation stage. - Add probability clipping to the prediction output if using
predict_probain classification mode. clipping to1e-6, 1 - 1e-6. - The averaging of the probability for classification task is now on logit scale, and the
meanprediction in the output is back-transformed to probability scale. However, the std in the output will still be in logit scale! - The roc_auc score is now calculated with probability and y_true. Previously a 0.5 threshold was applied to obtain a binary prediction results before calculating auc.
- Removing "try-except" in the base model training process. If you failed in the base model training, that's a problem.
Oct 25, 2024:
Related: #59; #69
- Add a option for completely randomized grids generation (compared to equal division of the 90 degree angle).
- Implement Lazy-loading model dictionary for saving memory; Save ensmebles of models to disk when finish training, and loaded it when used for prediction
- Update init parameters in AdaSTEM classes, STEM classes, and SphereAdaSTEM classes.
- Update lazy loading documentation & example notebooks.
- Add related pytests.