Skip to content

praesc/Desk-LM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Desk-LM

Desk-LM is a python environment for training machine learning models (and make predictions as well). It currently implements the following ML algorithms:

  • Linear SVM
  • Decision Tree
  • K-NN
  • ANN
  • TripleES Holt-Winters Triple Exponential Smoothing implementation for time series

We are extending the library to other algorithms, also unsupervised. Your voluntary contribution is welcome.

The user can specify a .csv dataset, an algorithm and a set of parameter values, so to train and select the best model and export it for use on edge devices, by exploiting the twin tool Micro-LM.

For ANNs, Desk-LM outputs the model in hdf5 file format, to be imported by STM32 CubeAI, together with .c and .h files for pre-processing and for testing the whole dataset performance on the microcontroller (STM32 Nucleo boards only).

For all the other algorithms, Desk-LM produces .c and/or .h that will be used as source files in a Micro-LM project for optimzed memory footprint on edge devices. They contain the parameters of the selected ML model.

Desk-LM relies on numpy, pandas, sk-learn and keras/Tensorflow.

We are working so that Desk-LM will output .json files so to allow dynamic usage by microcontrollers.

Getting started

Input

The command line expects the path to .json files specifying:

  • dataset
  • estimator
  • preprocessing
  • validation
  • output
  • prediction
  • storage

Dataset configuration

-d <path_to_dataset_config_file>

The .json file exposes the following properties:

  • path, path to the .csv dataset file
  • skip_rows, number of rows to be skipped before the column names
  • select_columns, array of names of the columns to be selected as features (all the others are discarded)
  • skip_columns, array of names of the columns to be skipped (ignored, if select_feature_columns is set)
  • target_column, name of the target column
  • time_series_column, name of the column for the time series. It is used only for time series computation and is mutually exclusive with all the other column options
  • sep, .csv file separator string/char
  • decimal, .csv file decimal number separator
  • test_size, integer (number of samples for the test), or float (fraction of the total samples to be used as test)
  • categorical_multiclass, boolean specifying whether the multiclass labels are categorical or not (i.e., ordinal)

Estimator configuration

-e <path_to_estimator_config_file>

The .json file exposes the following properties:

  • estimator, name of the estimator. If not differently stated, all estimators are implemented through sk-learn. Currently supported estimators are:
    • KNeighborsClassifier
    • KNeighborsRegressor
    • DecisionTreeClassifier
    • DecisionTreeRegressor
    • LinearSVC
    • LinearSVR
    • ANNClassifier, Keras/Tensorflow implementation
    • ANNRegressor, Keras/Tensorflow implementation
    • TripleES, the Holt-Winters Triple Exponential Smoothing for time series

Each estimator type has its own configuration parameters:

  • k-NN
    • n_neighbors (*), default: 5
  • DecisionTree
    • max_depth (*), default: None
    • min_samples_split (*), default: 2
    • min_samples_leaf (*), default: 1
    • max_leaf_nodes (*), default: None
  • SVC/SVR
    • C_exp (*), exponent of the C parameter, default 0
  • ANN
    • epochs (*), default: 10
    • batch_size (*), default: 32
    • dropout (*), float between 0 and 1, default: 0
    • activation, array of strings (e.g., "relu", "tanh") for the activation function of the hidden layers, default: "relu"
    • hidden_layers, array of array of integers representing the size of each hidden layer, default: []. Each inner array represents one ANN layer configuration (i.e., number of nodes for each layer) among those to be evaluated in the cross-validation
  • TripleES
    • season_length

(*) for the sake of model selection and cross validation, for this property it is possible to specify (all values are integers, if not differently specified, as ANN dropout):

  • < prop >, a single value
  • <prop_array>, an array of values
  • <lower_limit>, a lower_limit for a value range
  • <upper_limit>, a upper_limit for a value range
  • < step >, step for a value range

Please refer to sk-learn and keras documentation for the details on the configuration parameters.

Preprocessing configuration

-p <path_to_preprocessing_config_file>

The .json file exposes the following properties:

  • scale, array of strings specifying scalers (currently supported scalers: "StandardScaler", "MinMaxScaler"), default: no scaler
  • pca_values, array of numbers (integer or float between 0 and 1) representing the principal components. It is possible to specify also the string "mle", default: no PCA

Not used for TripleES.

Please refer to sk-learn documentation for further details.

Model selection configuration

-s <path_to_model_selection_config_file>

The .json file exposes the following properties:

  • cv, cross validation schema, default: None, to use the default 5-fold cross validation
  • scoring, scoring method for cross validation
    • Regression problem, possible values:
      • "mean_absolute_error"
      • "mean_squared_error", default value for ANN
      • "mean_squared_log_error'
      • "root_mean_squared_error"
      • "r2", default value for k-NN, LinearSVR, DecisionTree
    • Classification problem, possible values:
      • "accuracy", which is the default value
      • "balanced_accuracy"
      • "f1", "f1_micro", "f1_macro", "f1_samples", "f1_weighted"
      • "precision", "precision_micro", "precision_macro", "precision_samples", "precision_weighted"
      • "recall", "recall_micro", "recall_macro", "recalln_samples", "recall_weighted"
  • verbose, verbosity level, default: 0

Please refer to sk-learn documentation for further details.

Output configuration

-o <path_to_output_config_file>

The .json file exposes the following properties:

  • export_path, path to a directory where to export the output folders, default: no export. Output folders are structured like this:
    • elm/source, for .c files
    • elm/include, for .h files
    • elm/model, for models, such as the ANN .h5 file
  • is_dataset_test, if files for a whole dataset test on the target devices are to be prepared, default: no dataset test (i.e., one shot estimation only)
  • dataset_test_size, sets a limit to the number of testing labels to be exported for the dataset test. Can be either int (number of labels) or float (0-1), default: 1
  • training_set_cap, for k-NN, sets a limit to the number of training samples to be exported for the k-NN estimation. Can be either int (number of samples) or float (0-1), default: no cap

Prediction configuration

--predict <path_to_predict_config_file>

The .json file exposes the following properties:

  • model_id, the uuid of the model to be loaded on the server to make the prediction
  • samples, an array of samples for which to build the prediction

Storage configuration

--store

The program stores the estimator in the "./storage/" directory. It provides an uuid for future retrieval for predictions

Run

  • For computing a model and generating files for deployment on Micro-LM (or an .h5 file for neural networks) python main.py -d <dataset_config_file> -e <estimator_config_file> -p <preprocessing_config_file> -s <model_selection_config_file> -o <output_config_file>

  • For computing and storing a model python main.py -d <dataset_config_file> -e <estimator_config_file> -p <preprocessing_config_file> -s <model_selection_config_file> --store

  • For predicting a set of samples with a trained model python main.py --predict <predict_config_file>

Output

Output files are also produced under the out diectory, with the following structure:

  • out/source, for .c files
  • out/include, for .h files
  • out/model, for models, such as the ANN .h5 file

The structure is duplicated in the elm export directory, if specified in the output configuration file.

Linear SVM / DT / K-NN / TripleES

For using source and header files produced by Desk-LM for these algorithms, please refer to the Micro-LM documentation.

Use a Desk-LM ANN in a CubeIDE project, using STM X-Cube-AI package (for STM32 Nucleo boards only):

  1. Load the .h5 model generated by DeskML into STM32CubeIDE and generate the code for your target board
  2. Add to your CubeIDE project the files generated by Desk-LM for pre-processing and/or dataset testing (source/preprocess_params.c and source/testing_set.c and include/PPParmas.h and include/testing_set.h).
  3. You need to add also ELM.h and Preprocess.c and Preprocess.h from Micro-LM.
  4. Use function preprocess(X), exposed by ELM.h, where X is the sample vector for preprocessing

Examples

Example .json files are provided in the input dirctory. We adopt the following convention:

  • ds_ is the prefix for dataset configuration files.
  • pp_ for preprocessing
  • est_ for estimator
  • ms_ for model selection
  • output_ for output
  • pr_ for predicting

Example .csv files are provided in the ./dataset/ directory.

Data type

float 32 data are used

Package versions

Please see the Desk-LM.3.6.10.yml file. Python 3.6, and Keras 2.2.4, which is needed for importing the ANN model in Cube-AI

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Contributing

Please see CONTRIBUTING.md

Reference article for more infomation

F., Sakr, F., Bellotti, R., Berta, A., De Gloria, "Machine Learning on Mainstream Microcontrollers," Sensors 2020, 20, 2638. https://www.mdpi.com/1424-8220/20/9/2638

References

Credit for the Holt-Winters Triple Exponential Smoothing implementation for time series: https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-9-time-series-analysis-in-python-a270cb05e0b3

About

Desk-LM is a python environment for training machine learning models

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%