Skip to content

PrathikI/Sagemaker-Development

Repository files navigation

Sagemaker Work

AWS Sagemaker Deliverables for

Project 1 - 09/25/24 Detailed Deliverables (Focus on Future MLOps)

Deliverable: A template notebook (with detailed instructions) that performs the following for a public dataset & use case:

  • USED DATASET FOR DELIVERABLE

  • Processes raw data into features (cleaning, formatting, validating, binning, etc). Make sure it includes some binning

    • DATA QUALITY checks include:
      • NULL/missing values
      • Duplicate values
      • Removal of redundant/unnecessary columns
    • Conversion types (especially from meaningless integer to categorical values)
    • Binning refers to “transforming continuous data into intervals/bins”
      • Dummy variables to prepare data for ML model
      • Technique to drop first variable from each set of dummy variables to prevent multicollinearity
  • Registers features into a new feature group in Sagemaker Feature Store

    • Documentation Link: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html
    • Created a script to automate this process
    • Initialize Environment:
      • Import necessary libraries (boto3, pandas, sagemaker, etc.)
      • Initialize a SageMaker session and Boto3 clients for SageMaker Feature Store
    • Load and Prepare Data:
      • Load the dataset into a DataFrame
      • Add columns back to the DataFrame for feature store registration (if needed for date time stamps and key identifiers)
      • Handle missing values by filling them with placeholders
    • Define Feature Group:
      • Define the feature group name
      • Specify feature definitions, including all relevant columns
    • Create Feature Group:
      • Check if the feature group already exists and delete it if it does
      • Create the feature group in SageMaker Feature Store with the specified schema
    • Wait for Feature Group Activation:
      • Continuously check the status of the feature group until it becomes ACTIVE
      • Will be created on AWS Sagemaker Feature Store before it programmatically says it has been active (check the store periodically as script runs)
    • Ingest Data:
      • Ingest the prepared DataFrame (bike_new) into the feature group once it is active
    • Completion Message:
      • Print a success message indicating that the features have been registered into the feature group successfully
  • Train & test a linear regression model to predict some outcome

    • Split data into train and test using sklearn model
      • Usual: train is around 70% and test is around 30%
    • Scale all numerical variables using MinMaxScaler() function built into sklearn
      • Scales all numerical features within a given range
    • Apply the scale to the numerical variables (fit_transform function)
    • Model creation in steps
      • Split the training data into X and Y training data
    • Calculate VIFs (variance inflation factors) - just shows multicollinearity chances within the model
    • Create the first-fitted OLS (ordinary least squares) model -  starting point for all spatial regression analyses
      • Check the important parameters obtained from the model
    • Print the final summary of the regression model using those variables
    • Results of bike dataset (and how results for regression model will look):
      • Model Summary:
        • Dependent Variable: cnt
        • R-squared: 1.000
        • Adjusted R-squared: 1.000
        • F-statistic: 7.327e+31
        • Prob (F-statistic): 0.00
        • Number of Observations: 510
        • AIC (Akaike Information Criterion): -2.643e+04
        • BIC (Bayesian Information Criterion): -2.636e+04
        • Df Residuals: 494
        • Df Model: 15
        • Covariance Type: nonrobust
      • Coefficients:
        • const: -1.805e-12 (p-value: 0.000)
        • casual: 1.0000 (p-value: 0.000)
        • registered: 1.0000 (p-value: 0.000)
        • mnth_1: -1.563e-12 (p-value: 0.000)
        • mnth_2: -4.636e-13 (p-value: 0.243)
        • mnth_3: -7.252e-13 (p-value: 0.022)
        • mnth_4: -1.084e-12 (p-value: 0.000)
        • mnth_6: 4.539e-13 (p-value: 0.064)
        • mnth_9: 1.794e-13 (p-value: 0.438)
        • mnth_10: -5.88e-13 (p-value: 0.032)
        • mnth_12: -5.347e-13 (p-value: 0.068)
        • weekday_0: 1.599e-12 (p-value: 0.000)
        • weekday_6: -5.365e-13 (p-value: 0.020)
        • season_1: 3.313e-13 (p-value: 0.326)
        • season_4: 1.004e-13 (p-value: 0.621)
        • weathersit_1: 5.898e-13 (p-value: 0.000)
      • Diagnostic Tests:
        • Omnibus: 5.990 (p-value: 0.050)
        • Durbin-Watson: 1.823
        • Jarque-Bera (JB): 6.049 (p-value: 0.0486)
        • Skew: -0.266
        • Kurtosis: 2.960
        • Condition Number: 4.63e+04
      • Interpretation:
        • R-squared and Adjusted R-squared:
          • R-squared (1.000): Indicates that 100% of the variability in the dependent variable (cnt) is explained by the independent variables in the model. This is an unusually high value, suggesting a perfect fit, which might indicate overfitting.
          • Adjusted R-squared (1.000): Adjusts the R-squared value for the number of predictors in the model. It is also 1.000, reinforcing the perfect fit.
        • F-statistic and Prob (F-statistic):
          • F-statistic (7.327e+31): Tests the overall significance of the model. A very high F-statistic value indicates that the model is statistically significant.
          • Prob (F-statistic) (0.00): The p-value associated with the F-statistic is zero, indicating that the model is statistically significant.
        • Coefficients and P-values:
          • Coefficients: Represent the change in the dependent variable for a one-unit change in the independent variable, holding all other variables constant.
          • P-values: Indicate the statistical significance of each coefficient. A p-value less than 0.05 typically indicates that the coefficient is statistically significant.
          • Significant Variables: casual, registered, mnth_1, mnth_3, mnth_4, mnth_10, weekday_0, weekday_6, weathersit_1.
        • Diagnostic Tests:
          • Durbin-Watson (1.823): Tests for autocorrelation in the residuals. A value close to 2 indicates no autocorrelation.
          • Omnibus and Jarque-Bera (JB) Tests: Test for normality of the residuals. Significant p-values indicate that the residuals are not normally distributed.
          • Skew and Kurtosis: Measure the asymmetry and peakedness of the residual distribution. Values close to zero indicate normality.
          • Condition Number (4.63e+04): Indicates potential multicollinearity. A high condition number suggests that some predictors may be highly correlated.
        • Conclusion:
          • Perfect Fit: The R-squared and Adjusted R-squared values of 1.000 indicate a perfect fit to the training data. This is unusual and may suggest overfitting.
          • Statistical Significance: The overall model and most of the individual predictors are statistically significant.
          • Potential Issues: The diagnostic tests suggest potential issues with normality of residuals and multicollinearity among predictors.
  • Register the model in Sagemaker’s Model Registry

    • Lauren’s red wine dataset has not implemented this yet
    • Documentation Link: https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html
    • Allows for
      • Catalog models for production
      • Manage model versions
      • Associate metadata, such as training metrics, with a model
    • Created script to automate this process
    • This script sets up a SageMaker session and Boto3 client, retrieves the necessary image URI for the scikit-learn framework, and generates a unique model package group name using the current timestamp to avoid conflicts
    • Then creates the model package group and waits for its successful creation
    • Defines the model using the retrieved image URI and model data, and prepares the inference specification for the model package
    • Registers the model package with the specified group name, description, and approval status, and finally, it updates the model package status to “Approved”
  • Create a Sagemaker Pipeline to trigger when new raw data appears in S3 to process the data, load into feature, run thru model to generate predictions, and write predictions back to a file in S3

Project 2 - 10/15/24 Model Registry, GitHub Connection, Model Cards, Pickle File

Deliverable 1: register model in model registry with model card (metadata about the model - tailored to development team + 5/3 information standard)

Deliverable 2: establish live connection to remote GitHub repository to pull changes when account deactivation occurs

Deliverable 3 (Stretch Goal): export trained weighted coefficients into .pickle file (in 1 notebook), then pass in new data through the .pickle file to create more predictions (separate notebook)

About

AWS Sagemaker Deliverables for Fifth Third Bank

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors