Ger Inberg https://gerinberg.com/ data science developer Tue, 03 Sep 2024 12:41:39 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://gerinberg.com/wp-content/uploads/2017/05/favicon-150x150.jpg Ger Inberg https://gerinberg.com/ 32 32 FeatureExtraction on CRAN https://gerinberg.com/2024/09/03/featureextraction-on-cran/ Tue, 03 Sep 2024 12:40:01 +0000 https://gerinberg.com/?p=1861 Feature engineering is a crucial step in the data science process, often making the difference between a good model and a great one. It involves transforming raw data into meaningful features that can improve the performance of predictive models. For those working in R, the […]

The post FeatureExtraction on CRAN appeared first on Ger Inberg.

]]>
Feature engineering is a crucial step in the data science process, often making the difference between a good model and a great one. It involves transforming raw data into meaningful features that can improve the performance of predictive models. For those working in R, the FeatureExtraction package on CRAN offers a powerful and flexible toolset for automating and streamlining this process.

Originally developed as part of the OHDSI (Observational Health Data Sciences and Informatics) ecosystem, FeatureExtraction is particularly well-suited for working with large-scale observational data. In this post, we’ll explore the package in detail, focusing on its core features, practical applications, and a step-by-step example to help you get started.

Key Features and Capabilities

1. Automated Feature Generation

FeatureExtraction excels at automatically generating a wide range of features from raw data. These features include basic demographic variables like age and gender, as well as more complex attributes derived from longitudinal data, such as the frequency of medical visits or the presence of certain conditions over time.

2. Temporal Features

Temporal data, such as patient histories or time-dependent events, are common in many fields, especially healthcare. FeatureExtraction handles temporal data adeptly, allowing users to define time windows relative to key events (e.g., diagnosis dates). This feature is crucial for creating time-sensitive covariates that capture trends and patterns in data over specified periods.

3. Custom Feature Extraction

While the package offers extensive automated capabilities, it also allows for custom feature extraction. Users can define custom covariates and specify how these should be generated from the underlying data, incorporating domain-specific knowledge into the feature engineering process.

4. Scalability

Feature engineering can become computationally intensive, particularly with large datasets. FeatureExtraction is designed for scalability, leveraging parallel processing and optimized algorithms to ensure that feature extraction remains efficient even with big data.

5. Integration with OHDSI Tools

As part of the OHDSI ecosystem, FeatureExtraction integrates seamlessly with other tools like PatientLevelPrediction and CohortMethod, enabling a smooth workflow from data extraction to model building and analysis.

Getting started

Installing the FeatureExtraction package is straightforward. You can install it directly from CRAN using the following command:

install.packages("FeatureExtraction")
Load the package in your R session:
library(FeatureExtraction)

Practical Example: Creating Covariates Based on Other Cohorts

To illustrate how FeatureExtraction can be applied, let’s walk through an example where we create covariates based on the presence of patients in other cohorts. This is particularly useful in studies where the relationship between different conditions or treatments over time is of interest.

Step 1: Setting Up the Database Connection

First, we need to define the connection to our CDM-compliant database:

connectionDetails <- createConnectionDetails(dbms = "postgresql",
server = "your_server",
user = "your_username",
password = "your_password")
cdmDatabaseSchema <- "your_cdm_schema"
cohortDatabaseSchema <- "your_cohort_schema"

Step 2: Define the Cohorts of Interest

Assume we have a cohort of patients with diabetes and another cohort with a history of cardiovascular disease. We want to create a feature that indicates whether a patient in the diabetes cohort has a prior history of cardiovascular disease.

# Define cohort IDs (these would be predefined in your database)
diabetesCohortId <- 1
cvdCohortId <- 2

Step 3: Create the Feature Extraction Settings

Next, we define the feature extraction settings, specifying that we want to create covariates based on the presence of patients in the cardiovascular disease cohort:

covariateSettings <- createCohortBasedCovariateSettings(useDemographicsGender = TRUE,
useDemographicsAge = TRUE,
cohortId = cvdCohortId,
startDay = -365,
endDay = 0)

In this example, the startDay and endDay parameters define a time window of one year prior to the cohort’s index date. This means the feature will reflect whether a patient was in the cardiovascular disease cohort within one year before the index date.

Step 4: Extract the Features

Now, we extract the features for the diabetes cohort using the settings we defined:

covariateData <- getDbCovariateData(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTable = "cohort",
cohortId = diabetesCohortId,
covariateSettings = covariateSettings)

This function retrieves the covariate data for the specified cohort, based on the feature extraction settings we provided.

Step 5: Use the Extracted Features

The extracted features are now available in the covariateData object, which can be used for further analysis, such as model building or cohort characterization.

# Explore the covariate data
summary(covariateData)

This simple example demonstrates how FeatureExtraction can be used to create meaningful features based on different cohorts. The package’s flexibility and scalability make it a powerful tool for a wide range of applications, from small-scale studies to large observational databases.

The post FeatureExtraction on CRAN appeared first on Ger Inberg.

]]>
DrugExposure Diagnostics https://gerinberg.com/2023/04/01/drugexposurediagnostics/ Sat, 01 Apr 2023 09:32:00 +0000 https://gerinberg.com/?p=1848 DrugExposureDiagnostics: A Comprehensive R Package for Assessing Drug Exposure in Clinical Research Drug exposure is an essential aspect of clinical research, as it directly affects the efficacy and safety of drugs. Measuring drug exposure accurately and understanding the factors that influence it is crucial for […]

The post DrugExposure Diagnostics appeared first on Ger Inberg.

]]>

DrugExposureDiagnostics: A Comprehensive R Package for Assessing Drug Exposure in Clinical Research

Drug exposure is an essential aspect of clinical research, as it directly affects the efficacy and safety of drugs. Measuring drug exposure accurately and understanding the factors that influence it is crucial for clinical decision-making. This is where the R package DrugExposureDiagnostics comes in handy.

As the author of this R package, I am excited to introduce you to this powerful tool for analyzing drug exposure data. Before delving into the package, let’s first understand what drug exposure is and why it is crucial in clinical research.

Drug exposure refers to the extent to which a drug enters and stays in the body, thereby producing its intended therapeutic effects. Measuring drug exposure accurately involves capturing key metrics, such as drug concentrations, AUC, Cmax, and Tmax. By doing so, researchers can evaluate drug efficacy and safety and make informed decisions regarding dosing and administration.

One way to capture drug exposure data is through the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), developed by the Observational Health Data Sciences and Informatics (OHDSI) community. The OMOP CDM standardizes and integrates data from various sources, allowing for large-scale observational studies and analysis.

This is where the R package DrugExposureDiagnostics comes in. It is a comprehensive tool for analyzing drug exposure data in the OMOP CDM format. The package includes functions for calculating various exposure metrics, handling missing data, and summarizing data at different levels, such as by subject or visit. Additionally, it provides tools for identifying outliers and comparing exposure between groups.

DrugExposureDiagnostics has been extensively tested and validated, ensuring that it produces accurate results. The package has been released on the Comprehensive R Archive Network (CRAN), making it easily accessible to R users worldwide. To use the package, simply install it using the install.packages() function in R and load it using the library() function.

If you are interested in learning more about DrugExposureDiagnostics or trying it out for yourself, visit the package github

The post DrugExposure Diagnostics appeared first on Ger Inberg.

]]>
Multi page shiny apps https://gerinberg.com/2022/08/11/multi-page-shiny-apps/ Thu, 11 Aug 2022 18:09:05 +0000 https://gerinberg.com/?p=1834 Web applications can have rich functionality nowadays. For example a website of an E-commerce shop has a page about the products they are selling, a page about their conditions, a shopping cart page and an order page. Furthermore it can handle different HTTP requests from […]

The post Multi page shiny apps appeared first on Ger Inberg.

]]>
Web applications can have rich functionality nowadays. For example a website of an E-commerce shop has a page about the products they are selling, a page about their conditions, a shopping cart page and an order page. Furthermore it can handle different HTTP requests from the user. A GET request is used to retrieve a page (e.g. products) from the server whereas a POST request is used to send information to the server (e.g. make order).

Problem

Shiny apps are, by default, a bit limited when looking at it from this perspective.  It can handle only GET requests by default, unless you are a technical expert in this field, see this stack overflow post. Furthermore shiny apps can have just one entry-point “/”. So you can’t have another entry-point “/page2”. Thus, the e-commerce shop is not possible out of the box in R shiny.

Solution

There are multiple solutions to support multiple pages. The one that I am using since a while is the package brochure developed by Colin Fay. It is still in development, so you might encounter some issues but I haven’t found any major bugs yet. A brochure app consists of a series of pages that are defined by an endpoint/path, a UI and a server function. Thus each page has its own shiny session, its own UI, and its own server! This is important to keep in mind.  A separate session for each page has some advantages but also some disadvantages (e.g. how to pass user data between pages?). A very simple brochureApp looks like this:

library(shiny)
library(brochure)

brochureApp(
  # First page
  page(
    href = "/",
    ui = fluidPage(
      h1("My first page"), 
      plotOutput("plot")
    ),
    server = function(input, output, session){
      output$plot <- renderPlot({
        plot(iris)
      })
    }
  ), 
  # Second page, no server-side function
  page(
    href = "/page2", 
    ui =  fluidPage(
      h1("My second page")
    )
  )
)

Donation app

Coming back to the E-commerce shop example, I have developed an app where one can sponsor me for my open source work on R packages. The app has an integration with Stripe to make a donation and a thank you and error page. When calling Stripe you have to give the two endpoints for these pages and by using brochure, I am able to setup these endpoints. See the app on shinyapps.io and of course I would appreciate it, if you use the app!:-)

The post Multi page shiny apps appeared first on Ger Inberg.

]]>
Speed skating viz updated https://gerinberg.com/2021/12/30/speed-skating-viz-updated/ Thu, 30 Dec 2021 14:54:12 +0000 https://gerinberg.com/?p=1817 Speed skating is one of my favorite sports to practice and to watch. This winter the Winter Olympics will be held in Beijing, China.  Will the Dutch be as successful as they were in Sochi and PyeongChang?  How many medals will the Chinese win? Four […]

The post Speed skating viz updated appeared first on Ger Inberg.

]]>

Speed skating is one of my favorite sports to practice and to watch. This winter the Winter Olympics will be held in Beijing, China.  Will the Dutch be as successful as they were in Sochi and PyeongChang?  How many medals will the Chinese win?

Four years ago, I created a visualization about past medal winners at the Olympic Games. I have updated it now with the results of the games in 2018 at PyeongChang.

See the live version. Let me know if you like it or you are interested in some other charts. Enjoy the Winter Olympics!

The post Speed skating viz updated appeared first on Ger Inberg.

]]>
Catchment Area Research Dashboard https://gerinberg.com/2021/07/13/catchment-area-research-dashboard/ Tue, 13 Jul 2021 08:04:00 +0000 https://gerinberg.com/?p=1800 In healthcare, the catchment area is the area served by a hospital or medical centre. The Rutgers Cancer Institute of New Jersey has one main goal: to help individuals fight cancer. More specific they are targetting cancer with precision medicine, immunotherapy and clinical trials next to providing advanced cancer care […]

The post Catchment Area Research Dashboard appeared first on Ger Inberg.

]]>
In healthcare, the catchment area is the area served by a hospital or medical centre. The Rutgers Cancer Institute of New Jersey has one main goal: to help individuals fight cancer. More specific they are targetting cancer with precision medicine, immunotherapy and clinical trials next to providing advanced cancer care to adults and children.

In order to serve patients as best as they can, researchers need as much (quality) data that can serve their purpose.Surveillance, Tracking and Reporting through Informed Data Collection and Engagement (STRIDE) is an interactive data and visualization dashboard. It includes clinical trials enrollment, bio-specimen inventory, tumor registry analytic cases, and catchment area information.

They have approached me to improve the user interface of their dashboard and that’s what I have been doing! Next to this, I have been helping the person that created the initial dashboard with the following concepts.

  • DRY (Don’t Repeat Yourself); this is basically re-using of code that you already wrote so you don’t have to write this code again and you will end up with less code to maintain.
  • Reproducible results. When I tried to run the initial dashboard, it didn’t work since some packages were not imported and it was not clear which version of those packages I should use. I have added renv as dependency management, this will add a file to your project containing all packages and their versions. It’s easy to setup and it’s worth it!
  • Interactive charts. The initial charts were made with the package ggplot2. This is a nice package for data visualization and it offers many charts and display options. But, it’s not interactive and that’s what most people want and expect in a dashboard.

See below for some screenshots of this app.

The post Catchment Area Research Dashboard appeared first on Ger Inberg.

]]>
CITO public analysis https://gerinberg.com/2021/03/12/cito-public-analysis/ Fri, 12 Mar 2021 07:24:00 +0000 https://gerinberg.com/?p=1750 CITO is an institute in the Netherlands that support governments and schools so that they can develop world-class testing and monitoring systems to complete their educational programs. They have a lot of data regarding testing scores and it could be interesting to combine this data […]

The post CITO public analysis appeared first on Ger Inberg.

]]>
CITO is an institute in the Netherlands that support governments and schools so that they can develop world-class testing and monitoring systems to complete their educational programs. They have a lot of data regarding testing scores and it could be interesting to combine this data with public data. For example, are testing scores of children living in deprived areas worse than average?

Exploratory Analysis

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics. This is often done by using data visualization methods. The main purpose of EDA is to help look at data before making any assumptions. For me it’s one of the nicest parts of the data science! Since you don’t know yet what’s in the data and there will always be surprises. It’s like you are on holiday and exploring the area that you seeing for the first time:-)

For example, is a certain variable in the data normally distributed or not? Is there any missing data or duplicated values? In my experience, yes in most cases, there is missing and duplicated data. We need to fix these issues before we can do the real analysis. This phase is called data cleaning, you might have heard about this before.

Representativeness Analyses

In general, a representative sample is a group or set chosen from a larger statistical population that adequately replicates the larger group according to whatever characteristic or quality is under study. In case of CITO, we like to know if the sample data set has more or less the same characteristics regarding scores.  For example, are the average and standard deviation of the sample data set close to the ones of the total data set. I have plotted the distributions of the 2 data sets in a single chart, in order to compare them. In the subtitle one can find the average, standard deviation and the median.

Below you can find some of the charts I made for both EDA and the representativeness Analyses. The code is available in a public repository on github. It can be run using a docker container, R and renv for library management.

The post CITO public analysis appeared first on Ger Inberg.

]]>
AI Redaction Application https://gerinberg.com/2020/12/15/ai-redaction-application/ Tue, 15 Dec 2020 14:32:00 +0000 https://gerinberg.com/?p=1548 What is redaction? redaction is the blacking out or deletion of text in a document. It is intended to allow the selective disclosure of information in a document while keeping other parts of the document secret. It is common within court documents and in the […]

The post AI Redaction Application appeared first on Ger Inberg.

]]>
What is redaction?

redaction is the blacking out or deletion of text in a document. It is intended to allow the selective disclosure of information in a document while keeping other parts of the document secret. It is common within court documents and in the government. Categories of redacted items are phone numbers, e-mail addresses, bank account numbers, dates and names. It takes quite some time to manually redact documents, but fortunately AI can help to speed up this process. Natural Language Processing (NLP) is a subfield of AI that studies how to analyze and process a piece of natural text. This technology allows us to extract the keywords from the text.

Slimmer AI develops AI software products that support industries, solve real-world challenges and takes professionals into the future. They have developed an API that allows the redaction of PDF files. This API returns the redacted document based on your redaction action (e.g. all phone numbers). I have collaborated with Slimmer AI on building the interface for their new redaction application. 

Redaction Application

The developed application has the following features:

  • search for keyword(s) in the text, this can be a regular expression
  • AI search: search for items in a category like phone numbers
  • select a piece of text in the document
  • redact the results from the actions above
  • display the redacted PDF

Below you see a screenshot of the application. The left sidebar is the search column where the keyword and AI search can be performed. At the bottom of this sidebar, the results of the search are shown. When a user clicks on a result, it is selected for redaction.

The center of the application contains the document. This is the section where the text selection is performed. Once a piece of text is selected a popup appears that asks if the selected text should be redacted or not.

The right column contains the items that have been selected for redaction. When the user pushes the ‘Redact All’ button, the document is processed on the backend and the middle section will show the redacted version of the document.

The application uses the PDF.JS library for basic functionality like rendering the PDF and selecting some text. It is a free and open source library. There are some commercial libraries that offer more functionality, but they were unrequired. The rest of the technology stack for the application includes Javascript, JQuery, Bootstrap4 and HTML/CSS.

Improvements

The application was meant as a Proof of Concept to see if we could create a user-friendly wrapper for the API. Since the current functionality is working well, the application is being further developed. One thing on the improvements list is the option for a rectangle select. So next to redacting a piece of text on a line, like we can do now, this allows the user to redact any rectangular area in the document. 

The post AI Redaction Application appeared first on Ger Inberg.

]]>
My Electricity Balance https://gerinberg.com/2020/10/14/my-electricity-balance/ Wed, 14 Oct 2020 09:02:38 +0000 https://gerinberg.com/?p=1530 Since the beginning of this year I have solar panels on the roof of my house. The electricity these are producing should be more than enough than I am currently using. I bought a bit more solar panels since I expect to have an electric […]

The post My Electricity Balance appeared first on Ger Inberg.

]]>
Since the beginning of this year I have solar panels on the roof of my house. The electricity these are producing should be more than enough than I am currently using. I bought a bit more solar panels since I expect to have an electric car in a couple of years.

My energy provider is giving me my monthly usage and production. Based on this I have created the visual below. It states for each month the consumption (red) vs the production (blue). The electricity meter is installed at March 1 so there is no data before that date. The solar panels have been installed in May. So far this year the numbers are looking very good, since the solar panels produce the most in spring and summer. In the coming months the bars will be more red! See below for my current balance or have a look at my energy for the current status.

The post My Electricity Balance appeared first on Ger Inberg.

]]>
eRum lightning talk speaker https://gerinberg.com/2020/06/16/erum-lightning-talk-speaker/ Tue, 16 Jun 2020 13:44:44 +0000 https://gerinberg.com/?p=1505 This week, the European R Users Meeting (ERUM) is happening. It's a biennial conference that brings the R User Community together and this year it would be held in Milan. I am excited to give a lightning talk about "Reproducible Data Visualization with CanvasXpress"!

The post eRum lightning talk speaker appeared first on Ger Inberg.

]]>

This week, the European R Users Meeting (eRum) is happening. It’s a biennial conference that brings the R User Community together and this year it would be held in Milan. Because of covid-19, the organizers decided to do the whole conference online. I am very happy that the conference didn’t have to be cancelled, though it’s too bad we can’t visit Milan. Furthermore I am excited that I will give a lightning talk about “Reproducible Data Visualization with CanvasXpress”!

Since I am working with CanvasXpress since a couple of years, I know it quite well. I wrote about it before in this blog. Many times I have been surprised by the amount of functionality that the library provides. Especially all the options that are available after the chart has been created. There’s a ‘Reproducible Research’ sub-menu which has been extended lately with a very cool replay option.

Replay

When you make changes to a rendered plot, canvasXpress keeps a history of these changes. You can reset the chart back to it’s original state and replay the previous changes that you have made using the replay button. This button is the 2nd leftmost button in the top menu bar, see the screenshot below. It’s only available if you have made any changes to the plot.

The replay creates a new window that displays all the user actions step by step. For each step, more information is available when selecting that step. In the example above, I have removed the x-axis on top in step 1. This relates to the property ‘xAxisShow’.

What if you would like to share this replay with your coworker? Well, you can download the chart in PNG format using the camera icon in the top menu. The downloaded image also contains the user actions. So if your coworker is importing your canvasXpress PNG, he/she can do the same replay as you. Pretty nice huh?

I will give a demonstration about the replay functionality in my presentation this Friday. Please have a look at the eRum schedule for the exact time and (online) location.

 

 

 

 

 

The post eRum lightning talk speaker appeared first on Ger Inberg.

]]>
Linear Regression with R https://gerinberg.com/2020/06/01/r-linear-regression/ Mon, 01 Jun 2020 11:32:00 +0000 https://gerinberg.com/?p=1582   You might have heard about linear regression and machine learning before. Basically linear regression is a simple statistics problem.  But what are the different types of linear regression and how to implement these in R? Introduction to Linear Regression Linear regression is an algorithm […]

The post Linear Regression with R appeared first on Ger Inberg.

]]>
 

You might have heard about linear regression and machine learning before. Basically linear regression is a simple statistics problem.  But what are the different types of linear regression and how to implement these in R?

Introduction to Linear Regression

Linear regression is an algorithm developed in the field of statistics. As the name suggests, linear regression assumes a linear relationship between the input variable(s) and a single output variable. The output variable, what you’re predicting, has to be continuous. The output variable can be calculated as a linear combination of the input variable(s).

There are two types of linear regression:

  • Simple linear regression – only one input variable
  • Multiple linear regression – multiple input variables

We will implement both today – simple linear regression from scratch and multiple linear regression with built-in R functions.

You can use a linear regression model to learn which features are important by examining coefficients. If a coefficient is close to zero, the corresponding feature is considered to be less important than if the coefficient was a large positive or negative value. 

That’s how the linear regression model generates the output. Coefficients are multiplied with corresponding input variables, and in the end, the bias (intercept) term is added.

There’s still one thing we should cover before diving into the code – assumptions of a linear regression model:

  • Linear assumption — model assumes that the relationship between variables is linear
  • No noise — model assumes that the input and output variables are not noisy — so remove outliers if possible
  • No collinearity — model will overfit when you have highly correlated input variables
  • Normal distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking
  • Rescaled inputs — use scalers or normalizer to make more reliable predictions

You should be aware of these assumptions every time you’re creating linear models. We’ll ignore most of them for the purpose of this article, as the goal is to show you the general syntax you can copy-paste between the projects. 

Simple Linear Regression from Scratch

If you have a single input variable, you’re dealing with simple linear regression. It won’t be the case most of the time, but it can’t hurt to know. A simple linear regression can be expressed as:Linear Regression FormulaAs you can see, there are two terms you need to calculate beforehand: beta0 and beta1. You’ll first see how to calculate Beta1, as Beta0 depends on it. This is the formula:Beta1And this is the formula for Beta0:

Beta0

These x’s and y’s with the bar over them represent the mean (average) of the corresponding variables.

Let’s see how all of this works in action. The code snippet below generates X with 500 linearly spaced numbers between 1 and 500, and generates Y as a value from the normal distribution centered just above the corresponding X value with a bit of noise added. Both X and Y are then combined into a single data frame and visualized as a scatter plot with the plotly package:

library(plotly)
# Generate synthetic data with a linear relationship x <- seq(from = 1, to = 500) y <- rnorm(n = 500, mean = 0.5*x + 70, sd = 30) lr_data <- data.frame(x, y) # create the plot plot_ly(data = lr_data, x = ~x, y = ~y, marker = list(size = 10)) %>% layout(title = list(text = paste0('Simple linear regression', '<br><sup>', 'Linear relation is visible', '</sup>'))) %>% config(displayModeBar = F)

 

Let’s calculate the coefficients now. The coefficients for Beta0 and Beta1 are obtained first, and then wrapped into a lr_predict() function that implements the line equation.

The predictions can then be obtained by applying the lr_predict() function to the vector X – they should all be on a single straight line. Finally, input data and predictions are visualized:

# Calculate coefficients
b1 <- (sum((x - mean(x)) * (y - mean(y)))) / (sum((x - mean(x))^2))
b0 <- mean(y) - b1 * mean(x)

# Define function for generating predictions
lr_predict <- function(x) { return(b0 + b1 * x) }

# Calculated predictions: Apply lr_predict() to input
lr_data$ypred <- sapply(x, lr_predict)

# Visualize input data and the best fit line
plot_ly(data = lr_data, x = ~x) %>%
add_markers(y = ~y, marker = list(size = 10)) %>%
add_lines(x = ~x, y = lr_data$ypred, line = list(color = "black", width = 5)) %>%
layout(title = list(text = paste0('Applying simple linear regression to data', '<br><sup>', 'Black line = best fit line', '</sup>')),
showlegend = FALSE) %>%
config(displayModeBar = F)
 

And that’s how you can implement simple linear regression in R! 

Multiple Linear Regression

You’ll use the Boston Housing dataset to build your model. To start, the goal is to load in the dataset and check if some of the assumptions hold. Normal distribution and outlier assumptions can be checked with boxplots.

The code snippet below loads in the dataset and visualizes box plots for every feature (not the target):

library(reshape)

df <- read.csv("https://raw.githubusercontent.com/
                ginberg/boston_housing/master/housing.csv")

# Remove target variable
temp_df <- subset(df, select = -c(MEDV))
melt_df <- melt(temp_df)

plot_ly(melt_df, 
        y = ~value, 
        color = ~variable, 
        type = "box") %>%
   config(displayModeBar = F)

A degree of skew seems to be present in all input variables, and they all contain a couple of outliers. We’ll keep this blog to machine learning based, so we won’t do any data preparation/cleaning.

The next step once you’re done with preparation is to split the data into testing and training data. The caTools package is the perfect candidate for this task. 

You can train the model on the training set after the split. R has the lm function built-in, and it is used to train linear models. Inside the lm function, you’ll need to write the target variable on the left and input features on the right, separated by the  ~ sign. If you put a dot instead of feature names, it means you want to train the model on all features.

After the model is trained, you can call the summary() function to see how well it performed on the training set. Here’s a code snippet for everything discussed above:

library(caTools)
set.seed(21)

# Train/Test split, 80:20 ratio
sample_split <- sample.split(Y = df$MEDV, SplitRatio = 0.8)
train_set    <- subset(x = df, sample_split == TRUE)
test_set     <- subset(x = df, sample_split == FALSE)
# Fit the model and print summary
model        <- lm(MEDV ~ ., data = train_set)
summary(model)

The most interesting result are the P-values, displayed in the Pr(>|t|) column. Those values indicate the probability of a variable not being important for prediction. It’s common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there’s a low chance it is not significant for the analysis.

Let’s make a residuals plot now. As a general rule, if a histogram of residuals looks normally distributed, the linear model is as good as it can be. If not, it means you can improve it. Here’s the code for visualizing residuals:

# Get residuals
lm_residuals <- as.data.frame(residuals(model))

# Visualize residuals
plot_ly(x = lm_residuals
As you can see, there’s a bit of skew present due to a large error on the far right. Now, let’s make predictions on the test set. You can use the predict() function to apply the model to the test set. You can combine the actual values and predictions into a single data frame, just so the evaluation becomes easier. Here’s how:
 
# predict price for test_set 
predicted_prices <- predict(model, newdata = test_set)
result <- data.frame(Y = test_set$MEDV, Ypred = predicted_prices)

A good way of evaluating your regression models is to look at the RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average number of price units the model is wrong:

mse  <- mean((result$Y - result$Ypred)^2)
rmse <- sqrt(mse)
The rmse variable holds the value of 70.821, indicating the model is on average wrong by 70.821 price units.
 

Conclusion

In this blog you’ve learned how to train linear regression models in R. You’ve implemented a simple linear regression model entirely from scratch. After that you have implemented a multiple linear regression model with  on the real dataset. You’ve also learned how to evaluate the model through summary functions, residuals plots, and the RMSE metric. 

If you want to implement machine learning in your organization, feel free to contact me.

The post Linear Regression with R appeared first on Ger Inberg.

]]>