Lasse Hansen

Inferring Neuropsychiatric Conditions from Language - How Specific are Transformers and Traditional ML Pipelines?

Thu, 01 Dec 2022 15:00:00 +0000

Click on the Slides button above to view the built-in slides feature.

Invited talk at the Harvard-MIT Speech and Language Biomarker Interest Group on work related to my master’s thesis on a multi-class approach to inferring neuropsychiatric conditions from voice (preprint out Jan 2023). Joint work with Roberta Rocca and Riccardo Fusaroli and jointly presented with Roberta Rocca.

PSYCOP

Wed, 01 Jun 2022 00:00:00 +0000

PSYCOP

The PSYCOP project aims at using the wealth of information in Electronic Health Records to improve patient care and treatment of those with mental illness.

To do this, we develop machine learning models with a specific focus on evaluation and clinical applicability.

We are currently developing tools for thoroughly validation our data and models, and quickly iterating on classical machine learning models. Check out our Github repositories for feature generation and data validation and model training and evaluation.

Ongoing and planned projects include:

Representation learning on the multi-modal EHR data
Deep sequential models for EHR time-series

For more information, see our paper which outlines the cohort and research directions, or reach out!

Members

Faculty:

PhD students:

Inferring Mental Illness from Voice

Thu, 21 Apr 2022 15:00:00 +0000

Click on the Slides button above to view the built-in slides feature.

Invited talk at the Data Science for Mental Health group at the Alan Turing institute. Talk based on the paper “A generalizable speech emotion recognition model reveals depression and remission”, and the pitfalls and prospects of classifying mental illness from voice.

Gjallarhorn

Wed, 02 Feb 2022 00:00:00 +0000

The Gjallarhorn project seeks to democratize Danish speech technology by open sourcing models and resources. We recently released a version of XLS-R-300m pretrained on 140.000 hours of Danish radio along with a model finetuned for Danish automatic speech recognition (ASR). The model outperformed the previous state-of-the-art ASR model by 20%.

The project recently received a grant from the Danish e-infrastructure Cooperation (DeiC) for computational resources to continue this line of work. During the fall 2022, we will continue training and releasing new models in collaborations with Alvenir and the Alexandra Institute.

Check out our releases on the Huggingface Hub!

Team

Lasse Hansen (PI)
Rasmus Arpe Fogh Jensen ( Alvenir)
Martin Carsten Nielsen ( Alvenir)
Søren Winkel Holm ( Alvenir)
Anders Pedersen ( Alexandra Institute)

Danish Foundation Models

Tue, 01 Feb 2022 00:00:00 +0000

A generalizable speech emotion recognition model reveals depression and remission

Tue, 30 Nov 2021 00:00:00 +0000

DaCy: A unified framework for Danish NLP

Wed, 17 Nov 2021 00:00:00 +0000

The PSYchiatric clinical outcome prediction (PSYCOP) cohort: leveraging the potential of electronic health records in the treatment of mental disorders

Mon, 09 Aug 2021 00:00:00 +0000

Neural Networks - step by step

Tue, 27 Apr 2021 00:00:00 +0000

This was created as part of TA’ing the Data Science course for Cognitive Science students. It ended up being fairly extensive so I thought I’d share it here as others might find it useful. The document is a step-by-step walkthrough of a single training exaple of a simple feedforward neural netowrk with 1 hidden layer. Linear algebra is kept out, and emphasis is placed on what happens at the individual nodes to develop an intuition for how neural networks actually learn.

This was originally created using the Tufte-style, but I failed to make the HTML version render nicely here. If you prefer, you can find a Tufte-style PDF version here.

Introduction

Neural networks (NNs) are all the rage right now and can seem kind of magical at times. In reality, most NNs are essentially just a bunch of logistic regressions¹ stacked on top of each other + a clever way of distributing blame for the predictions and updating the weights accordingly. In this handout, we will break NNs down step-by-step to hopefully de-mystify. There are exercises along the way, but I suggest reading through the whole document before doing them to have a better understanding of the big picture.

I have tried to minimize the math and keep linear algebra out of the way. Implementations of neural networks in libraries such as torch, keras/tensorflow etc. use matrix multiplications and other linear algebra instead of what I’ll show you here². The reason for doing it this way, is that it, at least for me, is far more intuitive to see what happens at each individual node. Once you understand this, understanding how it’s done with linear algebra is not as daunting.

Let’s begin our journey by understanding activation functions.

Activation functions

Consider a standard feedforward neural network (aka multilayer perceptron)

Figure 1: Figure 1: Standard MLP

This network takes 3 inputs (a row of 3 variables/columns in your data), has 4 hidden nodes, and a single output node which means it will output a single number. For regression and binary classification you use a single output node, and for multiclass classification you use an output node per group.

Upon creation of the network, all weights between the nodes (i.e. the links/lines in the image) are initiated with a random value³. When the networks get an input of training examples, the inputs are multiplied by the weights and fed to the next layer of nodes. In our case, each node in the hidden layer receives an input which is a vector of 3 elements: the input times the weight from each input node to the hidden node. Before the hidden node can propagate the signal forward, it has to aggregate it to a single number with a non-linear function.⁴ This is where the activation function comes in.

Nowadays, ReLU (and its variants) is the most common activation function, but sigmoid is the OG. The equations are as follows:

\[ReLU = max(0, input)\] \[sigmoid = \frac{1}{1+e^{-input}}\] The input in the above equations refers to the net input to the nodes, i.e. the weighted sum of all inputs. All the activation function does, is apply some function to the net input. This becomes the node’s activation and is what is fed to the next layer of nodes in the network in exactly the same manner as from the input to the first hidden layer.

Let’s see a quick example of how to calculate activation for a single hidden node with 3 input nodes. Let’s assume this is the top hidden node in Figure 1, $h_1$.

# Assume we get an input of (1, 3, 5) in
# x1, x2, x3 respectively
input_nodes <-  c(1, 3, 5)

# these are weights coming into h1
# first element from x1, second from x2 etc
weights_i_h1 <-  c(0.5, 1.2, -0.3)

(weighted_input <-  input_nodes * weights_i_h1)

## [1]  0.5  3.6 -1.5

(net_input <- sum(weighted_input))

## [1] 2.6

# Define our activation functions 
sigmoid <- function(x) 1 / (1+ exp(-x))
relu <- function(x) max(0, x)

# Calculate activations
# Note: each node only has 1 activation function. 
# I showed both here to illustrate the difference between them
(h1_activation <- sigmoid(net_input))

## [1] 0.9308616

relu(net_input)

## [1] 2.6

Visually, this is what’s happening.

Figure 2: Figure 2: Activation function

Repeat the process for all the nodes in the network and voíla - you just implemented the forward pass of a neural network!⁵

Notice how similar this is to logistic regression: You have a bunch of input nodes/variables, which have an associated weight (betas), which are combined using the sigmoid (aka logistic) function.

Note: each node also has a bias which is an extra trainable parameter (essentially an intercept). For simplicity we assume it to be zero here.

Exercise

Implement the entire forward pass for the neural network in the image. (3 input nodes, 4 hidden nodes, 1 output node). Randomly initialize the weights and use [1, 3, 5] as the input nodes. Feel free to use either Python or R.

Gradient descent

Hurray, the forward pass is done! The network is just making random predictions for now though, so we need to give it a way to learn. This is where backpropagation and gradient descent comes in. Now, as with any regression model, we need to define some loss function, i.e. how we calculate the “goodness” of the model. The loss function to use depends on the problem at hand, but a common one for regression is sum of squares errors. The equation is very straightforward and looks like this:

\[SSE = \sum^n_{i=1}(y-\hat{y})^2\] Where $y$ is the label and $\hat{y}$ your prediction. The greater the distance your prediction is from the target, the larger the SSE is. Therefore, we want to minimize this.

From high school calculus, you might recall that the derivative of a function is simply the slope of the function. ⁶

(#fig:grad_desc)Figure 3: Loss landscape

The plot above is an illustration of the loss (e.g. the SSE) given different combinations of weights between nodes.⁷ As the image illustrates, if we are unlucky we can get stuck in local minima, i.e. an area where the gradient is zero, but that is not the global minimum, i.e. the combination of weights with the lowest possible loss. This is just a fact of life for neural networks - you are in no way guaranteed to find the optimal solution. Different weight initializations will lead you to find different minima, which can vary substantially.

Anyhow, the way we minimize the loss is to calculate the gradient of the loss function in our current state and make changes to our weights based on this. The sign (+ or -) of the gradient tells us which direction to go, and the magnitude tells us how large steps to take. In essence, we start at some random point on the graph above, and slowly make our way down, until we hopefully end at the green dot.

The procedure differs slightly if the node is an output or hidden node. Let’s go through it for output nodes first.

In our case, we had a single output node, so let’s assume we’re doing binary classification (is it a 0 or a 1?).⁸ For binary classification, it makes sense to use the sigmoid as activation function for our output, as it squishes the values to a range between 0 and 1. To calculate the derivative of the loss function with respect to the weights (and biases), $\delta$, we need to use the chain rule which eventually gives us this⁹:

\[ \begin{aligned} \delta &= \textrm{the derivative of the loss function} \cdot \\& \textrm{the derivative of the activation function} \end{aligned} \]

Using SSE and sigmoid activation function we get this:

\[\delta = 2(y-\hat{y}) \cdot a(1-a)\]

Where $a$ is the node’s activation, i.e. the value we get after using the activation function (sigmoid) on the sum of the weighted input.

Let’s see an example calculation.

# assume the true label of the target is 1
label <-  1

# assume that the activation of the output
# node (after the entire forward pass) is 0.6
output_node_activation <- 0.6

# Derivative of the sigmoid function:
sigmoid_derivative <- function(x) x*(1-x)

(delta_o <- 2*(label - output_node_activation) * 
    sigmoid_derivative(output_node_activation))

## [1] 0.192

Visually, this is what’s happening.

Figure 3: Figure 4: Calculation of $\delta$ of the output node

Easy! Notice that the value of $\delta$ is positive. This is because our prediction (0.6) was below the label (1) and we should therefore increase the weights to get closer to the correct prediction. Had the label been 0 instead, $\delta$ would have been a negative number. Essentially, $\delta$ tells us how we should make changes to our weights to get closer to the optimal state as defined by the loss function.¹⁰

The clever thing about backpropagation is that weights are updated based on their magnitude. That is, if the error is large, large activations will change more than small activations, as they “contribute” more to the prediction than the smaller ones. As the name implies, the errors are propagated back into the network (what is known as the backward pass). Calculating $\delta$ for the hidden layer is the first step in this process.

Exercise

Change the value of the output_node_activation and see how the delta changes. What do you expect happens as it gets closer to the label?
What do you think the derivative of the ReLU function is? Plot the ReLu function, implement its derivative function, and try using it instead of the sigmoid.
Use the value for the output_node_activation that you calculated in the first exercise (the forward pass).

Now, to distribute blame for the prediction.

Backpropagation

As mentioned, backpropagation works a bit differently whether you’re calculating weights coming in to an output or a hidden node. For weights coming in to the output nodes, the weight changes are proportional to the learning rate $\eta$, the activation of the predecessor node, and the $\delta$ we just calculated. To continue with our example from Figure 1, let’s calculate the weight changes for the weight going from the top hidden node $h_1$ to the output $\hat{y}_1$.¹¹

# Assume a random weight
# from h1 to yhat (calling yhat 'o' for output)
weight_h1_o <- c(0.8)


# To calculate the weight change we need:
# 1) the activation of the input unit
#   which in this case is h1
# 2) the delta of the output unit
# 3) a learning rate (alpha)

# Learning rates differ a lot
# but let's go with 0.01
alpha <- 0.01

# plugging numbers into the equation
(weight_change <- alpha * delta_o * h1_activation)

## [1] 0.001787254

# update weight
(weight_h1_o <- weight_h1_o + weight_change)

## [1] 0.8017873

Figure 4: Figure 5: Weight updating

Pretty easy right? To calculate the weight change for weights going into hidden units, we simply multiply the learning rate, $\alpha$ with the $\delta$ (the derivative of the loss function with respect to the weights) and the activation at the preceeding node.

\[\text{weight change} = \alpha \cdot \delta \cdot activation\]

Exercise

How does the learning rate impact the weight change?
Calculate the weight changes for all the weights going from the hidden nodes to the output node. Use the weights c(0.8, -1.2, 0.5, -0.3) for $h_1, h_2, h_3, h_4$ respectively.

We’re getting really close to having trained our little NN on a single example. All that’s left is calculating the weight changes for the weights from the input nodes to the hidden nodes. Fortunately, the procedure is largely the same as for the output units.

First, we calculate $\delta$ for the specific weight and node again. However, since we are now in the hidden layer, we don’t have access to the true label anymore. Instead, we use the $\delta$ from the succeeding layer (the output node) multiplied by the weight from our hidden node to the output!

To calculate $\delta$ for the hidden node, $\delta_{h_1}$, we sum over all the $\delta \cdot weight$ for the output node and multiply by the derivative of the activation. In other words, we calculate delta_o_times_weight_h1 + delta_o_times_weight_h2 + delta_o_times_weight_h3 + delta_o_times_weight_h4 and multiply by the derivative of the activation of $h_1$. Let’s see an example.

# Using the weight we calculated before
# and setting 3 random weights for h2_o, h3_o, h4_o
weights_h_o <- c(weight_h1_o, 0.5, 0.1, 0.6)

(delta_o_times_weights <- delta_o * weights_h_o)

## [1] 0.1539432 0.0960000 0.0192000 0.1152000

(sum_delta_o_times_weights <- sum(delta_o_times_weights))

## [1] 0.3843432

(delta_h1 <- sum_delta_o_times_weights * 
  sigmoid_derivative(h1_activation))

## [1] 0.02473567

Figure 5: Figure 6: Calculation of $\delta_{h_1}$

Notice that the value of $\delta$ is substantially smaller than what it was at the output nodes. This means that the weight changes from the input nodes to the hidden nodes will be even smaller. Deep networks can run into the problem of vanishing gradients, i.e. $\delta$ becomes so small that weight changes are negligible. ReLU is far more robust to the problem of vanishing gradients than the sigmoid function, which is one of the reasons for its success.

Alright, we’re getting reeally close now. The last step to update the weights coming in to the hidden nodes is exactly the same as for the weights coming in to the output node: $\alpha \cdot \delta \cdot activation$

We already defined our 3 input nodes and their weights going to $h_1$. Let’s update them.

(weight_change <- alpha * delta_h1 * input_nodes)

## [1] 0.0002473567 0.0007420701 0.0012367836

(weights_i_h1 <- weights_i_h1 + weight_change)

## [1]  0.5002474  1.2007421 -0.2987632

And we’re done! Rinse and repeat for the rest of the hidden nodes and you just trained one step of a neural network! For a standard feedforward neural network, this is all that is going on - just repeated a lot of times.¹² When this process has been conducted for each training example, the neural network has gone through one epoch. You usually stop training after a certain number of epochs, or once you reach a stopping criterion.¹³

Doesn’t seem so magical anymore, right?

Exercise

Calculate the weight changes for all the weights going from the input nodes to the hidden nodes. You decide what the remaining weights should be.
Congratulations! You have now created and trained one step of a simple neural network! All that’s left is looping over this for all the training examples, and repeat until the network converges. Go have a look through some of the code inspiration in the References and further reading section to get a sense of how this can be done.
I encourage you to take a stab at implementing your own neural network (I suggest 1 or 2 hidden layers) that can take an arbitrary number of input/hidden/output layers. Feel free to follow either this or Nielsen’s way of going about it.

To sum up, here are the steps:

Initialize the network: Randomly initialize all the weights.
Forward pass: Pass an input to the neural network and propagate the values forward. To calculate activation of the nodes, take the weighted sum of their input and use an activation function such as the sigmoid or ReLU.
Backward pass: Calculate the loss for your current training example. Calculate $\delta$ for the output node(s) and update the weights coming in to the output node(s) by multiplying $\delta$ with the learning rate and the activation of the hidden node feeding into the output node. Continue this process by propagating $\delta$ back into the hidden layers and continually updating the weights.
Repeat: Repeat for a specific number of epochs or until some stopping criterion is reached.

References and further reading

Code inspiration

I took an elective in Neural Networks a couple of years ago, where part of the exam was to implement a NN from scratch. You can see my code here. It’s implemented in much the same style as this document, i.e. no linear algebra, but lots of for loops. After working through the exercises in here, it will likely seem quite straightforward to you.

Kenneth wrote an implementation based on Nielsen’s to do classification on the MNIST digits dataset. You can find it on Blackboard.

Online Courses

Fast.ai: Practical Deep Learning for Coders - A great and comprehensive course on neural networks. You will learn to implement them from scratch using pytorch and pick up tons of useful knowledge along the way. In particular, check out the part on SGD in chapter 4 for a great introduction.

Deep Learning Specialization - A true classic, updated spring 2021 with Transformer models and other goodies.

Books

Kriesel, D. A Brief Introduction to Neural Networks. http://www.dkriesel.com/en/science/neural_networks. - A quite nice book on the fundamentals of neural networks. A bit old by now, but the foundations are the same.

Nielsen, M. Neural Networks and Deep Learning. http://neuralnetworksanddeeplearning.com

Depending on activation function↩︎
Along with a whole bunch of other optimization.↩︎
Not always the case, some of the more advanced NNs require sophisticated initialization schemes.↩︎
This is crucial and what makes NN universal function approximators.↩︎
Different nodes can have different activation functions. For instance, input nodes always use the identity function (i.e., no transformation is done). Hidden nodes are usually some variant of ReLU. The activation function of the output node(s) depends on the task. Doing regression? The identity function would make sense. Doing binary classification? A sigmoid would be a good choice (as the values are squished to the range 0-1).↩︎
Figure taken from https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6 ↩︎
This is a simplification, as in reality the weight space is highly multidimensional. To develop an intuition however, this is a useful way to look at it.↩︎
You would also use 1 output node for regression, but use e.g. the identity function as activation function for the output node instead. Normally, you would use a different loss function for binary classification, but let’s stick to SSE for simplicity.↩︎
I skipped a lot of math here. You don’t need to understand exactly how this is derived to understand neural networks, but feel free to read up on it if so inclined.↩︎
See Figure 3.↩︎
Learning rates are a pretty big deal. Many different optimizers exist (you’ve probably heard of ADAM or RMSProp), which use clever ways of adapting the learning rate (among other things).↩︎
This has gone through the most basic/original formulation of a neural network. Of course, many tricks have been added, such as adding momentum (weight changes are also proportional to the magnitude of the previous weight change), and perhaps more importantly stochastic gradient descent. In essence, stochastic gradient simply works in batches of multiple inputs instead of a single one. That is, weights changes are not calculated for every single training row, but is instead accumulated over e.g. 48 rows before weights are changed. See the Further Reading section for more↩︎
Could be a certain validation error threshold or once the network converges, i.e. reaches a stable state.↩︎

TextDescriptives

Mon, 01 Feb 2021 00:00:00 +0000

Predicting Presence and Severity of Depression from Speech Using Emotional Transfer Learning

Wed, 16 Dec 2020 14:00:00 +0000

Click on the Slides button above to view the slides.

A presentation of my internship project while at Hoffman-La Roche on using transfer learning from emotional speech to detect depression. We trained a Mixture of Experts consisting of gradient-boosted decision tree classifiers to classify happiness and sadness in datasets of acted emotional speech in English and German. The model was applied to a dataset of interviews with Danish speaking patients with first episode depression and matched healthy controls. We observed significant seperation between the two groups, and found patients in remission to speak similarly to controls. Further, we conducted experiments on the effect of removing background noise and speaker diarization, which showed consistent levels of background noise to be crucial for consistent inferences.

gglearn2

Mon, 01 Jun 2020 00:00:00 +0000

HOPE Dashboard

Wed, 01 Apr 2020 00:00:00 +0000

Garmin Run Stats

Wed, 01 Jan 2020 00:00:00 +0000

Slides

Tue, 05 Feb 2019 00:00:00 +0000

Create slides in Markdown with Academic

Academic | Documentation

Features

Efficiently write slides in Markdown
3-in-1: Create, Present, and Publish your slides
Supports speaker notes
Mobile friendly slides

Controls

Next: Right Arrow or Space
Previous: Left Arrow
Start: Home
Finish: End
Overview: Esc
Speaker notes: S
Fullscreen: F
Zoom: Alt + Click
PDF Export: E

Code Highlighting

Inline code: variable

Code block:

porridge = "blueberry"
if porridge == "blueberry":
    print("Eating...")

Math

In-line math: $x + y = z$

Block math:

$$ f\left( x \right) = ;\frac{{2\left( {x + 4} \right)\left( {x - 4} \right)}}{{\left( {x + 4} \right)\left( {x + 1} \right)}} $$

Fragments

Make content appear incrementally

{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}

Press Space to play!

One Two Three

A fragment can accept two optional parameters:

class: use a custom style (requires definition in custom CSS)
weight: sets the order in which a fragment appears

Speaker Notes

Add speaker notes to your presentation

{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}

Press the S key to view the speaker notes!

Themes

black: Black background, white text, blue links (default)
white: White background, black text, blue links
league: Gray background, white text, blue links
beige: Beige background, dark text, brown links
sky: Blue background, thin dark text, blue links

night: Black background, thick white text, orange links
serif: Cappuccino background, gray text, brown links
simple: White background, black text, blue links
solarized: Cream-colored background, dark green text, blue links

Custom Slide

Customize the slide style and background

{{< slide background-image="/img/boards.jpg" >}}
{{< slide background-color="#0000FF" >}}
{{< slide class="my-style" >}}

Custom CSS Example

Let’s make headers navy colored.

Create assets/css/reveal_custom.css with:

.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}

Questions?

Ask

Documentation

Analysing Political Biases in Danish Newspapers Using Sentiment Analysis

Fri, 07 Jul 2017 00:00:00 +0000