kyrcha.info

2019 in review

Mon, 23 Mar 2020 16:26:00 GMT

Finally! I had this post in draft mode for about three months, but better late than never. This post also signifies the start of blogging in 2020!

2019 is (well) over and through this post I will try to summarize important aspects of my life that can be quantified and reflect on what went well, what didn't and what I've learned this year the James' Clear way. As in my 2018 review, the first part is pretty quantitative, while the second is more qualitative.

In 2019, and after failing a couple of tenure applications, I switched from the Academia to the Industry and also left Cyclopt, the spin-off company I co-founded. In fact I switched two positions in the Industry, back-to-back, because the first company I signed up for (a mature start-up) got out of funding just one month after I started working. In my job-hunting adventure that started early in the summer of 2019, to get a couple of offers, I applied to around 80 positions. I will leave the details and the stats for another post.

In general, I don't consider that 2019 was a good or lucky year for me: My father passed away and I had to look for a job, one month after starting on one. But as Nietzsche said: "What does not kill me makes me stronger".

Quantitative

Books

Even though I started a lot of books (both non-fiction and tech ones), I only finished:

Atomic Habits by James Clear. Has all the theory, the systems and the processes you need to start new (better) habits, but as with many other personal improvement guides the problem I have is in execution. I will definetelly re-read it, since I read it during my summer vacations and this is not a good time start building new habits.

"Launches"

I launched a couple of web sites and apps this year either alone or with teams:

Cenote
Cyclopt bot (from my startup, now down and discontinued)
The interview questions knowledge base in Raneto
This website: kyrcha.info
npm-miner an infrastructure that performs static code analysis of the npm registry.

and some open source software:

gh-downloader for downloading files from GitHub given search criteria
character-position VS code extension for revealing the current character position
eslint-config a shareable config for web application development using node.js and react.

Talks

I gave two talks:

My talk at ECEIG: Simple rules for building robust machine learning models
My talk at ECESCON: Advices and strategies I learned from my first business attempt

Courses

I taught two courses in the university:

Big Data Analysis to graduates
Software Engineering to under-graduates

Research Proposals

I submitted one proposal for research funding as a principal investigator to ELIDEK (early in 2020 I learned that it was unsuccessful).

Blog posts

I wrote 12 blog posts, much, much better than previous years:

The website had 2,705 users visiting vs. 5,189 in 2018. I am not sure abuot the drop. Probably also had to do with the switch in technologies from Wordpress (with SEO plugins etc.) to GatsbyJS.

Publications

Published 4 papers out of 8 submissions (50%). The published papers were:

"npm-miner: An Infrastructure for Measuring the Quality of the npm Registry" in MSR 2018
"Predicting hyperparameters from meta-features in binary classification problems" in AutoML 2018
"A Natural Language Driven Approach for Automated Web API Development"_ in WS-REST 2018
and "Deep Reinforcement Learning for Doom using Unsupervised Auxiliary Tasks"_ in arxiv

Competitions

Really worked with Kaggle Humpback Whale Identification Competition and learned new stuff even though the approach did not generalize well. It was my first image recognition pipeline ever written so I've learned a lot.
Worked on the CodRep 2019 competition, 2nd place. The writeup of the compeittions can be found here.

Qualitative

Things that went well

Output in terms of blog posts, papers submitted, competitions participated, software produced and launched, job applications submitted, interviews conducted.

Things that didn't go well

Unfortunatelly we didn't get any VC funding in 2019 for Cyclopt after applying and following the processes of 3 VCs, so I decided to leave the start-up and focus on other aspects of my career.
A recurring theme, my weight and in general the fact that I didn't weight-lifted as much as I wanted (less than 2 times per week).

What I've learned

The revelation after re-reading "the subtle art of not giving a f*ck" that to be happy, solve problems you enjoy solving. Life is suffering. You will suffer. So at least for the problems you can pick, pick the ones you enjoy solving. I also created a slide for one of my talks for process:

If you want to do a career in Academia:

Don't do work you would do in a software company. Take on projects that will have research outcomes and not just build applications.
Follow the publish or perish rule.

I learned about Naval Ravikant and devoured a lot of content from him, especially the viral tweetstorm "How to Get Rich (without getting lucky)" and its related podcast.

I don't remember where I caught this but when you press submit to deploy an application you shouldn't prepare your suitcases just yet.

2020

No goals for 2020. I am setting up processes and systems, building atomic habits, shipping early and often.

Previous reviews

2018

Photo from mohamed Hassan from Pixabay

Thanks dad

Tue, 31 Dec 2019 16:00:00 GMT

In memory of Christodoulos K. Chatzidimitiou (1944-2019). Thanks for everything dad!

Data Outlier Detection using the Chebyshev Theorem - Paper review and online adaptation

Tue, 26 Nov 2019 09:58:00 GMT

This is the first paper review, in a series of paper reviews and implementations I would like to do, on online (or streaming) outlier detection algorithms. I believe with the advent of the Internet of Things it will be an important subject to research and study with all the sensory data to be produced and consumed. All the knowledge from this series will be gathered in the awesome-streaming-outlier-detection repository.

In this first post in the series I am going to present the 2005 paper: Data Outlier Detection using the Chebyshev Theorem by Brett G. Amidan, Thomas A. Ferryman, and Scott K. Cooley.

The paper uses the Chebyshev inequality in order to calculate upper and lower outlier detection limits. These thresholds give a bound to the percentage of data that fall ouside k standard deviations from the mean, while on the same time, the calculations make no assumptions about the distribution of the data. This is important as often the distribution is not known and we don't want to make any assumption over it. The only assumptions the method makes is that the data are independent measurements and that only a small percentage of the outliers are contained in the data.

With an unknown distribution, the Chebysev inequality is:

$$P(|X - \mu| \leq k \sigma) \geq (1 - \frac{1}{k^2})$$ (1)

and indicates that if k=2 at least 75% of the data would fall within 2 standard deviations from the mean (lower bound). The equation above can also be changed as:

$$P(|X - \mu| \geq k \sigma) \leq \frac{1}{k^2}$$ (2)

to indicate that at most 25% of the data is outside 2 standard deviations from the mean (upper bound).

There is also a special case when, if we assume that the data is unimodal, (data with only one peak, which can be examined for example by plotting the data) we can use the unimodal Chebyshev's inequality. But since we are talking about streaming data that arrive one after the other, I will assume that nothing is known in advance and continue with the standard case.

From the Chebyshev's inequality an Outlier Detection Value (ODV) is calculated. Any data value that is more extreme than the ODV is considered to be an outlier.

The algorithm

The algorithm follows a two stage process.

Stage 1

The first stage is responsible for trimming the data from values that are possibly outliers.

We decide on a value of $p_1$, which can be considered as the probability of seeing an expected outlier. We can use values like 0.1, 0.05 or even 0.01.
Solving equation (2) for k we have equation (3). Anything more extreme than k standard deviations is considered a stage-1 outlier. So if $p_1=0.05$ then $k=4.472$ and thus everything more extreme than 4.472 standard deviations will be considered a stage-1 outlier.

$$k=\frac{1}{\sqrt{p_1}}$$ (3)

Then we calculate ODVs (upper and lower bounds) for stage-1, where $\mu$ and $\sigma$ are the sample mean and the sample standard deviation derived from the data:

$$ ODV_{1U} = \mu + k \sigma$$ (4)

$$ ODV_{1L} = \mu - k \sigma$$ (5)

Data that are more extreme than the ODVs of stage-1 are removed from the data for the second phase of the algorithm. The truncated dataset (i.e. without the outliers) is used to calculate the mean and standard deviation needed for the Chebyshev's inequality.

Stage 2

The second stage derives the final ODVs.

Select a value for $p_2$, the expected probability of seeing an outlier. Usually smaller than $p_1$, used to actually determine the outliers. Reasonable values are 0.01, 0.001, 0.0001.
Solve equation (2) for k and get equation (6).

$$k=\frac{1}{\sqrt{p_2}}$$ (6)

Calculate stage-2 ODVs using equations (4) and (5), where $\mu$ and $\sigma$ are the sample mean and the sample standard deviation derived from the truncated data.
All data (from the complete dataset) that are more extreme than the stage-2 ODVs are considered to be outliers.

Streaming version

Even though the algorithm is not made for streaming data, I will convert it into a streaming algorithm by calculating the running average and variance of the data using Welford's online algorithm and benchmark it using the Numenta Anomaly Benchmark (NAB).

The code for the algorithm can be found in my personal fork of the NAB GitHub repository. It achieves:

18.44 in the Standard Profile
13.18 Reward Low FP
23.21 Reward Low FN

with 100.00 being the perfect score in all three categories. Even though the score maybe low, every record is handled in constant time O(1) and there are no assumptions made whatsoever regarding the distribution of the incoming data.

Below is an example of outliers detected from a time series of AWS EC2 CPU utilization:

With red circles we have the points labeled as outliers, while with blue crosses are the predicted ones. As one can see the algorithm is capable of identifying unique outliers rather than periods of "outlierish" behavior needed by NAB.

The code for the NAB implementation can be found below:

from nab.detectors.base import AnomalyDetector

import math

class ChebyshevDetector(AnomalyDetector):
    """ An streaming version of the algorithm found in the paper: 
    "Data Outlier Detection using the Chebyshev Theorem"
    using Welford's online algorithm to calculate mean and standard deviation
    """

    def __init__(self, *args, **kwargs):
        super(ChebyshevDetector, self).__init__(*args, **kwargs)
        self.p1 = 0.1 # Stage 1 probability 
        self.p2 = 0.001 # Stage 2 probability 
        self.k1 = 1/math.sqrt(self.p1)
        self.k2 = 1/math.sqrt(self.p2)
        self.n1 = 0
        self.m1 = 0
        self.m1_2 = 0
        self.std1 = 1
        self.n2 = 0
        self.m2 = 0
        self.m2_2 = 0
        self.std2 = 1

    def handleRecord(self, inputData):
        """Returns a tuple (anomalyScore).
        The input value is considered an outlier if it resides outside the Outlier Detection Values (upper or lower).
        The anomalyScore is calculated based on the normalized distance the input value has from the upper or lower 
        ODVs, if the input value is considered and outlier, otherwise it is 0.0.
        The probabilities p1 and p2 have been tuned a bit to give good performance on NAB.
        """
        anomalyScore = 0.0
        inputValue = inputData["value"]
        # stage 1 statistics
        self.n1 += 1
        delta = inputValue - self.m1
        self.m1 += delta/self.n1
        self.m1_2 += delta * (inputValue - self.m1)
        self.std1 = math.sqrt(self.m1_2/(self.n1-1)) if self.n1-1 > 0 else 0.000001
        odv1_high = self.m1 + self.k1 * self.std1
        odv1_low = self.m1 - self.k1 * self.std1
        if inputValue <= odv1_high and inputValue >= odv1_low:
            # Passed the first test, let's calculate the second stage statistics
            self.n2 += 1
            delta = inputValue - self.m2
            self.m2 += delta/self.n2
            self.m2_2 +=  delta * (inputValue - self.m2)
            self.std2 = math.sqrt(self.m2_2/(self.n2-1)) if self.n2-1 > 0 else 0.000001
        odv2_high = self.m2 + self.k2 * self.std2
        odv2_low = self.m2 - self.k2 * self.std2
        if inputValue > odv2_high:
            ratio = (inputValue - odv2_high)/inputValue
            anomalyScore = ratio
        elif inputValue < odv2_low:
            ratio = abs((odv2_low - inputValue)/odv2_low)
            anomalyScore = ratio
        return (anomalyScore, )

What is a (startup) mastermind group?

Thu, 07 Nov 2019 11:25:00 GMT

I've been listening to podcasts from Startups for the Rest of Us for some time now and what has captured my attention was the idea of having a mastermind group. The notion of a startup mastermind group was mainly discussed in episodes:

Originally mentioned in Napoleon's Hill book "Think and Grow Rich" (I haven't read it, but it was mentioned in the podcast), a mastermind group is a (small) group of people who are in a similar "boat", encounter similar problems and have a similar type of business. Such a group can give suggestions to allow you to make better decisions for your business and grow your accountability towards yourself and your business. It can also offer you support and feedback. Especially for micropreneurs and solo founders, it is a way of getting a group of supportive people without being isolated. Family, friends and/or employees will probably never understand your problems at the level you need them to.

Some heuristics are:

3-5 persons are around the optimal, 3 people is an very good number to have
Duration around 2 hours
Around 30' each talking about your product/problems
Meet every other week
Have an opt out period
Have an expectation of confidentiality because you will be discussing monetary and legal stuff among other things
Probably the best is to have met in person before

For accountability, planning and history you can use an colab document editor like Google docs with bullet points of:

previous commitments
accomplished work
work to be done

Five approaches to structure your mastermind group:

Round table: each person speaks an equal amount of time.
Time segments: for example 5' talk, 1' questions, 1' transition to the next person and start over
Short hot seat: 1 person gets extra time for example 1h, 15', 15'
Dedicated hot seat: - each session one person talks all the time
Use a moderator

As you can see there is no correct way to structure your mastermind group. The most important point is to be meaningful and helpful for everyone participating.

Generating plausible paper titles with Recurrent Neural Networks

Thu, 07 Nov 2019 09:48:00 GMT

This is a fun project that occured to me while reading month after month the email with the table of contents from IEEE Transactions on Neural Networks and Learning Systems journal. It seemed to me that the titles followed a pattern that consisted of some conjuctions, prepositions and particles, intermingled with a lot of keywords, specific to the field. So I thought it should be easy to learn to generate titles with a recurrent neural network and a small corpus. Let's see what I got.

Data

I exported the emails (see below) into a text file and with some text processing the final txt file contains all the titles (one title by line) from March 2016 till November 2019.

Pipeline

The whole pipeline can be found in my deep-learning-pipelines repository as an ipython notebook.

I start with the imports and then download the NLTK model data (you need to do this once):

import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline
%%capture
# Download NLTK model data (you need to do this once)
nltk.download("book")

I read the file with the titles and now I am ready to check the data:

with open('ieee-tnnls-titles.txt', 'r') as f:
    text = f.read()

Data exploration

Let's explore the dataset a bit.

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
titles = text.splitlines()
print('Number of titles: {}'.format(len(titles)))

word_count_sentence = [len(title.split()) for title in titles]
print('Average number of words in each title: {}'.format(np.average(word_count_sentence)))

We have:

1207 titles,
around 2705 unique words, while
the average number of words in each title is 10.2

So the ratio of titles to unique words is 0.5, which means probably we should prune the vocabulary and its interactions to be learnt to much fewer words.

Pre-processing

My next step was to preprocess the titles, append END and START tokens in-front and in the back of each title, then tokenize the titles into words and remove non alphabetical tokens.

First I declared three tokens to be used for a) unknown words, b) the start and c) the end of a title:

unknown_token = "UNKNOWN_TOKEN"
title_start_token = "TITLE_START"
title_end_token = "TITLE_END"

Then I proceed into sentence and word tokenization along with appending the start/end tokens:

from nltk.tokenize import sent_tokenize, word_tokenize
sentences = itertools.chain(*[nltk.sent_tokenize(x.lower()) for x in titles])
tokenized_titles = ["%s %s %s" % (title_start_token, x, title_end_token) for x in sentences]
tokenized_titles = [nltk.word_tokenize(title) for title in tokenized_titles]
final_title = []
for title in tokenized_titles:
  final_title.append([token for token in title if token.isalpha() or token == title_start_token or token == title_end_token])
tokenized_titles = final_title

An example of a tokenized title will be:

['TITLE_START', 'object', 'detection', 'with', 'deep', 'learning', 'a', 'review', 'TITLE_END']

During this pre-processing step 2073 unique word tokens were found. Since the corpus is not very large, I will try and learn the connections between the most popular words in order to have enough samples to learn meaningful interconnections. Thus I chose a vocabulary size of 250. So the next steps are: to find these frequent words, replace the rest with the UKNOWN token and build index_to_word (a mapping from an integer to a word) and word_to_index (vice-versa) mappings:

vocabulary_size = 250
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print("Using vocabulary size %d." % vocabulary_size)
print("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

The least frequent word in our dictionary of 250 words appeared to be "stable" (7 appearances) and the most frequent "for" (553 appearences).

As a next step I replaced all words not in our vocabulary with the UNKNOWN token:

for i, sent in enumerate(tokenized_titles):
    tokenized_titles[i] = [w if w in word_to_index else unknown_token for w in sent]

So the title: "Plume Tracing via Model-Free Reinforcement Learning Method" would look like after pre-processing as: ['TITLE_START', 'UNKNOWN_TOKEN', 'UNKNOWN_TOKEN', 'via', 'reinforcement', 'learning', 'method', 'TITLE_END']

Training

Let's create the training data. I used the KerasBatchGenerator from this blog post to generate the batches to be fed into the LSTMs:

class KerasBatchGenerator(object):

  def __init__(self, data, num_steps, batch_size, vocabulary, skip_step=5):
    self.data = data
    self.num_steps = num_steps
    self.batch_size = batch_size
    self.vocabulary = vocabulary
    # this will track the progress of the batches sequentially through the
    # data set - once the data reaches the end of the data set it will reset
    # back to zero
    self.current_idx = 0
    # skip_step is the number of words which will be skipped before the next
    # batch is skimmed from the data set
    self.skip_step = skip_step

  def generate(self):
    x = np.zeros((self.batch_size, self.num_steps))
    y = np.zeros((self.batch_size, self.num_steps, self.vocabulary))
    while True:
      i = 0
      while i < self.batch_size:
        # I don't want to see in x a title end token to predict y 
        if self.current_idx < len(self.data) and self.data[self.current_idx] == word_to_index[title_end_token]:
          self.current_idx += self.skip_step
        if self.current_idx + self.num_steps >= len(self.data):
          # reset the index back to the start of the data set
          self.current_idx = 0
        x[i, :] = self.data[self.current_idx:self.current_idx + self.num_steps]
        temp_y = self.data[self.current_idx + 1:self.current_idx + self.num_steps + 1]
        # convert all of temp_y into a one hot representation
        y[i, :, :] = to_categorical(temp_y, num_classes=self.vocabulary)
        self.current_idx += self.skip_step
        i += 1
      yield x, y

Through the generator, batches of 10 tokens that predict the next token (in one hot encoding form) are generated. Each batch contains 2 arrays that contain 10 tokens each. The first array has 10 integers, while the second array has 10 one hot encoding vectors that represent the equivalent next tokens of the first array. For example:

The 2 arrays are of the form:

[[0.],[122.],[249.],[29.],[3.],[187.],[11.],[0.],[40.],[3.]]

and

[[[0., 0., 0., ..., 0., 0., 0.]],
 [[0., 0., 0., ..., 0., 0., 1.]],
...

The START token in the first array, which is 0, to predict the one-hot encoded version of 122 (which is the next token after 0)
The 122 token to predict the one-hot encoded version of 249
The 249 token to predict the one-hot encoded version of 29
and so on and so forth...

The first 10K tokens are employed for generating training batches, while the rest 3846 for validation. As a note, we never have a sample that uses the END token to predict the next token. Let's create the batch generators:

num_steps = 1
skip_step = 1
batch_size = 10

# set seeds for reproducibility
from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(234)

# Create the training data
# A concatenation of all tokens as integers (indices)
X = list(itertools.chain(*np.asarray([[word_to_index[w] for w in sent] for sent in tokenized_titles])))
# Create 2 batch generators out of the concatenation
train_data_generator = KerasBatchGenerator(X[:10000], num_steps, batch_size, vocabulary_size, skip_step)
valid_data_generator = KerasBatchGenerator(X[10001:], num_steps, batch_size, vocabulary_size, skip_step)

Next I create the model:

from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, Dropout, TimeDistributed
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint

hidden_size = 250

model = Sequential()
model.add(Embedding(vocabulary_size, hidden_size, input_length=num_steps))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(Dropout(rate=0.5))
model.add(TimeDistributed(Dense(vocabulary_size)))
model.add(Activation('softmax'))

compile the model:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

and train the model for 10 epochs:

num_epochs = 10

model.fit_generator(train_data_generator.generate(), len(X[:10000])//(batch_size*num_steps), num_epochs,
                        validation_data=valid_data_generator.generate(), validation_steps=len(X[10001:])//(batch_size*num_steps))

After training we got a validation categorical accuracy of 0.3625, which is of course much better than randomly predicting around 250 tokens.

Generating

Now it is time to check the model. We start by feeding the model a START token and keep sampling until there is an END token. We resample if the sampling generates the UNKNOWN token:

def generate_title(model):
    # We start the sentence with the start token
    new_title = [word_to_index[title_start_token]]
    # Repeat until we get an end token
    while not new_title[-1] == word_to_index[title_end_token]:
        x = np.zeros((1,1))
        x[0, :] = new_title[-1]
        next_word_probs = model.predict(x)[0][0]
        sampled_word = word_to_index[unknown_token]
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            samples = np.random.multinomial(1, next_word_probs)
            sampled_word = np.argmax(samples)
        new_title.append(sampled_word)
    title_str = [index_to_word[x] for x in new_title[1:-1]]
    return title_str

num_sentences = 30
senten_min_length = 7
senten_max_length = 15

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length or len(sent) > senten_max_length:
        sent = generate_title(model)
    print(" ".join(sent))

We generated 30 sentences between 7 and 15 tokens:

a new active systems under control of boolean network for deep noise framework and processes
multiview metric clustering and neural networks approach for heterogeneous systems and noise
learning structure of nonlinear multiagent systems and unknown systems
deep neural networks with adaptive delays of regression
adaptive stochastic models using active learning processes
stability analysis for mimo neural networks with delays
on state estimation of a new iterative learning
a class of online model for a novel recurrent neural network
a controller for feature analysis of neural networks and noise
collaborative quality of and the neural networks a unified
sparse representation of delayed neural network representation with
the feature selection based on a application to stochastic delays via regularization
unified analysis for a deep transfer learning
a network of coupled uncertain delay and application to semisupervised classification
a deep convolutional neural networks with communication constraints and its switched linear multiagent systems
multimodal data for nonlinear systems with adaptive complex networks
optimal delays control of multiple learning for clustering
linear data design of delayed jump neural network for linear systems
memristive generalized efficient estimation for feature selection for modeling for nonlinear systems
a deep convolutional neural dynamic systems with hierarchical
a constrained iterative learning with multiple least the classification
sequential metric learning with a supervised systems with learning
exponential synchronization of communication processes and its switched systems using neural networks
application to mixture of gaussian heterogeneous and time delays
a new control for generalized domain adaptation
robust concept and local method for heterogeneous dynamic programming by
a novel adaptive control of graph analysis for nonlinear kernel convolutional neural networks with delays
optimal control of time regression and its application to features
semisupervised feature optimization and probabilistic matrix learning
markov for and an multiobjective framework with dynamical delay

Even though the network didn't learn any grammar rules, some plausible titles were generated. For example (even though I wouldn't know what it would be about):

adaptive stochastic models using active learning processes

and my favorite:

a novel adaptive control of graph analysis for nonlinear kernel convolutional neural networks with delays

what a mouthfull!

References

At this point I should mention that I re-used some code from:

https://adventuresinmachinelearning.com/keras-lstm-tutorial/ (mainly the KerasBatchGenerator)
https://github.com/dennybritz/rnn-tutorial-rnnlm/blob/master/RNNLM.ipynb (Pre-processing and generating text snippets)

Fitting modified Gombertz and Baranyi equations for bacterial growth in R

Sat, 26 Oct 2019 10:40:00 GMT

Modified Gombertz and Baranyi equations are two of the most famous equations for modelling bacterial growth. Bacterial growth is modelled in four different phases:

The lag phase
The log or exponential phase
The stationary phase
The death phase

Researchers in the food engineering industry are interested in the first two (or three) phases, since in order to maintain low bacterial populations you are interested in prolonging the initial phase and prohibiting the growth of the population. In the first three phases the growth curves resemble what is called a sigmoid curve. As in a previous blog post on fitting sigmoid curves using R I will use the non-linear least-squares method in R to fit these specific curve into the data.

Both the data and the equations are taken from the edited book of McKellar and Lu, 2004: Modeling Microbial Response in Food.

Modified Gompertz

The modified Gompertz equation is equation (2.2) from the book and is given as:

$$log(x_t) = A + C \cdot e^{-e^{(-B \cdot (t - M))}}$$

where $x_t$ is the number of cells at time $t$, $A$ the asymptotic count, $C$ the difference in value of the upper and lower asymptote, $B$ the relative growth rate at $M$, and $M$ the time at which the absolute growth rate is maximum.

Some data from the book for Listeria monocytogenes at 5 degrees Celsius are:

# time in days
d = c(0, 6, 24, 30, 48, 54, 72, 78, 99, 126, 144, 150, 168, 174, 191, 198, 216, 239, 266, 291, 316, 336, 342, 360, 384)
# log cfu ml^-1
y = c(4.8, 4.7, 4.7, 4.7, 4.9, 5.1, 5.3, 5.4, 5.9, 6.3, 6.9, 6.9, 7.2, 7.3, 7.7, 7.8, 8.3, 8.8, 9.1, 9.2, 9.3, 9.7, 9.7, 9.7, 9.5)

Let's start with defining the Gompertz equation in R:

gombertz_mod = function(params, x) {
  params[1] + (params[3] * exp(-exp(-params[2] * (x - params[4]))))
}

Next I fit the model using non-linear least squares

fitmodel <- nls(y ~ A + C * exp(-exp(-B * (d - M))), start=list(A=3, B=0.01, C=10, M=10))

Extract the parameters and apply the model to new data

gomb_params=coef(fitmodel)
print(gomb_params)

##            A            B            C            M 
##   4.65920718   0.01163221   5.40581821 138.91015307

d2 <- 0:400
y2 <- gombertz_mod(gomb_params, d2)
y_pred_gomb <- gombertz_mod(gomb_params, d)

Let's plot the equations and the data points:

plot(d2, y2, type="l", xlab="time (days)", ylab="logx", main="Growth for Listeria monocytogenes (Gompertz)")
points(d, y)

and calculate the RMSE:

rmse <- function(real, pred) {
  sqrt(mean((real-pred)^2))
}

paste("RMSE modified Gombertz: ", rmse(y, y_pred_gomb))

## [1] "RMSE modified Gombertz:  0.112256060122191"

Baranyi

The Baranyi, equations (2.9) and (2.10) from the book are:

$$y(t) = y_0 + \mu_{max} \cdot A(t) - ln(1 + \frac{e^{\mu_{max} \cdot A(t)} - 1}{e^{y_{max}-y_0}})$$

and

$$A(t) = t + \frac{1}{\mu_{max}} \cdot ln(e^{-\mu_{max} \cdot t} + e^{-\mu_{max} \cdot \lambda} - e^{[-\mu_{max} \cdot (t + \lambda)]})$$

where $y(t)=lnx(t)$, $y_0=lnx_0$, $\mu_{max}$ is the rate of increase of the limiting substrate and $\lambda$ is the lag-phase duration. Note: the equation for $A_(t)$ is derived after substituting $q_0$ with $\frac{1}{e^{\mu_{max}} - 1}$ in the original equation from the book.

Thus in R for the Baranyi model, we have the following function:

fitmodel <- nls(y ~ y0 + mmax * (d + (1/mmax) * log(exp(-mmax*d) + 
                exp(-mmax * lambda) - exp(-mmax * (d + lambda)))) - 
                log(1 + ((exp(mmax * (d + (1/mmax) * log(exp(-mmax*d) +
                exp(-mmax * lambda) - exp(-mmax * 
                (d + lambda)))))-1)/(exp(ymax-y0)))),
                start=list(y0=2.5, mmax=0.1, lambda=10, ymax=10))

baranyi <- function(params, x) {
  params[1] + params[2] * (x + (1/params[2]) * log(exp(-params[2]*x) + 
  exp(-params[2] * params[3]) - exp(-params[2] * (x + params[3])))) - 
  log(1 + ((exp(params[2] * (x + (1/params[2]) * log(exp(-params[2]*x) + 
  exp(-params[2] * params[3]) - exp(-params[2] * (x + params[3])))))-1)/
    (exp(params[4]-params[1]))))
}

baranyi_params <- coef(fitmodel)
print(baranyi_params)

##          y0        mmax      lambda        ymax 
##  4.63245864  0.02577884 65.17090445  9.69155419

d3 <- 0:400
y3 <- baranyi(baranyi_params, d3)
y_pred_baranyi <- baranyi(baranyi_params, d)

plot(d3, y3, type="l", xlab="time (days)", ylab="logN", main="Growth for Listeria monocytogenes (Baranyi)")
points(d, y)

paste("RMSE Baranyi: ", rmse(y, y_pred_baranyi))

## [1] "RMSE Baranyi:  0.102942076757872"

As expected from the bibliography, the Baranyi equations have a smaller error, basically due to the better fit in the steady state of the bacterial growth.

A rendered edition of the R markdown notebook can be found in Rpubs.

Sending graphql queries using http.Client in Go

Tue, 15 Oct 2019 05:30:00 GMT

Background: I wanted to test quickly, different, complex GraphQL queries to the GitHub v4 API especially wrt to error messages produced. At the same time I didn't want to create complex Go types to match the GitHub schema using a library like shurcooL/githubv4, it seemed like a lot of hassle for my purpose, especially since I didn't want to decode the response and use it.

Prerequisites: Create a personal access token with the scopes related to the queries you want to do and put it in the environmental variable GITHUB_TOKEN.

According to the GitHub documentation on forming calls:

The string value of "query" must escape newline characters or the schema will not parse it correctly. For the POST body, use outer double quotes and escaped inner double quotes.

So in order not to do the encoding myself, I will use the go json library to take care of the json encoding/marshaling.

In addition one must take care of putting extra braces like explained in this StackOverflow answer.

So let's start by creating an OAuth2 client since you cannot have non-authenticated queries to the v4 API:

client := oauth2.NewClient(
        context.TODO(),
        oauth2.StaticTokenSource(
            &oauth2.Token{AccessToken: os.Getenv("GITHUB_TOKEN")},
        ))

then a query:

    query := `query {
            repository(owner:"octocat", name:"Hello-World") {
              issues(last:20, states:CLOSED) {
                edges {
                  node {
                    title
                    url
                    labels(first:5) {
                      edges {
                        node {
                          name
                        }
                      }
                    }
                  }
                }
              }
            }
          }`

then marshal (or encode) the basic struct into JSON:

gqlMarshalled, err := json.Marshal(graphQLRequest{Query: query})

and finally POST:

resp, err := client.Post("https://api.github.com/graphql", "application/json", strings.NewReader(string(gqlMarshalled)))

and dump the response:

b, _ := httputil.DumpResponse(resp, true)
fmt.Println(string(b))

The complete gist that includes a query with variables can be found below:

gist:kyrcha/76fdcabfbdb4c746fdc8d20761262212#graphqlclient.go

Execute it with:

GITHUB_TOKEN=<your token> go run graphqlclient.go

Launching the new kyrcha.info using Gatsby, Bulma, Contentful and Surge

Wed, 22 May 2019 15:49:00 GMT

Finally! Since the first commit in GitHub on the 26th of April 2018, that is after almost a year, I am in the position to say that I can announce the official launch of kyrcha.info.

As I state in the header my home page, I want kyrcha.info to be the main point of entry to my digital self. To serve as a medium to communicate with the world, to serve as an archive, to serve as a long-term memory, to server as a marketing tool.

I have used many technologies before to build it: Wordpress (numerous attempts), dokuwiki, docpad, plain old html and more, but eventually I believe I found the combination that satisfies my requirements:

A static site generator, with whatever that means in terms of performance and security vs. dynamic website platforms.
Be able to own my content.
Be able to extend the functionality myself programmatically.
Use technologies I also use in other projects.
Have pride in that I've stiched it up myself.

So I am writing this post to present to you kyrcha.info, my personal website that uses GatsbyJS as the static site generator, Bulma as the CSS framework, Contentful for managing content and surge for publishing.

Features I wanted and have implemented in this website are:

Google analytics through the google-analytics Gatsby plugin
RSS feed with email subscription. This required the feed Gatsby plugin and feedburner with emai support and some code.
Math equations in blog posts. For this I used the Gatsby plugin remark-mathjax and some code I found over GitHub. So now I write equations like this: $\frac{a}{b}$ in Contentful and they are transform into math: $\frac{a}{b}$.
Be able to write my own code if I want to for anything.
Be able to draft something in Contentful and preview it without commiting it to GitHub or do other hacks.
Be able to write my posts in (simple) markdown and not in html or in some rich format editors that are a pain many times.

I have also added:

Commenting using Disqus through a React plugin. Unfortunattely, I still cannot make the old comments to show up to the new website, despite the migrations I have made in the Disqus platform.

If you like the technologies, the features and the layouting...the code is on GitHub.

Some remaining tasks are:

~~I am missing pagination in the blog page~~ Update 2019-10-26: done
I am missing tag pages to contain collections of posts with the same tag
Optimizations for speed
More content :)

Simple rules for building robust machine learning models

Thu, 16 May 2019 09:29:00 GMT

This is the title of my invited talk in the Ask Me Analyting (AMA) call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group. The minutes of the call will be posted here.

The rules are summarized as follows:

Always have 3 sets:
- training
- validation
- test
Validation and test sets should reflect the data you expect to see in the future
Follow dataset size heuristics
Choose one metric to iterate faster and have more focus
Always do your exploratory analysis data
- density plots
- correlation plots
- box plots
When preprocessing use statistics based only on the training set
Increase the number you do 10-fold CV to get even more accurate estimates of performance
Use Wilcoxon statistical test to choose between two models
Time is money (Person-Months and Cloud Computing), start with a small dataset, debug and then increase the size
If you don't have enough data, find or create more data
Decide if you strive for performance or interpretability
Learn the strong points of each ML model
Become a knowledgeable trader of bias-variance
Finish of with an ensemble
Tune hyperparameters ... but up to a point
Start with a simple waterfall like process:
- Study the problem
- EDA
- Define optimization strategy
- Do feature engineering
- Modelling
- Ensembling

Enjoy!

https://speakerdeck.com/kyrcha/simple-rules-for-building-robust-machine-learning-models

Advices and strategies I learned from my first business attempt

Tue, 23 Apr 2019 08:46:00 GMT

This is the title of my invited talk in ECESCON 2019 (Electrical and Computer Engineering Student Conference). Even though I do not consider myself as an experienced (or successful) entrepreneur, I took up the challenge to come up with a talk that I believe it will help others in their first attempt. The slides are below:

https://speakerdeck.com/kyrcha/advices-and-strategies-i-learned-from-my-first-business-attempt

Calculating the running average and variance of streaming data using redis

Thu, 04 Apr 2019 21:33:00 GMT

In our Big Data Management System, cenote, we wanted to calculate the running average and the running variance of numeric JSON properties from streaming data processed by Storm bolts. Before the bolts store the data in the database, we wanted to update these statistics for each numeric property in order to perform online outlier detection. The values of each numeric property can be processed by a different bolt and all these bolts have to update the same running statistic concurrently. Thus each update should be either an atomic operation or formed as a transaction, while at the same time be fast enough to achieve near-real time processing times end-to-end.

A system that supports quick writes and reads is redis, an in-memory, key-value store that has both atomic operations and transactions. So when a bolt receives a JSON document it should update the tripplet, {n, m, m2}, according to Welford's algorithm, with n being the number of sample, m the mean and m2 the squared distance from the mean.

Below there is example code (or here as a gist) that instantiates a pool of 100 threads, with each one processing a number, connecting to redis and updating the number of samples, their running average and their variance using transactions (or pipelines in the Python-redis language). It uses a lua script and the EVAL redis command.

from multiprocessing import Pool
import redis
import math
import json
from random import seed
from random import gauss

# Atomic operations
def sum(x):
    r = redis.Redis(host='localhost', port=6379, db=0)
    r.incrbyfloat('sum', x)

# transactional operations using EVAL
def welford(x):
    r = redis.Redis(host='localhost', port=6379, db=0)
    pipe = r.pipeline()
    running(keys=['aggregate'], args=[x], client=pipe)
    pipe.execute()

if __name__ == '__main__':
    # create a sequence of numbers following a normal distribution
    # of mean 0 and 1 standard deviation
    seed(1)
    sequence = [gauss(0,1) for i in range(1000)]
    # connect to redis and initialize
    rmain = redis.Redis(host='localhost', port=6379, db=0)
    rmain.set('sum', 0)
    rmain.set('aggregate', '{ "n": "0", "m": "0", "m2": "0" }')
    # Welford's online algorithm
    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
    lua_script = """
        local aggregate = redis.call('get',KEYS[1])
        local decode = cjson.decode(aggregate)
        local n = decode['n']
        n = n + 1
        local m = decode['m']
        local m2 = decode['m2']
        local delta = ARGV[1] - m
        m = m + delta/n
        m2 = m2 + delta * (ARGV[1] - m)
        decode['n'] = n
        decode['m'] = m
        decode['m2'] = m2
        local encoded = cjson.encode(decode)
        redis.call('set', KEYS[1], encoded)
    """
    running = rmain.register_script(lua_script)
    p = Pool(100) # create a pool of 100 threads
    p.map(sum, sequence) # calculate the sum
    p.map(welford, sequence) # calculate running mean and variance or standard deviation
    print('sum: ' + str(rmain.get('sum')))
    result = json.loads(rmain.get('aggregate'))
    print('count: ' + str(result['n']))
    print('mean: ' + str(result['m']))
    print('std: ' + str(math.sqrt(result['m2']/result['n'])))

On collinearity and feature selection

Fri, 22 Mar 2019 15:00:00 GMT

I am writing this post as an action to Kent C. Dodds Call for Action in the area of intentional carreer building. In that post Kent C. Dodds discusses (possibly reproducible) ideas on how he built his carreer (by creating and communicating value). One of the proposed actions was "Answer your co-worker's question in a public space (YouTube, gist, etc.) and share it".

In this blog post I am going ahead and answering a student's question in public space. In particular, I got asked by a student, whether one should elliminate collinearity - using Variance Inflation Factor (VIF) for example - before using a feature selection algorithm. I'll do my best to provide an insightful answer and to do that I will be fusing my knowledge, experimentation and different resources I found on the Internet.

More posts like that will follow. It is a way of finding ideas and writing posts that help you become a better communicator of ideas and concepts, creating content and value and helping other along the way. But I am stalling, so let's start.

Resources

I read and used the following resources on the subject:

This StackOverflow (SO) question and its answers: answer 1, answer 2, answer 3, answer 4
This SO question and its answers: answer 1
This SO question and its answers: answer 1
This blog post
The textbook Applied Linear Statistical Models, 5th edition.
Wikipedia entry

Collinearity

Intercorrelation or multi-collinearity is the existence of predictor variables that are (highly) correlated among themselves. For example family income, family savings, age of head of household are correlated among themselves in an example when we try to predict family food expenditures. The older you are, you probably have more money and more savings and vice-versa.

In the presence of perfect collinearity, i.e. feature $X_1 = \alpha + \beta * X_2$ (1), we could have many different coefficient values of features that would predict response variable $Y$ equally well after performing Ordinary Least Squares (OLS). So given all these solutions, one would not be able to say something regarding the effect $X_1$ and $X_2$ have on Y. In addition, if a new sample arrives that we want to predict and does not follow equation (1), the prediction error will be probably very big. Finally, the regression coefficients of any multicollinear predictor variables can be very different in the presence of non-correlated variables. In practice though, this is rarelly the case since there is also an error component to the relationship.

In plain words: If there is a "nice" relationaship between $X_1$ and $X_2$, when new disturbed data arrive, don't expect to have good predictions. On the other hand, if the relationship between $X_1$ and $X_2$ is fuzzy and it continuous to be fuzzy then you won't have a problem. In general collinearity causes problems to the interpretability of the model. Prediction is not hurt as long as the new samples that arrive follow the same (multi)collinear pattern.

Variance Inflation Factor (VIF)

The formal method for detecting the presence of multicollinearity is Variance Inflation Factor (VIF). VIF measures how much the variances of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. VIF is 1 when $X_n$ is not linearly related to other predictors and greater than 1 in the presence of intercorrelations with other features. A VIF of more than 10 is a heuristic for the indication that colinearity is influencing regression.

One can drop one or more collinear variables from the model, but 1) you get no other insight on whether the dropped variables hurt or not the prediction 2) the coefficients of the remaining variables will change. One can also do Principal Components Analysis, which will provide new uncorrelated variables. Of course the new variables will not have any physical meaning though whatsoever, hurting again interpretability. Remedial measures against serious collinearity may be Ridge Regression, which through regularization it gives preference to one solution over the others.

On the other hand, in Machine Learning we most often care about robust predictions and models that can generalize well, rather than issues regarding the interpretability of the models. (Sidenote: I believe this will change because we would often like to know why a machine leanring model provided this or that prediction...think of bank credit scoring systems or autonomous cars using deep neural networks that must adhere to certain laws as well.) The balance of how much regularization is needed (a form of a bias-variance trade-off with examples: the $\lambda$ factor in Ridge Regression, number of variables sampled in Random Forrests or regularization parameter C in Support Vector Machines), is useually found through cross-validation. Of course it is always good to know about the existence of collinearity, since in principle when the collinearity equation changes, we can have large prediction errors.

To conclude, if you are interested in the effects of the predictor variables to the response and the interpretability of the model, do care about collinearity. If you are interested about the predictive abilities of a model then you can skip it and follow regular machine learning flows. But it is always good to check.

Below one can find an experimentation of various cases along with comments I've done using R to prove the arguments above. The Rmd document can be found here and the rendered html document here.

Rmd document for Collinearity and Feature Selection

Intro

This notebook is an online appendix of my blog post: On Collinearity and Feature Selection, where I play with the concepts using R code.

The dataset

We will use the auto-mpg dataset, where we will try to predict the miles per galon (mpg) consumption given some car related features like horsepower, weight etc.

set.seed(1234)
fileURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
download.file(fileURL, destfile="auto-mpg.data", method="curl")
data <- read.table("auto-mpg.data", na.strings = "?", quote='"', dec=".", header=F)
# remove instances with missing values and the name of the car
data <- data[complete.cases(data),-9]
summary(data)

##        V1              V2              V3              V4       
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##        V5             V6              V7              V8       
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000

Preprocessing

Let's normalize the dataset (values to be in the interval [0,1]), an operation that will maintain the correllation between the variables, and split between training and testing.

normalize <- function(x) {
  (x - min(x, na.rm=TRUE))/(max(x,na.rm=TRUE) - min(x, na.rm=TRUE))
}
normData <- cbind(data[,1], as.data.frame(lapply(data[,-1], normalize)))

# name variables
names(normData) <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin")

# check correlation
cat("Cor between disp. and weight before norm.:", cor(data$V3, data$V5), "\n")

## Cor between disp. and weight before norm.: 0.9329944

cat("After norm.:", cor(normData$displacement, normData$weight))

## After norm.: 0.9329944

# Train/Test split
library(tidyverse)
library(caret)
training.samples <- normData$mpg %>%
createDataPartition(p = 0.8, list = FALSE)
train.data  <- normData[training.samples, ]
test.data <- normData[-training.samples, ]

Modelling

Linear Regression

linearModel <- lm(mpg ~., data = train.data)
summary(linearModel)

## 
## Call:
## lm(formula = mpg ~ ., data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4147 -2.1845 -0.1875  1.7702 12.8931 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.4017     1.3012  19.521  < 2e-16 ***
## cylinders     -2.5022     1.8711  -1.337   0.1821    
## displacement   7.7778     3.2280   2.409   0.0166 *  
## horsepower    -2.8404     2.8077  -1.012   0.3125    
## weight       -22.9293     2.5394  -9.029  < 2e-16 ***
## acceleration   2.0678     1.8409   1.123   0.2622    
## model_year     9.3472     0.7023  13.309  < 2e-16 ***
## origin         2.9604     0.6383   4.638 5.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.389 on 307 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.8134 
## F-statistic: 196.5 on 7 and 307 DF,  p-value: < 2.2e-16

One can see that weight is a very important factor both in terms of the coefficient value and in terms of statistical significance (unlikely to observe a relationship between weight and mpg due to change). Notice the negative coefficient (more weight, less miles per gallon), which can be explained by the laws of physics. But also notice that even though weight and diplacement have a correlation o 0.93 (almost collinear), their coefficients have different signs. Based on common knowledge though they should have had the same signs. Collinearity is bad when you try and explain the outputs of models. Let's examine the VIF values:

library(car)
vif(linearModel)

##    cylinders displacement   horsepower       weight acceleration 
##    10.797995    19.995685     8.952301    10.191612     2.424953 
##   model_year       origin 
##     1.223735     1.794710

We observe that 3 predictors have a value more than 10 that is a concern for the existence of collinearity (or multicollinearity in this case). So let's drop displacement and create a second model:

linearModelMinusDisp <- lm(mpg ~.-displacement, data = train.data)
summary(linearModelMinusDisp)

## 
## Call:
## lm(formula = mpg ~ . - displacement, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4776 -2.2073 -0.1473  1.7625 12.9679 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.4241     1.3113  19.388  < 2e-16 ***
## cylinders      0.5072     1.4040   0.361    0.718    
## horsepower    -1.1352     2.7382  -0.415    0.679    
## weight       -20.6134     2.3688  -8.702  < 2e-16 ***
## acceleration   1.5673     1.8434   0.850    0.396    
## model_year     9.2684     0.7070  13.110  < 2e-16 ***
## origin         2.4812     0.6112   4.060 6.24e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.415 on 308 degrees of freedom
## Multiple R-squared:  0.8141, Adjusted R-squared:  0.8104 
## F-statistic: 224.7 on 6 and 308 DF,  p-value: < 2.2e-16

vif(linearModelMinusDisp)

##    cylinders   horsepower       weight acceleration   model_year 
##     5.986398     8.383554     8.731646     2.394082     1.221086 
##       origin 
##     1.620491

and let's check the predictive ability of the two:

predLM <- linearModel %>% predict(test.data)
predLMMD <- linearModelMinusDisp %>% predict(test.data)

cat("Full model:", RMSE(predLM, test.data$mpg), "\n")

## Full model: 3.098026

cat("Minus disp. model:", RMSE(predLMMD, test.data$mpg))

## Minus disp. model: 3.125825

As one can see I now have a more "understandable" model, with kind of a "worse" predictive ability (slightly higher error).

Now let's remove all the variables that had a VIF value of more than 10

linearModelSimpler <- lm(mpg ~.-displacement-cylinders-weight, data = train.data)
summary(linearModelSimpler)

## 
## Call:
## lm(formula = mpg ~ . - displacement - cylinders - weight, data = train.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.832 -2.415 -0.533  1.930 12.794 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   28.7618     1.4362  20.027  < 2e-16 ***
## horsepower   -24.3357     1.6960 -14.349  < 2e-16 ***
## acceleration  -6.9758     1.8700  -3.730 0.000227 ***
## model_year     8.0722     0.8034  10.048  < 2e-16 ***
## origin         4.9410     0.6324   7.813 8.76e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.938 on 310 degrees of freedom
## Multiple R-squared:  0.7511, Adjusted R-squared:  0.7479 
## F-statistic: 233.9 on 4 and 310 DF,  p-value: < 2.2e-16

vif(linearModelSimpler)

##   horsepower acceleration   model_year       origin 
##     2.418236     1.852449     1.185461     1.304459

predLMS <- linearModelSimpler %>% predict(test.data)
cat("Even simpler model - RMSE:", RMSE(predLMS, test.data$mpg), "\n")

## Even simpler model - RMSE: 3.575336

Now the model is simple, more explainable, without any colinearities (low VIF values) but not as good in terms of RMSE as the previous ones.

Feature Selection

To check also a feature selection method, stepwise feature selection using the Akaike Information Criterion (AIC):

require(leaps)
require(MASS)
step.model <- stepAIC(linearModel, direction = "both", trace = FALSE)
summary(step.model)

## 
## Call:
## lm(formula = mpg ~ displacement + weight + acceleration + model_year + 
##     origin, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1253 -2.1573 -0.1547  1.8907 12.8220 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   24.4535     1.0334  23.663  < 2e-16 ***
## displacement   4.3108     2.3178   1.860   0.0639 .  
## weight       -24.4212     2.2367 -10.918  < 2e-16 ***
## acceleration   3.1711     1.4748   2.150   0.0323 *  
## model_year     9.5092     0.6811  13.963  < 2e-16 ***
## origin         2.7723     0.6225   4.453 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.392 on 309 degrees of freedom
## Multiple R-squared:  0.816,  Adjusted R-squared:  0.813 
## F-statistic:   274 on 5 and 309 DF,  p-value: < 2.2e-16

linearModelAIC <- lm(as.formula(step.model), data = train.data)
vif(linearModelAIC)

## displacement       weight acceleration   model_year       origin 
##    10.288515     7.890980     1.553344     1.148540     1.703980

predLMAIC <- linearModelAIC %>% predict(test.data)
cat("AIC model:", RMSE(predLMAIC, test.data$mpg), "\n")

## AIC model: 3.114763

Through this example we can see than even though we have colinearities involved (VIF value of 10+), we obtain a low RMSE of 3.11. Or using another feature selection package:

# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model2 <- train(mpg ~., data = train.data,
                    method = "leapBackward", 
                    tuneGrid = data.frame(nvmax = 1:7),
                    trControl = train.control
                    )
step.model2$results

##   nvmax     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1     1 4.387909 0.6950849 3.378332 0.9854225 0.09849010 0.6697571
## 2     2 3.407376 0.8137023 2.661535 0.9123314 0.06485389 0.5518240
## 3     3 3.333900 0.8224665 2.549990 0.8779426 0.06157152 0.5311275
## 4     4 3.371991 0.8188230 2.592064 0.8832272 0.06214344 0.5410169
## 5     5 3.395429 0.8170493 2.618763 0.8325787 0.05722386 0.4970754
## 6     6 3.404866 0.8163436 2.633234 0.7979961 0.05357786 0.4526265
## 7     7 3.384003 0.8180447 2.618266 0.8100225 0.05401179 0.4619242

summary(step.model2$finalModel)

## Subset selection object
## 7 Variables  (and intercept)
##              Forced in Forced out
## cylinders        FALSE      FALSE
## displacement     FALSE      FALSE
## horsepower       FALSE      FALSE
## weight           FALSE      FALSE
## acceleration     FALSE      FALSE
## model_year       FALSE      FALSE
## origin           FALSE      FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: backward
##          cylinders displacement horsepower weight acceleration model_year
## 1  ( 1 ) " "       " "          " "        "*"    " "          " "       
## 2  ( 1 ) " "       " "          " "        "*"    " "          "*"       
## 3  ( 1 ) " "       " "          " "        "*"    " "          "*"       
##          origin
## 1  ( 1 ) " "   
## 2  ( 1 ) " "   
## 3  ( 1 ) "*"

coef(step.model2$finalModel, 3)

## (Intercept)      weight  model_year      origin 
##   26.190907  -21.244932    9.522434    2.370324

The best model has 3 predictors: weight, model_year and origin. So making one more final linear regression model and predicting the mpg in the test set we have:

linearModelBest <- lm(mpg ~weight+model_year+origin, data = train.data)
summary(linearModelBest)

## 
## Call:
## lm(formula = mpg ~ weight + model_year + origin, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8638 -2.1894 -0.0388  1.7413 13.0971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.1909     0.6744  38.834  < 2e-16 ***
## weight      -21.2449     1.0134 -20.965  < 2e-16 ***
## model_year    9.5224     0.6649  14.321  < 2e-16 ***
## origin        2.3703     0.5942   3.989 8.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.412 on 311 degrees of freedom
## Multiple R-squared:  0.8126, Adjusted R-squared:  0.8108 
## F-statistic: 449.6 on 3 and 311 DF,  p-value: < 2.2e-16

vif(linearModelBest)

##     weight model_year     origin 
##   1.601095   1.082269   1.534712

predLMBest <- linearModelBest %>% predict(test.data)
cat("Best model:", RMSE(predLMBest, test.data$mpg), "\n")

## Best model: 3.096694

3.09!!! Lowest error, with only 3 predictors and no collinearities involved. In this case feature selection went along with a fina model that is interpretable as well.

Ridge Regression

One solution to the collinearity problem (without doing any feature selection) is to apply Ridge Regression and to try to "constrain" the number of solution of the beta coefficients into a single solution. In this case since we have the hyperparameter lambda to optimize for which we will apply 10-fold cross-validation on the training set to find the best value and then use that to train the model and predict mpg in the testing dataset.

library(glmnet)
y <- train.data$mpg
x <- train.data %>% dplyr::select(-starts_with("mpg")) %>% data.matrix()
lambdas <- 10^seq(3, -2, by = -.1)
fit <- glmnet(x, y, alpha = 0, lambda = lambdas)
cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas, nfolds = 10)
# uncomment the plot to see how lambda changes the error.
#plot(cv_fit)
opt_lambda <- cv_fit$lambda.min
x_test <- test.data %>% dplyr::select(-starts_with("mpg")) %>% data.matrix()
y_predicted <- predict(fit, s = opt_lambda, newx = x_test)
cat("Ridge RMSE:", RMSE(y_predicted, test.data$mpg))

## Ridge RMSE: 3.096991

The Ridge Regression produced one of the lowest error and without dropping any of the coefficients. And as for the coefficients' values:

coef(cv_fit)

## 8 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)   26.7194742
## cylinders     -2.1671592
## displacement  -1.6690043
## horsepower    -5.5116891
## weight       -11.7117393
## acceleration  -0.3212087
## model_year     7.8818066
## origin         2.6205643

which as we can see gave a much more reasonable and physically explainable model. The more cyclinders, displacement, horsepower, weight and accellaration...the less miles per gallon you can drive, while the more recent the model, which originated from origin 2 (Europe) or 3 (Japan) the more miles on the gallon you can go. As for the RMSE, it is close both to the full model and to the optimized model using feature selection.

Discussion

Collinearity is important if you need to have an understandable model. If you don't, and you just care for predictive ability you can be more brute and care about the numbers.

Make your environment variables more robust by making them more fragile

Tue, 29 Jan 2019 08:39:00 GMT

In the Devit 2016 conference in Thessaloniki, in one of the keynotes, Yegor Bugayenko explained why you need to make your software more fragile, in order to make it more robust. It was a really good talk, a talk that I still remember. In the end of this post I have embedded the recording from the conference.

In this talk Yegor Bugayenko explained why you need to fail fast software in development in order to make it more robust in production. According to that strategy, every time there is a potential source of a bug or in general something that is not in the happy path, we should make it more visible and "bigger", in order to catch it by failing fast, fix it and deploy again. Some of the examples in the talk say "do this":

@Override
void save() {
  throw new Exception(
    "not implemented yet"
  );
}

"instead of this":

@Override
void save() {
  // not implemented yet
}

So if the method is not implemented, the program will fail fast with the first strategy but at least you will not think you called save() and it did something when it actually didn't, like in the second example. Or, "do this":

file.delete();

"instead of this":

if(!file.delete()) {
  throw new Exception(
    "failed to delete a file"
  );
}

In this case we ignore the output of the method. If for some reason the system fails to delete the file, we are left hanging with the idea that the file was deleted. But if we throw the exception, we can start writing exception handling code on how to approach a failed file deletion (file not found? file in use? etc.).

Now to our case, according to the third factor of the twelve-factor-app methodology, we should use environment variables in order to configure the code between deploys (i.e. environments like staging, production, development etc.).

In a lot of example in node.js that use environment variables there is often the pattern:

const env = process.env.NODE_ENV || "development";

const mongo_uri = process.env.MONGODB_URI || "mongodb://localhost:27017/test";

Based on the "making the software more fragile in order to make it more robust" strategy, in my projects I use the following pattern:

function throwErr(msg) {
  throw new Error(msg);
}

const env = process.env.NODE_ENV || throwErr('NODE_ENV is unset');

const mongo_uri = process.env.MONGODB_URI || throwErr('MONGODB_URI is unset');

So now, all unset environment variables will cause an error to be thrown and this will force me (or other members of the team) to set them up before continuing with the deployment to an environment, even my own development environment. And how will I know what to do? It will be shown in the logs.

Below is the talk that inspired me for this pattern:

Personal note: This is a post I wanted to write for so much time (a couple of years), but after reading Kent C. Dodds newsletter yesterday on "Intentional Career Building" I did it first thing in the morning as one of the call to action bullet points he suggested: "Write the blog post you wish existed last week when you were learning something new"

2018 in review

Wed, 23 Jan 2019 09:31:00 GMT

2018 is over and through this post I will try to summarize personal and professional accomplisments and some of my 2019 goals. I will also try to reflect what went well, what didn't and what I've learned this year the James Clear way. I always wanted to share my annual reviews and goals in a blog post, so let's make this, year one! I know I am a bit late for such posts, but better late than never. The first part is pretty quantitative, while the second is more qualitative.

The list

Cyclopt

(For those wondering what is Cyclopt, Cyclopt is my first startup company)

We got our first clients performing software quality assessment on their web applications.
We launched our first MVP Quality as a Service web application.
We got the 5th place out of 361 business plans in the NBG Business Seeds Competition.

Publications

I co-authored and self-published my first book ever: Practical Machine Learning with R, which at the moment of writing this post has around 500 readers.
Published 4 papers:
- "npm-miner: An Infrastructure for Measuring the Quality of the npm Registry" in MSR 2018
- "Predicting hyperparameters from meta-features in binary classification problems" in AutoML 2018
- "A Natural Language Driven Approach for Automated Web API Development"_ in WS-REST 2018
- and "Deep Reinforcement Learning for Doom using Unsupervised Auxiliary Tasks"_ in arxiv

Books I've read

(not as much as I planned)

"The Barefoot Investor" by Scott Pape
"It Doesn't Have to Be Crazy at Work" by Jason Fried and David Heinemeier Hansson

Conferences I went to

Devit in Thessaloniki
AutoML@ICML in Stockholm
Voxxed days in Thessaloniki

Weightlifting

The personal records I did this year:

Deadlift: 170kg x 1
Squat: 165kg x 1
Overhead Press: 72.5kg x 1

Projects

The projects I worked mainly on are the following:

Already running:
- Completed the automated continuous integration and deployment pipeline for https://app.equadcapital.com
- Project management for the project: "Continuous Implicit Authentication on Mobile Devices and Kiosks through gestures". Through this project we launched the mobile application Brain Run, which reached place 344 in Play Store in the Games category.
- Mobile-Age H2020 project
New:
- eeRIS: electric energy Residential Informational System
- VITAL: Versatile Internet of Things for AgricuLture

For these two new projects we started building an open source Big Data Management System (BDMS) like keen.io for handling and analyzing real-time event streams. We named it cenote.

Proposals

Zero out four (0/4) proposals for funding were accepted. On the positive side we have enough previously funded proposals.

Diploma Theses

Some diploma theses I co-supervised that completed in 2018 were:

Anastasios Kakouris: "Continuous User Authentication in Web Applications through Behavioral Biometrics"
Napoleon-Christos Economou: "Call by Meaning: Calling Software Components Based on Their Meaning"
Giorgos Konstantopoulos: "Decentralized Metering and Billing of energy on Ethereum with respect to scalability and security"

Life in general

Moved to a new apartments
Completed my academic CV (39 pages) and applied for three tenured track positions in Greek universities.
I also think I nailed down what I want to do R&D on and what to be good on: "Autonomously improve the quality of software systems (what is called autonomic computing), either in the automatically Find Bugs-Fix-Verify (Fi-Fi-Verify) sense for software systems or in the Life-long-learning sense for machine learning based systems (Software 2.0)."

Software

Started working on some open source software projects:

jssa - javascript static analyzer: JS static analyzer (jssa): An aggregation of javascript source code static analysis tools
js-startet-kit: JS web application starter kit for the MERN stack along with a software development lifecycle proposal
github-project-story-points: forked and adapted

What I've learned

I will quote James' Clear lesson learnt which is exactly what I would want to write: "Entrepreneurship is never as sexy on the inside as it appears on the outside."

I cannot lose weight unless I spend more calories than those I eat. I knew it, I've read about it over and over again, my wife and friends tell me, but I don't want to believe it, or most probably don't want to do it, and think I can escape it by going to the gym more, doing keto, intermitent fasting and what not. Energy balance is the foundation of the nutrition pyramid. Period.

The four burners theory is a valid theory. You cannot do everything, i.e. health, work, family, friends, with your top performance, especially in my case, where work is split between the academia and the start-up and the family is a wife and two small kids.

Another valid theory is the willpower muscle, especially in combination with the four burners theory. I cannot have many burners on and in the same time expect myself to have the willpower to accomplish stuff in all or most of them. What usually pays the bill is the health burner, both in terms of overweight and stress.

2019

The major professional and health goals for 2019 are:

To go below 90 kgs and 20% body fat.
To launch the Cyclopt chabot and GitHub Application and reach a decent number of installations.
Research and development on the JavaScript (node.js) ecosystem with blog posts, open source software, papers and more.
Write more blog posts than the previous year and read more books than the previous years (non-fiction and technological/scientific).
Co-author more academic papers than the previous year: 4+

Finally,some other annual reviews I read were:

Kent C Dodds

Coarse-to-Fine Decoding for Neural Semantic Parsing

Thu, 29 Nov 2018 09:54:00 GMT

Preamble

I decided to have a machine learning on software (MLSW) paper reading group on my own :) and start writing short summaries (bits) on papers I read on the subject. I want them to serve as long term memory for me and make me write better summaries and reviews. I am not sure if they will be of use to anyone else, but in anycase I make them public. I will start by reading all the papers from the awesome machine learning on source code repo.

Summary

Coarse2fine method learns semantic parsers from instances of natural language expressions paired with structured meaning representations that are machine interpretable. More specifically, the structured meaning representations are logical forms (λ-calculus), django (python) expressions and SQL queries. As an example the goal is to transform:

What record company did conductor Mikhail Snitko record for after 1996?

into

SELECT Record Company WHERE (Year ofRecording > 1996) AND (Conductor = Mikhail Snitko)

To do that coarse2fine transforms the input x into a meaning sketch a and then into the final meaning representation y. Bi-directional LSTMs are used for encoding the input x and RNNs with an attention mechanism are used to decode the encoded input into the abstract sketch a. A similar encoding-decoding scheme is used for transforming a to y. Of course certain fine-tunings are added to encounter for differences among tasks.

The experimental results show that coarse2fine does a pretty good job and is worth taking a better look.

The paper can be found here and the code is provided here.

Devit 2018 takeaways and notes

Wed, 13 Jun 2018 11:10:00 GMT

Some notes and takeaways from the Devit 2018 conference that took place in Thessaloniki on Monday, June the 11th, 2018 and that will also help my long-term memory.

From David Platt

Lively presentation!

Reading:

The joy of ux

From Pawel Dudek:

"There is no such thing as untestable behavior". If you cannot test your code then you haven't architect it correctly. "Tests drive the architecture of the app".

To checkout:

2009 - Sandi Metz - SOLID Object-Oriented Design from Gotham Ruby Conference on Vimeo.

From Cheryl Platz:

Ask difficult questions when you design and build apps in the era of AI and conversational bots like:

How will this make the world better or worse?
If we are successful how customers will be harmed?
How customers can abuse our product?
What is the worst case impact our product could have?

To checkout:

The tarot cards of tech for humanity-centered design
On Surveys for UI/UX
The practical ethnography book
Microsoft's Inclusive Design thinking
dscout

Small note: Device agnostic architecture - move customer data and telemetry on the cloud, don't leave them in the device. So the user could have a nice flow when moving between devices.

In general the theme from both David Platt's and Cheryl Platz's talks was similar to the mom test book:

"Don’t build products your customers don’t need"
"Give customers what they want for a price they are willing to pay"
Don't write code before showing mockups to users/customers (leads to expensive, lost time)

From the panel on privacy

I learned that according to research a person has to spent 76 work days per year to read all the Privacy Policies encountered on the interent.

From Ingrid Epure

Notes:

Add thresholds in dashboards so the user does not have to think so hard about a metric

Checkout:

How complex systems fail paper
http://opentracing.io/
https://www.honeycomb.io/
Tom Wilkie's RED method on the metrics you need to monitor

From Julien Simon

Fun demo with a small robot on how one can use AWS and move all computations to the cloud with various AWS services.

I am thinking that from now on it doesn't really makes sense to build and deploy your own models that for example do face recognition or text translation. It doesn't worth the time and effort.

Also checkout:

Two Decades of Recommender Systems at Amazon.com paper

Machine learning tutorials mini-site

Thu, 10 Nov 2016 15:19:00 GMT

It’s been ages since I wrote my last post. I am planning to be more active from now on (I hope).

I’ve been wanting to do a mini-site with machine learning tutorials for years and finally here it is!

The mini-site is ml-tutorials.kyrcha.info and its GitHub repo: https://github.com/kyrcha/ml-tutorials

The main reason for finally getting through it was that I started teaching two data mining courses in two postgraduate programs (one on the fall and one on the spring semester with different audiences) and I wanted to have some notes to give to students with R implementations of the algorithms I teach in theory in the classroom. The mini-site also include introductory material to R to help you get familiar with it.

At the moment I only discuss the R specifics of the algorithms, but my plans are to add some theory in each algorithm as well in order to make the tutorials more standalone.

For creating the site I used the R Markdown for Website and RStudio. A great resource is this cheatsheet.

Jupyter vs. R Markdown

I started this effort by working with Jupyter notebooks with an R-kernel, but the reasons that made me switch to R Markdown and RStudio were that:

You can actually create out of the box mini-sites like that.
I run into problems when I tried to render the Jupyter notebooks into pdf to hand out to students.
R Markdown is in markdown and not in JSON, so it is easier to edit it with a text editor.
Works well with GitHub and GitHub pages project sites.

Deployment

I wanted for GitHub to serve the rendered html pages via the GitHub pages project site functionality, using a custom domain to serve the site: the subdomain ml-tutorials.kyrcha.info. Searching a bit over the internet I set it up as follows:

Step 1: Configured the site rendering tool to put the generated html files to a docs folder

Step 2: Added a footer with a new google analytics property to check out the traffic.

Step 3: In the repo settings in GitHub I added

The above will add a CNAME file in the docs folder. Since the docs folder is deleted and re-created when rendering the site, I included it in the root folder of the project and in the _site.yml configuration file added: include: ["CNAME"] so that it is transferred in the docs folder every time the site is rendered.

Step 4: Finally I also created an CNAME record in my DNS provider, with name: ml-tutorials and value: kyrcha.github.io.

Now http://ml-tutorials.kyrcha.info/ shows whatever is served from GitHub pages https://kyrcha.github.io/ml-tutorials and https://kyrcha.github.io/ml-tutorials redirects to http://ml-tutorials.kyrcha.info/

Whenever I want to add a new tutorial or update an older one I:

Make the changes in my Rmd files
Render the site: rmarkdown::render_site()
Do a git add and a git commit in the local repository and push both the source and the rendered html pages to GitHub.
If I want to render a specific page to pdf I enter: rmarkdown::render("knn.Rmd", output_format="pdf_document")

The S-CASE concept

Fri, 24 Oct 2014 10:02:00 GMT

This is a post I wrote for the S-CASE project blog. S-CASE or Scaffolding Scalable Software Services is an EU funded FP7 project I am currently working on as a technical coordinator. The post below describes what the project is about.

The S-CASE project is about semi-automatically creating RESTful Web Services through multi-modal requirements using a Model Driven Engineering methodology. The world of web services is moving towards REST and S-CASE aims at facilitating developers implement such web services by focusing mainly on requirements engineering. The figure below depicts the basic components and basic flow of events/data in S-CASE.

Typical use case scenario

Through the S-CASE IDE the user imports or creates multi-modal requirements for his/her envisioned application. The requirements may be:

Textual requirements in the form “The user/system must be able to …”,
UML activity and use case diagrams created in the platform or imported as images,
Storyboards for flow charting, and
Analysis class diagrams to improve the accuracy of the system to identify entities, their properties and their relationships.

The requirements are then processed through natural language processing and image analysis techniques in order to extract relevant software engineering concepts. These are mainly the identification of RESTful resources, their properties and relations and Create-Read-Update-Delete (CRUD) actions on resources. All these concepts are stored in the S-CASE ontology.

The above procedure also identifies action-resource tuples that can be created automatically by the system like the action-resource “create bookmark” (automatically built) or others that need more elaborate processes like “get the weather given geolocation coordinates” (semi-automatically build or composed). The latter are send into the Web Services Synthesis and Composition module.

The Web Services Synthesis and Composition module tries to synthesize elaborate processes by composing 3rd party web services into a single S-CASE composite web service. To perform such a computation, S-CASE provides a methodology for semantically annotating 3rd party web services using S-CASE domain ontologies, so that they can later be matched to the requirements of the composite service. The composite service is deployed to the YouREST deployment environment and registered in the directory of S-CASE web services for future reference and re-use.

Upon completing the stages above, the model driven engineering procedure initiates. The first step is to create the Computational Independent Model (CIM) out of the S-CASE ontology. The CIM contains the bare minimum information needed to scaffold a REST service that adheres to the requirements imposed by the user, i.e. it includes all the problem’s domain concepts. After that model transformations take place transforming the CIM into PIM (incorporate design constraints, but platform independent) and PSM (Add support for implementing the PIM into a specific suite of software tools like: java, jax-rs, hibernate, json, jaxb, postgresql etc.). The final step is to automatically generate the code of the web service. Calls to composite services are wrapped inside the generated code. The code is build and deployed to YouREST for others to use.

In order to support software re-use, every software artifact created from this procedure is stored into the S-CASE repository for future retrieval.

Through S-CASE we plan to develop an ecosystem of services, along with the appropriate tools for service providers to develop quality software for SMEs with an affordable budget.

Searchable, scrollable bootstrap dropdown with angularjs

Thu, 23 Oct 2014 02:54:00 GMT

So you are working in AngularJS, you are using the Bootstrap framework and the requirement is to create a dropdown button, which will include several (list) items and that a) is scrollable and b) is searchable because the menu items are many.

The following code presents a solution to the above problem.

We created a dropdown button with menu items coming from the angular controller. At the top of the menu an input element is added as a list item and bound to the scope variable query. This will act as the filter in the ng-repeat directive. The problem is that at this point clicking inside the input element will instantly close the dropdown since the event is propagated up the DOM tree. Thus the jQuery stopPropagation method is used for stopping the event from bubbling up.

Book Review: eCommerce in the Cloud by Kelly Goetsch - O'Reilly

Mon, 20 Oct 2014 10:56:00 GMT

Author Kelly Goetsch, a product manager focusing on large-scale eCommerce solutions, aims at educating eCommerce stakeholders on whether, why and how they could move their IT infrastructure to the cloud.

The book is quite easy to read, mainly due to the fact that the presentation of the technologies and techniques is kept at a high level.

Topics of the book include:

Cloud computing related terminology
Cloud architectures
Availability: how to avoid outages
Performance: perform transactions in a reasonable amount of time
Automation: reducing errors
Elasticity: scaling up and down
Security

I would say that the book is suitable for owners and managers of medium to large eCommerce businesses, novices in cloud technologies and distributed computing, who would like to know the terminology and better communicate with their IT personnel on cloud solutions.

Update 2015-09-18: This review was part of the O'Reilly Reader Review Program that is no longer available

SSL/HTTPS server with Node.js and Express.js

Tue, 14 Oct 2014 09:56:00 GMT

So let’s assume the requirement is to create an https server that will redirect traffic to https if http is used in a request to the server instead. I created this little guide by bundling together a couple of links related to the subject.

We will begin by creating quickly a project using the express-generator:

$ express https-server
$ cd https-server &amp;&amp; npm install
$ npm start

Server should be running at http://localhost:3000/. Now let’s create the certificates (Reference):

$ openssl genrsa 1024 &gt; file.pem
$ openssl req -new -key file.pem -out csr.pem
$ openssl x509 -req -days 365 -in csr.pem -signkey file.pem -out file.crt

We assumed no passphrase was used. We can then read the certificates in the starting point file www:

var fs = require("fs");
var config     = {
    key: fs.readFileSync('file.pem'),
    cert: fs.readFileSync('file.crt')
};

The next step is to create two servers, one to listen on http and port 3000 and one on https and port 8000 (Reference). The www file becomes:

#!/usr/bin/env node
var debug = require('debug')('https-server');
var app = require('../app');
var https = require('https');
var http = require('http');
var fs = require("fs");
var config = {
    key: fs.readFileSync('file.pem'),
    cert: fs.readFileSync('file.crt')
};

http.createServer(app).listen(3000)
https.createServer(config, app).listen(8000)

Now one can navigate both to http://localhost:3000 and https://localhost:8000 and get the same response. In the latter case with the usual “proceed with caution” notice since the certificate is not signed by a trusted authority.

The last step is to redirect traffic that come into http to https by using a middleware for all routes (Reference):

function ensureSecure(req, res, next){
  if(req.secure){
    return next();
  };
  res.redirect('https://'+req.host+':' + 8000 + req.url);
};

app.all('*', ensureSecure);

app.use('/', routes);
app.use('/users', users);

So http://localhost:3000 and http://localhost:3000/users redirect to https://localhost:8000 and https://localhost:8000/users respectively.

The complete code can be found in GitHub.

Last but not least, in production you can redirect traffic to standard http and https ports like in this reference.

Introductory post: Going MEAN

Fri, 10 Oct 2014 10:43:00 GMT

This was the first post I wrote for the meanstack.info blog I have created for all things MEAN, now merged with kyrcha.info, the site you are at.

Dear visitor,

Hi! My name is Kyriakos Chatzidimitriou. If you would like you can find out more about.me. I like to consider my self an intelligent systems, data and software engineer and this is my blog about the MEAN stack, i.e. MongoDB, ExpressJS, AngularJS and NodeJS and of course about Javascript and Javascript libraries in general.

At a certain point in time, during the last couple of years, after reading some inspiring books and along with the rise of cloud computing, the software-as-a-service paradigm and start-ups, I wanted to start building things and creating real products that provide real value to real customers.

Even though being a polyglot has many merits, since for example you can learn a lot by studying other programming languages and provide yourself with a fresh perspective to your current dev stack (see Matz’s talk on being a language designer in Euruko 2013), something I am actively pursuing, I also found fascinating the idea that you could have “one language to rule them all”. A Lingua Franca for building SaaS applications. From database, to server-side, to client side. With that respect, MongoDB, NodeJS and ExpressJS were no brainers to pick for my main dev stack. The last thing was to decide which client-side JS framework to pick-up: BackboneJS, EmberJS, AngularJS, CanJS, other? Again after some digging around I decided to go for AngularJS and complete the puzzle. I’d like to devote a couple of lines to the posts of other developers that got me started with the MEAN stack and aided me decide:

The MEAN stack post on MongoDB’s blog
A comparison post on the client-side JS frameworks
A way for NodeJS, ExpressJS and AngularJS integration

By no means I consider myself at this point to be an expert on the MEAN stack. I started using the MEAN stack since September 2013, I’ll be always learning and along the way I am making this procedure public. If others can benefit form it, the better. My familiarity with the Javascript language and its frameworks is just getting started so bare with me if you spot any mistakes on using Javascript. I promise I’ll get better.

I am starting this blog so that:

other developers could get the help I got from other blogs like that,
make me a better MEAN stack developer by trying to organize my thoughts better in order to write posts open to public criticism,
create a link to the MEAN stack community and receive feedback,
act as a long term memory storage for practices and techniques I am working on and
serve as a reference for future coworkers that are starting with the MEAN stack.

These are my adventures in the world of the MEAN stack …

Best,

– Kyriakos Chatzidimitriou

PS 1. Some links are affiliate links, which if you use, you will make it easier for me to maintain the site and get even more books, to learn more stuff and write even better posts.

PS 2. Occasionally, M will mean MySQL since a) other problems suit document databases and other relational ones and b) I am really liking the Sequelize framework.

Calculating the fractal dimension of the Greek coastline (1.25)

Fri, 19 Apr 2013 00:54:00 GMT

"Great Britain Box" by Prokofiev - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons.

Inspired by the Introduction to Complexity course and the unit on Factals, I though it would be fun to make a rough calculation of the fractal dimension of the Greek coastline using the box counting method.

The box counting method goes as follows:

Split the 2D map that depicts the coastline into squares (boxes) of a certain side size (r) and count the number of boxes (n) that include the coastline.
Decrease the size of the boxes and go to step 1.
When finished for a series of box sizes do a linear regression of log(n) given log(1/r).
The slope of the line fitting the points on the plot is the fractal dimension of the object.

For a map of Greece, I used the one from Ginko maps licensed under the Creative Commons Attribution 3.0. Via an image editor, I removed the frame with the infobox and geolocation axes, plus the borders not residing by the sea to facilitate further image processing. The retouched image was cropped to 1600×1600 pixels. Both images are shown below.

Original map

Retouched map

The R script below implements the box counting method on the coastline jpeg picture (make all values > 0.5 white) using boxes with sides that are divisors of 1600.

The coastline map

The R script


    library(jpeg)
    rm(list=ls())
    img = readJPEG("coastline.jpeg")
    # filter out mainland
    img[img > 0.5] = 1
    # divisors of 1600
    # 1,2,4,5,8,10,16,20,25,32,40,50,64,80,100,160,200,320,400,800,1600
    boxSizes = c(50, 40, 32, 25, 20, 16, 10, 8, 5, 4)
    h = img[,,1]
    data = data.frame()
    for(k in 1:length(boxSizes)) {
      b = boxSizes[k]
      x = dim(img[,,1])[1]
      ratio = x/b
      # https://stat.ethz.ch/pipermail/r-help/2012-February/303163.html
      k = kronecker(matrix(1:(ratio^2), ratio, byrow = TRUE), matrix(1,b,b))
      g = lapply(split(h,k), matrix, nr = b)
      counter = 0;
      for(i in 1:length(g)) {
        counter = counter + any(g[[i]] < 0.999)
      }
      data = rbind(data,c(log(counter),log(1/b)))
    }
    names(data) = c("Y", "X")
    model = lm(Y~., data=data)
    cat(coef(model), "\n")

The plot

With this rough approximation, the calculation yielded that the fractal dimension of the Greek coastline is 1.25. Great Britain’s was measured to be 1.25 and Norway’s 1.52 [source].

2013 and beyond, todo list

Sat, 22 Dec 2012 22:00:00 GMT

This time my new year’s resolutions are here to stay. For life. I hope sometime soon I form them into my personal constitution, relating to the pro-activeness habit, one of the seven habits of highly effective people. In addition, I plan to have an annual review for keeping up with more specific roles and goals for 2013. To cut things short, my todo list is:

To live a life true to myself
To don’t work so hard
To have the courage to express my feelings
To stay in touch with my friends
To be happier
To aim high
To be modest (“You don’t know what you don’t know”)
To have passion
To build my character taking also into account values and virtues I admire in others
To believe in myself
To work with people I like and have fun with
To be surrounded with people with positive energy
To be patient and not to give up
To admit my mistakes
To be lucky

OK I know the last one is not up to me, but I interpret it as “To don’t run for trains”. Note: you must have read the Black Swan book in order to understand this one.

The first five are taken from the blog post “REGRETS OF THE DYING“, while the next ten from a talk by Nikos Stathopoulos (in Greek) about the habits of highly effective people.

Budapest trip

Sat, 18 Aug 2012 04:35:00 GMT

Last time I visited Budapest was the summer of 2002 during my IAESTE internship at Elcoteq in Pécs. This is a log of our trip during the summer of 2012.

Day 1

First time to fly Ryanair – The flight was delayed a bit so we didn’t get to hear their jingle – Taxi booked online (20€) prior to departure – When we arrived there was an incident with an unattended bag in the parking lot but everything turned out to be OK – After checking in at the hotel, we went out for a walk to locate 0-24 hour shops nearby and walked in Vaci utca and by the Danube.

Day 2

Breakfast at Cafe Gerbaud – Got the 72 hours transportation tickets (3850 Fiorints Per Person – FPP) for the metro, tram and buses – Started the walk: Vienna gate => Fisherman’s bastion => Royal palace => Tram to Gellert hill => Citadella => Liberty bridge => Small stop at the market => Raday utca for lunch and then Cafe Central for desert – In the evening we went for the standard walk on Vaci, Danube and Chain bridge – Later visited the Fashion street where they sell stove cakes (kurtoskalacs) and bread langos. As for the stove cakes, I liked the one with cinnamon more than the one with cocoa.

Day 3

These kurtoskalacs make for a great breakfast – Today’s tour started from Szent István Bazilika – Took the lift up to the Dome for panoramic view of the city – Then to Szabadság tér , a plaza with a pressure aware fountain and the Parliament – This was our first attempt to enter (tours sold out early since it was Sunday) – Margaret Island – Walk to the end and took the bus back – Great Synagogue (largest in Europe, second largest in the world but largest in capacity) – English tour included in the ticket – “For sale” for goulash soup – Again the standard walk in Vaci, Danube and the Chain bridge.

Day 4

Woke up early – wait in the line to enter the Parliament (see my tip on 4sq and why you must book online first) – then off to market – tried fried langos – the upper level of the market was pretty packed and seemed to me more of a touristy place than a traditional Hungarian spot – visited the renowned New York cafe.

Day 5

Szechenyi baths – Heroes square – Andrassy utca – Terror museum

Day 6

Checked out and to the airport – This time the Ryanair jingle played upon landing

Afterthoughts and observations

Even though there are not very famous museums to visit or monuments that stand out, I think Budapest’s beauty is in its location: the river, the banks and the bridges. We went to the terror museum since we found it to be different from others we visited in the past. The museum actually turned out to be quite atmospheric. Also a must do in Budapest is to visit one of the baths.

Prices are reversely proportional to the distance from Vaci utca from supermarkets prices to eating and change shops. For example prices in Raday utca, a nice street full of places to eat and drink, are much lower than in Vaci and of the same, if not better quality.

Tip for the supermarkets: Blue cap for sparkling and pink cap for non sparkling water.

A tip between 12-15% is included in the bill in around half the places we ate. In the others you can calculated it.

There were no big metro signs in the metro entrances, so you had to look for them. Also there are ticket inspectors at every metro we visited. I guess they had a big problem with missing revenue from free-riders and they resolved to this measure.

Budapest had enough tourists from all over the world, but not as many as I show for example in Barcelona last year.

More resources

I compiled a list of places I visited and some interesting tips in the foursquare Budapest trip list.

Also a small collection of photos can be found in the Budapest 2012 flickr set.

View Budapest in a larger map

Fitting a sigmoid curve in R

Sun, 08 Jul 2012 08:54:00 GMT

This is a short tutorial on how to fit data points that look like a sigmoid curve using the nls function in R. Let’s assume you have a vector of points you think they fit in a sigmoid curve like the ones in the figure below.

The general form of the logistic or sigmoid function is defined as:

Let’s assume a more simple form in which only three of the parameters K, B and M, are used. Those are the upper asymptote, growth rate and the time of maximum growth respectively.

The following R code estimates the parameters, where y is a vector of data points:

# function needed for visualization purposes
sigmoid = function(params, x) {
  params[1] / (1 + exp(-params[2] * (x - params[3])))
}

x = 1:53
y = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,
  0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,
  0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1)

# fitting code
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))

# visualization code
# get the coefficients using the coef function
params=coef(fitmodel)

y2 <- sigmoid(params,x)
plot(y2,type="l")
points(y)

Now the data points along with the sigmoid curve look like this, with a = 1.0395204, b = 0.1253769, and c = 29.1724838.

Translation of Echo State Networks in Greek

Sat, 30 Jun 2012 02:55:00 GMT

Eureka! For my dissertation I translated Echo State Networks into Δίκτυα Ηχωικών (ή Ηχοϊκών) Καταστάσεων (ΔΗΚ). Liking it.

— Kyr. Chatzidimitriou (@kyrcha) June 30, 2012

I think I am going with Ηχωικών due to Echo = Ηχώ. Thus Echo State Networks (ESN) = Δίκτυα Ηχωικών Καταστάσεων (ΔΗΚ) στα Ελληνικά. The idea basically came from thinking about the anechoic chamber = ανηχωικός θάλαμος.

MsAriadne at Pac-Man vs Ghost Competition - CEC 2011

Thu, 07 Jul 2011 07:04:00 GMT

We got the third place with MsAriadne bot in the Ms Pac-Man vs Ghosts competition, organized by University of Essex and held during the 2011 Congress on Evolutionary Computation (CEC 2011) . The bot is part of George Matzoulas diploma thesis project.

Videos of MsAriadne bot versus Legacy ghost team

Finding related work and keeping up to date

Thu, 10 Mar 2011 23:26:00 GMT

"Göttingen-SUB-old.books". Licensed under Public Domain via Wikimedia Commons.

One important task as a researcher, is to keep up with all the recent related work in your domain. The items below are what I do personally for following the state or the art. They involve both push notifications in the form of email alerts, and todos that I put on my calendar every two or three months.

1. I have subscribed to the RSS feeds of all well known publishers (Elsevier, IEEE, Springer etc.) for the journals I am interested in. Then using a RSS reader (I personally use Google Reader so that I have my feeds available and synced in all my machines and my mobile phone) in order to get all the articles in press. Every couple of months I spent a couple of hours to check the titles and abstracts. If the article seems interesting or is a related work I download it and read it further.

2. In my browser I have a bookmark folder for the websites of the research groups and researchers working in my area. Again every couple of months I do an “Open All in Tabs” action and browse through the tabs for newly added material. I put a recurrent reminder in my calendar app to check them every 3 months.

3. Through Google Alerts I have created a few queries that return newly added search engine results via email. Once per week new alerts arrive in my mailbox and I do a quick scan. For example a query could be “Echo State Network” in quotes in order to match the whole phrase. Also alerts can be added using Google Scholar, where one can save the searches as email alerts. I have a few of those too.

4. As in item 2, with research groups and researchers, one can have another bookmarks folder for conferences and workshops taking place every year. Some of them publish their proceedings online, so by visiting their websites two-three times a year you can step into the published work.

6. Last but not least, I have subscribed to a number of mailing lists in the areas of my interest. For example some of the lists I am subscribed to are:

rl-list (Reinforcement Learning)
ML-News (Machine Learning News)
reservoir-computing (Reservoir Computing)
cig (Computational Intelligence in Games)

among others. Sometimes authors advertise and provide links to their most recent publications, besides just using them for CFPs and job openings.

7. Currently I have started experimenting with social networking sites related to research like Mendeley and ResearchGATE. I’ll see how that goes.

What do you do?

Diploma Theses @ ISSEL on Computational Intelligence in Games

Wed, 27 Oct 2010 00:24:00 GMT

This is a video I made gathering clips from AI agents/bots/controllers or whatever you want to call them, developed by researchers, students and aficionados, mainly for competitions in the IEEE CIG conferences, with a couple of them being projects of diploma theses students at Intelligent Systems and Software Engineering Labgroup (ISSEL). The goal is to demonstrate existing test-beds to whoever is looking for developing autonomous agents as a diploma thesis project with ISSEL in the field of CIG.

https://www.youtube.com/watch?v=erKHbw0NdTo

Mario

Gameplay, learning, level generation http://www.marioai.org/

DEFCON

http://www.introversion.co.uk/defcon/ http://www.doc.ic.ac.uk/~rb1006/projects:api

Keepaway

http://www.cs.utexas.edu/~AustinVilla/sim/keepaway/ http://gridsoccer.codeplex.com/(grid soccer environment)

Unreal Tournament

http://www.botprize.org/

Video making, technical information

In case you are interested, clips were downloaded from the YouTube channels mentioned in the video, from attributed in the video websites, or with the help of screen capture programs in the following formats: flv, wmv, avi and swf. The swf video was converted to flv using a swf2flv converter and all of them to raw DV with the help of Kdenlive and Kino open source programs under Ubuntu Linux. For editing, rendering and uploading iMovie was used.

Rome

Sat, 02 Oct 2010 09:38:00 GMT

Along with Paris, Rome is one of my favorite cities so far. This was my second visit to Rome. The previous one was a short one, during an InterRail excursion a decade ago. Me and my wife decided to go there for our honey moon and this is just a journal of our time to the Eternal City. I’ve also tried to visit all the “Angels & Demons” sights and see them first-hand. So there will be an A&D post for sure in the near future. Our time of visit was end July – beginning of August 2010.

Day 1

Early in the morning caught our flight from SKG to FCO with Alitalia (they offered a drink and a snack to eat) – Got the Leonardo express (14 EPP) and in 30 minutes we reached the Roma Termini – Metro line B was out causing a little chaos in the bus terminals around – Check in – Visited the Castel Sant’Angelo and in particular: il Passeto (luckily since it was only open 10:30 to 11:30 am), walked through the castle and its museum halls (unfortunately no English translations available at the exhibits), terrace (nice view) – Short walk to the St. Peter’s plaza – Then headed east to Piazza del Popolo, Porta Del Popolo (Bernini) and Santa Maria del Popolo [1]: highlights there are the Cappela Chigi (Raphael’s & Bernini’s work) with the kneeling skeleton and the two Caravagios – Evening walk @ Via del Corso, Via Condotti, Piazza di Spagna and Scalinata, Fontana di Trevi and Piazza Barberini with Bernini’s Triton fountain.

Day 2

Colosseum (the 1.5 EPP booking online ticket fee saved us a lot of queue waiting time so I would recommend it) – Palatino: Museums, Stadio, Casa di Livia, Casa de Augusto, Roman Huts (not as exhiting as the Colosseum) – Roman Forum walk through the Via Sacra and Via dei Fori Imperialli (must have A LOT of imagination) – il Vittoriano: took the elevator to the top (kind of expensive with 7 EPP, but the view is really nice, personally I prefer it even from the St. Peter’s Dome since it is more in the middle of Rome than aside) – Piazza Venezia and Chiesa di St. Marco [2] (worth the “offerte” for lighting the golden mosaic) – Capitolium: Piazza and Musei Capitolini – Walk towards Theatre of Marcellus, Santa Maria Cosmedin with its Boca de la Verita (took a picture from the side, since it was not open at that time), the Broken bridge and Isola de Tiberina – At that time the clouds started gathering so we turned back to Circo Massimo where we took the metro back to the hotel after grabbing some take away food for dinner – The most tiresome day that put our legs to test, but at least it was worth it.

Day 3

Musei Vaticani (like in the Colosseum, the extra 4EPP for reserving the tickets online are worth it): just looked for the basic attractions in Pinacoteca, Museo pio Clementino, Gregoriano Egizio, Degli Arazi, Geographice, Stanze di Rafaello and Cappella Sistina – From piazza St. Pietro we entered the St. Peter’s Basilica [3], then the Dome, “Cuppola” in Italian, (elevator costs 7EPP) and finally then the Vatican Grotoes (for free) – The nice thing about going alone instead of being in a group is that you can take your time. Actually we stayed just in the Basilica for two hours – Back to the hotel… – In the evening a small trip to the Scalinata and Fontana di Trevi.

Day 4

Plazza Barberini, Santa Maria della Vitoria [4] with its Theresa in Ecstasy sculpture by Bernini, Via Veneto, Santa Maria dei Concicione [5] and Crypto dei Cappucini (kind of kreepy, their motto: “What you are we used to be. What we are you will be” – In Via Veneto the coffee was expensive (5EPP) and we experienced some bad attitude from the waiters, unfitting for “want to stay famous” caffes – Walk in the park of Villa Borghese – Caught the nice view towards Piazza del Popolo – Sat in Caffe Rosati for ice coffee like proposed in AD (I bet Dan Brown has not visited Greece) for 7EPP – Headed towards Piazza de la Rotonda, the Pantheon, Piazza de la Minerva and the Elephantino obelisk, Santa Maria Dela Minerva [6] – Looked for the Caravagios @ St. Augustino [7] and St. Francesi [8] that were advertised in public spots all over Rome – Later Piazza Navona and Agnes in Agony [9] – Got some rest and went to eat – Grabbed an ice cream from the Old Bridge – Admire St. Peter’s piazza and Basilica at night until 11:00 pm, when they close it – Returned to the hotel after a quick walk at Bernini’s bridge in front of Castel Sant Angelo.

Day 5

Walking and shopping in Rome’s center – At afternoon, our attempt to locate Trastevere with no guide ends up in disaster, since we end up at Trastevere train station, which has nothing to do with the “cool” place in Rome – In between we saw the Pyramide and another view of Rome not so “historic”, but rather “urbanic” – Grabbed something to eat and returned to the hotel – After getting a good rest from the afternoon’s painful to the legs mistake, we relocated Trastevere, and went there on foot by the Tiber (pass four bridges heading south after Ponte Sant’Angelo and you will find it before Isola Tiberina on the west side of “Tiveris”) – @Trastevere: Santa Maria dei Trastevere [10], Piazza Trastevere, sat down to eat, searched for two Lonely Planet’s proposed gellateries but both were closed since it was kind of late – Went back again on foot, which we kind of regretted it since we felt kind of threatened in a situation. At least we got some nice night pictures with long exposure times.

Day 6

Checked out and left our baggage with the hotel – Got our small presents for family and friends mainly @ Via dei Rienzo – Made a final walk through Castel Sant’Angelo, Piazza Navona, Pantheon, St. Ignacio dei Loyola [11], Fontana di Trevi, St. Peter’s Basilica (Rome in general and St. Peter’s piazza more specifically by that time were packed with ROMA 2010 CIM attendees) – Finally, Hotel, Metro, Train to FCO, FCO to SKG, Thessaloniki, ate gyros and said home sweet home…

Photos

A small collection of my photos from Rome and the Vatican City.

Budget, Eating and Drinking

An expensive city (food, drink, museums). For our budget, we had to put some effort to think and search where to eat or drink, maintaining a good quality-to-money ratio. But I guess this is difficult for everything and everywhere nowadays. The coffee is not so cheap, as people often believe in Greece, especially in the caffes at the historic center. Kind of difficult though for the “average” tourist to see the attractions and have in his or her mind, where to eat “smart”: cheap enough and good enough. Personally, I prefer the pizzas and the freddo cappucino like made in Greece. Filled pasta was good (same for the lasagna) but the other kinds of pasta were too “al dente” for our tastes. Gellato was just great!!! Our personal favor was “Old Bridge” where we went three times. Fistacio there is just great. But beware, there are places where one can pay 5EPP (for example we spotted one such place at the Fontana di Trevi area) for an ice cream smaller than in places where you pay 1,5EPP for the same quantity.

In numbers

Photos taken: 1368
Churches (“Chiesas”) entered: 11 (including St Peter’s Basilica)

Acronyms

EPP: Euro(s) Per Person
AD: Angels & Demons

IEEE ICDM 2010 Contest

Wed, 08 Sep 2010 09:25:00 GMT

Just for fun, I participated in the IEEE ICDM 2010 Contest - Traffic track, with a couple of R scripts, at first using linear regression and later neural networks. Mainly due to summer vacations limiting time available, the approach was nothing too fancy, ending up in the 17th place out of 101 active participants.

The task was to predict traffic in 10 road segments, 2 ways each, for 1000 60-minutes long windows between the 41st and the 50th minute, knowing only the first 30 minutes. Historical data were provided in the form of 100 10-hour windows (60000 rows) with 20 values per row, corresponding to the traffic observed in a minute of one of the 10 road segments x 2 ways.

My best result in the competition was obtained using the following procedure:

a. Preprocessing: Transform the training and test datasets, having corresponding to 10-minutes intervals rather than 1-minute intervals. Normalize all value to [0,1]. b. Modelling: Make the problem a supervised learning problem. I used 60 attributes, 20 for time t+1 to t+10, 20 for time t+11 to t+20, and 20 to t+21 to t+30, to predict one of the 20 traffic values at time t+41 to t+50. Thus 20 such datasets were created, one for each road segment and way. c. Training: 20 Feed-Forward Neural Nets (FFNNs) were trained for each one of the above 20 datasets, and 20 more were trained the same way, using a reduced dataset with 15 attributes instead of 60. This was achieved by using ReliefF feature selection algorithm in WEKA and maintaining the top 15 attributes. Each one of the 40 FFNNs had its weights randomly initialized. The former 20 FFNNs had 15 hidden units, while the later 30. Decay rate was also used. d. Predicting: Predictions were made for each one of the 20 target values using all 40 NN. The final prediction was the mean value of the 40 predictions.

Academic Fun

Tue, 31 Aug 2010 23:51:00 GMT

How to make a publication:

If you have crisp algorithm, make it fuzzy.
If you have a problem, solve it using a GA.
If you have an algorithm, program it in CUDA.

A NEAT Way for Evolving Echo State Networks

Thu, 29 Apr 2010 03:56:00 GMT

My ECAI 2010 submission entitled "A NEAT Way for Evolving Echo State Networks" was accepted for publication as a full paper. I'll keep updating the post with information about the paper.

Abstract: The Reinforcement Learning (RL) paradigm is an appropriate formulation for agent, goal-directed, sequential decision making. In order though for RL methods to perform well in difficult, complex, real-world tasks, the choice and the architecture of an appropriate function approximator is of crucial importance. This work presents a method for automatically discovering such function approximators, based on a synergy of ideas and techniques that are proven to be working on their own. Using Echo State Networks (ESNs), as our function approximators of choice, we try to adapt them, by combining evolution and learning for developing the appropriate ad-hoc architectures to solve the problem at hand. The choice of ESNs was made for their ability to handle both non-linear and non-Markovian tasks, while also being capable of learning on-line, through simple gradient descent, temporal difference learning. For creating networks that enable efficient learning, a neuroevolution procedure was applied. Appropriate topologies and weights were acquired by applying the NeuroEvolution of Augmented Topologies (NEAT) method as a meta-search algorithm and by adapting ideas like historical markings, complexification and speciation, to the specifics of ESNs. Our methodology is tested on both supervised and reinforcement learning testbeds with promising results.

Presentation

http://www.slideshare.net/kyrcha/a-neat-way-for-evolving-echo-state-networks