NextOpt

GSx Active Learning

2020-04-20T11:43:00+00:00

GSx Active Learning

A problem I had with datasets having small amount of samples(for the sake of simplicity I set a criteria of <30) was that it’s hard to get reasonably accurate prediction results.

Apparently, one of the methods used in said scenarios was Active Learning. Active learning is a method where given a set of data points, it only chooses up to $k$ points, either by manual selection or a set criteria. It’s often used in scenarios where data is abundant that not all can be labeled.

However, as I’m primarily dealing with time series, I wasn’t sure if it was acceptable to selectively omit some data points.

I found a paper named Active Learning for Regression Using Greedy Sampling by D. Wu et al., using simple greedy techniques to select data points for training.

The first method the paper mentioned, named “Greedy Sampling on the Inputs(GSx)”, would select data points which is the closest to the centroid of all the samples. This is iterated $k$ times, resulting in the training set.

I tested the method with “Total number of phone calls to pizza delivery services from Seoul area in September, 2019” data. The data consists of daily number of phone calls from September 1st to the 30th, separated by age group(10, 20, 30, …) and district. Since I was only interested in the total number of calls, I deemed that age group and district would be irrelevant in the results. So my formatted data consists of 30 data points, one for each day.

GSx was implemented by using the absolute distance from the mean of the standarized values, which would be the absolute distance from 0. The data was divided into 24 train data points, and 6 test points. Additionally, the data showed a strong weekly seasonal frequency.

data plot, standardized

A standard Linear Ridge model was used, with a Piecewise Trend and a Weekly Fourier feature added. I started with an initial value of $k = 10$ , up to the length of the train set, 24.

The results are shown in the image below(The numbers next to k is the RMSE):

As you can see, the RMSE score peaks at $k = 21$ , but interestingly enough, adding more data points actually increased the error.

Even though GSx is a simple technique, it was able to increase the accuracy in a test environment. Additionally, other methods, such as GSy in the aforementioned paper, or more advanced ones, could possibly bring even better results.

AF Ratio Optimization

2020-04-07T02:34:00+00:00

AF Ratio Optimization

Histogram of AF ratios

Optimizing the distribution of A/F ratio for the maximum profit by using historical data. Inform about the best decision for order quantity.

Solution for how to optimize this A/F data, which is right-skewed and limited to positive values. Yield different models depending on an allowing range and the amount of data.

Challenge

How to fit the A/F ratio distribution under partial demand and forecast information

Solution

Through Entropy Maximization, find optimal inventory levels with uncertain demand distribution.

Why?

Balance between risk and benefit of ordering is important. Focusing on the proportion casts light on optimal order quantity and has an advantage of normalizing, rather than exact prediction of actual quantity.

How?

Estimate the optimal distribution by calculating and comparing the resulting profits between various competing distributions.

Feature Engineering: Using Weather Data to Predict Meal Consumption in a Military Mess

2020-04-06T08:50:00+00:00

I was tasked to attempt to add weather data(precipitation, temperature) to enhance forecast results for meal consumption quantity in a military mess hall near Daejon, Korea.

The Data I recieved was daily meal consumption quantity from January 1st, 2019 to August 30th of the same year. Additionaly, some dates were missing from the data, which is believed to be from days which the mess didn’t open, like holidays or leaves.

The plot below shows the consumption quantity plotted against dates, with selected peaks and floors marked as red and blue dots:

Upon first look, some sort of seasonality can be seen, although not exact, since we have to remember that some dates are missing, meaning time axis is not uniform.

I tried plotting some autocorrelation plots of various lags to try to grasp seasonality intervals:

As you can see, using just an ACF plot it’s hard to figure out the seasonality intervals.

For the sake of analysis, decided to use seasonality values of 5, 7, 15, and 31.

My goal was to see if adding standarized precipitation and average temperature data would make meaningful impact to prediction accuracy.

As a result, regression without weather data added as a feature resulted in a RMSE of 1.18(in standarized scale).

Unfortunately, adding weather barely improved the result, pushing it up to a measly 1.14.

Adding raw weather data was not effective in this specific datset. Additional preprocessing would be required to effectively incorporate the data. I’m currently thinking of applying a Logistical function to precipitation, because one of my hypothesis is that less people are likely to be active the more raint it falls.

Model Composition: Constructing a Hierarchical Spline Time Series Model

2020-04-06T08:24:06+00:00

A hierarchical spline time series model is being experimented on for forecasting failure rates.

The following properties were observed while configuring the model:

The sparsity of the data is unbalanced
Some portions of the data are missing
Each layer within the model shared similar properties

Sparsity data

Data sparsity could be overcome by estimating B-Spline parameters, $\beta, w$ for the overall period.

A hierarchical model can be constructed, and its hyperpriors estimated using Markov Chain Monte-Carlo sampling to handle missing data.

This method calculated the basis of B-Spline at equal intervals. It will be possible to build a more reliable model when calculating the basis by quantile division of the number of data according to the amount of data for each section.

Correlation Analysis for values D, S, C with flow

2020-03-28T07:37:11+00:00

find i for
 D ~ [D's trend, D's season, y_S_shift(i)]

df.epsilon = df.y - m.predict(df)['yhat']

for this we suggest the following

sort corr(df.epsilon, y_S_shift(i))
select a number of i who have high correlation.
feature candidate : {y_S_shift(i)}

The reason behind comparing the error ($ y - \widehat{y} $) and y_S, instead of y_D and y_S, is to eliminate the effect of highered correlation resulting from the same seasonality components.

Welcome to Jekyll!

2020-03-26T10:41:11+00:00

You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.

To add new posts, simply add a file in the _posts directory that follows the convention YYYY-MM-DD-name-of-post.ext and includes the necessary front matter. Take a look at the source for this post to get an idea about how it works.

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll Talk.