Maksim Bober

Introducing Poker Gym

2025-12-28T00:00:00+00:00

I started learning Poker seriously and as part of this effort generated a simple app to help me do this. I’m also trying to learn how Telegram web-apps are working, so I created an app that will be available on Telegram that helps you remember poker combinations. I’m sharing it with the world to do a basic idea validation and at the same time collect some feedback.

What is Poker Gym?

Poker Gym is a free app available through Telegram where users can practice basic poker skills. For now, it only supports poker combinations. It will have more later.

The Problem

Learning basics of poker is hard; you need to learn it through trial and error. What if you don’t want to play couple of hundred games to get good at it and instead can drill those in an app?

The Solution

Poker Gym lets you practice basics such as learning combinations and rankings.

Features (Planned)

Outs: Get familliar and drill counting outs (unseen cards that combine well with your cards)
Progress tracking: See your improvement over time
Leaderboard: See your place in relation to other players

Looking for Feedback

If this sounds interesting to you, I’d love to hear your thoughts:

Would you use a tool like this?
What specific scenarios would you want to practice?
What features would make this valuable for you?

Feel free to reach out with feedback at https://forms.gle/E2UMLnJqQxX1JnTx9

Scaling Databases: The Modulo Hashing Problem Visualized

2025-11-29T00:00:00+00:00

Problem

The current database cannot handle the volume of incoming write requests. During peak times, there are too many write/update requests incoming per second, so requests are taking longer to execute. You can buffer requests, but if the database cannot fulfill them faster than they arrive, it will overflow. Let’s say you work in a bank and cannot afford to drop any requests.

Solution

To scale writes, you can either use a bigger database (scale vertically) or use multiple databases (scale horizontally). Let’s say you want to scale horizontally. So you add an extra database. Now, you need to figure out how to forward requests to multiple DBs.

You can do it in a round-robin style. The issue is that requests for user X are persisted across different DBs, which makes querying all records for X slower (we need to query all DBs to get results) and makes enforcing table constraints more difficult (cross-database referential integrity is handled outside the database engine). Since round-robin is not working for us because we lose referential integrity this way, we need to route requests so that user X always goes to Database 1, so that all of user X’s data is located on Database 1, and the database engine can perform referential-integrity checks for that user.

One way to achieve this is to use the modulo operator. We can take the modulo of user_id and use the result to determine which database to map our user to. Here’s an example of how it can be done. We can take our ID, convert it to an integer, and perform a modulo operation on it.

It works nicely as long as your ids are evenly distributed. If our ids are evenly distributed, then each database receives the same number of users. Let’s check if our UUIDs are evenly distributed.

Pretty much evenly distributed, there seems to be some noise around the 2nd bucket, but it will smooth out as numbers increase.

Now what’s the problem with modulo hashing?

Problems with this approach start when we want to re-scale our database. Because when we rescale our database, instead of doing modulo 5 we do modulo 6 and records that were mapped to Database 1 are now going to be mapped to Database 6 and it will need to happen for many records.

% 5 -> 0
% 5 -> 1
% 5 -> 2

% 6 -> 1
% 6 -> 0
% 6 -> 1

Conclusion

Modulo hashing is simple and works well for static systems, but it becomes problematic when you need to scale. For some cases the number of records that need moving can go up as high as 93%. There are different ways of solving it. One way is to use consistent-hashing or lookup table.

Gold Forecast

2025-05-27T00:00:00+00:00

We can see that the gold movement roughly follows inflation. However, it also follows gold buying done by other countries.

For example, China has been increasing its gold reserves. It’s widely assumed that China does it to reduce risk of dependence on US dollar. It would decrease dependence on US dollar.

Launching Option Calculator

2025-01-13T00:00:00+00:00

Built calculatemyoptions.click website. It’s entirely hosted on AWS. It’s a SPA where static files are on S3 bucket and are served with CloudFront. Backend is R code hosted on Lambda. All of the infra is created/updated with AWS CDK.

In order to release this project, I had to figure out how to host R inside of a container and serve it with AWS Lambda. I’ve already done something similar in Running R on AWS Lambda, so I could re-use parts of learning from there and build on top of it.

Challenges

There were a couple of challenges that I encountered when working on this project:

R Libraries and Docker Image Size

Not all R libraries were available for AWS Lambda image, so I had to compile a couple of them from source code. When compiling, too many intermediate artifacts were created which put the final image over 10GB (Docker images hosted on AWS Lambda have a limit of 10GB [1]).

I reduced the size of the Lambda container by using multi-stage Docker build process and copying only compiled binaries into a final AWS Lambda image. I was able to go from 11GB to around 4GB, and I could run R container with all libs on AWS Lambda (yay).

Frontend Development

The second challenge was the frontend since I’ve never done it before. Luckily ChatGPT helped me setup the React template that I could then modify and shape.

Also, CloudFront was a bit tricky to configure, specifically for configuring routes to Lambda function and making sure that the SPA could talk to Lambda and work across Firefox and Chrome.

Testing and Release

After parts of the whole project had been configured, I did a couple of runs of integration testing and fixing. Once I checked that the skeleton and parts work together, I did a mini release on LinkedIn, to see what people say and if I can catch any errors with real traffic.

Takeaways

Overall, it was a fun learning experience, and now I have deployment templates that I can leverage for future projects as well as knowledge about how website hosting on AWS is done.

References

AWS Lambda container image size limits

Running R on AWS Lambda

2024-07-26T00:00:00+00:00

What’s AWS Lambda?

It’s a compute env managed by AWS. You can think about it as a service that has a while true loop that waits for incoming requests. When the request comes in, Lambda will call your code and pass a request to appropriate function.

How do I upload my R code to Lambda?

Ok, not so fast. We cannot upload R code to Lambda directly, because Lambda does not support R runtime. Here’s the list of supported runtimes. There’s a way to patch it, but you will keep running into issues when installing deps and you would need to do your own maintenance time to time. We don’t want that.

That’s why we are going to be using 🐳 Docker container to host the R env and when the request comes to Lambda, it will pass it to a running container.

Lambda would pull an image from AWS ECR (host for docker images) and then run that image when the request comes in.

So what’s the plan?

Create docker image that will have our R script and all the deps that it needs
Setup Docker Image Registry where you going to upload your images to
Configure Lambda to use it (To continue, check the code in the repo)

Check out the example repo: https://github.com/Bobrinik/r_on_lambda_example

Trigger your lambda from console

31 seconds of startup time (initial speedup is lengthy, might be ok or pretty bad depending on your use case)

Now, what are Lambda constraints?

Startup time:
- For my simple example it was around 31 seconds (it’s still the time that you pay for). The subsequent one is going to be much faster though, but still.
Timeout:
- 15min max of runtime
Memory:
- 10 GB
CPU:
- Proportional to memory; at 10GB it will give you around 6 vcpu

Taken from https://www.youtube.com/watch?v=rpL77KDN92Q

For price/power tuning: https://github.com/alexcasalboni/aws-lambda-power-tuning

References

Pandas Tips And Tricks For Finance

2024-06-16T00:00:00+00:00

What is about?

Here I’m tracking of the collection of useful functions for the analysis of time series with Pandas.

Correlation

Taken from Python for Finance, 2nd Edition

Python for Finance, 2nd Edition

rets.corr()
Out[56]:           .SPX      .VIX
         .SPX  1.000000 -0.804382
         .VIX -0.804382  1.000000
         

In [57]: ax = rets['.SPX'].rolling(window=252).corr(
                           rets['.VIX']).plot(figsize=(10, 6))
         ax.axhline(rets.corr().iloc[0, 1], c='r');

LLMs for clustering TO exchange tickers

2024-04-16T00:00:00+00:00

You can diversify portfolios across sectors. The idea is that each sector has different supply lines and revenue streams. So if something goes wrong with let’s say production of potash, it should not affect your tech sector.

I wanted to see if instead of using pre-defined sectors by some other organization; I can partition tickers based on their risk profile. For doing that, I could use knowledge compressed in OpenAI LLM.

So the idea is to use OpenAI embeddings of risks for clustering Toronto Exchange tickers. The hypothesis is to use those instead of sectors. If successful, it would allow to diversify across risks instead of volatility and expected return, or sectors.

Unfortunately, it didn’t work; I think the prompt or the way I was merging embeddings for risks was not ideal. Anyway, if someone wants to continue, the code is on GitHub.

View the notebook on GitHub →

How to download portfolio composition from Wealthsimple

2024-04-03T00:00:00+00:00

In short, people are asking for capabilities to export data from Wealthsimple so that they can track it in Excel or do some Python modelling. So far, the solutions are to either use Wealthica that is using some unknown API or some sort of a crawler to extract that information (you would need to give it your creds, not ideal) you would also need to pay for ability to download it from them or you can manually copy paste the information.

Solution

Grease Monkey is a popular browser extension that allows users to customize the functionality and appearance of websites they visit. It works with various web browsers, including Google Chrome, Mozilla Firefox, and others. Grease Monkey uses user scripts, which are small JavaScript programs, to modify the behavior of web pages. Grease Monkey works by injecting user scripts into web pages as they are loaded in your browser. - ChatGPT

The idea is to inject script into webpage that would add functionality which is lacking. That script would get necessary data from the loaded webpage and put it into a CSV. It would also add a download button to the webpage so that person could download it.

That’s how it looks.

// ==UserScript==
// @name          jQuery Example
// @require       https://cdnjs.cloudflare.com/ajax/libs/jquery/3.7.1/jquery.min.js
// ==/UserScript==

function getFormattedDate() {
    var dateObj = new Date();
    var year = dateObj.getFullYear();
    var month = ("0" + (dateObj.getMonth() + 1)).slice(-2); // getMonth() is zero-based
    var day = ("0" + dateObj.getDate()).slice(-2);

    return `${year}-${month}-${day}`;
}

window.onload = function() {
    setTimeout(function () {
      jQuery(document).ready(function($) {
          let downloadButton = document.createElement("button");
          downloadButton.innerHTML = "Download CSV";
          downloadButton.id = "csvButton";
          downloadButton.style.padding = "20px"; 
        
          document.body.insertBefore(downloadButton, document.body.firstChild);

          function generateCSV() {
              let separator = ",";
              let csvContent = [];
              let header = ['Security', 'Name', 'Total_Value', 'Quantity', 'All_Time_Return', 'Per_All_time_Return', 'Today_Price', 'Per_Today_Price'];
              
              csvContent.push(header.join(separator));
                          
              $("tbody tr").each(function () {
                  let row = [];
                  $(this).find("td").each(function () {
                      $(this).find("p").each(function() {
                          row.push($(this).text());
                      });
                  });
                
                  if(row.length == 9) {
                    row = row.slice(1);
                  }
                  console.log(row);
                  csvContent.push(row.join(separator));
              });
              return csvContent.join("\n");
          }

          document.getElementById("csvButton").addEventListener("click", function () {
              let accountName = $(".knseRw > div:nth-child(1)").text();
              let csvContent = generateCSV();
              var hiddenElement = document.createElement('a');
              hiddenElement.href = 'data:text/csv;charset=utf-8,' + encodeURI(csvContent);
              hiddenElement.target = '_blank';
              hiddenElement.download = accountName+'_portfolio_'+getFormattedDate()+'.csv';
              hiddenElement.click();
          });
      });
    }, 5000);
}

You can read more and follow instructions here.

Compute OHCL from Tick Data with Google BigQuery

2024-03-01T00:00:00+00:00

Pre-reqs to follow this tutorial

Know what’s gcloud bucket and how to copy files to it
Have gcloud tool configured on local
Know how to use Python
Know how to use bash

Getting data

Finnhub provides tick level data for TSX for couple of years that you can bulk download from 2021 up to last month.

You can download each one separately or use the script below to get everything

#!/bin/bash

TOKEN="YOUR_TOKEN"
DIR_NAME="./finnhub_data/"

for year in {2021..2023}; do 
    for month in {1..12}; do 
        # Get the redirect URL
        REDIRECT_URL=$(curl -s "https://finnhub.io/api/v1/bulk-download?exchange=to&dataType=trade&year=$year&month=$month&token=$TOKEN" | grep -oE 'href="proxy.php?url=[^"]+"' | cut -d'"' -f2)
        mkdir -p "$DIR_NAME"
        # Follow the redirect if a URL was found
        if [[ ! -z "$REDIRECT_URL" ]]; then
            curl -o "to_trade_$year-$month.tar" "$REDIRECT_URL"
            mv "to_trade_$year-$month.tar" $DIR_NAME
        fi

        sleep 1
    done
done

# Copy paste the code into file, say fetch_finnhub_archive.sh
chmod +x fetch_finnhub_archive.sh
./fetch_finnhub_archive.sh

Once you are done, you will end up with 94GB of files. Now let’s say you want to convert this to 1-min OHCL data. You can use pandas and do the processing, or you can use Google BigQuery to do that.

Compute OHCL with Google BigQuery

Untar files
You will end up with many small files that you can compress into bigger files
Upload bigger files to Google Bucket
Import files into BigQuery table
Compute OHCL from it and store results in a separate table
Export the ohcl table into Google Bucket
Download result to your local
Costs

Untar all of your tick archives

#!/bin/bash
for file in $1/*.tar; do
    # Extract the tar file into the directory
    echo "Extracting $file to $dir_name..."
    dir_name="./uncompressed/${file##*/}"
    mkdir -p $dir_name
    tar -xf "$file" -C "$dir_name"
done

# Copy and paste into a script called uncompress_finnhub_archive.sh
chmod +x uncompress_finnhub_archive.sh
./uncompress_finnhub_archive.sh ./finnhub_data

After you run this script and cd uncompressed/to_trade_2021-1 and run ls -hl. You will see something like this.

total 2.5M
drwx------ 2 user user 124K Jan  5  2021 2021-01-04
drwx------ 2 user user 120K Jan  5  2021 2021-01-05
drwx------ 2 user user 124K Jan  6  2021 2021-01-06
drwx------ 2 user user 116K Jan  7  2021 2021-01-07
drwx------ 2 user user 128K Jan  8  2021 2021-01-08
drwx------ 2 user user 128K Jan 12  2021 2021-01-11
drwx------ 2 user user 124K Jan 13  2021 2021-01-12
drwx------ 2 user user 124K Jan 14  2021 2021-01-13
drwx------ 2 user user 124K Jan 15  2021 2021-01-14
drwx------ 2 user user 124K Jan 15  2021 2021-01-15
drwx------ 2 user user 120K Jan 19  2021 2021-01-18
drwx------ 2 user user 120K Jan 19  2021 2021-01-19
drwx------ 2 user user 124K Jan 20  2021 2021-01-20
drwx------ 2 user user 120K Jan 21  2021 2021-01-21
drwx------ 2 user user 124K Jan 23  2021 2021-01-22
drwx------ 2 user user 128K Jan 26  2021 2021-01-25
drwx------ 2 user user 124K Jan 27  2021 2021-01-26
drwx------ 2 user user 124K Jan 27  2021 2021-01-27
drwx------ 2 user user 124K Jan 28  2021 2021-01-28
drwx------ 2 user user 124K Jan 31  2021 2021-01-29

How many files are there in total and what’s their average size?

find "uncompressed" -type f | wc -l
2490838
find "uncompressed" -type f -exec du -k {} + | awk '{sum += $1} END {print sum}'
12081404

❯ python3
>>> 12081404 / 2490838
4.85033711546074 # Kbs

What we see is that we have lots of small files, and it will take lots of time to upload each one separately to Google Cloud bucket for further processing.
Instead let’s collate those together into larger .csv files

To do this, let’s use the script below. Note, you need to install pandas and tqdm libraries.

import os
import pandas as pd
from tqdm import tqdm

for dir in tqdm(os.listdir("./uncompressed"), desc="Processing months"):
    try:
        for file in tqdm(os.listdir(f"./uncompressed/{dir}"), desc="Processing days"):
            tables = []
            file_name = f"./transformed/transformed_{dir}_{file}.csv"
            if os.path.exists(file_name):
                pass
            for asset in os.listdir(f"./uncompressed/{dir}/{file}"):
                symbol = asset.split(".csv.gz")[0]
                df = pd.read_csv(f"./uncompressed/{dir}/{file}/{asset}", compression='gzip')
                df["symbol"] = symbol
                tables.append(df)

            df = pd.concat(tables)
            os.makedirs("./transformed", exist_ok=True)
            df.to_csv(file_name)
    except Exception as e:
        print(e)
        print("Skipping")

So how many files do we have now?

find "transformed" -type f | wc -l
 749

As you can see, we have fewer files and those files are much bigger. Now, it’s more manageable to load everything into Google bucket and process it with BigQuery.

At this point, you are going to have to upload multiple files to a bucket from local by using the following:

gsutil -m cp -r transformed gs://your-bucket-datalake/finnhub_transformed

Depending on your upload speed, it might take some time to upload. You can do all of the above steps on Google Compute, and the upload speed from Google Compute to Google Bucket will not be an issue.

Import files into BigQuery

Create a dataset in BigQuery
Create a table and specify path to a location on Google Storage bucket that contains all of the uncompressed files: my-bucket-names/finnhub_transformed/*
Don’t forget to enable Schema Auto Detect

Compute OHCL from it and store results in a separate table

Now that our data is within the BigQuery table, we can use BigQuery SQL to compute OHCL.

CREATE TABLE trade_data.one_minute_ohcl AS

WITH MinuteRounded AS (
  -- This subquery rounds timestamps to the nearest minute
  SELECT
    TIMESTAMP_TRUNC(TIMESTAMP_MILLIS(timestamp), MINUTE) AS minute_timestamp,
    symbol,
    price,
    volume,
    timestamp  -- Include the raw timestamp
  FROM
    trade_data.tick_data
),

AggregatedData AS (
  SELECT
    minute_timestamp,
    symbol,
    FIRST_VALUE(price) OVER w AS open,
    MAX(price) OVER w AS high,
    MIN(price) OVER w AS low,
    LAST_VALUE(price) OVER w AS close,
    SUM(volume) OVER w AS volume
  FROM
    MinuteRounded
  WINDOW w AS (
    PARTITION BY symbol, minute_timestamp
    ORDER BY timestamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  )
)

SELECT
  minute_timestamp,
  symbol,
  open,
  high,
  low,
  close,
  volume
FROM
  AggregatedData
GROUP BY 
  minute_timestamp, symbol, open, high, low, close, volume
ORDER BY 
  symbol, minute_timestamp;

Once the above command runs, you are going to have another table called one_minute_ohcl that you can export to bucket in the UI. Note that you might receive an error saying that the export should happen into the bucket which is within the same region that you read data from. The error will also tell you where your bucket needs to be. To resolve this you can create a new bucket with correct region.

Costs

Finnhub subscription $149.97 USD for a quarter (can’t have lower than that)
[Optional] ~3hr of compute for downloading and processing data ~ 5 USD max
Big Query is going to be free since you are going to fall into free tier with this data volume

Predicting the winner of Kentucky Derby

2022-05-18T00:00:00+00:00

There is a horse race called Kentucky Derby. People are betting on the outcomes of this race. Let’s do an analysis to see if we can get an edge over other people.

The Kentucky Derby is one of the most prestigious horse racing events in the world, attracting millions of viewers and bettors alike. With so much money on the line, can data analysis give us an advantage over the average bettor?

In this analysis, we’ll explore historical data, track conditions, horse statistics, and other factors that might influence race outcomes.

Read the full notebook on Wolfram Community →