Kevin Alex Zhang

Project Laminar

Kevin Alex Zhang — Thu, 05 Jan 2023 02:14:22 GMT

Schwab Intelligent Portfolios had a rough year in 2022. We did slightly better with the Laminar project, our implementation of a portfolio rebalancing bot which runs on the Alpaca platform and trades with real money.

For obvious reasons, the code will not be open source at any time soon. But this post shares how it works at a high level.

The performance of my Schwab Intelligent Portfolio account from August to December 2022. In the same period, the Laminar project only lost 1.7%, so I guess we count that as a win? Still not as good as just putting all your money in an index fund though, so fingers crossed for 2023!

The entire project is composed of Luigi tasks, which allows us to build a data pipeline with complex dependencies that are automatically resolved. This helps us reason about the system and make sure there is no leakage of future data into our models.

Our data flow diagram looks something like this:

T = time

IngestNYT(T)
IngestOHLC(T)

(IngestNYT(0...T-1), IngestOHLC(0...T-1)) -> TrainModel(T)

IngestOHLC(T-1) -> EstimateCovariance(T)
(IngestNYT(T-1), IngestOHLC(T-1), TrainModel(T)) -> PredictReturns(T)

(EstimateCovariance(T), PredictReturns(T)) -> OptimizePortfolio(T)
OptimizePortfolio(T) -> RunLaminar(T)

IngestNYT/IngestOHLC - These tasks download data from external APIs.
TrainModel(T) - This trains a model using only data from T-1 and earlier. It uses Luigi's dependency resolution to figure out how to get the appropriate data so we don't have to reason about whether future data is leaking into our model.
EstimateCovariance(T), PredictReturns(T) - These estimate the expected returns and covariance of returns.
OptimizePortfolio(T) - This uses (mu, sigma) to generate several Pareto-efficient portfolios with varying levels of risk and then selects between them based on market conditions and sentiment extracted from the NYTimes.
RunLaminar(T) - This examines the current portfolio on Alpaca (using real money!) and figures out how to achieve the new optimal portfolio.

At the end of the day, we simply call RunLaminar(Now()) with the current timestamp and it backfills all the dependencies as necessary, using cached results whenever possible.

We build a Docker container containing all the dependencies and push it to AWS Batch, which then schedules a job every Monday, Wednesday, and Friday to rebalance our portfolio. After every rebalancing, the system sends us an automated email with the current portfolio so we can do a quick sanity check.

This AWS job has been running on this schedule for over 4 months now and only costs tens of cents a month, so we plan to keep it running for the foreseeable future.

References

This project was powered by:

Alpaca. Commission free trading API.
Sendgrid. Easily send email notifications.
Luigi. Lightweight library for building Python data pipeline.
PyPortfolioOpt. Efficient implementation of common portfolio optimization strategies.
Scikit-learn, PyTorch, and other standard ML libraries.

AO3 Disco: The Road to v1.0

Kevin Alex Zhang — Tue, 03 Jan 2023 22:54:00 GMT

On January 1st, 2023, I decided to start building the AO3 Discovery Engine. The motivation for this project was three fold:

Personal usage. I wanted to build something like this for personal use, as opposed to digging through bookmarks/tags on AO3.
Learning modern mobile app development. I haven't touched mobile apps since high school, back when Windows Phone was a thing, and wanted to get back into it.
Practical applications for research. I spend a lot of time reading various research papers, this project would put a lot of those ideas to a practical application.

This post will summarize the path from the original idea to the release of AO3 Disco v1.0 on the Google Play store and where I plan to go from here.

Web → Mobile

At the very beginning, I planned to build a web application that would allow people to build collections of works, and then for each collection, they would get suggestions for additional works that fit the "theme" of each collection. We actually got fairly far along in this process before discovering two issues:

According to my sample size of three - we conducted a very scientific and metholodical survey - most people read fanfics on their mobile devices, not their laptops.
Building collections of works was tedious - you had to repeatedly copy and paste URLs. The alternative was asking users to provide their AO3 username/password, which wasn't going to happen.

These discoveries - in addition with the fact that I was interested in learning mobile development - led me to pivot to building a mobile app. The key feature I was looking for? The ability to "share" a work from your browser to the app and get recommendations - no copy and pasting needed.

Then, upon discovering that the registration fee for submitting an app to Android's Play Store was only \$25 while Apple's App Store was \$99, I decided to start with an Android app.

And that's how AO3 Disco was born.

Android App

When building the mobile app, I encountered several false starts. I started by trying to build the app using Jetpack Compose, the latest framework for building mobile apps officially supported by Google. As it turns out, I have no clue how to use Kotlin.

After rapidly going through a bunch of frameworks - Java for Android, Flutter, Xamarin, etc. - and concluding that they were all really hard to use, I finally settled on Ionic/Capacitor, a framework which would allow me to implement most of the app using web technologies and only using native plugins for the tricky stuff.

An early sketch of what the app could have looked like, drafted in Figma.

Having finally decided upon a tech stack for the front-end of the app itself, I proceeded to blunder my way through implementing a MVP which provided 2 capabilities: sharing a work from a browser and allowing users to scroll through a deck of recommendations.

The MVP which was released to a small group of testers after posting in /r/TheCitadel.

I posted in a relatively smaller subreddit (r/TheCitadel) asking users to give our app a try and received lots of actionable feedback. This led us to v1.0 of the app which added new features such as bookmarks, history, filters, snoozing, and more.

Our listing on the Google Play store after two weeks.

This was released publicly a week later and posted to several subreddits, including r/rational, where I received a lot of awesome technical questions, leading me to draft this post.

Discovery Engines

Currently, there are two discovery engines available in the app that the users can choose from. Eventually, I hope to be able to combine them into a single optimized engine by adding a second stage model, but as I have not figured out a way to balance the trade-offs yet, it's currently up to the user to choose which experience they prefer:

Classic. Given a specific work, this engine looks at other users who also enjoyed that work, then looks at all of the other works those users enjoyed, and uses that score to provide a set of recommendations.
Freeform. Given a specific work, this engine uses a neural network to transform it into a 200-dimensional embedding vector. Then, it looks for other works whose vector representations are closest to the specified vector.

Classic

The benefits of the classic model are clear:

Faster + lower cost to run. This can be implemented (mostly) as a SQL query which can be easily optimized with various indexes.
Consistent / reliable results. The recommendations only change if the underlying data changes a lot; furthermore, by definition, this will only give you "reasonable" recommendations.
Easy to debug. If you get some really bad recommendations, it's easy to dig into the data and figure out what went wrong. And usually, the answer to that is that the system simply didn't have enough data.

This classic model is also very similar to what many others have tried to do (i.e. when looking for similar systems, I found a desktop app + some Jupyter notebooks that do this exact thing). The drawbacks, however, are fairly significant:

Popularity bias. This approach would never recommend works that aren't already well-known. If someone writes an incredibly high quality work, I want to see it immediately, not after everyone else has already seen it.
Unknown works. If the user provides a work that we don't have much data on (i.e. it's a new work that no one has kudo'sd or bookmarked), then we can't generate any recommendations.
Cross-site recommendations. This approach cannot be extended to work across multiple fanfiction sites (i.e. AO3 and FF.net). Although we currently only support AO3, being able to start from an AO3 work and recommend works on FF.net would be really awesome, posing both a modeling challenge and an infrastructure challenge.

To mask some of these drawbacks, in the current system, when the classic model is unable to come up with any recommendations, we fall back to the freeform model.

Freeform

The freeform model is designed to overcome these limitations. Instead of relying on the user-work connections, we use a neural network that analyzes each work independently and generates a vector embedding. Then, to get recommendations, we can efficiently find the nearby vectors using a library such as Spotify's annoy.

There are three classes of features which are passed to the model:

Dense features. The dense features include things such as the number of chapters, number of words, and whether the work is completed.
Sparse features. The sparse features include things such as the fandom tags, relationship tags, and "additional" tags. Several of the tag types have very high cardinality - i.e. relationships - so we make use of the hashing trick [1].
Embedding features. The embedding features include things such as a document embedding of the "summary" of the work, generated via fasttext. These models are pre-trained and not optimized as part of our system.

The architecture of the freeform model is designed to combine these three feature types together into a single embedding vector:

TODO: Replace this picture of a sticky note on my desk with a nicer sketch.

This architecture is quite similar to those proposed in works such as [2] [3]. We currently train this model using a modified triplet loss [4] which aims to pull works in the same collection closer together while pushing randomly sampled works that don't belong to the collection further away.

Of course, this approach has drawbacks as well. On several instances - prior to building a robust validation system - I published a model that would spit out random garbage, and debugging it would require dumping the model parameters, inspecting the gradients, and manually checking the embedding vectors.

Furthermore, this is a lot more computationally expensive and greatly increases both the latency and the cost of the servers needed to run AO3 Disco.

Future Work

First, I have a long list of planned improvements to the existing Android app, ranging from allowing users to export their recommendations to making it possible to filter on custom tags. In addition, I plan to make upgrades to the "freeform" model and improve the quality of recommendations overall.

TODO: Replace this picture of sticky notes on my wall with an actual plan.

After all of that though, here are the big new directions that I would like to explore:

iOS. Since most of the code can be re-used on iOS, I plan to try building an iOS version. However, I'm not sure I want to spend $99 getting an Apple developer license unless the Android version has already a significant number of users.
Web. Not a huge fan of this, to be honest, since it will require integrating a bunch of services and dealing with various security issues (i.e. need to handle authentication, server-side storage, etc.) while also having to rewrite a bunch of stuff from scratch. But it might happen.
FanFiction.net. Super excited about this, but poses significant technical and modeling challenges. Without tags, we'll have to get very good at natural language processing and uses techniques such as domain adaptation [5] so it can play nice with AO3.
Open source. At some point, I plan to refactor the code and open source most of it. However, everything is currently dumped in one giant mono-repo, and I may or may not have git-committed some important access keys and certificates because I was too lazy to configure environment variables ☺, so it will require some time and effort to untangle.

References

[1] https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f

[2] https://quoraengineering.quora.com/Unifying-dense-and-sparse-features-for-neural-networks

[3] https://arxiv.org/pdf/1906.00091.pdf

[4] https://towardsdatascience.com/triplet-loss-advanced-intro-49a07b7d8905

[5] https://paperswithcode.com/task/domain-adaptation

EchoFlow

Kevin Alex Zhang — Fri, 18 Dec 2020 22:54:02 GMT

This library provides tools for generative modeling of tabular datasets using normalizing flows. Some of its core features include:

Density estimation.
Joint + conditional sampling.
Categorical + continuous values.
Normalizing flows from RealNVP and MAF.

To get started with EchoFlow, check out our documentation!

Motivating Example

Let us start by considering a simple tabular dataset containing two columns which, when plotted, forms a spiral. Our goal will be to train a generative model on this dataset and then sample from it to create a "synthetic" copy of the dataset. Some of the tools for accomplishing this include:

Copulas. This library uses copula functions, a classical statistical method which is widely used in finance.
CTGAN. This library uses generative adversarial networks, a deep learning-based method which has notably been used to generate photo-realistic images.
And, of course, EchoFlow. Our library implements normalizing flows, which uses specialized neural networks to transform probability distributions.

We applied each of these methods to our spiral dataset, generated 1000 samples, and visualized the results below.

This figure shows the spiral dataset as well as synthetic copies of the dataset generated using Copulas, CTGAN, and EchoFlow.

As shown in the above figure, EchoFlow produces significantly higher quality samples than either Copulas or CTGAN. In the following sections, we will (1) introduce some of the key concepts and math behind normalizing flows and (2) demonstrate some of the core functionality provided by the EchoFlow library.

Normalizing Flows

At a high level, normalizing flows work by transforming random variables with invertible neural networks by applying the change of variables in the probability density functions.

Invertible Neural Networks

The most important property of a normalizing flow is that it must be invertible. In practice, this means that each layer of the neural network must be invertible so that the whole neural network can be inverted.

This property is critical because the direct pass - $f(x)$ - is used to map the input distribution to the prior distribution and the inverse pass - $f^{-1}(z)$ - is used to map the prior distribution back to the target distribution.

For example, given a trained network, the direct pass could be used to map your tabular data to a standard multivariate normal distribution to evaluate the log-likelihood. Using the same network, the inverse pass could be used to map random noise sampled from the multivariate normal distribution into samples that resemble the original tabular data.

For those of you who are familiar with variational auto-encoders (VAE), these ideas may sound familiar - the direct pass is essentially the encoder in a VAE while the inverse pass is essentially the decoder. However, with normalizing flows, these two networks are combined into one invertible neural network.

Change of Variables

Suppose you have a random variable $x$ which has distribution $p_x(x)$. If you apply a function $z = f(x)$, then the random variable $z$ has the following distribution:

\[ p_z(z) = p_x(x) \bigg| det \frac{\partial f}{\partial x} \bigg|^{-1} \]

Normalizing flows use a neural network as the function $f$ and apply this change of variable formula repeatedly to get from the input distribution to the prior distribution. Then, the loss function is simply the negative log-likelihood of the data. Therefore, in addition to being invertible, our neural network needs to be designed in such a way that the determinant of the Jacobian is easy to compute.

The general strategy used to achieve this is to use neural networks that have triangular Jacobian matrices. This corresponds to a model where, assuming the network has N inputs/outputs, the $i$th output only depends on the preceding $i-1$ inputs. By using this type of autoregressive structure in each layer of the network, the determinants of the Jacobians can be multiplied together to compute the likelihood.

Visualizing Flow Layers

One way to get insight into how normalizing flows work is to visualize the output of each layer. The below example shows the output of each layer in a normalizing flow network which is being trained on the spiral dataset with a Gaussian prior.

This figure shows the values as they pass through each layer of the neural network.

The input is a standard multivariate normal. Each layer transforms the distribution until it approaches the target distribution which resembles a spiral.

Introducing EchoFlow

The EchoFlow library implements normalizing flows using PyTorch but also provides additional functionality to make it easier to apply to real-world tabular datasets. To get started with EchoFlow, you can install it from pip by running:

pip install echoflow
python -c "import echoflow; print(echoflow.__version__)"

Then, you can load the spirals dataset and train an EchoFlow model as follows:

from echoflow import EchoFlow
from echoflow.demo import load_dataset

model = EchoFlow()
model.fit(load_dataset())
synthetic = model.sample(num_samples=10)

You can pass any DataFrame to the fit method; the sample method will yield a new DataFrame containing the synthetic data with the specified number of rows. For advanced usage including conditional sampling, custom transformers, and more, check out our documentation here!

Benchmarks

EchoFlow uses the SDGym library for benchmarking. Using the default models - RNVP and MADE - we obtain better results than the CTGAN model across a variety of simulated datasets.

Currently, EchoFlow does not outperform the baseline on several real world datasets, largely due to sub-optimal handling of categorical values. We are looking into improving support for categorical variables through methods such as discrete normalizing flows and treating them as external conditioning variables as in CTGAN.

References

Hello World

Kevin Alex Zhang — Sat, 12 Dec 2020 07:58:00 GMT

Welcome to kevz.dev!