Inspired by recent advances in generative AI models I thought it would be cool to train a model to generate the genomic sequences of viruses. (Now with 100% more inpainting!)
Virus genomes were downloaded from NCBI databases and one-hot encoded to represent the genome as a 5-Channel 1 Dimensional vector. A diffusion model was then trained to generate a genome through an iterative denoising.
In doing this project I learnt how to programmatically access NCBI data, and how to implement the underlying theory of diffusion models such as sinusoidal embedding and U-Nets.
It was challenging finding a balance between maximum genome size and model training performance, i.e filtering for smaller genomes allowed for quicker and more feasible training but significantly reduced the number (and subsequently the variance) of genomes the model could be trained on. It was also tricky not to run into resource exhausted errors since the genomes are upwards of a few thousand base pairs long. In the end I opted for using genomes with a length of 25000 base pairs or lower, leaving me with 334 genomes to train the model with. Data augmentation is still an open area of research so I omitted any dataset enhancement.
I think this project qualifies for hackiest hack cause I've had to completely disregarded best practices in machine learning to get this project working in time. There was no exploratory data analysis done on the data so I'm just using raw sequences straight from the server, there's no model validation and I just guessed the model architecture and hyperparameters. So yeah this is a very hacky AI project.
Built With
- keras
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.