Deniz Qian

Venture AI

Sun, 15 Dec 2024 00:00:00 +0000

Venture AI: AI-Powered Travel Planner

Venture AI automates the entire travel planning process, generating personalized, day by day itineraries based on user preferences and real time data. We synthesized a structured dataset of 2,000 itinerary prompt–response pairs via model distillation and fine-tuned open-source large language models (Falcon, LLaMA, Mistral, FLAN-T5) using parameter efficient LoRA adapters. By integrating a FAISS vector store over live point of interest APIs, the system retrieves up to date POI details in a Retrieval-Augmented Generation (RAG) pipeline to ground the LLM in factual travel information.

Machine Learning Techniques

Model distillation to create a 2,000-entry synthetic dataset of structured prompt–response pairings from Wikivoyage data and 4o‑mini outputs.
PEFT fine-tuning (LoRA) of open-source LLMs, employing 8‑bit quantization and FP16 mixed‑precision to reduce compute.
Vector datastore construction using FAISS (and optional Chroma) to index real‑time POI embeddings for low-latency semantic retrieval.
RAG integration that prepends retrieved POI context to user prompts, enhancing the model’s factual accuracy and reducing hallucination.
Distributed training via PyTorch DDP on multiple GPUs and hyperparameter sweeps in Weights & Biases for reproducible optimization.

Results

Our RAG‑aware, LoRA‑fine‑tuned models outperformed base LLMs on various benchmarks such as BLEU, Self-BLEU, and perplexity. We noticed an average of 38% lower perplexity and a 2.3x higher BLEU score after fine tuning. Qualitative ablation studies confirmed that itineraries respect budget, timing, and preference constraints while adapting seamlessly to real‑time updates.

This end to end pipeline, from synthetic data generation through PEFT fine‑tuning, vector retrieval, and benchmark evaluation, provides a scalable foundation for next‑generation AI travel planners. Next steps for this project would be to evaluate against the TravelPlanner and RAGAs benchmarks. Full implementation details and quantitative findings are available in the paper and on GitHub.

JEPA Navigator

Sun, 15 Dec 2024 00:00:00 +0000

Self-Supervised JEPA World Model for Two-Room Navigation

This project explored building a self-supervised Joint Embedding Predictive Architecture (JEPA) to learn representations of an agent navigating a simple two-room environment. The goal was to train a model that could predict future latent representations based solely on current observations and actions without explicit reconstruction of pixel frames. We trained our JEPA models on 2.5 million frames of agent trajectories, ultimately learning structured latent spaces that capture spatial temporal dynamics and can be probed to recover true (x, y) agent coordinates.

Machine Learning Techniques

Implemented both recurrent and teacher-forcing JEPA variants with a ResNet-based CNN encoder to embed each frame into 128-D latent vectors.
Used a lightweight MLP to fuse these embeddings with action vectors, predicting the next latent state to model environment dynamics.
Optimized an energy-based objective by minimizing MSE between predicted and target encoder outputs, supported by a contrastive margin loss to encourage separation in representation space.
Applied VicReg-inspired variance and covariance regularization to prevent representational collapse and maintain feature diversity.
Stabilized training by introducing a momentum updated target encoder (similar to BYOL) for consistent gradient signals.
Employed a quadratic warm-up with cosine decay for learning-rate scheduling, and ran extensive sweeps on hyperparameters using Weights & Biases for reproducibility.
Accelerated training and handled large-scale data with PyTorch’s Distributed Data Parallel (DDP) across multiple GPUs, allowing us to process over 2.5 million frames efficiently.

Results

The final model demonstrated promising ability to predict latent future states, as evidenced by probing experiments with a 2-layer MLP to recover ground truth (x, y) coordinates. We evaluated performance on standard trajectories, wall-collision sequences, long horizon prediction, and out-of-domain test sets—reporting mean squared errors that confirmed the quality of our learned representations.

Although we faced challenges with instability and early overfitting (mitigated by gradient clipping, adjusted learning rates, and robust regularization) our best models showed clear improvements over initial baselines. Future work would include scaling to richer environments and experimenting with more advanced self-supervised losses.

stockXarb

Sun, 15 Dec 2024 00:00:00 +0000

Embedding‑Driven Clustering for Statistical Arbitrage

In this project, I sought to uncover latent arbitrage opportunities by clustering over 9,580 NYSE-traded symbols based on custom feature embeddings. After extracting and log scaling key time series features (trade price, volume, moving averages, and cyclically encoded timestamps), I built a neural embedding model with an input embedding layer followed by multiple dense blocks using LeakyReLU activations, dropout, and L2 regularization to ensure robust generalization under memory constraints.

Machine Learning Techniques

Log-scaling and moving-average smoothing of raw price and volume signals
Cyclical encoding (sine/cosine) of hourly timestamps
Embedding layer for symbol representation followed by dense networks
LeakyReLU, dropout layers, and L2 weight regularization to prevent overfitting
Early stopping and ReduceLROnPlateau learning-rate scheduling in lieu of LSTM layers
PCA for feature compression and t‑SNE for visual cluster inspection
Spectral clustering and K‑Means (k=4 via elbow method) evaluated by silhouette and purity scores

Results

After training on 41 days of high frequency data with batch optimization, the embedding driven MLP revealed compact symbol representations that captured market dynamics more effectively than traditional correlation matrices. Applying K‑Means to these embeddings yielded four distinct clusters with a silhouette score of 0.607, while cluster purity against sector labels reached 0.426, demonstrating alignment with known industry groupings and suggesting novel arbitrage groups beyond classical approaches.

The end-to-end pipeline, from feature engineering through embedding learning, dimensionality reduction, and unsupervised clustering, provides a scalable framework for statistical arbitrage. Full implementation details and additional quantitative analyses are available in the accompanying paper and GitHub repository.

Sherlock AI

Sun, 15 Dec 2024 00:00:00 +0000

Intelligent Chatbot for Large Codebases

During my internship at Zoom Video Communications as a Software Engineering Intern I had the opportunity to participate in the company's annual AI Hackathon. During the two-week stint, I implemented an internal chatbot aimed at accelerating developer onboarding across Zoom’s sprawling codebase.

To achieve this, I built a retrieval-augmented generation (RAG) pipeline that begins by parsing and chunking source files using LangChain, then indexes those chunks in a Pinecone vector database. This setup delivers low-latency, high-relevance semantic search over millions of lines of code. User questions are sent via a Flask-based REST API to Anthropic’s language models, which generate context-aware answers by combining retrieved code snippets with LLM inference.

The final product is a seamless developer assistant that can answer deep technical questions—everything from “How does our authentication middleware work?” to “Where are database migrations defined?”—in real time. During the Hackathon demo, the chatbot answered queries with over 90% accuracy, slashing initial ramp-up time for new engineers from days to mere hours.

Channel Atlas

Sun, 15 Dec 2024 00:00:00 +0000

Unsupervised YouTube Channel Classification System

Channel Atlas is an unsupervised NLP framework I developed to automatically categorize and explore thematic relationships across YouTube channels. The project began by scraping and collecting transcript corpora from over 250 diverse YouTube channels using BeautifulSoup, then preprocessing them with NLTK for tokenization, stop word removal, and normalization.

To capture each channel’s distinctive content signature, I used KeyBERT to extract the top-n most relevant keywords from every channel transcript, producing concise semantic profiles. These keywords were then embedded with word2vec to generate high-dimensional feature vectors that encode subtle semantic similarities between topics. I implemented a weighted cosine similarity metric that measures the overlap in these keyword embeddings, providing a flexible way to quantify thematic affinity between channels.

With these pairwise similarities, I constructed a channel similarity graph in NetworkX, where edges represent semantic closeness. Applying the Clauset–Newman–Moore modularity optimization algorithm allowed me to detect tightly knit communities (groups of channels that consistently cover related topics without any prior labels or supervision). I then visualized these clusters and their interconnections using Gephi to qualitatively assess topic coherence and uncover hidden content ecosystems on YouTube.

For quantitative evaluation, I explored how different keyword counts and similarity thresholds impacted clustering granularity, calculating precision, recall, and F-scores to assess performance. This analysis revealed important trade-offs between capturing broad thematic groups and isolating more niche channel communities. You can explore the full codebase on GitHub and read more about the methodology and results in the accompanying research paper.

VENT

Sun, 01 May 2022 00:00:00 +0000

Emotion-Aware Spotify Playlist Generator

I created an end-to-end, emotion-aware music-recommendation platform that begins with natural-language understanding and finishes with a curated Spotify playlist. Users start on a simple Flask web page where they can "vent" about how they feel. The application splits that free-form text into sentences, then feeds each fragment into a Naive Bayes emotion classifier.

The model outputs a list of emotions and their associated confidence levels. From there, the app queries Spotify's API to find songs that match the detected emotional profile, creating a personalized playlist that resonates with the user's current mood.

This project combines natural language processing, machine learning, and music information retrieval to create a unique user experience that transforms emotions into curated music selections.