Project Alabama

Inspiration / Problem

There has been a simultaneous rise in both the volume, velocity, and usage of data within companies and growing concerns about insider threats, as trusted individuals with privileged access may, whether intentionally or inadvertently, pose significant cybersecurity risks, including sensitive data breaches.

Additionally, organizations also rely on the collection of log data from various applications to mitigate potential risks and cybersecurity threats. Detecting such threats early is crucial to mitigate potential risks and protect the organization’s assets

Long Term Vision

Our long term goal is to start a project that modernizes data project management and anomaly detection in the enterprise.

Recently, Large Language Models (LLM) got great attention for their ability to summarize the contexts of texts. We aim to leverage multimodal LLMs to learn the semantics and contexts of system logs across enterprise products, design a method for constructing a correlation graph from summarized contexts, and propose a graph-based anomaly detection method.

Toward this goal, we pursue two long term objectives:

Designing an LLM-based model that automatically analyzes the system logs and extracts the meaningful contexts for the anomaly detection task and
Designing a graph-based anomaly detection model whose output will be the ordered list of anomalous users and their events.

How we built it

Via this submission we v1 of Data Trei using a modern web stack with Next.js as the core framework. Some challenges we overcame included:

Implementing secure OAuth flows for both GitHub and GCP integrations.
Designing a flexible database schema to accommodate various log types.
Ensuring real-time updates across the application using Supabase's real-time subscriptions.
Integrating AI capabilities for natural language log querying.
Managing state and data flow in a complex, multi-component React application.
Steps required to set a pub/sub and log ingestion was not very well documented by google.

Challenges we ran into / What we learned

How to integrate multiple cloud services (GitHub, GCP, custom APIs) into a single platform.
The importance of real-time data processing for log analysis and threat intelligence.
Implementing AI-powered natural language processing for log querying.
Designing a multi-tenant architecture with proper data isolation and access control.
The challenges of managing OAuth flows for multiple third-party services.
Setting up log ingestion in a user's gcp project as the minimal invasive way to track user logs instead of storing their long lived access token in the database

Accomplishments that we're proud of

Via this submission we of Data Trei v1 we are proud that we used a modern web stack with Next.js as the core framework. Using this as our backbone, we were successful with and proud of the following:

Implementing secure OAuth flows for both GitHub and GCP integrations.
Designing a flexible database schema to accommodate various log types.
Ensuring real-time updates across the application using Supabase's real-time subscriptions.
Integrating AI capabilities for natural language log querying.
Managing state and data flow in a complex, multi-component React application.
Steps required to set a pub/sub and log ingestion was not very well documented by google.

What's next for Project Alabama

The general objective of this project is to develop an ensemble of multimodal LLMs and a graph-based approach for the robust and adaptive anomaly detection system. Toward this objective, we pursue two immediate sub-objectives:

designing a multimodal LLM-based model that automatically analyzes the system logs and extracts the meaningful contexts for the anomaly detection task
Designing an ensemble approach of a multimodal LLM-based model and a graph-based model for an adaptive and robust anomaly detection system.

Our long term goal is to start a project that modernizes data project management and anomaly detection in the enterprise.

We want to start by designing an anomaly detection system via log analysis that has the following properties:

Data compatibility - the capability of analyzing data across various modalities (e.g., texts and tabular) that’s proactive (real-time online detection system) and an adaptive model to the ever evolving world of cyberthreats
Dev Tools - Developers can customize and extend the platform's capabilities, ensuring that data flows, transformations, and validation processes are tailored to specific business needs.
Schema Studio - Providing the necessary measures for accuracy, and consistency of your data, including features like data immutability, policy management and secure third-party integrations.
Compliance auditor - the ability to significantly reduce the risk of schema attacks and prevent unauthorized alterations when data is being exchanged between data producers and consumers