ShadowScan

Inspiration

In sensitive environments such as defense systems, a lot of safeguards need to be in place in order to ensure data is handled properly and doesn't fall into the wrong hands. This, however, makes development tedious and slow-paced at times. Specifically, a high-to-low transfer, or moving data from classified environments to unclassified environments is especially scrutinized and tedious. Oftentimes, developers have to copy code and other pieces of information by hand to ensure no sensitive data is leaked. ShadowScan aims to remediate this process by using a ML similarity detection and pattern matching algorithms as well as a robust protocol to make high-to-low transfers more effective. It is a modular, containerized app that can be hosted on production-grade high-side environments.

What it does

ShadowScan works by carefully handling the flow of sensitive data and leveraging ML models while still putting the final decision in the hands of a human. Here's how it works.

Users upload files they want released
Immediately, ShadowScan's similarity detection engine compares the input file with a corpus of sensitive data and keywords to derive a similarity score. It sends the file and this score to an admin user.
The admin user utilizes this score and their own intuition to decide whether or not the file should be released to the low-side.

How we built it

ShadowScan's Proof-of-Concept is built with React, FastAPI, and SQLlite. All the components are containerized via Docker and hosted on Render. These choices aren't the most effective or efficient, but they aim to maximize speed of development. React is a quick way to build UIs, and using a Python based backend service such as FastAPI makes ML integration seamless. In production, however, I would opt to use a server based database such as Postgres rather than SQLlite and opt to use Go for much of API routing and dataflow for its performance benefits and create a separate Python engine for processing.

Challenges we ran into

Formulating a secure way for high-low-transfer to occur was a big challenge. Other than that, many challenges came in the implementation. We initially thought using LLM like a self-hosted instance of llama could easily detect similarity, so we dedicated resources towards that. However, that wasn't the case, and developing a semantic matching tool from scratching that utilized more basic techniques like word encoding and cosine similarity proved to me more fruitful. On the infrastructure side, our initial goal was develop a fully-managed network layer to sit on top of ShadowScan to control the quarantine and release of files. However, time constraints forced us to build an MVP via Render.