Inspiration
The field of cybersecurity is evolving at an unprecedented pace, with new vulnerabilities, exploits, and defensive techniques emerging almost daily. Staying informed is critical for cybersecurity professionals, researchers, and enthusiasts. Yet, the sheer volume of information spread across countless wikis, knowledge bases, and documentation sites can make finding precise, actionable insights a daunting task.
This challenge inspires the development of a Retrieval-Augmented Generation (RAG) application tailored to querying the most popular cybersecurity knowledge base wikis. Here’s why this is important:
- Cybersecurity wikis contain valuable insights but are often vast and unstructured so searching for specific information across these platforms requires expertise and time.
- Threat actors innovate rapidly, and countermeasures evolve in response. However, knowledge is distributed across various sources, making it hard to consolidate.
- Not everyone has the expertise to navigate cybersecurity jargon or know where to look for specific answers.
I was intrigued by Pinata's decentralized file storage system and how intuitive it was to use it as a database that allows for data sharing. The feature is crucial for my application as the vision is to have a community that helps to contribute to the database of cybersecurity wikis. SumbaNova's performance and scalability also fascinated me and I was very impressed with its responses when I tried their demos. Their product was useful for my application as I needed a platform that could support large input sequences to allow for richer context when combining retrieved data with generative outputs.
What it does
Cyber Knowledge Hub consolidates pdf files of cybersecurity wikis and leverages on RAG to shorten the querying process of a cybersecurity concept. The process is as follows
- A user can upload pdf files of cybersecurity wikis into Pinata. (For this hackathon, I used open-source cybersecurity wikis found on GitHub.)
- The documents stored on Pinata will be processed and vectorized.
- Compares the cosine similarity between the vectorized documents and queries.
- Use SumbaNova to generate a response.
How we built it
- Sheer determination
- Caffeine
- Python
- Streamlit
- Pinata
- SumbaNova
Challenges we ran into
The biggest struggle that I faced was being limited to using Pinata as my only database, which made it difficult to embed and store vectors, which was a critical feature for me to implement a similarity search to query the relevant documents. Pinata and SumbaNova were also new technologies that I worked on so I had a bit of a challenge reading and understanding the documentations.
Accomplishments that we're proud of
- Traveling to UTD to be part of this hackathon (I'm a foreign exchange student from Singapore)
- Persistence in building this application (I stayed up and managed to get the main functionality working at 6am)
What we learned
- Learning to read documentation
- RAG concepts
What's next for Cyber Knowledge Hub
- Implement a better information retrieval algorithm like BM25
- Implement relevance feedback to improve the search responses
- Ultimate goal is to be the go-to site for amateurs/hobbyists/students/experts to find cybersecurity information
Acknowledgements
I modified the open-source code from SambaNova's Enterprise Knowledge Retrieval AI Starter Kit to suit the purposes of my project. The main changes that I made in the GitHub Repository are in the folder enterprise_knowledge_retriever. The rest are the starter code from SambaNova.
Built With
- langchain
- llm
- pinata
- python
- streamlit
- sumbanova
Log in or sign up for Devpost to join the conversation.