Inspiration
In the fast-paced world of science and technology, staying up-to-date with the latest research is crucial. Most researchers make it a point to read at least one paper a day. However, this task becomes particularly challenging in fields like biology and computer science, especially in artificial intelligence research. Astonishingly, between 500 and 2,000 new papers are uploaded to arXiv every day in these areas. To resolve this problem, we build our autoReader.
What it does
It’s practically impossible for researchers to read every article, even just the titles. Consequently, there’s a growing need for tools to assist in keeping up with the latest cutting-edge research results. This necessity led me to develop my auto-reader. Unlike traditional archive searches and Google searches, which only match keywords in the abstract, my auto-reader uses semantic searching methods and a vector database to search through all parts of a paper, including the background, experiments, motivations, and insights.
How we built it
The system is comprised of the following components:
Downloader: Cloudflare Workers are employed to download papers efficiently. Data Preprocessing: Advanced OCR technology is utilized to transform PDF-format papers into Markdown. This step includes several optimizations tailored for academic papers. RAG Database Building: Papers are segmented into chunks, converted into embeddings, and then stored in a custom vector database designed for rapid retrieval. Backend Query System: A straightforward PostgreSQL setup manages user and subscription data. Frontend: A sleek, user-friendly interface for our AI application is created using React.
Challenges we ran into
Initially, we believed our project would be straightforward, given the existence of various RAG systems like RAGFlow, QAnything, and llamaindex. However, we quickly realized none were suited for our specific needs.
Firstly, these frameworks struggled with PDF OCR, particularly with two-column academic papers, and often disregarded images altogether. We dedicated significant effort to enhance data quality.
Secondly, processing daily arXiv papers proved immensely time-consuming. To address this, we developed a more efficient vector database indexing algorithm, which we plan to publish on arXiv shortly.
Accomplishments that we're proud of
- Vector Database: Developed a high-performance vector database capable of handling a massive volume of access requests.
- Indexing Algorithm: Engineered a robust algorithm for constructing vector indexes efficiently across a vast array of vectors.
- Scholar Papers OCR: Implemented significant enhancements to OCR techniques specifically for academic papers, improving accuracy and conversion quality.
- Embedding Model: Advanced the embedding model for scholarly papers, resulting in markedly improved representation and retrieval efficiency.
What we learned
- RAG System: The quality of the data is crucial; if the data is of low quality or improperly processed, it can significantly impair the entire system's performance.
- Vector Database: Vector databases are abundant globally, with query speed being a primary concern for many. However, with the constant influx of vast new data daily, the speed of index building becomes equally critical.
What's next for AutoReader
Build a better frontend Build a larger dataset for weekly/monthly/yearly dataset Long term: Build vector index for whole arXiv
Log in or sign up for Devpost to join the conversation.