[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval )

# Project Details
AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.

## Features to be implemented

The idea is to implement a document uploader API that is async and returns the embeddings for chunks of that document. It should save the data for a short period until the user asks for the download. This data can then be uploaded by the user wherever they have a search engine. The current problem statement doesn't cover this.

## How it works
Extract the text from the PDF file. Tokenize the extracted text using cosine distance and create chunks. For each chunk, create vector embeddings using an Instructor Model.

## Create APIs to upload the following document Types
- [ ] PDF
- [ ] Audio (transcription)
- [ ] Video (transcription)

### Behavior of Upload API
- [ ] It takes a pdf file and uploads it to our database.
- [ ] API returns a document id in response. For future calls, this document id should be used. Each document id maps to an index containing embeddings.
- [ ] If you are indexing multiple documents, then pass document ids accordingly.
Taken from [here](https://llmtown.com/#file-upload)

### File Status API

- This API is used to check the status of file upload.
- It returns status and document id.
- Possible values for status are `yet_to_start`, `in_progress`, `completed`, and `failed`
- If the embeddings for a document are successfully created and indexed, then completed is returned.
Taken from [here](https://llmtown.com/#file-upload)

### Chunking
- [ ] To be done based on cosine distance between docs
- [ ] [Threshold should be configurable by the API params](https://gist.github.com/ChakshuGautam/2b71b2b01dbb3dfb710c0c2fe51f4f1d#file-chunking_by_embedding-py-L32)

## Sample pdfs: 
https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw

## OpenAI Embedding Alternatives
- [ ] Evaluate and compare different models
- [ ] https://huggingface.co/hkunlp/instructor-xl

## Learning Path

### Complexity  
Medium      

### Skills Required 
Python, Knowledge of HuggingFace Transformers, NLP.

### Name of Mentors:        
@GautamR-Samagra

## Project size
8 Weeks

## Product Set Up
See the setup [here](https://github.com/Samagra-Development/ai-tools#setup)

## Acceptance Criteria 
- [ ] Unit Test Cases
- [ ] e2e Test Caes
- [ ] OpenAPI Spec/Postman Collection
- [ ] Dockerfile for this module

## Milestone
Every document type supported is a milestone.

## Reference
1. [Gist with basic implementation](https://gist.github.com/ChakshuGautam/2b71b2b01dbb3dfb710c0c2fe51f4f1d)
2. [LLM Town](https://llmtown.com/)

# C4GT
This issue is nominated for Code for GovTech (C4GT) 2023 edition. 
C4GT is India's first annual coding program to create a community that can build and contribute to global [Digital Public Goods](https://www.codeforgovtech.in/digitalpublicgoods). If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/

-----------------------------------------------------------------------------------------------------------------------------------------

The scope of this ticket has now expanded to make it the 'content processing' part of 'FAQ bot'. 
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.  

This ticket covers the content processing part of the bot. It includes the following tasks in its scope:  
 
- [x] #166
- [x] #167
- [x] #168
- [x] #169
- [ ] Explore metrics for measuring accuracy of content retrieval - check out RAG metrics https://github.com/Samagra-Development/ai-tools/issues/146
- [x] #170
- [x] #171
- [ ] Compare sentence embeddings against COLBERT https://github.com/Samagra-Development/ai-tools/issues/149
- [ ] Being able to determine when to ask for more context (user for now) https://github.com/Samagra-Development/ai-tools/issues/147
- [x] #172
- [x] #200
- [x] #199

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

Project Details

Features to be implemented

How it works

Create APIs to upload the following document Types

Behavior of Upload API

File Status API

Chunking

Sample pdfs:

OpenAI Embedding Alternatives

Learning Path

Complexity

Skills Required

Name of Mentors:

Project size

Product Set Up

Acceptance Criteria

Milestone

Reference

C4GT

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

Description

Project Details

Features to be implemented

How it works

Create APIs to upload the following document Types

Behavior of Upload API

File Status API

Chunking

Sample pdfs:

OpenAI Embedding Alternatives

Learning Path

Complexity

Skills Required

Name of Mentors:

Project size

Product Set Up

Acceptance Criteria

Milestone

Reference

C4GT

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions