Project Details
AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.
Features to be implemented
The idea is to implement a document uploader API that is async and returns the embeddings for chunks of that document. It should save the data for a short period until the user asks for the download. This data can then be uploaded by the user wherever they have a search engine. The current problem statement doesn't cover this.
How it works
Extract the text from the PDF file. Tokenize the extracted text using cosine distance and create chunks. For each chunk, create vector embeddings using an Instructor Model.
Create APIs to upload the following document Types
Behavior of Upload API
File Status API
- This API is used to check the status of file upload.
- It returns status and document id.
- Possible values for status are
yet_to_start, in_progress, completed, and failed
- If the embeddings for a document are successfully created and indexed, then completed is returned.
Taken from here
Chunking
Sample pdfs:
https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw
OpenAI Embedding Alternatives
Learning Path
Complexity
Medium
Skills Required
Python, Knowledge of HuggingFace Transformers, NLP.
Name of Mentors:
@GautamR-Samagra
Project size
8 Weeks
Product Set Up
See the setup here
Acceptance Criteria
Milestone
Every document type supported is a milestone.
Reference
- Gist with basic implementation
- LLM Town
C4GT
This issue is nominated for Code for GovTech (C4GT) 2023 edition.
C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/
The scope of this ticket has now expanded to make it the 'content processing' part of 'FAQ bot'.
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.
This ticket covers the content processing part of the bot. It includes the following tasks in its scope:
Project Details
AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.
Features to be implemented
The idea is to implement a document uploader API that is async and returns the embeddings for chunks of that document. It should save the data for a short period until the user asks for the download. This data can then be uploaded by the user wherever they have a search engine. The current problem statement doesn't cover this.
How it works
Extract the text from the PDF file. Tokenize the extracted text using cosine distance and create chunks. For each chunk, create vector embeddings using an Instructor Model.
Create APIs to upload the following document Types
Behavior of Upload API
Taken from here
File Status API
yet_to_start,in_progress,completed, andfailedTaken from here
Chunking
Sample pdfs:
https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw
OpenAI Embedding Alternatives
Learning Path
Complexity
Medium
Skills Required
Python, Knowledge of HuggingFace Transformers, NLP.
Name of Mentors:
@GautamR-Samagra
Project size
8 Weeks
Product Set Up
See the setup here
Acceptance Criteria
Milestone
Every document type supported is a milestone.
Reference
C4GT
This issue is nominated for Code for GovTech (C4GT) 2023 edition.
C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/
The scope of this ticket has now expanded to make it the 'content processing' part of 'FAQ bot'.
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.
This ticket covers the content processing part of the bot. It includes the following tasks in its scope: