Earnings call transcripts: written records of conference calls held by Publically traded companies quarterly to discuss financial results, peroformance, future outlook. Capture the dialog between company executives, analysts and reporters.
Dialog goes on:
1. prepared remarks by management
2. financial results
3. Strategic initiatives
4. forward looking statments and guidance on future peroformance
5. Q&A sessions
transcripts are valued for their searchability, accuracy, transparency.
Typically lasts 45 to 90 mins Read-by: investors(Individuals and Institutional), Finaicial Analysts, journalists and media, Regulators, company executives and employees
Read-for: To assess the finaicial health, growth potential, to make informend buy/sell decisions,
issue reports on performance and market moving announcements, investment recommendations,
compliance monitoring, Strategic alignment.
System Design:
1. Business problem
a. What business case are we solving?
The idea is to provide a solution to simplify the decision making process by reducing the time effort taken to consume documents of 10's of pages
for all the stake holders(be it common users, Institutional investors, Analysts, Media, Regulators)
b. What are we trying to do?
Build a system which can provide support to user queries to get an understanding saving the time spent surfing through the documents
While also giving the citations for the generated content
c. Constraints?
Prediction format: The user query is real time.
Infra constraints: We need LLM to handle multiple user requests, database to hold full doc for RAG, compute to host and run the full agent.
d. What metrics are important and tracked?
Request level : latency, token usage, cost/request, tools called, success/failute
Business level: task completion rate, cost per successful task
e. High level Design?
The approach is to have a hybrid solution,
1. Batch mode process to store the documents into chroma databse
2. For the custom question we will have a LLM based agent which can retreive the context related to the question
We will have the following as a part of the solution:
1. vector store to store the chunks and the documents and return relvent information
2. LLM to answer the questions
3. streamlit based UI to show all the chat,summaries and comparision analysis
5. Langfuse to evaluate the requests and responses
6. conversation buffer memory for working memory
2. DATA
a. Data Sources?
50 companies,2.5MB/company, quarters from 2018-Q1 to 2024-Q2, some reaching 2025-Q2
1. each file starts with a header line
2. huge paragraph of full call (earnings + Q&A)
3. The paragraph split properly and then Q&A section
4. Ending line
b. Data Cleaning?
1. Remove the redundant content to store only the transcript and Q&A
2. A pipeline which goes through all the files in the path and retuens a smaller file with the actual content.(Data cleaner code)
3. Effectively the size will be 1.2MB/company reduced by 50%
c. cost implication to summarize?
with GPT-5(400k input tokens limit):
If we utilize openAI gpt 5 mini the costs are 0.25$ for 1M tokens, output costs are 2$ for 1M tokens (Standard)
with Gemini models AI(1M input tokens limit):
If we utilize gemini 2.5 flash the costs are 0.3$ for 1M tokens, output costs are 2.5$ for 1M tokens (Standard)
3. Indexing and persisiting
We use the ChromaDB to store the raw trasncripts for grounding the user queries.
Since we are a project its best to use a local vector DB like Chroma.
For production level we can have like 1-10M vectors we upgrade to pinecone or weavite.
For larger enterprises >10M vectors its best to go with Weavite ot Milvus.
Chunking:
since documents are less we can go with hybrid chunkung (semantic + recursive character splitting) which is usually computationally time consuming.
And each chunk is added with metadata (company+year+quarter+segment). where segment is either prepared or queries
preared is the prepared remarks by management - will be chunked seperately with metadata segment = "remarks"
queries - will be chunked seperately with metadata segment = "Q&A"
indexing:
HNSW - default indexing for chorma DB.
Embedding model:
To convert text into embeddings we can use local embedding model:( for POC)
a. google/embeddinggemma-300m
or
b. Gideon531/Qwen3-Embedding-0.6B
Qwen base version has better performance, but gemma is quciker and easy to run on my laptop GPU
Reasons: costs + the GPU support in my LAPTOP
For deplying the code as container its better to use colud based embedding since we do not need to downlaod libraries and whole embedding model:
a. google's gemini-embedding-001 is a decent start
4. LLM, Prompting and memory
We will have a streamlit code which handles the UI part where the user enters query and get thier responds back.
MEMORY:
LLM will have a conversational memory to hold the session's memory of past questions and user preference.
Prompts:
A system prompt holding the agent's primary resposibilites and thinking logic pattern
A user prompt (which is paired with user question) giving the guidelines on using input to send for retrieval and ways to present the o/p
System prompt will be used to translate the user query into properly formatted query/queries for effective retrieval.
As part of query translation we have multiple approaches - RAG Fusion, Decomposition, step-back, HyDE
Since we will be doing metadata filtering and as a result no of documents at 1st stage itself will be less (HyDE could be overkill),
Since the user questions can be technical and specific and expeactations would be specific answer(step-back and genralizing is not ideal),
user can ask direct technical and comparision questions so the better approach for our use case is Decompsition.
We divide the question into subqueries, solve sequentially and solve each step using prior sub solution.
The purpose of this Agent to is to equip the user with knowledge to take decision not suggest the decision so Decomposition is better suited than RAG-Fusion.
(If our Agent has to suggest a decision then RAG-fusion works given the multi perspective approach nature).
user prompt will be used to route the question to the proper chroma collection by query translation adhering to the vector metadata schema
Also, user prompt has the rules to present the final response to user.
We use can use Agentic approach to reason on the results. (better suited since Gemini is large model)
LLM:
We will be using the gemini 2.5-Flash. The use case requires the responses to be factual.
Temperature : To control the creativity of the response we can test with 0.5-0.7 so the response stays factual
Top-P : Probability decides the set of next tokens and temp tells which of these to select next - 0.9 would select the top tokens summing up to 90% probability
maximum length: Controls the output length of tokens (helps keep response precise and reduce the cost)
Not touching penalities(Frequency penality, prescence penality) becuase they encourage creativity, our goal is factuality.
5. Retreival
For retrieving the relevent content we use pre-filtering + similarity search. Since documents are less simple Similarity search would do the job
first we do a pre-filtering-Metadata filtering - to reduce the search space to a set of embedidng in the collection based on the tags
Semantic - Approximate nearest neighbours this used the embedded question and navigates through HNSW index to find similar vectors
Results are ranked by cosine-similarity
6. Model evaluation and monitoring
To Evlauate the model performance we are tracking:
Latency (p50/p95), token usage, tool call count, success/failure - these are tracked online real time using langfuse
and business level - taks completion rate, cost per successful task
and with LLM-as-a-judge we also measure hallucinations, context recall, context precision and faithfulness of the queries in real time.
For a production level project we need a monitoring workflow which logs the metrics, displays in dashboard and alerts for events
We need to log the query level logging for metrics, user ids , queries used for log tracking
then we have dashboard tool showing the metrics in dashboards like requests, costs, tools level
then we can have alerting with rules on logging tools
7. Model deployment
For our project deploment we will be using the following flow:
1. ChromaDB embedding and persisting is done offline locally, once done its uploaded into google cloud storage
2. REPO is divided into 2parts backend (LLM_engine, chroma_tool,chroma_loader,tracking etc) this is handled by FASTAPI
3. front end code which has the stream lit app
backen deploy with 0 traffic:
gcloud run deploy rag-backend --no-traffic
test new backend:
gcloud run services describe rag-backend
modifying traffic when deployed backend:
gcloud run services update-traffic rag-backend --to-revisions rag-backend-v2=10,rag-backend-v1=90
8. Performance:

