🔎 LLM Data Toolkit — Retrieval & Generation Workflows in Python

A curated set of Jupyter notebooks (00 – 90) that show, step by step, how to turn unstructured text into structured, analysis‑ready data with the OpenAI Python SDK.

🚀 Quick wins	📚 Re‑usable patterns	🧪 Validation baked in
Extract causal “A → B” graphs from PDFs	Two‑stage retrieval flows that cut costs 3×	Modal‑vote agreement, cosine sanity checks
Generate supply‑chain nets & innovation profiles	Async batch jobs (10 k calls ⇢ 1 ¢/req)	Bias‑check extensions for demographic tasks

📂 Notebook catalogue

ID	Notebook name	What you learn (one‑liner)
`00_api_smoke_test.ipynb`	Hello API	Key handling, quick connectivity check
`10_single_stage_retrieval.ipynb`	Retrieval v1	Extract causal edges from abstracts via JSON schema
`20_two_stage_retrieval.ipynb`	Retrieval v2	Summarise 30 pp → then pull edges (cheaper & cleaner)
`30_supply_chain_generation.ipynb`	Generation	Bill‑of‑materials for an EV (inputs + scores)
`40_embeddings_mapping.ipynb`	Embeddings 1	Map free‑text parts to HS6 / JEL codes with embeddings
`41_41_embeddings_novelty_detection.ipynb`	Embeddings 2	Use embeddings to detect distinctive items among a group.
`50_dictionary_gen_prune.ipynb`	Keyword builder	Context‑aware n‑gram lists & LLM pruning loop
`60_tweet_stance.ipynb`	Stance classifier	Pro / anti / neutral / unrelated with modal voting
`70_name_gender.ipynb` & `71_name_race.ipynb`	Demographic tagging	Same helper, different schema; bias tests included
`80_company_innovation.ipynb`	Innovation profiler	17‑field profile from just name + country
`90_batch_translation_demo.ipynb`	Async batching	Split → upload → poll → parse 50 k requests end‑to‑end

🖼️ Slides (`slides.pdf`)

Slides that mirror the notebooks: quick API tour, workflow patterns, batching cheat‑sheet, cost maths, and validation playbook.
“Why‑now?” chart (150× token‑price drop, 500× context‑window jump).
Case‑studies (causal graphs, supply‑chains, stance detection, translation)

Crediting

If you think the notebooks helped, you could express gratitude by citing the relevant paper. Each notebook contains underlying papers where I developed the application. These are on top of each notebook. These papers include:

Garg, P. and Fetzer, T., 2025. Causal claims in economics. arXiv preprint arXiv:2501.06873.
Fetzer, T., Lambert, P.J., Feld, B. and Garg, P., 2024. AI-generated production networks: Measurement and applications to global trade.
Garg, P. and Fetzer, T., 2025. Political expression of academics on Twitter. Nature Human Behaviour. DOI: 10.1038/s41562-025-02199-1
Garg, P. and Fetzer, T., 2025. Artificial Intelligence health advice accuracy varies across languages and contexts. arXiv preprint arXiv:2504.18310.

👋 About Me

Hi — I’m Prashant Garg
PhD candidate, Economics & Public Policy Department, Imperial College Business School

Research areas

* AI & Big Data | Economics of Networks
* Science of Science | Media & Political Economy

📄 Papers & projects → https://www.prashantgarg.org/
✉️ Reach me at [email protected]

Always happy to chat about LLM workflows, causal graphs, or collaborative ideas!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
int_data		int_data
key		key
00_api_smoke_test.ipynb		00_api_smoke_test.ipynb
10_retrieval_edges.ipynb		10_retrieval_edges.ipynb
20_two_stage_retrieval.ipynb		20_two_stage_retrieval.ipynb
30_supply_chain_generation.ipynb		30_supply_chain_generation.ipynb
40_embeddings_mapping.ipynb		40_embeddings_mapping.ipynb
41_embeddings_novelty_detection.ipynb		41_embeddings_novelty_detection.ipynb
50_dictionary_generation_pruning.ipynb		50_dictionary_generation_pruning.ipynb
60_tweet_stance_classification.ipynb		60_tweet_stance_classification.ipynb
70_gender_name_classification.ipynb		70_gender_name_classification.ipynb
80_company_innovation_generation.ipynb		80_company_innovation_generation.ipynb
90_batching_and_translation.ipynb		90_batching_and_translation.ipynb
LICENSE		LICENSE
README.md		README.md
additional_readme.txt		additional_readme.txt
readme.txt		readme.txt
slides.pdf		slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔎 LLM Data Toolkit — Retrieval & Generation Workflows in Python

📂 Notebook catalogue

🖼️ Slides (`slides.pdf`)

Crediting

👋 About Me

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔎 LLM Data Toolkit — Retrieval & Generation Workflows in Python

📂 Notebook catalogue

🖼️ Slides (slides.pdf)

Crediting

👋 About Me

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🔎 LLM Data Toolkit — Retrieval & Generation Workflows in Python

🖼️ Slides (`slides.pdf`)

Packages