<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://danielfleischer.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://danielfleischer.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2025-12-18T15:26:16+00:00</updated><id>https://danielfleischer.github.io/feed.xml</id><title type="html">blank</title><subtitle>Personal website for Daniel Fleischer. </subtitle><entry><title type="html">DeepMath: A lightweight math reasoning Agent with smolagents</title><link href="https://danielfleischer.github.io/blog/2025/deepmath-a-lightweight-math-reasoning-agent-with-smolagents/" rel="alternate" type="text/html" title="DeepMath: A lightweight math reasoning Agent with smolagents"/><published>2025-12-04T00:00:00+00:00</published><updated>2025-12-04T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2025/deepmath-a-lightweight-math-reasoning-agent-with-smolagents</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2025/deepmath-a-lightweight-math-reasoning-agent-with-smolagents/"><![CDATA[<p>DeepMath is an aligned math reasoning agent built on Qwen3-4B Thinking and fine-tuned with GRPO (Group Relative Policy Optimization). Instead of verbose text, the model emits tiny Python snippets for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length. The agent is implemented using the smolagents library.We evaluate DeepMath on four math datasets: MATH500, AIME, HMMT, and HLE, and show that:🤖 The math agent alone reduces output lengths by up to 66%, while often improving accuracy.⚡ GRPO training improves the agent performance even further, in almost all benchmarks.👉 Code and evaluation scripts: https://github.com/IntelLabs/DeepMath 👉 Model: https://huggingface.co/Intel/deepmath-v1Large language models (LLMs) have advanced reasoning capabilities, but mathematical problem-solving remains challenging; chain-of-thought traces can be lengthy and prone to arithmetic mistakes. Recent works[^1][^2] demonstrate that small models can reach strong performance, and other studies[^3] investigate tool use to improve reliability. What those papers generally do not emphasize is reducing trace verbosity or explicitly training models to prefer short, computation-oriented traces executed in a constrained, auditable environment.We focused on two goals:Offload deterministic computation to a safe executor.Train models to prefer concise, computation-oriented traces over verbose text.DeepMath tackles this by combining a small Python executor with a fine-tuned LLM, enabling concise, computation-driven reasoning. The model learns to generate short Python snippets, which are executed in a sandbox and reintegrated into the context. GRPO fine-tuning encourages this behavior by rewarding correctness and encouraging shorter outputs.</p> <p>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend. Agent Interface: During inference, the model can output normal tokens or special agent calls containing Python snippets.Execution: Snippets run in a sandboxed environment with strict safety constraints (no file I/O, no network, timeouts).Design Goals:Concision: Replace multi-line textual calculations with short, focused snippets.Determinism &amp; Safety: Enforce strict execution limits.Interpretability: Snippets are readable and auditable.</p> <p>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context. We fine-tune the model using GRPO, a reward-based optimization that balances:Accuracy Reward: +1 for correct answers.Using code snippets: +1 for generating code snippets, weighted 10:1 vs. the accuracy reward.Length reduction: shorter lengths are encouraged by limiting the GRPO completion candidates to 5k tokens.Temperature Scheduling: We implemented linear temperature scheduling (T=1.2 → T=0.7) to balance exploration and stability during training. This approach aims to enhance experimentation during the initial training phases, subsequently reducing the temperature as we refine our proficiency in the skill.In-context Learning: we include 4 solved examples where the trace contains agent calls and executor outputs, so the model learns the syntax and the call/response pattern.Dataset: we used the Tool-Integrated Reasoning (TIR) subset of the OpenMathReasoning dataset. Note that GRPO only uses the problem, not the solution in the data. This dataset was chosen to ensure the problems benefit from the external tool.We benchmarked DeepMath against baselines on four datasets. Metrics include:majority@16: robustness across samples, as used in previous math reasoning works, see references.Mean output length: brevity.</p> <p>We compare a baseline configuration (Qwen3-4B-Thinking-2507, no agenting) with our DeepMath model. As ablation, we evaluate the agentic framework we developed running with the untrained Qwen3 model, denoted by +Agent. Additionally, we examine whether the GRPO training (for agentic use) improves non-agentic inference, denoted by +GRPO. Thus the two ablations are independent, not additive.We observe the agentic inference reduces output lengths, with mixed accuracy results. The DeepMath model is both GRPO-trained and run in agentic mode, and shows the highest accuracy with shortened traces. We conclude both GRPO training and agentic inference are needed for best results.Key Insight: DeepMath reduces output length by up to 66% while improving accuracy on challenging datasets.Accuracy: Offloading computation reduces arithmetic errors.Efficiency: Shorter outputs mean faster inference and easier interpretability.Safety: Sandbox execution mitigates risks of running arbitrary code.DeepMath demonstrates a practical and lightweight way to combine a small executor with an LLM and to train the model to prefer short, computation-driven traces. Offloading deterministic computation reduces arithmetic and numerical errors and shortens traces, and GRPO fine-tuning further encourages concise, correct answers. The result is a more accurate and more interpretable math-solving agent without requiring a massive model or heavyweight external tools.Check out the GitHub repo and share your feedback! Contributions welcome. 🚀If you use DeepMath in your research, please cite:Scope: we focused on a small model and on mathematical reasoning.Generalization: evaluated on contest-style math; results may not transfer to open-ended mathematical creativity or formal proofs.Executing generated code is inherently risky. DeepMath uses strict sandboxing and resource limits, but any deployment should carefully manage attack surfaces and enforce rate limits.[1] Luo, Michael, Sijun Tan, Justin Wong, et al. 2025. “DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL.” https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2[2] Liu, Mingjie, Shizhe Diao, Ximing Lu, et al. 2025. “ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models.” arXiv:2505.24864. Preprint, arXiv, May 30. https://doi.org/10.48550/arXiv.2505.24864[3] Moshkov, Ivan, Darragh Hanley, Ivan Sorokin, et al. 2025. “AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset.” arXiv:2504.16891. Preprint, arXiv, April 23. https://doi.org/10.48550/arXiv.2504.16891More Articles from our Blog· Sign up or log in to comment</p>]]></content><author><name></name></author><summary type="html"><![CDATA[We’re on a journey to advance and democratize artificial intelligence through open source and open science.]]></summary></entry><entry><title type="html">Breaking Language Barriers in Mathematical AI: Introducing Hebrew Math Tutor</title><link href="https://danielfleischer.github.io/blog/2025/breaking-language-barriers-in-mathematical-ai-introducing-hebrew-math-tutor/" rel="alternate" type="text/html" title="Breaking Language Barriers in Mathematical AI: Introducing Hebrew Math Tutor"/><published>2025-09-07T00:00:00+00:00</published><updated>2025-09-07T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2025/breaking-language-barriers-in-mathematical-ai-introducing-hebrew-math-tutor</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2025/breaking-language-barriers-in-mathematical-ai-introducing-hebrew-math-tutor/"><![CDATA[<p>Hebrew Math Tutor (Intel/hebrew-math-tutor-v1) brings advanced mathematical problem-solving capabilities directly to Hebrew speakers, providing detailed step-by-step reasoning entirely in Hebrew without sacrificing the computational accuracy that makes these models valuable for education.🤖 Try the Demo: IntelLabs/hebrew-math-tutorAdvanced mathematical AI models like those trained on competition mathematics datasets have shown remarkable problem-solving abilities. However, they primarily operate in English, creating barriers for non-English speaking educational communities. Hebrew speakers, in particular, have faced challenges accessing these powerful educational tools in their native language.Simply translating outputs isn’t enough—effective mathematical tutoring requires natural language flow, culturally appropriate explanations, and seamless integration of Hebrew text with mathematical notation. This requires a more sophisticated approach.Hebrew Math Tutor addresses these challenges through targeted fine-tuning of Qwen3-4B-Thinking-2507, a powerful 4-billion parameter mathematical reasoning model. Our approach focuses on three key principles:The model provides complete mathematical explanations in natural Hebrew while preserving mathematical notation and formal expressions. It understands Hebrew mathematical terminology and can explain complex concepts using appropriate pedagogical language.By carefully fine-tuning rather than training from scratch, we maintain the model’s core mathematical reasoning capabilities while adapting its communication style to Hebrew.At ~4 billion parameters, the model strikes an optimal balance between capability and computational efficiency, making it practical for educational applications and research prototyping.Creating an effective Hebrew math model required more than simple translation. Our methodology involved:We selected ~10,000 high-quality problems from the OpenMathReasoning dataset, translating questions and answers to Hebrew while preserving the original reasoning chains and mathematical notation.We fine-tuned the model over 3 epochs with optimized parameters (learning rate 5e-6, 0.1 warmup, cosine scheduling) to adapt the output language while maintaining the underlying reasoning capabilities.The model’s internal <think>...</think> reasoning blocks remain in English, as these represent core computational processes that would require more extensive training to modify.We evaluated Hebrew Math Tutor against its base model on three challenging mathematical benchmarks: MATH500 (curriculum problems), AIME24, and AIME25 (competition mathematics). The results demonstrate significant improvements in Hebrew language output while maintaining strong technical performance.🚀 Dramatic Hebrew Language Gains: Hebrew answer production jumped from 35-75% to 95-100% across all benchmarks—a transformative improvement for Hebrew-speaking users.📈 Consistent Accuracy Improvements: Notable gains in pass@16 scores on Hebrew evaluations, showing the model doesn’t just translate but actually improves problem-solving in Hebrew contexts.🔄 Preserved Core Capabilities: Maintained competitive English performance, demonstrating that Hebrew specialization didn’t compromise the model’s fundamental mathematical abilities.⚖️ Nuanced Majority Vote Results: While performance improved on MATH500 and remained stable on AIME24, there’s an interesting decrease in maj@16 on AIME25 that provides insights for future training approaches.Hebrew Math Tutor opens new possibilities across multiple domains:Hebrew Math Tutor integrates seamlessly with the Transformers ecosystem:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hebrew Math Tutor in action: A Streamlit interface showing detailed step-by-step reasoning in Hebrew. The expandable reasoning sections allow users to dive deep into the mathematical process or focus on final answers. While Hebrew Math Tutor represents significant progress, responsible deployment requires careful consideration:The model works best as an educational aid rather than a replacement for qualified instruction. We recommend implementing human oversight, providing clear disclaimers about AI-generated content, and ensuring compliance with relevant privacy regulations in educational applications.Hebrew Math Tutor demonstrates that language barriers in AI can be effectively addressed through thoughtful fine-tuning approaches. This work represents more than just a Hebrew mathematical model—it's a proof of concept for making advanced AI capabilities truly accessible across linguistic communities.The techniques developed here can be adapted for other languages, creating a pathway toward more inclusive mathematical AI tools. As we continue to refine these approaches, we're moving closer to a future where language is no longer a barrier to accessing the most advanced educational technologies.Hebrew Math Tutor is available now under the Apache-2.0 license. We encourage the community to:🚀 Start exploring Hebrew Math Tutor today and experience mathematical AI that truly speaks your language.Built with gratitude upon the foundational work of Qwen3-4B-Thinking-2507 and the OpenMathReasoning dataset.·
				Sign up or
				log in to comment
</code></pre></div></div>]]></content><author><name></name></author><summary type="html"><![CDATA[A Blog post by Daniel Fleischer on Hugging Face]]></summary></entry><entry><title type="html">Jujutsu Impressions</title><link href="https://danielfleischer.github.io/blog/2025/jj/" rel="alternate" type="text/html" title="Jujutsu Impressions"/><published>2025-08-29T00:00:00+00:00</published><updated>2025-08-29T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2025/jj</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2025/jj/"><![CDATA[<h4 id="tldr">TL;DR</h4> <ul> <li>JJ does things differently from Git, but it’s compatible enough to try. The core unit is the persistent <strong>change</strong>, which evolves through snapshots.</li> <li>Actions are logged as operations, enabling undo and restore to any point in time.</li> <li>No Git-style branches. Changes have persistent IDs, and you use bookmarks for Git-forge workflows.</li> <li>You can colocate <code class="language-plaintext highlighter-rouge">jj</code> with an existing Git repo, keeping your Git workflow intact while <code class="language-plaintext highlighter-rouge">jj</code> handles history.</li> <li>Curious? Follow the tutorial or experiment on a current repo to see the difference firsthand.</li> </ul> <hr/> <p>In this post I’ll share some insights about using the <a href="https://jj-vcs.github.io/jj/latest/">Jujutsu version control system</a>, which I’ll call <code class="language-plaintext highlighter-rouge">jj</code> for the rest of the post.</p> <p><code class="language-plaintext highlighter-rouge">jj</code> is a version control system, currently built on top of git, using its building blocks. However, it’s not just a new porcelain; it defines new abstractions and data structures of its own.</p> <h4 id="concept-the-change-as-the-atomic-unit">Concept: The Change as the Atomic Unit</h4> <p>The most important concept is the <strong>change</strong>. The <strong>change</strong> defines an atomic unit of, well, change. Its analog is the git commit. But in <code class="language-plaintext highlighter-rouge">jj</code>, a change can develop in time. Technically, it’s a git commit that keeps getting amended, the difference being that the <strong>change</strong> keeps its identity via a persistent ID and a description field. A <strong>change</strong> updates whenever <code class="language-plaintext highlighter-rouge">jj</code> takes a snapshot of the repo—every time you call <code class="language-plaintext highlighter-rouge">jj</code>. If you think about it, <code class="language-plaintext highlighter-rouge">jj</code> can’t lose work, as it always starts by taking a snapshot, even for informative commands like <code class="language-plaintext highlighter-rouge">jj log</code>.</p> <p>Thus the <strong>change</strong> represents an amended commit, the “last” change. However, <code class="language-plaintext highlighter-rouge">jj</code> lets you inspect the internal, previous commits inside the <strong>change</strong>, by calling <code class="language-plaintext highlighter-rouge">jj evolog</code>. These internal commits are not synced to git forges and are not automatically garbage collected, like git amended commits are.</p> <p>When we are done with a <strong>change</strong>, we can give it a description (which we can do at any time using <code class="language-plaintext highlighter-rouge">jj description</code>) and create an empty <strong>change</strong> on top of it, ready to receive new modifications. We do that using <code class="language-plaintext highlighter-rouge">jj commit</code>.</p> <p>For example, when you initialize a repo, the initial current <strong>change</strong> is empty. You add files, edit them. When you are ready, you give this change a name via <code class="language-plaintext highlighter-rouge">jj description</code> and then start a new change via <code class="language-plaintext highlighter-rouge">jj new</code>, or do both at the same time via <code class="language-plaintext highlighter-rouge">jj commit</code>.</p> <h4 id="graph-branches-and-bookmarks">Graph, Branches, and Bookmarks</h4> <p>Moving around (<code class="language-plaintext highlighter-rouge">git checkout</code>) is done with <code class="language-plaintext highlighter-rouge">jj edit</code>. But we need to be careful, as any edit we make after the jump will get snapshots (“amended”) on top of the current <strong>change</strong>, modifying it. If you want to jump in to do some work, it’s better to use <code class="language-plaintext highlighter-rouge">jj new</code>.</p> <p>You might see the claim there are no branches in <code class="language-plaintext highlighter-rouge">jj</code>. There aren’t branches in the usual git sense. In <code class="language-plaintext highlighter-rouge">jj</code>s graph there are “branches”, but they don’t need names. The persistent <strong>change</strong> IDs serve as feature names and jump addresses (using <code class="language-plaintext highlighter-rouge">edit</code> or <code class="language-plaintext highlighter-rouge">new</code>). Nevertheless, branch names are introduced in <code class="language-plaintext highlighter-rouge">jj</code> in order to be compatible with git forges; they are called <strong>bookmarks</strong>, and they need to be moved explicitly across the <strong>changes</strong>.</p> <h4 id="logging-and-time-travel">Logging and Time Travel</h4> <p><code class="language-plaintext highlighter-rouge">jj</code> logs what’s happening using <strong>operations</strong>: these are the commands you enter and the current <em>change</em> you’re in, in addition to some metadata. The logs enable handy features like <code class="language-plaintext highlighter-rouge">jj undo</code> or the deeper <code class="language-plaintext highlighter-rouge">jj op restore</code> (restore to any point in time). For browsing the operation log, see <code class="language-plaintext highlighter-rouge">jj op log</code>.</p> <p>There is a <a href="https://jj-vcs.github.io/jj/latest/git-command-table/">comparison</a> table between <code class="language-plaintext highlighter-rouge">git</code> and <code class="language-plaintext highlighter-rouge">jj</code> commands, which could be useful but it’s important to not fixate on how <code class="language-plaintext highlighter-rouge">git</code> is doing things in order to be open to the new paradigm <code class="language-plaintext highlighter-rouge">jj</code> represents. However, some features seem to have been added in response to git users’ needs or workflows, or perhaps <code class="language-plaintext highlighter-rouge">jj</code> developers rediscovered the same needs.</p> <h4 id="mixing-jj-with-git">Mixing JJ with Git</h4> <p>I chose one existing git project and converted it to a mixed usage of <code class="language-plaintext highlighter-rouge">jj</code> and <code class="language-plaintext highlighter-rouge">git</code>, using <code class="language-plaintext highlighter-rouge">jj git init --colocate</code>. It means <code class="language-plaintext highlighter-rouge">jj</code> initialize its presence in an existing repo and it will keep updating the <code class="language-plaintext highlighter-rouge">.git</code> folder with what’s happening, at a level compatible with git constructs; for example, the <strong>changes</strong> are saved as git commits, <code class="language-plaintext highlighter-rouge">jj git fetch</code> is fetching from git forges into <code class="language-plaintext highlighter-rouge">.git/</code>, etc. I haven’t used rebasing, squashing or other history-changing operations so I can’t comment on how easy they are to use, maybe next time.</p> <p>If you found it interesting, give it a try. There’s the <a href="https://jj-vcs.github.io/jj/latest/tutorial/">tutorial</a>, or you can run it on an existing git repo.</p>]]></content><author><name></name></author><category term="software"/><category term="git"/><summary type="html"><![CDATA[My first impression of using the Jujutsu version control system.]]></summary></entry><entry><title type="html">Built an MCP server for LLMs to search email from terminal | Daniel Fleischer posted on the topic | LinkedIn</title><link href="https://danielfleischer.github.io/blog/2025/built-an-mcp-server-for-llms-to-search-email-from-terminal-daniel-fleischer-posted-on-the-topic-linkedin/" rel="alternate" type="text/html" title="Built an MCP server for LLMs to search email from terminal | Daniel Fleischer posted on the topic | LinkedIn"/><published>2025-06-23T00:00:00+00:00</published><updated>2025-06-23T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2025/built-an-mcp-server-for-llms-to-search-email-from-terminal--daniel-fleischer-posted-on-the-topic--linkedin</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2025/built-an-mcp-server-for-llms-to-search-email-from-terminal-daniel-fleischer-posted-on-the-topic-linkedin/"><![CDATA[<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>          Agree &amp; Join LinkedIn
        
  By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
📬 I built an MCP server that lets LLMs search my email from the terminal
</code></pre></div></div> <p>The server connects Claude to email search via the mu CLI tool. Now I just ask it things like: “Find emails with PDF attachments from last April” ⚡</p> <p>🛠 No custom frontend. No heavy framework. Just a CLI tool made smarter.</p> <p>💡 I learned that MCP servers are basically API translators — they take complex developer SDKs and flatten them into simple function calls that LLMs can actually use.</p> <p>🎯 The bigger picture: This pattern can breathe new life into existing CLI tools and services. Complex APIs → Simple, declarative functions → Natural language queries.</p> <p>This isn’t a product — just an experiment in stitching new capabilities into existing workflows. Code here: https://lnkd.in/eT2fJBSv</p> <p>mu email indexer and searcher: https://github.com/djcb/mu</p> <p>#MCP #LLM #EmailSearch #OpenSource #AI</p> <p>What existing tools would you want to make LLM-friendly? 🤔 To view or add a comment, sign in Whenever I am building complex 𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐒𝐲𝐬𝐭𝐞𝐦𝐬, I mostly end up adding a lot of 𝐥𝐚𝐭𝐞𝐧𝐜𝐲 to the system. And personally, these two techniques have always helped me a lot with reducing latency:</p> <ul> <li>𝐏𝐚𝐫𝐚𝐥𝐥𝐞𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧: If your agent has several steps that don’t depend on each other, you can run them at the same time instead of one after another — this makes things much faster. But in a real production system, you have to be careful to avoid data conflicts (known as data races) when those parallel steps access shared information.</li> </ul> <p>In that case, if you are running five processes inside the logic of your agent, and all of them are taking 3 seconds: Without parallelization: 3x5 = 𝟏𝟓𝐬 (𝐁𝐚𝐝 𝐔𝐗) With Parallelization: 3x1 = 𝟑𝐬 (𝐆𝐨𝐨𝐝 𝐔𝐗)</p> <ul> <li>𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠: Sometimes, you can’t make your agent actually faster without sacrificing quality. In that case, focus on improving how fast it feels to the user — what’s called perceived latency. You can do this by showing helpful updates while the agent works — like a progress bar, a list of key actions it’s taking, or even streaming the AI’s response live, word by word, as it’s being generated.</li> </ul> <p>Because streaming keeps the user engaged. If you are a web developer, you know the importance of a loader when a process is happening or waiting for an API response. If you have used Cursor or some coding agent, you would have experienced that it shows you:</p> <ul> <li>The todos it is organizing</li> <li>The code it is reading</li> <li>The Edits it is making</li> <li>Different files it is working on</li> <li>Different commands it is executing Guess what if it doesn’t show us anything and only comes after 5 minutes. I may end up closing it mid-way for sure 😅 That’s the importance of Streaming</li> </ul> <p>𝑊ℎ𝑎𝑡 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑡𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒𝑠 𝑑𝑜 𝑦𝑜𝑢 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑙𝑦 𝑢𝑠𝑒 𝑡𝑜 ℎ𝑒𝑙𝑝 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑖𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑥 𝐴𝑔𝑒𝑛𝑡𝑖𝑐 𝑠𝑦𝑠𝑡𝑒𝑚𝑠?</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    To view or add a comment, sign in
You've built your FastAPI application. Tests pass. It works locally. Now you're reading deployment guides and drowning in advice about connection pool tuning, PostgreSQL JIT compilation, and async event loop optimization.
</code></pre></div></div> <p>Here’s the problem: you’re optimizing blind. You don’t have production traffic to measure. You don’t know where your bottlenecks are.</p> <p>Meanwhile, the stuff that will actually break on day one gets buried in the noise. I’ve seen developers spend days tuning connection pools for traffic they don’t have yet, while missing the fact that their authentication breaks in production or their database credentials aren’t set correctly.</p> <p>The truth is, before your first deployment, only three things actually matter:</p> <ul> <li>Don’t get hacked (security configurations that prevent immediate exploitation)</li> <li>Don’t break immediately (configuration that prevents instant failures)</li> <li>Know what’s happening (minimal observability so you can actually debug issues)</li> </ul> <p>Everything else - performance tuning, advanced monitoring, sophisticated caching - can wait until you have real data.</p> <p>I just published a practical checklist covering exactly what matters in the hour before you go live. No overwhelming theory. Just the non-negotiables explained with real examples of what happens when you skip them.</p> <p>Link in the comments!</p> <p>#python #fastapi #webdev #deployment</p> <hr/> <p>Want to skip deployment configuration entirely? Check out FastroAI at https://fastro.ai - a production-ready FastAPI template with security, monitoring, and deployment already configured correctly. To view or add a comment, sign in Tried my hands on web scraping and AI-powered document processing recently.</p> <p>I built a pipeline that crawls configured websites, filters PDFs by exam type and year, and downloads them in a structured way. Both the exam name and years are configurable through the config file.</p> <p>Instead of using traditional parsing methods, I integrated Claude (Sonnet 4) to directly read PDFs, extract questions and options, and tag them with subject, topic, difficulty level, and many more attributes — all in one step. The processed data exports to Google Sheets for easy analysis and organization.</p> <p>The project includes three CLI commands for crawling, tagging with Claude, and exporting to Sheets, keeping the workflow modular and composable.</p> <p>Here’s a demo dataset from one of the runs showcasing the structured output. This setup uses exam papers from two years: https://surl.li/rhsnmg</p> <p>Tech stack: Node.js, TypeScript, Claude API, Google Sheets API</p> <p>GitHub Repo: https://lnkd.in/g4kdAvFE</p> <p>#AI #WebScraping #Automation To view or add a comment, sign in</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              489 followers
          Actions, Not Just Chat   React Component GPT:
</code></pre></div></div> <p>We need a GPT that understands our React components, knows our CSS variables, and can spit out code that’s ready to use. This isn’t about general knowledge; it’s about our knowledge. The standard GPT knowledge upload is fine for broad docs, but for precise component generation, we need control. That’s where Actions come in. Our design system lives in zeroheight. Our CSS variables are in a .css file. Our React components are in .jsx files. These are all discrete sources of truth. A generic LLM has no idea how they connect. If someone asks for a “primary button,” it might give generic HTML, not our Button component with –color-brand-primary. Unacceptable. We build an API. This API becomes our “knowledge retrieval service.” The GPT uses Actions to call this API when it needs specific, localized data. Extract Data (The ETL of our Design System): zeroheight Content: Use the zeroheight API to pull down all component documentation. Store it, parse it, clean it. We’re i https://lnkd.in/gufWti_X To view or add a comment, sign in</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              330 followers
          Actions, Not Just Chat 
</code></pre></div></div> <p>React Component GPT:</p> <p>We need a GPT that understands our React components, knows our CSS variables, and can spit out code that’s ready to use. This isn’t about general knowledge; it’s about our knowledge. The standard GPT knowledge upload is fine for broad docs, but for precise component generation, we need control. That’s where Actions come in. Our design system lives in zeroheight. Our CSS variables are in a .css file. Our React components are in .jsx files. These are all discrete sources of truth. A generic LLM has no idea how they connect. If someone asks for a “primary button,” it might give generic HTML, not our Button component with –color-brand-primary. Unacceptable. We build an API. This API becomes our “knowledge retrieval service.” The GPT uses Actions to call this API when it needs specific, localized data. Extract Data (The ETL of our Design System): zeroheight Content: Use the zeroheight API to pull down all component documentation. Store it, parse it, clean it. We’re i</p> <p>https://lnkd.in/gufWti_X To view or add a comment, sign in</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              489 followers
          API-Mocker Hits 5.38K Downloads: The Open Source API Development Platform That's Changing How Developers Mock APIs   The Problem Every Developer Faces
</code></pre></div></div> <p>Building modern applications means integrating with countless APIs. But what happens when those APIs are down, rate-limited, or simply don’t exist yet? Most developers resort to basic mocking tools that barely scratch the surface of real-world API complexity. API-Mocker isn’t just another mocking tool. It’s a comprehensive API development platform that has already been downloaded over 5,380 times by developers worldwide. Here’s what makes it different: FastAPI-based server with advanced routing and regex pattern matching AI-powered mock generation using OpenAI GPT models with intelligent fallback Scenario-based mocking for testing different API states and edge cases Smart response matching that analyzes request data for intelligent response selection GraphQL support with schema introspection and subscription handling WebSocket mocking for real-time communication testing Advanced authentication with OAuth2, JWT, API keys, and MFA support Database inte https://lnkd.in/gYKbM7Ku To view or add a comment, sign in API-Mocker Hits 5.38K Downloads: The Open Source API Development Platform That’s Changing How Developers Mock APIs The Problem Every Developer Faces</p> <p>Building modern applications means integrating with countless APIs. But what happens when those APIs are down, rate-limited, or simply don’t exist yet? Most developers resort to basic mocking tools that barely scratch the surface of real-world API complexity. API-Mocker isn’t just another mocking tool. It’s a comprehensive API development platform that has already been downloaded over 5,380 times by developers worldwide. Here’s what makes it different: FastAPI-based server with advanced routing and regex pattern matching AI-powered mock generation using OpenAI GPT models with intelligent fallback Scenario-based mocking for testing different API states and edge cases Smart response matching that analyzes request data for intelligent response selection GraphQL support with schema introspection and subscription handling WebSocket mocking for real-time communication testing Advanced authentication with OAuth2, JWT, API keys, and MFA support Database inte https://lnkd.in/gYKbM7Ku To view or add a comment, sign in WKassebaum’s fork of zen-mcp-server seems to be better maintained than the official version, with support for more LLMs from different providers. For those unfamiliar:</p> <p>zen-mcp-server is a “Model Context Protocol server that supercharges tools like Claude Code, Codex CLI, and IDE clients such as Cursor or the Claude Dev VS Code extension. Zen MCP connects your favorite AI tool to multiple AI models for enhanced code analysis, problem-solving, and collaborative development”.</p> <p>https://lnkd.in/efRqQ7PH To view or add a comment, sign in The Cloudflare Code Mode approach to MCP tool calls (https://lnkd.in/erdnK7EH) sounds like a really significant improvement on the MCP experience. It’s one of those rare breakthroughs that is both elegant and obvious in hindsight.</p> <p>At a high level, the idea is to translate “raw MCP” into TypeScript interfaces, and ask the LLM to code against the TypeScript interface. It’s a form of language arbitrage you might say: the agent exchanges a low-resource language (raw MCP) for a high-resource language (TypeScript), so the LLM performs much better. Then something cool happens - the LLM can also write code to chain tool calls together, or otherwise process the tool call responses in interesting ways. The agent is left holding a bunch of LLM-generated code, so it needs a sandbox to go run that code, and of course Cloudflare offers a solution for that.</p> <p>We’ll see if this approach takes hold; it seems to have a lot of traction already. If it does, then it’s worth asking whether the MCP protocol itself needs a revision - for example, by making the MCP server provide the TypeScript interface natively. That then raises another round of questions, around what is the best way for MCP servers to “speak” to LLMs - can we do better than Typeface?</p> <p>Certainly it’s a cool idea, and I think it’s a great step forward for MCP usage.</p> <p>h/t to Kushagra Kumar for sending this post my way! To view or add a comment, sign in 💡 𝗡𝗲𝘃𝗲𝗿 𝗹𝗼𝘀𝗲 𝘁𝗿𝗮𝗰𝗸 𝗼𝗳 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗮𝗴𝗮𝗶𝗻.</p> <p>Just released MCP Memory Service v8.6.0 with 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗜𝗻𝗴𝗲𝘀𝘁𝗶𝗼𝗻 - your personal AI-powered knowledge base.</p> <p>𝗧𝗵𝗲 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: You have PDFs, documentation, notes scattered everywhere. Finding the right information takes forever. Context is lost between AI conversations.</p> <p>𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: Upload your documents once. Search them semantically. Let AI remember everything for you.</p> <p>𝗛𝗼𝘄 𝗶𝘁 𝗪𝗼𝗿𝗸𝘀:</p> <p>1️⃣ 𝗨𝗽𝗹𝗼𝗮𝗱 - Drag PDFs, docs, or notes to the web interface 2️⃣ 𝗣𝗿𝗼𝗰𝗲𝘀𝘀 - Intelligent chunking preserves context 3️⃣ 𝗦𝗲𝗮𝗿𝗰𝗵 - Ask in natural language: “authentication flow from the security docs” 4️⃣ 𝗥𝗲𝗺𝗲𝗺𝗯𝗲𝗿 - AI assistants access this knowledge automatically</p> <p>𝗪𝗵𝗮𝘁 𝗠𝗮𝗸𝗲𝘀 𝗧𝗵𝗶𝘀 𝗦𝗽𝗲𝗰𝗶𝗮𝗹:</p> <p>🌟 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗦𝗲𝗮𝗿𝗰𝗵 - Finds relevant content, not just keywords 🌟 𝗣𝗿𝗶𝘃𝗮𝗰𝘆-𝗙𝗶𝗿𝘀𝘁 - Runs locally on your machine (or your team’s server) 🌟 𝗨𝗻𝗶𝘃𝗲𝗿𝘀𝗮𝗹 - Works with Claude, VS Code, Cursor, and 13+ AI applications 🌟 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗥𝗲𝗮𝗱𝘆 - 2000+ memories in active deployments, &lt;500ms search times</p> <p>𝗕𝘂𝗶𝗹𝘁 𝗳𝗼𝗿 𝗧𝗲𝗮𝗺𝘀: • OAuth 2.1 collaboration • Hybrid sync (local + cloud) • Zero-configuration setup • Enterprise security</p> <p>From solo developers to entire teams - one source of truth for AI-powered work.</p> <p>𝗢𝗽𝗲𝗻 𝗦𝗼𝘂𝗿𝗰𝗲 &amp; 𝗙𝗿𝗲𝗲: 👉 https://lnkd.in/ePYekaAF</p> <p>#ArtificialIntelligence #Productivity #KnowledgeManagement #DeveloperTools #OpenSource #Claude #AI To view or add a comment, sign in</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        1,445 followers
      
                Create your free account or sign in to continue your search
              
          or
        
  By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

            New to LinkedIn? Join now
          
                      or
                    
                New to LinkedIn? Join now
              
  By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
</code></pre></div></div>]]></content><author><name></name></author><summary type="html"><![CDATA[📬 I built an MCP server that lets LLMs search my email from the terminal The server connects Claude to email search via the mu CLI tool. Now I just ask it things like: "Find emails with PDF attachments from last April" ⚡ 🛠 No custom frontend. No heavy framework. Just a CLI tool made smarter. 💡 I learned that MCP servers are basically API translators — they take complex developer SDKs and flatten them into simple function calls that LLMs can actually use. 🎯 The bigger picture: This pattern can breathe new life into existing CLI tools and services. Complex APIs → Simple, declarative functions → Natural language queries. This isn’t a product — just an experiment in stitching new capabilities into existing workflows. Code here: https://lnkd.in/eT2fJBSv mu email indexer and searcher: https://github.com/djcb/mu #MCP #LLM #EmailSearch #OpenSource #AI What existing tools would you want to make LLM-friendly? 🤔]]></summary></entry><entry><title type="html">Summarize Hacker News Posts with Haystack &amp;amp; OPEA | Haystack</title><link href="https://danielfleischer.github.io/blog/2025/summarize-hacker-news-posts-with-haystack-opea-haystack/" rel="alternate" type="text/html" title="Summarize Hacker News Posts with Haystack &amp;amp; OPEA | Haystack"/><published>2025-06-10T00:00:00+00:00</published><updated>2025-06-10T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2025/summarize-hacker-news-posts-with-haystack--opea--haystack</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2025/summarize-hacker-news-posts-with-haystack-opea-haystack/"><![CDATA[<p>Build a RAG pipeline to fetch live Hacker News posts and summarize them with a local LLM endpointWelcome to this step-by-step tutorial where we’ll build a simple Retrieval-Augmented Generation (RAG) pipeline using Haystack and OPEA. We’ll fetch the newest Hacker News posts, feed them to a lightweight LLM endpoint (OPEAGenerator), and generate concise one-sentence summaries (based on this notebook). Let’s dive in! 🎉In modern GenAI applications, having a flexible, performant, and scalable platform is essential. OPEA (Open Platform for Enterprise AI) is an open, model-agnostic framework for building and operating composable GenAI solutions. It provides:In this demo, we’ll use an OPEA LLM endpoint in a Haystack pipeline, giving you:In this tutorial, we’ll build a simple RAG pipeline that fetches the newest Hacker News posts, sends them to a local OPEA endpoint running a Qwen/Qwen2.5-7B-Instruct demo model, and produces concise one-sentence summaries. Of course, you can replace our example model with any other OPEA-served model, making this pattern both lightweight for prototyping and powerful for real-world deployments. Let’s get started! 🚀Make sure you have:NOTE: As a reference, here is a Docker Compose recipe to get you started. OPEA LLM service can be configured to use a variety of model serving backends like TGI, vLLM, ollama, OVMS… and offers validated runtime settings for good performance on various hardware’s including Intel Gaudi. In this example, it creates an OPEA LLM service with a TGI backend. See the documentation for LLM Generation. The code is based on OPEA LLM example and OPEA TGI example.To run, call LLM_MODEL_ID=Qwen/Qwen2.5-7B-Instruct docker compose up.We’ll create a custom Haystack component, HackernewsNewestFetcher, that:We use the OPEAGenerator to call our LLM over HTTP. Here, we point to a local endpoint serving the Qwen/Qwen2.5-7B-Instruct model:Using PromptBuilder, we define a Jinja-style template that:We wire up the components in a Pipeline:Fetch and summarize the top 2 newest Hacker News posts:Beautiful, concise summaries in seconds! ✨In this tutorial, we built a full RAG pipeline:Feel free to extend this setup with more advanced retrieval, caching, or different LLM backends. Happy coding! 🛠️🔥 Building products, technology and solutions for LLM-enabled applications.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Build a RAG pipeline to fetch live Hacker News posts and summarize them with a local LLM endpoint]]></summary></entry><entry><title type="html">המודל שלכם הוזה? כך תבנו בעצמכם מערכת RAG | גיקטיים</title><link href="https://danielfleischer.github.io/blog/2024/rag/" rel="alternate" type="text/html" title="המודל שלכם הוזה? כך תבנו בעצמכם מערכת RAG | גיקטיים"/><published>2024-12-01T00:00:00+00:00</published><updated>2024-12-01T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2024/-------rag--</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2024/rag/"><![CDATA[<p>חיפושים חמים: מערכות RAG יכולות לפתור את תופעת ההזיות של LLMs, אבל כדי לבנות אחת טובה תצטרכו לקבל הרבה החלטות קריטיות (ולקוות שתגיעו הכי קרוב למושלם) האתגר העיקרי בבניית מערכת RAG הוא להגיע לרמת איכות ודיוק מקסימליים (צילום: Dreamstime)מאת דניאל פליישר, חוקר AI ב-Intel Labsמודלי שפה גדולים (LLMs) מציעים פתרונות מתקדמים לאתגרי פיתוח, אך לעתים, מה לעשות, הם שוגים. המטרה של מערכות RAG היא לפתור את הבעיה הזאת, באמצעות גישה חדשה לשיפור דיוק ואמינות המודלים. איך הן עושות את זה, האם ההייפ מוצדק וכיצד נוכל לבנות מערכת RAG משלנו? כדי לענות על השאלות האלה, בדקתי את האתגרים בגישה זו אל מול היתרונות שהיא מציעה בשדרוג האינטראקציה שלנו עם AI.לפני שנצלול, בואו נדבר רגע על LLMs. למעשה, מדובר בייצוג מתמטי של שפה טבעית באמצעות הסתברויות: מודל השפה יכול לחשב הסתברות לכל משפט בשפה שבה אימנו אותו, וכך הוא יכול לייצר משפטים חדשים, מילה אחר מילה. מודל השפה למעשה “ממשיך את המשפט”: מזינים התחלה של משפט, והוא ממשיך. השימושים הם כמעט אינסופיים: אפשר לבקש מהמודל לענות על שאלות, לסכם טקסטים, לחבר שירים, לכתוב פונקציות בשפות תכנות שונות ואפילו לייצר תמונות בצורה סינתטית.אבל לא הכל מושלם, ואחד הפגמים בתפקוד של LLMs הוא ההזיות המוכרות לכולנו – אותם מקרים שבהם מודלי שפה מספקים תשובות שגויות עובדתית, לא רלוונטיות או מופרכות לחלוטין. כדי להוסיף חטא על פשע, מודלי השפה לא תמיד מסייגים את תשובותיהם ועונים בביטחון מלא, ולכן ההזיות הן מסוכנות וקשות לגילוי.אחת הדרכים להתמודד עם תופעת ההזיות היא RAG: Retrieval Augmented Generation (או בעברית, “כתיבת טקסט בעזרת שליפה והעשרה בידע”). בשיטה זו, מודל השפה נעזר בידע חיצוני שאותו הוא שולף או מקבל לצורך השלמת המשימה, כדי שהתשובות יהיו מדויקות יותר. RAG הוא שם כולל לכל הטכניקות לבניית מערכות שבהן מחברים מודלי שפה לידע חיצוני לצורך שיפור התוצאות.מערכת RAG כוללת מאגר ידע כגון בסיס נתונים, אוסף של מסמכים ואפילו כלי חיפוש בגוגל. כשאנחנו מזינים בקשה למערכת, היא מאתרת פיסות מידע רלוונטיות בתוך מאגרי הידע. אלה עוברים יחד עם הבקשה המקורית אל מודל השפה, והוא עונה על הבקשה – בתקווה שהתשובה תהיה מלאה, מדויקת ורלוונטית, בזכות המידע שהמודל קיבל. אחזור מידע (retrieval) רלוונטי והעברתו למודל השפה משפרים באופן מוכח ומשמעותי את הביצועים של מודלי שפה במגוון משימות המצריכות ידע – כולל מענה על שאלות (Q&amp;A), מיון (Classification), סיכום ועוד.למערכות RAG ישנם שימושים שונים, והיתרון המובהק שלהן הוא היכולת לרתום מאגרי ידע במטרה להשלים את המשימה. מאגרים אלה הם הדרך העיקרית להוסיף ידע חדש ולעדכן ידע קיים במודלי השפה, ולכן הם מוצלחים כל כך. בנוסף, בחירה דינמית של מאגרי הידע פותחת את הדלת למודלי שפה מותאמים אישית.קחו לדוגמה עוזר אישי (digital assistant). במקרה הזה מערכת ה-RAG מחוברת למאגרי ידע מקצועיים. בזכות היכולות של מודל השפה, המערכת יכולה לשוחח בשפה טבעית על מגוון נושאים טכניים, לייעץ ולענות על שאלות. בנוסף, המערכת יכולה להפנות את המשתמש למקורות הידע עצמם (באמצעות ציטוט, citation), כמעין עוזר מחקר.דוגמה נוספת היא שימוש בידע של המשתמש עצמו, כלומר – מאגר הידע מבוסס על הדאטה של המשתמש, וכך מערכת ה-RAG מתאימה לו את תשובותיה. המשתמש יכול להעלות אוסף קבצים אישיים למערכת ולבקש ממנה לסכם נושאים המופיעים במסמכים, לשאול היכן נמצא דיון בנושא כזה או אחר, לבקש שתרכיב מצגת סיכום מהמסמכים וכן הלאה.כאשר ניגשים לבנות מערכת RAG צריך לקבל החלטות רבות. בשלב הראשון עלינו לבחור מודל שפה – פתוח או סגור. מודל סגור הוא מודל מסחרי איתו עובדים בעזרת ממשק API, ואין לנו גישה לאופן פעולת המודל או המשקולות שלו. לעומתו, במודל פתוח גם הקוד וגם המשקולות נגישים לבחינה, אימון והגשה (inference). את המודל הפתוח ניתן להתאים לעולם התוכן שלנו בעזרת אימון fine tuning. בחלק מהמקרים הגשה עצמית של מודלים פתוחים עשויה להיות עדיפה על פני מודלים סגורים, אך היא מצריכה מיומנות.החלק השני בבניית מערכת RAG הוא חיבור למאגרי ידע. לחיבור הזה ישנם היבטים רבים, הכוללים את אופן אינדוקס הידע, עיבוד מקדים של הטקסט והתמונות, חלוקה לפסקאות או משפטים, ניקוי, עיבוד של דאטה טבלאי, המרת הדאטה לייצוג וקטורי (vector embedding) שיכול לשפר את איכות החיפוש, חיפוש מבוסס מילים, חיפוש סמנטי או שילוב שלהם, מספר הדוגמאות לאחזור, מיון רלוונטיות, סינון, שיכתוב הדוגמאות, סיכום ועוד. זה ממש על קצה המזלג, ורק על השלב הזה יכולתי לכתוב מאמר שלם. החלק הזה קריטי, כי כמו שאומרים – garbage in, garbage out: אם אחזור הידע יהיה לא מדויק ולא רלוונטי מספיק, איכות התשובות של מודל השפה תפגענה מיד, והסיכוי להזיות יגדל.לאחר מכן ניתן להוסיף מרכיבים למערכת: מודלי שפה נוספים שנבחרים בצורה דינמית על פי אופי המשימה; שימוש בכמה מאגרי ידע במקביל; שימוש בכלים המודדים את איכות המסמכים שנמצאו, כך שניתן יהיה ללקט את קטעי המידע הרלוונטיים ביותר מתוך המסמכים שחזרו; מודלי שפה שיודעים לבקש ביצוע אחזור נוסף אם הם לא מצאו עדיין את התשובה שחיפשו; והרשימה ממשיכה, כיאה לתחום מחקר פעיל ביותר.האתגר העיקרי בבניית מערכת RAG הוא להגיע לרמת איכות ודיוק מקסימליים. כל חלקי המערכת משפיעים על התשובות הסופיות, ולכן מוכרחים לתכנן אותה בקפדנות. כחלק מההכנה, כדאי לבחון ולהשוות מודלי שפה על מנת להגיע לשילוב ראוי של עלות מול דיוק. בחירת מאגרי הידע, אופי האחסון, אחזור והצגת הידע למודל הם קריטיים בייצור תשובות נכונות ורלוונטיות בזמן ריצה סביר. כדי לבחון את טיב המערכת שבניתם, ניתן להשתמש בדוגמאות מתויגות (דוגמאות המכילות תשובות ידועות מראש) ולבצע בחינה ידנית או אוטומטית של התוצאות.המורכבות הגדולה נובעת בעיקר מההתנהגות הלא צפויה לעתים של LLMs. לדוגמה, בחלק קטן מהמקרים מודל השפה ישגה, אפילו שהידע שקיבל מכיל את התשובה הנכונה. הסיבות לא ברורות נובעות מסתירה בין הידע הפנימי של המודל לידע המוצג בפניו. הצוות שלנו חוקר דרכים בהן ניתן לשלוט בהתנהגות המודלים במקרים כאלה, בעזרת הוראות מותאמות ואימון נוסף.יחד עם זאת, ניתן לשפר את איכות המערכת כולה באמצעות התאמת המודל לביצוע משימות RAG. לצורך כך, הצוות שלנו פיתח כלי קוד פתוח שמאפשר לאמן ולשפר את יכולות ה-RAG של מודלי שפה.כדי לבנות מערכת RAG איכותית ומדויקת, נדרשת הבנה עמוקה של ההיבטים השונים שלה. כמובן שאי אפשר לוותר על תהליך ניסוי וטעיה, שעוזר לשפוך אור על הפשרות השונות הכרוכות בעיצוב המערכת. רק כך נוכל לבנות את המערכת האידיאלית לבעיה שאותה אנחנו מנסים לפתור.מערכות RAG מייצגות ארכיטקטורה חדשה המשלבת מודלי שפה עם מאגרי נתונים. בזכות השילוב הזה יש להם פוטנציאל לבצע משימות עתירות ידע, כגון עוזרים דיגיטליים מותאמים אישית, צ’אטבוט בשירות לקוחות ומערכות ידע ארגוניות. ואם לשפוט לפי המחקר הפעיל בתחום, זוהי רק ההתחלה.הכותב הוא חוקר במעבדת ה־NLP בארגון Intel Labs. במעבדה נחקרות סוגיות הקשורות למודלי שפה כגון RAG ,Efficient Inference, עבודה עם קונטקסטים ארוכים, שימוש בסוכנים ועוד.אינטל ממשיכה להוביל את תחום הבינה המלאכותית עם הפתרונות המתקדמים ביותר לתעשייה. מעבדי Xeon מהדור השישי והמאיצים הייעודיים Gaudi 3 מאפשרים לארגונים להאיץ את פיתוח והטמעת יישומי AI בקנה מידה גדול, תוך שמירה על יעילות גבוהה ותמורה כלכלית יוצאת דופן. אינטל מציעה גישה פתוחה וגמישה המאפשרת שילוב חלק של חומרה ותוכנה ממגוון ספקים, ובכך נותנת לארגונים את הכלים הדרושים להם להאצת השימוש ב- GenAI ובמודלים גדולים, כמו גם להקטנת התלות במערכות קנייניות של יצרנים אחרים. מרכזי הפיתוח של אינטל בישראל, שממוקמים בחיפה, פתח תקווה, ירושלים וקריית גת, משחקים תפקיד מפתח בעיצוב הדור הבא של טכנולוגיות עיבוד ו-AI , וממשיכים להניע את החדשנות הגלובלית של החברה באמצעות שילוב של ביצועים, גמישות וחדשנות מתמדת.כאן אפשר לבחור תחומי עניין, ואנחנו נתאים לך כתבות באופן אישי. הכתבות יופיעו כאן וברחבי האתר, והסימון שלנו יהיה הנה הכתבות שהתאמנו לך אישית. רוצה לרענן העדפות? בבקשה, אנחנו לא שופטים זה המקום להכיר את החברות, המשרדים וכל מי שעושה את ההייטק בישראל (ויש גם מלא משרות פתוחות!) #תוכן מקודםהניוזלטר שלנו עושה את האקסטרה מייל עם העדכונים והחדשות של השבוע© כל הזכויות שמורות לגיקטייםפיתוח אתריםdesigned by designed by  | פיתוח אתריםבגלל זה אנחנו מקפידים שהן לא יציקו, אבל הן מאפשרות לנו לתת לכם תוכן בחינם.</p> <p>פרסומות עוזרות לנו להתקיים ולהתמקד במה שחשוב: ליצור עבורך תוכן מקצועי ומעניין. כדי להמשיך ליהנות מגיקטיים, כדאי להסיר את החסימה מהאתר שלנו. אנחנו מבטיחים לא להציף.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[מערכות RAG יכולות לפתור את תופעת ההזיות של LLMs, אבל כדי לבנות אחת טובה תצטרכו לקבל הרבה החלטות קריטיות (ולקוות שתגיעו הכי קרוב למושלם)]]></summary></entry><entry><title type="html">Intel Labs Introduces RAG-FiT Open-Source Framework for Retrieval Augmented Generation in LLMs - Intel Community</title><link href="https://danielfleischer.github.io/blog/2024/intel-labs-introduces-rag-fit-open-source-framework-for-retrieval-augmented-generation-in-llms-intel-community/" rel="alternate" type="text/html" title="Intel Labs Introduces RAG-FiT Open-Source Framework for Retrieval Augmented Generation in LLMs - Intel Community"/><published>2024-10-09T00:00:00+00:00</published><updated>2024-10-09T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2024/intel-labs-introduces-rag-fit-open-source-framework-for-retrieval-augmented-generation-in-llms---intel-community</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2024/intel-labs-introduces-rag-fit-open-source-framework-for-retrieval-augmented-generation-in-llms-intel-community/"><![CDATA[<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                        Success!  Subscription added.
                    
                   
                
                    
                        Success!  Subscription removed.
                    
                    
                
                    
                        Sorry, you must verify to complete this action. Please click the verification link in your email. You may re-send via your
                        profile.
                    
                Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology. Intel Labs researchers Daniel Fleischer, Moshe Berchansky, and Moshe Wasserblat collaborated on RAG-FiT.HighlightsIntel Labs introduces RAG-FiT, an open-source framework for augmenting large language models (LLMs) for retrieval-augmented generation (RAG) use cases. Available under an Apache 2.0 license, RAG-FiT integrates data creation, training, inference, and evaluation into a single workflow, assisting in the creation of data-augmented datasets for training and evaluating LLMs in RAG settings. This integration enables rapid prototyping and experimentation with various RAG techniques, allowing users to easily generate datasets and train RAG models using internal or specialized knowledge sources.The library assists in creating data to train models using parameter-efficient fine-tuning (PEFT), which allows users to finetune a subset of parameters in a model. The Python-based framework is designed to serve as an end-to-end experimentation environment, enabling users to quickly prototype and experiment with different RAG techniques, including data selection, aggregation and filtering, retrieval, text processing, document ranking, few-shot generation, prompt design using templates, fine-tuning, inference, and evaluation.To demonstrate the effectiveness of the RAG-FiT framework (formerly known as RAG Foundry), Intel Labs researchers augmented and fine-tuned Llama 3.0 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across three knowledge-intensive question-answering tasks.Using RAG Systems to Address LLM LimitationsDespite their impressive capabilities, LLMs have inherent limitations. These models can produce plausible sounding but incorrect or nonsensical answers, struggle with factual accuracy, lack access to up-to-date information after their training cutoff, and struggle in attending to relevant information in large contexts.RAG enhances LLMs performance by integrating external information using retrieval mechanisms. Retrieving specific data from knowledge bases outside the model can effectively address knowledge limitations, which in turn can reduce hallucinations, improve the relevance of generated content, provide interpretability and could be vastly more cost efficient. Furthermore, recent research indicates that fine-tuning LLMs for RAG can achieve state-of-the-art performance, surpassing that of larger proprietary models.How RAG-FiT WorksAs an experimentation environment for researchers, the backbone of the RAG-FiT library consists of four distinct modules: data creation, training, inference, and evaluation. Each module is encapsulated and controlled by a configuration file, ensuring compatibility between the output of one module and the input of the next file. This modular approach allows isolation and independent experimentation on each step, enabling the production of multiple outputs and the concurrent execution of numerous experiments. Evaluation can be conducted on the generated outputs as well as on any feature within the data, including retrieval, ranking, and reasoning.Figure 1. In the RAG-FiT framework, the data augmentation module saves RAG interactions into a dedicated dataset, which is then used for training, inference, and evaluation.Dataset creation: The processing module facilitates the creation of context-enhanced datasets by persisting RAG interactions, which are essential for RAG-oriented training and inference. These interactions encompass dataset loading, column normalization, data aggregation, information retrieval, template-based prompt creation, and various other forms of pre-processing. The processed data can be saved in a consistent, model-independent format, along with all associated metadata, ensuring compatibility and reproducibility across different models and experiments.The processing module supports the handling of multiple datasets at once through global dataset sharing. This feature allows each step of the pipeline to access any of the loaded datasets, enhancing flexibility and allowing for complex processing procedures. Furthermore, the module includes step caching, which caches each pipeline step locally. This improves compute efficiency, and facilitates easy reproduction of results.Training: Users can train any model on the augmented datasets. A training module is used to fine-tune models from the datasets created by the previous processing module. The training module relies on the well-established training framework, TRL, for transformer reinforcement learning. The module also supports advanced efficient training techniques, such as PEFT and low-rank adaptation (LoRA) to customize the LLM for specific use cases without retraining the entire model.Inference: The inference module can generate predictions using the augmented datasets with trained or untrained LLMs. Inference is conceptually separated from the evaluation step, since it is more computationally demanding than evaluation. Additionally, users can run multiple evaluations on a single prepared inference results file.Evaluation: Custom metrics can be easily implemented or users can run current metrics, including Exact Match (EM), F1 Score, ROUGE, BERTScore, DeepEval, Ragas, Hugging Face Evaluate, and classification. Users can run metrics locally on each example, or globally on the entire dataset, such as recall for classification-based metrics. In addition to input and output texts, metrics can utilize any feature in the dataset, such as retrieval results, reasoning, citations, and attributions. In addition, the evaluation module uses a processing step called an Answer Processor, which can implement custom logic and perform many tasks, including cleaning and aligning outputs.Performance of RAG-FiT Augmentation TechniquesTo illustrate the utility of the framework, Intel Labs researchers conducted experiments involving retrieval, fine-tuning, chain-of-thought (CoT) reasoning, and a negative distractor documents technique. The team compared Llama 3.0 and Phi-3, two widely accepted baseline models, using enhancement methods across TriviaQA, PubMedQA, and ASQA, three knowledge-intensive question-answering datasets. The TriviaQA and PubMedQA datasets contain relevant context, while for the ASQA dataset, retrieval was done over a Wikipedia corpus using a dense retriever.The team measured and reported EM for TriviaQA, STR-EM for ASQA, and accuracy and F1 Score for PubMedQA. In addition, researchers evaluated two Ragas metrics: faithfulness (the relation between the generated text and the context) and relevancy (the generated text and the query). Overall, the two models showed consistent improvements across the three knowledge-intensive question-answering tasks.Figure 2. Evaluation results of baseline and different RAG settings for the three datasets and two models tested. In bold are the best configurations per dataset, based on the main metrics.For TriviaQA, retrieved context improved the results, fine-tuning the RAG setting boosted the results, but fine-tuning on CoT reasoning (which includes training on a combination of gold passages and distractor passages) decreased performance. For this dataset, the best method is model dependent. For ASQA, every method improved upon the baseline, CoT reasoning produced consistent improvement in both models, as well as fine-tuning of the CoT configuration, which performed best. Finally, for PubMedQA, almost all methods improved upon the baseline (with one exception), CoT reasoning improved on the untrained RAG setting, but for fine-tuning, the RAG method performed best in both models.Finally, the faithfulness and relevancy scores often did not correlate with the main metrics, or with each other, possibly indicating they capture different aspects of the retrieval and generated results, and represent a trade-off in performance.The results demonstrate the usefulness of RAG techniques for improving performance, as well as the need to carefully evaluate different aspects of a RAG system on a diverse set of datasets.
					You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
				Community support is provided Monday to Friday. Other contact methods are available here.Intel does not verify all solutions, including but not limited to any file transfers that may appear in this community. Accordingly, Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.For more complete information about compiler optimizations, see our Optimization Notice.
</code></pre></div></div>]]></content><author><name></name></author><summary type="html"><![CDATA[Scott Bair is a key voice at Intel Labs , sharing insights into innovative research for inventing tomorrow’s technology. Intel Labs researchers]]></summary></entry><entry><title type="html">Just a moment…</title><link href="https://danielfleischer.github.io/blog/2023/just-a-moment/" rel="alternate" type="text/html" title="Just a moment…"/><published>2023-08-23T00:00:00+00:00</published><updated>2023-08-23T00:00:00+00:00</updated><id>https://danielfleischer.github.io/blog/2023/just-a-moment</id><content type="html" xml:base="https://danielfleischer.github.io/blog/2023/just-a-moment/"><![CDATA[]]></content><author><name></name></author></entry></feed>