- Tech Insights 2026 Week 12by Johan Sanneblad
Things are moving fast now. Two months ago Anthropic launched a new feature called “Cowork”, where Claude works autonomously within a sandbox environment on your computer with access to your files and various connectors. You choose the files Claude can work with, but you need to be careful – if prompted the wrong way there is a risk Claude removes or invalidates important files. This entire feature, Claude Cowork, was built in record time – only 10 days, and 100% of the code for it was built by Claude itself.
Having agents run freely within a controlled environment is what every AI company is pursuing now, triggered by the Internet phenomenon OpenClaw. You can read about three new initiatives in this newsletter: OpenAI opened up their Responses API so AI agents can now respond to your requests running in environments very similar to Claude Cowork, and ByteDance launched DeerFlow 2.0, a complete rewrite for, you guessed it, to let AI agents run autonomously within a sandbox environment. The third “autonomous agent” news last week comes from Microsoft: on March 9 Microsoft launched Copilot Cowork, which is a licensed version of Anthropic Cowork but adapted to run in the M365 environment. Microsoft calls it “A new way of getting work done”. However the speed at which many companies are rolling out fully AI-generated features right now has me a bit concerned.
My focus with agentic engineering (what you call the work process where AI writes 100% of the code) has always been about greatly improving quality, not speed. This is the work process I teach to companies. Of course it’s faster than writing code by hand, but that is not the driving factor. And this approach is the exact opposite of what I see many companies are doing right now. Even Microsoft commits code fully created by Claude Code straight into their official GitHub repo. It’s no longer about being faster with AI – the race has somehow turned into who can be the fastest with AI. How many tasks can you run in parallel and maybe you want to pay 30x price premium for a new “fast” mode in GitHub Copilot to auto-generate code even faster?
The net result is that many companies are building up technical debt in a race to be first with the latest innovations. According to Claude Code creator Boris Cherny, he recently posted that Anthropic is betting that future models will be capable enough to clean up this debt to avoid what he calls a “slopcopolypse”. Anthropic are well aware of all the issues with Claude, Boris himself writes that it “over-complicates things, leaves dead code around, and doesn’t like to refactor when it should”.
If you work in a company that have recently rolled out AI-based development, ask your engineers (1) how many rounds of iterations they typically do on the AI-generated code before they submit it to the main repo, (2) how their process looks like (it should look something like define-plan-implement-test-review-(iterate-many-times)-review-commit), and (3) how many times in this process they do manual reviews of all code. If they do no manual reviews, they do no iterations and have no clear process, then just be aware where you are heading and that you are betting your future on a technology leap we do not yet know when it will happen.
In the mean time just know that it is actually possible to write agentic code with very high quality. As an example I created Notebook Navigator that is now officially sponsored by both OpenAI and Obsidian. It has nearly 400 000 downloads, and was fully created with agentic engineering. It has had zero critical issues since first launch in September, and it has over 200 000 lines of complex and interconnected but well-architectured and well-structured source code (the image processing bits are extremely advanced). There are ways to do this without building debt, you just need to put all the focus on process and quality, the speed comes automatically.
Thank you for being a Tech Insights subscriber!
Listen to Tech Insights on Spotify: Tech Insights 2026 Week 12 on Spotify
THIS WEEK’S NEWS:
- Microsoft Integrates Claude Cowork into Copilot
- Microsoft Launches Copilot Health
- OpenAI Publishes Framework for AI Agents to Resist Prompt Injection
- OpenAI Equips the Responses API with a Full Computer Environment
- Claude Adds Inline Interactive Visuals to Chat Responses
- Claude Code Remote Control Now Supports Mobile-Initiated Sessions
- ByteDance Rewrites DeerFlow as a Full SuperAgent Harness
- Google Releases Gemini Embedding 2, Its First Natively Multimodal Embedding Model
- Google Maps Adds Conversational Search and Redesigned Navigation
- Google Workspace Gets Gemini Integration Across Docs, Sheets, Slides, and Drive
- Andrew Ng Releases Context Hub: Open-Source API Docs for Coding Agents
- Karpathy’s Autoresearch Runs Autonomous ML Experiments Overnight on a Single GPU
- NVIDIA Releases Nemotron 3 Super for Multi-Agent AI Systems
Microsoft Integrates Claude Cowork into Copilot

The News:
- Microsoft announced Copilot Cowork on March 9, an integration of Anthropic’s Claude Cowork technology into Microsoft 365 Copilot that runs background tasks on a user’s M365 data across email, calendar, Teams, and SharePoint.
- Anthropic launched Claude Cowork in January 2026, a product for non-technical business users that reads, manipulates, and analyzes files on a computer; its launch contributed to a near $1 trillion drop in software stock values, with Microsoft losing approximately $220 billion in market capitalization within a week.
- Microsoft licensed the underlying Claude Cowork technology from Anthropic and integrated it into Copilot, with the Microsoft blog stating the product was built “working closely with Anthropic”.
- Copilot Cowork operates as a cloud-side agent within Microsoft 365’s service layer, not as a desktop automation tool; it takes actions such as reorganizing a calendar, drafting emails, or saving files to SharePoint while the user works on other things.
- Copilot is described as “model-diverse by design,” meaning Cowork’s agentic tasks run via the Anthropic/Claude integration while other Copilot functions may still use OpenAI models.
- The feature is in a limited Research Preview, with broader rollout via Microsoft’s Frontier program planned for late March 2026.
“Working closely with Anthropic, we have integrated the technology behind Claude Cowork into Microsoft 365 Copilot.”
My take: This is not just similar to Claude Cowork, this IS Claude Cowork, rebranded by Microsoft to “Copilot Cowork”. Microsoft’s corporate VP of business applications Charles Lamanna described the difference from standard chat Copilot as: “With chat, you’re supervising every action; Cowork allows for a more hands-off approach, enabling you to simply set it in motion and let it accomplish the work”.
Microsoft is pursuing two distinct paths forward here: On one end you have the M365 apps with M365 Copilot chat built-in, which can now finally also edit documents in place. And on the other hand you have the Claude Cowork agentic solution that basically does all the job for you, so you don’t even need to start Excel or Word. AI agents need CLI and tool access to perform their best, which is why the entire model of putting Copilot within each app is probably doomed to fail in the long-term. Maybe this is why Microsoft rushed this release so quickly.
Read more:
Microsoft Launches Copilot Health
https://microsoft.ai/news/introducing-copilot-health

The News:
- Microsoft launched Copilot Health on March 12, a separate, dedicated space within its Copilot platform that aggregates personal health records, wearable data, and lab results, then generates insights from that combined data.
- The product connects to data from over 50 wearable devices including Apple Health, Oura, and Fitbit, and pulls medical records from over 50,000 U.S. hospitals and provider organizations through a service called HealthEx.
- Microsoft’s existing consumer Copilot products already handle over 50 million health questions per day; an analysis of 500,000 de-identified Copilot conversations from January 2026 found that nearly one in five involved personal symptom assessment, and that health queries spiked sharply in evening and overnight hours.
- Health data and conversations are stored separately from general Copilot, encrypted at rest and in transit, and are not used for model training. The product holds ISO/IEC 42001 certification, an independent AI management systems standard.
- The launch is US-only, in English, for adults 18 and older, via an early-access waitlist. Microsoft developed the product with input from over 230 physicians across 24 countries.
My take: If you feel that you have seen this news before, you are not wrong. In January both Anthropic and OpenAI launched similar initiatives: Anthropic launched Claude for Healthcare and OpenAI launched ChatGPT Health. Copilot Health however has the strongest device support, integrating with multiple services like Fitbit, Oura, Garmin, Apple Health and Android Health Connect. Copilot Health is also the only platform with ISO/IEC 42001 support. As with the other releases this is a US-only release, it will take a long time before anything like this is allowed within the EU, despite all the potential benefits.
Read more:
OpenAI Publishes Framework for AI Agents to Resist Prompt Injection
https://openai.com/index/designing-agents-to-resist-prompt-injection

The News:
- OpenAI published a security design guide on March 11, outlining how to build AI agents that limit the impact of prompt injection attacks, where malicious instructions embedded in external content attempt to redirect agent behavior.
- Modern prompt injection attacks have shifted from simple instruction overrides to social engineering tactics. A documented 2025 example sent to OpenAI by external researchers used a realistic-looking workplace email with embedded instructions to extract and exfiltrate employee personal data. In testing, that attack succeeded 50% of the time.
- OpenAI frames the problem using source-sink analysis: an attacker needs both a way to inject instructions (source) and a capable action to exploit (sink), such as transmitting data to a third party or following an external link.
- The primary countermeasure deployed in ChatGPT is called Safe Url, which detects when information from a conversation would be transmitted to a third party. It either prompts the user to confirm the data transfer or blocks it and instructs the agent to find an alternative path.
- Safe Url applies across ChatGPT Atlas, Deep Research, Canvas, and ChatGPT Apps. Apps run in a sandbox that detects unexpected external communications and requests user consent before proceeding.
- Separately, OpenAI introduced Lockdown Mode and Elevated Risk warnings in ChatGPT as additional safeguards against data exfiltration via prompt injection.
My take: LLMs only have one memory (the context window), which is shared between instructions and data. This means that the next-token prediction is based on everything in the context window, so if you just copy text into the memory and the text has embedded instructions, there’s a good chance the instructions will be used to control the output.
If you create AI agents, you should definitely read this article. Here are the key takeaways:
- Use source-sink thinking: identify every place external content can enter your agent (sources) and every dangerous action it can take (sinks), then break the chain between them.
- Do not rely on input classification or “AI firewalling” as your primary defense. It fails against social engineering attacks.
- Apply least privilege by default. If an agent only needs to read emails, do not give it the ability to send them.
- Add explicit user confirmation gates before any high-risk action, especially data transmission to third parties.
- Use sandbox tool execution. Unexpected external network calls from within a task should trigger a consent prompt, not silent execution.
OpenAI Equips the Responses API with a Full Computer Environment
https://openai.com/index/equip-responses-api-computer-environment

The News:
- OpenAI has extended its Responses API with a hosted shell tool and container workspace, letting developers build agents that execute real commands, manage files, query databases, and call external APIs without building their own execution infrastructure.
- The shell tool runs in a Debian 12 environment with utilities like grep, curl, and awk, and supports Python 3.11, Node.js 22, Java 17, Go 1.23, and Ruby 3.1, unlike the existing code interpreter which only executes Python.
- Outbound network traffic from containers routes through a sidecar egress proxy with configurable allowlists. Credentials use domain-scoped secret injection at egress, meaning raw secret values never appear in model-visible context.
- Context compaction is built into the API as either a server-side automatic mechanism or a standalone /compact endpoint. Models from GPT-5.2 onward are trained to produce encrypted, token-efficient compaction items that preserve key state across long-running sessions.
- The API supports “Skills”, defined as versioned zip bundles containing a SKILL.md metadata file and supporting resources. The Responses API loads skill bundles into the container before each run in a deterministic three-step sequence: fetch metadata, unpack bundle, update model context.
My take: For small teams that quickly want to get up to speed with agents, this is a very welcome announcement. Instead of creating your own hosting environment with databases and tools, you can now let OpenAI spin up a dedicated container and use the OpenAI file API to upload documents and resources directly into the container file system before running.
The problem with this approach however is that it creates dual hosting environments. Your own environment where your data and documents are stored, and the runtime environment at OpenAI. My recommendation is to be very hesitant before going all-in on this new Responses API. Setting up your own container environment is not that difficult, and you avoid future pain-points having to keep the data in the OpenAI runtime environment in sync with your own backend.
Also, there are the regulatory aspects. Maybe you do not need to process sensitive data now, but what if you do in the future? Then you would have to build it yourself and also migrate over all existing code to this new platform.
Claude Adds Inline Interactive Visuals to Chat Responses
https://claude.com/blog/claude-builds-visuals

The News:
- Anthropic released a beta feature on March 12 that lets Claude generate interactive charts, diagrams, timelines, and idea maps directly inside chat conversations, available to all users including those on the free tier.
- Visuals appear inline in the conversation rather than in a side panel, and are temporary, changing or disappearing as the conversation progresses with new context added.
- Claude decides autonomously when a visual would improve a response, but users can also request one explicitly with a prompt such as “draw a diagram” or “visualise how this changes over time.”
- Visuals are interactive: users can click sections to reveal underlying data, edit input values, and use sliders to adjust variables. For example, a tax penalty query produced a chart with an amount input field and a days-overdue slider that recalculated dynamically.
- The feature builds on “Imagine with Claude,” an experimental project previewed in late 2025 that let Claude assemble interface elements dynamically without user-written code.
My take: This looks great, and if you ever need a simple visualization of an information structure you can now just ask Claude to draw it for you. If you have a minute go watch their launch video of this feature, it’s just 80 seconds long.
Read more:
- Claude can now show you – YouTube
- Claude Can Now Create Complex Diagrams | by Joe Njenga | Medium
- Claude now creates interactive charts, diagrams and visualizations : r/ClaudeAI
Claude Code Remote Control Now Supports Mobile-Initiated Sessions
https://twitter.com/noahzweben/status/2032533699116355819

The News:
- Claude Code’s Remote Control feature, first shipped February 25, has received an update that allows users to spawn entirely new local sessions from the Claude mobile app, not just connect to an already-running terminal session.
- Previously, a session had to originate from a desktop terminal; the mobile app could only attach to an existing one. With version 2.1.74 or later, users can initiate a new local session directly from their phone.
- The feature is available on Max, Team, and Enterprise plans. GitHub must currently be configured on the mobile device, though that requirement is expected to be relaxed in future releases.
- Sessions run entirely on the user’s local machine throughout, with the mobile interface acting as a remote window. Local filesystem, MCP servers, tools, and project configuration remain accessible.
- To resume work back on the desktop, users run /resume from the same project directory.
My take: I can see why many think this is a great feature, but I personally think it’s something you should actively avoid. When you prompt a coding agent you need to follow a fairly strict process to get somewhat predictable results from it. You need to include things like context, design, integrations, frameworks, detailed feature descriptions and expected outcome. This is what you get good at after a few months of prompting.
Now try do all that from your mobile phone. Good luck navigating around the code base so you can tell the AI which reference files it should look at for similar implementation patterns, and good luck writing a detailed prompt over 500 words. Using agentic software development from your phone will only result in poor prompts that inevitably will lead to bad software quality output.
Anthropic seems to be having a fun time adding things to Claude code lately, like customizing the “spinner verbs” (like “Beaming up” etc), and being able to type /btw to chat while waiting. Remote control is another one. I would really recommend that you think twice before starting to write code from your mobile phone.
ByteDance Rewrites DeerFlow as a Full SuperAgent Harness

The News:
- DeerFlow 2.0 is ByteDance’s open-source agent framework, rewritten from scratch. It shifts from a specialized deep research tool to a general-purpose SuperAgent harness that can execute multi-step tasks across research, coding, content creation, and web app generation.
- The framework runs inside an isolated Docker container, giving agents a real filesystem and bash terminal to execute code and commands directly, rather than returning text suggestions for humans to run manually.
- A lead agent decomposes complex tasks and spawns up to three parallel sub-agents, each with scoped context, tools, and termination conditions. Results are synthesized into final outputs such as reports, slide decks, or web applications.
- DeerFlow uses a progressive skill-loading system that activates capabilities only when needed, keeping context windows lean for token-sensitive models.
- The framework integrates with any OpenAI-compatible API, including GPT-4, Claude, Gemini, DeepSeek, and local models via Ollama, without requiring changes to the agent logic. A Claude Code integration lets developers send tasks and manage threads from the terminal.
- Long-term memory builds a persistent profile of user preferences and workflows across sessions.
My take: DeerFlow 2.0 joins other agentic solutions like Manus, OpenManus, and OpenClaw. If you are building complex agentic flows that need to run within Docker containers then maybe check this one out. The GitHub repo has over 30 000 stars and Deerflow claimed the #1 spot on GitHub Trending following the launch of version 2.
Google Releases Gemini Embedding 2, Its First Natively Multimodal Embedding Model
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2

The News:
- Google released Gemini Embedding 2 on March 10 in public preview via the Gemini API and Vertex AI. It is the successor to the text-only gemini-embedding-001 and maps text, images, video, audio, and PDF documents into a single unified vector space.
- The model handles interleaved inputs, meaning a single API call can combine modalities, such as an image paired with a text caption, without requiring separate model pipelines.
- Input limits per modality are: up to 8,192 tokens for text, 6 images (PNG/JPEG), 120 seconds of video (MP4/MOV), audio natively without intermediate transcription, and PDFs up to 6 pages.
- The model uses Matryoshka Representation Learning (MRL), which stores the most critical semantic information in the earliest vector dimensions. This lets developers truncate vectors from the default 3,072 dimensions down to 1,536 or 768 without the accuracy collapse typical of standard embedding models.
- Supports over 100 languages and integrates with LangChain, LlamaIndex, Haystack, Weaviate, QDrant, and ChromaDB.
- One independent test on the MS MARCO dataset reported an NDCG@10 score of 0.939, placing it near the top of 17 benchmarked embedding models.
My take: An embedding model converts any piece of content, text, image, audio, or video, into a list of numbers (a “vector”) that encodes its meaning. Items that are conceptually similar end up with numbers that are “close together” (in dimensionality), so the AI can find related content by measuring distance between numbers rather than matching keywords.
Embedding models are critical for AI models to find relevant information fit into their limited context window. The better the embedding, the more chance the relevant information is copied into the context window, and the more chance you will get a good response out of the LLM.
When it comes to Gemini Embedding 2 – previous multimodal embedding approaches relied on dual-encoder architectures such as CLIP (image + text) and CLAP (audio + text), requiring developers to maintain separate models per modality and manually align their vector spaces. Gemini Embedding 2 replaces this multi-model setup with a single API call across five modalities. This is a huge time saver and also makes it extremely simple for an AI model to scan through all modalities simultaneusly!
Google Maps Adds Conversational Search and Redesigned Navigation
https://blog.google/products-and-platforms/products/maps/ask-maps-immersive-navigation

The News:
- Google Maps launched two Gemini-powered features on March 12: Ask Maps, a conversational query interface, and Immersive Navigation, a redesigned driving experience rolling out first in the U.S. and India.
- Ask Maps accepts natural-language queries such as “Is there a public tennis court with lights I can play at tonight?” and returns answers with a customized map view, drawing on data from over 300 million places and a contributor base of more than 500 million users.
- Responses are personalized using a user’s search history and saved locations; for example, a query for a dinner spot for four at 7 PM can factor in a pre-known preference for vegan restaurants.
- Users can book reservations, save places, or start navigation directly from an Ask Maps result.
- Immersive Navigation introduces a vivid 3D map view built from Street View imagery and aerial photos processed by Gemini models, with smart zooms, transparent building overlays, and lane-level guidance for upcoming turns.
- Voice guidance is updated to use conversational phrasing, for example: “Go past this exit and take the next one for Illinois 43 South.”
- The alternate-routes view now explicitly states tradeoffs, such as a longer trip with less traffic versus a faster route with a toll, and surfaces real-time disruptions sourced from more than 10 million daily driver contributions.
- Immersive Navigation is rolling out on iOS, Android, CarPlay, Android Auto, and cars with Google built-in.
My take: Google first introduced Gemini in Google Maps in November last year and now they take it one step further with this new “Ask Maps” feature. There are multiple occasions where I have tried Google Maps searches that I know an LLM would be able to handle, just for it to fail at it, so I end up in either the web browser or in ChatGPT. If this works as good as Google says it will I am sure most people will stick to Google Maps and use Google Search even less.
Google calls the new Immersive Navigation the “biggest transformation of the navigation experience in over a decade” and while Apple was first with a 3D view that reflects nearby buildings, overpasses, and terrain, the way it is presented in the new Google Maps is something completely different. Immersive Navigation also draws from the combined pool of data from both Google Maps and Waze. Waze remains as a separate app focused on driver-reported hazards, but this shared data layer narrows the gap.
Read more:
Google Workspace Gets Gemini Integration Across Docs, Sheets, Slides, and Drive
https://workspace.google.com/blog/product-announcements/reimagining-content-creation

The News:
- On March 10, Google rolled out new Gemini features across Docs, Sheets, Slides, and Drive, available to Gemini Alpha business customers and Google AI Pro and Ultra subscribers.
- In Docs, a new “Help me create” tool generates a formatted first draft by pulling context from Drive, Gmail, Chat, and the web. A separate “Match writing style” feature analyzes a document and suggests edits to unify tone and voice across collaborator contributions. Google notes that more than a third of new Docs are created from copies of existing files, which the new formatting-match tool is intended to address.
- In Sheets, Gemini reached a 70.48% success rate on the full SpreadsheetBench dataset, which Google describes as approaching human expert performance. A “Fill with Gemini” feature auto-populates cells; Google cites a 95-participant study showing it completes 100-cell tasks nine times faster than manual entry.
- Sheets also gained an optimization tool powered by Google DeepMind and Google Research OR-Tools, which can solve scheduling and resource allocation problems from a natural language prompt, without requiring formulas or third-party tools.
- In Drive, a new “AI Overviews” feature uses semantic search to return a summarized answer with citations at the top of search results. “Ask Gemini in Drive” lets users query across files, Gmail, Calendar, and Chat, with the option to save a curated source set as a reusable “Project.”
My take: Google is really putting Gemini EVERYWHERE and since most of their services are web based they are having an easy time doing it. Gemini is now in all of the Google suite apps, in Google Maps, and in Google Search. Compare this to Microsoft that has to put Copilot into their native operating system and their native apps, and you quickly understand the technical challenges with that approach. Love it or hate it, most of you will use LLMs to interact with every single service before the end of the year, and for most services it will make them both easier to use and more efficient.
Andrew Ng Releases Context Hub: Open-Source API Docs for Coding Agents
https://github.com/andrewyng/context-hub

The News:
- Context Hub is an open-source CLI tool from Andrew Ng and DeepLearning.AI that provides coding agents with curated, versioned API documentation, addressing a problem where agents generate code using deprecated endpoints or hallucinated parameters.
- Agents install the tool via npm install -g @aisuite/chub and query docs with commands like chub get openai/chat –lang py, fetching documentation written specifically for machine consumption rather than humans.
- The registry launched with 68 API providers including Stripe, OpenAI, Anthropic, Supabase, Firebase, Twilio, Shopify, and AWS, and reached over 1,500 GitHub stars within five days of Ng’s LinkedIn announcement.
- Agents can annotate docs locally using chub annotate, persisting session-specific workarounds across future sessions so the agent does not rediscover the same fix repeatedly.
- A community feedback loop lets agents upvote or flag documentation via chub feedback, routing signals back to doc maintainers.
- Docs support language-specific variants (Python and JavaScript), incremental reference file fetching to reduce token usage, and built-in SKILL.md files for agents such as Claude Code.
My take: Almost a year ago the company Upstash launched Context7: Up-to-Date Docs for LLMs and AI Code Editors. Released as an MCP server, the promise of Context7 was for AI agents to ask for the right documentation related to the programming frameworks they are using, which is extremely useful when you want things written for an older framework that might have been deprecated in a newer one. In practice however Context7 felt slow and was problematic to use, and very seldom I got the AI to actually ask Context7 for help and it often missed the mark.
Context Hub can be seen as a modernized version of Context 7. It runs as a CLI tool (which is the trend these days) and the AI agent retrieves source code documentation in straight markdown format from it. It’s fast, small and lean and is something I will try myself in the coming weeks in my own projects. If you are developing with agentic AI and have struggled with Context7, give this one a go.
Karpathy’s Autoresearch Runs Autonomous ML Experiments Overnight on a Single GPU
https://github.com/karpathy/autoresearch

The News:
- Andrej Karpathy released autoresearch, an open-source Python framework that runs AI agents autonomously on ML experiments overnight, requiring no human intervention between iterations.
- The tool consists of approximately 630 lines of code, built from a stripped-down version of Karpathy’s nanochat LLM training framework, sized to fit entirely within a modern LLM’s context window.
- The agent reads a human-written Markdown file (program.md), edits the training script (train.py), runs fixed 5-minute training sprints on a single NVIDIA GPU such as an H100, and commits code changes to a git branch only when validation bits-per-byte (BPB) improves.
- In Karpathy’s own initial runs, the agent reduced validation loss from 1.0 to 0.97 BPB autonomously.
- Shopify CEO Tobi Lutke adapted the framework and reported a 19% improvement in validation scores, with the agent-optimized smaller model eventually outperforming a larger manually configured one.
- The repo crossed 8,000 GitHub stars within days of release.
My take: 8 000 GitHub stars within days of release, that’s quite a lot for 630 lines of code. So what is this thing? Autoresearch is an open-source tool released by Andrej Karpathy that runs a single AI agent in a loop overnight on your GPU. The agent reads your training script (train.py), makes a small code change, runs a fixed 5-minute training sprint, and checks whether the validation metric improved. If it did, the change is kept. If not, it is discarded. Then the loop repeats until you hit a time or cost limit.
To understand the importance of this, Shopify CEO Tobi Lutke adapted the “autoresearch” framework for a personal project where he reported a 19% improvement in validation scores. The agent-optimized smaller model even outperformed a larger model that had been configured through standard manual methods.
Read more:
- Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Tool Letting AI Agents Run Autonomous ML Experiments on Single GPUs – MarkTechPost
- tobi lutke on X: “OK this thing is totally insane.” / X
NVIDIA Releases Nemotron 3 Super for Multi-Agent AI Systems
https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai

The News:
- NVIDIA launched Nemotron 3 Super, a 120-billion-parameter open model with 12 billion active parameters at inference, targeting multi-agent AI workloads where context volume and inference cost are primary constraints.
- The model uses a hybrid Mixture-of-Experts (MoE) architecture combining Mamba layers for memory and compute efficiency with transformer layers for reasoning, plus a “Latent MoE” technique that activates four expert specialists at the cost of one per token generation.
- Multi-token prediction allows the model to predict several tokens simultaneously, resulting in 3x faster inference compared to standard single-token prediction.
- The model supports a 1-million-token context window, allowing agents to retain full workflow state without re-sending prior histories across long tasks.
- Weights, training data (over 10 trillion tokens), 15 reinforcement learning environments, and evaluation recipes are released under a permissive open license.
- NVIDIA AI-Q, powered by Nemotron 3 Super, currently ranks first on both the DeepResearch Bench and DeepResearch Bench II leaderboards.
My take: NVIDIA launched the Nemotron 3 series on December 15 2025, with the initial launch of Nemotron 3 Nano, a 30 billion parameter model. Now comes the second one, Super (120b parameters) which is to be followed by Nemotron 3 Ultra at around 500B parameters.
Performance wise this model seems to perform better than OpenAI gpt-oss-120B but slightly worse than Qwen 3.5 122B. When running on Blackwell infrastructure they managed to get the NVFP4 quantized version achieve 99.8% median accuracy relative to the BF16 baseline, which I found interesting. According to NVIDIA this “cuts memory requirements and pushes inference up to 4x faster than FP8 on NVIDIA Hopper, with no loss in accuracy”. This just further shows the power of the new Blackwell hardware currently rolling out to all data centers.
Read more:

