:probabl.

The open source tools of enterprise data science: A conversation with Merel Theisen

Tue, 31 Mar 2026 14:03:22 GMT

Ask any enterprise data science team what slows them down, and you’ll probably hear similar answers. Notebooks that work locally but fall apart in production. Pipeline code that only one person truly understands. New hires who spend their first weeks reverse-engineering what the team before them built. Exploring input data and cleaning it. These aren’t exotic edge cases – they are the norm.

This is where open source tools – from Skore by Probabl to Kedro by QuantumBlack – are making the difference for enterprise data science teams. This week, I sat down with Merel Theisen, Tech Lead of Kedro at QuantumBlack, to discuss how open source tools drive lasting value for enterprise data science teams.

Building for enterprise data science teams

At Probabl, we are driven by the conviction that the data science industry is failing enterprises – not for lack of compute or data, but for lack of rigor and structure. Most models never reach production. Reproducibility remains an aspiration. Knowledge walks out the door with every departing data scientist or engineer. And a generation of automated tools that promise magic instead deliver opacity, technical debt, and lock-in.

This conviction is inseparable from where we come from. Probabl was founded by the creators and maintainers of scikit-learn, the most downloaded Python library for machine learning. In March 2026 alone, scikit-learn was downloaded over 200 million times. We don’t observe the data science world from the outside. Our founders built the open source infrastructure it runs on, and keep doing so. And that history shapes what we do.

We’re building for enterprises that want to truly own their data science– those that prize tools that build institutional knowledge rather than concentrating it in black boxes or third-party platforms. As our CEO François Méro wrote recently, our four guiding principles are: (1) science first, (2) composability, (3) reusability, and (4) transparency.

We’re putting these principles into practice with Skore, available as an open source library and as an enterprise platform that empower data science teams to collaborate, scale their practice, and increase the impact of their AI projects.

None of this is built in isolation, though. The tools from the broader open source ecosystem – and the vibrant communities that maintain them at the state of the art – are essential to how enterprises can own their data science.

Kedro, the Python framework for production-ready data pipelines, is an important piece of that puzzle. By giving teams a standardized project structure and a principled way to build production-ready data pipelines, it addresses many of the same structural problems we think about at Probabl every day: how to move from individual heroics to institutional practice, from one-off experiments to reproducible, auditable systems.

A conversation with Merel Theisen

To learn more about the design choices and vision guiding Kedro, I sat down with Merel Theisen, Tech Lead of Kedro and Principal Software Engineer at QuantumBlack. We discussed how Kedro is built and why, what a healthy open source data science ecosystem actually looks like in practice, and how tools like Kedro and Skore create value for enterprises.

Marie Sacksick: Merel, for someone coming into this with zero context: what is Kedro and what problems does it solve for enterprise data science teams?

Merel Theisen: Kedro is an open source Python framework hosted by the Linux Foundation. It brings software engineering best practices to data science and data engineering, giving teams a standardized way to build production-ready data pipelines. For enterprise teams specifically, it solves some very real pain points: inconsistent project structures across teams, code that works in notebooks but falls apart in production, and the difficulty of collaborating on pipeline code when everyone has their own way of doing things. Kedro gives you a common foundation. This way teams can focus on the actual data science rather than reinventing project scaffolding every time.

Marie Sacksick: Data scientists rarely use a single tool in a vacuum. You may stitch together Kedro for data pipelining, scikit-learn for training machine learning models, SHAP for interpretability, and MLflow for monitoring models once they’re in production. Can you give us a sneak peak into a time you’ve seen Kedro used with tools like scikit-learn to drive real-world impact – what was the problem, and what was the outcome?

Merel Theisen: One great example is a large Brazilian independent broker that had no formal data science practice when they started out. Their main challenge was a classic one: every data scientist built pipelines their own way, and the typical workflow meant shipping notebooks straight to production. They’d tried adopting tools like MLflow but couldn’t get adoption due to coding overhead.

The team adopted Kedro and it clicked for them because it met them where they were. It gave them standardized project structure, encouraged good software engineering practices, and let them think about models as proper software artifacts rather than one-off notebook experiments.

What’s interesting is what happened next. Once Kedro was in place as that foundational layer, adopting other MLOps tools became much easier. MLflow for experiment tracking, Great Expectations for data validation, these tools slotted in naturally because the team already had clean, structured pipelines to integrate them with.

Marie Sacksick: When you develop new features for Kedro, how much do you prioritize interoperability with tools in the wider Python data science ecosystem? And going one step further, what does a healthily integrated Python data science ecosystem look like to you?

Merel Theisen: Kedro is designed to be tool- and platform-agnostic, so it slots into existing data stacks easily. As a Python library, tools like pandas, scikit-learn, and LangChain work natively inside Kedro projects. We also offer hooks, plugins, and kedro-datasets, our community-driven data connectors, to extend functionality further. A healthy ecosystem, to me, is one where tools complement each other and users can leverage the best of each without friction.

Marie Sacksick: At Probabl, we recently launched Skore Hub, a platform that extends our open source library Skore and enables data science teams to easily track, explore, and share their data science workflows. What value do you see Kedro and Skore, when used together, creating for enterprise data science teams?

Merel Theisen: To me, Kedro and Skore address different but complementary stages of the data science workflow. Kedro provides the pipeline structure: how data flows, how code is organised, how projects scale. Skore, as I understand it, focuses on model development quality, such as evaluation reports, methodological diagnostics, and cross-validation insights. I think together they’d give enterprise teams both structured, reproducible pipelines and rigorous model evaluation with built-in best practices, which is exactly the combination needed to move from experimentation to production confidently.

Marie Sacksick: Open source thrives on collaboration, yet many enterprise users are consumers rather than upstream contributors. Could you give us a sneak peak into how you and your team have successfully encouraged others to move from just using Kedro to actually contributing to it? Based on your learnings, what is your go-to advice for enterprises that steward core Python libraries for data science and AI?

Merel Theisen: Before open-sourcing Kedro, we established strong internal standards around code quality and testing. The challenge was maintaining that bar without discouraging contributions. We invested in clear contribution guides, streamlined developer setup, and responsive PR reviews, as people shouldn’t be left waiting. We also created tiered contribution paths: kedro-datasets is an easy entry point, and our experimental dataset tier lowers the bar further, letting contributors share ideas without needing to fully polish them. My advice: make contributing feel achievable, respond quickly, and offer varied entry points for different commitment levels.

Marie Sacksick: Last but not least, how would you pitch scikit-learn to CEOs who want to leverage the power of AI in their businesses?

Merel Theisen: I’d pitch scikit-learn as the most battle-tested ML library in the Python ecosystem. It’s open source, widely adopted, and covers the vast majority of practical ML use cases. And naturally, it works seamlessly inside Kedro projects, so teams get structured pipelines with best-in-class ML tooling out of the box!

About Merel Theisen

Merel Theisen is a Principal Software Engineer at QuantumBlack, where she is currently the tech lead of Kedro, an open source project part of the Linux Foundation. Merel has over ten years of experience in the software industry, with most of her career focused on backend product engineering. Merel is passionate about building products that solve real user problems, and cares deeply about creating robust, well-tested software that follows good engineering principles. Merel is also a strong advocate for open source software, and finds working with the community to be both inspiring and energizing.

For more from Probabl

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

The value of certifying your machine learning skills: A conversation with Dr. Fabian Stephany

Tue, 24 Mar 2026 13:36:54 GMT

We find ourselves in a period of profound uncertainty regarding the future of work. As software and AI redefine traditional workflows, policy makers and business executives alike are grappling with questions like: Which skills will remain relevant, and how do we build a workforce that is resilient to the next wave of disruption?

To find answers, we must move beyond speculation and look at the evidence. Dr. Fabian Stephany, an Assistant Professor in AI and Work at the University of Oxford, is at the forefront of this effort. Leading the multidisciplinary SkillScale Project, Dr. Stephany and his team use large-scale labor market data to provide empirical insights on how emerging technologies are reshaping work and the skills that are increasingly in demand.

In this post, I break down key findings from the SkillScale Project and share my conversation with Dr. Stephany about his latest insights on these questions.

What the data says about the value of ML skills

By analyzing millions of data points from online job vacancies and digital work platforms, Dr. Stephany and his team in the SkillScale Project have been digging into how the skill composition of professions is changing. I highlight three findings that stand out to me.

The value of complementarity: AI and ML skills pay off

In a 2024 research paper published in the prestigious Research Policy journal, the researchers found that the value of a skill depends on its complementarity; that is, by the number, diversity, and value of skills it can be combined with. By analyzing nearly 50,000 freelance projects, the researchers found that high-value skills like data analytics derive their worth from their complementarity and function as a force multiplier when paired with others.

The researchers also found that by having general AI-related skills, professionals can earn 21% more than their peers without such skills. This includes AI-adjacent roles where the professional uses AI to enhance their primary job (e.g. a marketer using AI for content generation or a project manager using AI for forecasting).

For data scientists, this means that your professional resilience is built not by hyper-specialization per se, but by developing a diverse set of interlocking skills that create strategic options for the future. For executives, this underscores a strategic shift in how to build human capital in your enterprise: your most resilient employees are those whose skill sets are diverse enough to offer strategic options for future reskilling.

Professionals with ML skills enjoy a 40% wage premium

In the same paper, the researchers identified a hierarchy of wage premiums based on the depth of a professional’s AI expertise. In particular, there is a significant wage premium for workers who have machine learning skills, who see a 40% increase in hourly wages. This specialized expertise represents the highest wage premium, followed by other types of AI skills such as deep learning (+27%) and natural language processing (+19%).

Crucially, this premium for machine learning skills is not confined to the tech sector. The researchers found that software and technical skills are often more valuable when applied in non-tech domains; for example, commanding ten times the value in Finance or Legal sectors compared to the Tech domain itself. This indicates that machine learning has become a key general-purpose skill, where the highest economic rewards go to those who can bridge the gap between technical execution and industry-specific application.

Certified ML skills increase likelihood of landing an interview invitation

In an experimental study involving over 1,700 recruiters across the US and UK, published in a 2026 working paper, Dr. Stephany and his team found how AI skills impact hiring decisions in graphic design, administration, and software engineering.

The researchers found that having ML and AI skills in your resume increases the likelihood of landing an interview invitation by up to 15%. Notably, AI skills can act as a powerful equalizer, capable of offsetting traditional labor market disadvantages related to age or lower formal education. In addition, verifiable certificates for machine learning and AI skills–particularly those issued by recognized universities or companies–act as a credible hiring signal.

A conversation with Dr. Fabian Stephany

I sat down with Dr. Fabian Stephany to better understand his latest insights on in-demand skills and how data scientists can upskill to remain competitive in the evolving labor market.

Arturo: Your research suggests that skills like data analysis and machine learning gain value when paired with others. For a data scientist today, what are the most underrated complementary skills that significantly boosts the market value of their technical expertise?

Dr. Fabian Stephany: We certainly see a strong premium for AI related skills such as data analysis, machine learning, and increasingly the application of AI agents in business workflows. But at the same time, our recent research shows that so called human or soft skills are becoming more valuable as AI spreads through the workplace.

The reason is quite straightforward. As AI tools become better at handling repetitive technical tasks such as cleaning datasets, refactoring code, or drafting reports and emails, this frees up cognitive bandwidth for workers to focus on areas where humans still have a comparative advantage. These include things like ethical judgment, communication, and teamwork.

Interestingly, when we look at occupations where AI adoption is particularly strong, we also observe rising demand for exactly these kinds of human capabilities. So for technical professionals such as data scientists, it is important to think beyond purely technical development. Technical expertise remains essential, but the professionals who will benefit most from AI are those who combine it with strong collaborative skills, the ability to translate technical insights into business decisions, and a sense of responsible and ethical deployment of these technologies.

In other words, the future value of technical expertise increasingly lies in how well it is embedded in human judgment and collaboration.

Arturo: In your 2026 working paper, you found that AI skills significantly increase interview invitations, even for non-technical roles like office assistants. Since claiming AI skills is becoming easier and more common, how can recruiters distinguish between a candidate who merely uses machine learning or AI tools and one who truly understands how to integrate them into professional workflows?

Dr. Fabian Stephany: This question essentially comes down to signaling. Today it has become relatively easy to claim AI expertise, sometimes simply because someone knows how to write prompts or use a particular tool. For recruiters, that makes it increasingly difficult to distinguish between buzzwords and genuine capability.

In our research we conducted a large online experiment with more than 1700 recruiters. Interestingly, we find that even self reported AI skills already increase the probability of being invited to an interview. But the effect becomes significantly stronger when these skills are accompanied by credible credentials.

Micro credentials, often short courses lasting one or two weeks offered by trusted industry providers or universities, greatly strengthen the signal. Candidates who list such certified AI skills receive substantially more interview invitations. This effect is particularly strong for applicants who might otherwise face disadvantages in the labor market, such as older workers or candidates with lower levels of formal education.

So while AI skills are increasingly common claims, credible certification from trusted institutions remains a powerful way to separate genuine capability from simple buzz.

Arturo: You’ve identified that employers are increasingly prioritizing practical AI skills over traditional degrees. Do you see a future where verifiable, hands-on certifications become the primary hiring signal for technical roles?

Dr. Fabian Stephany: What we observe right now is a shift toward skill based hiring. Employers increasingly focus on specific capabilities rather than relying solely on traditional degrees as signals.

However, this does not mean that degrees such as bachelor’s or master’s programs have lost their value. In many cases universities simply have not yet scaled up programs that focus specifically on applied AI skills. As a result employers currently rely more heavily on direct signals of skills because strong academic credentials in these areas are still relatively scarce.

Micro credentials, short targeted training programs, are filling this gap at the moment. They provide a fast and credible way to signal practical capabilities.

In the longer run, however, I expect universities to adapt. As more structured degree programs emerge around AI applications and computational skills, traditional academic credentials will continue to play an important role. The likely future is not the replacement of degrees but rather a hybrid system in which formal education and verifiable skill credentials complement one another.

Arturo: The finding that machine learning and AI skills are significantly more valuable in Finance or Legal than in Tech is interesting. Why does the translation of AI expertise into traditional industries command such a high premium?

Dr. Fabian Stephany: One explanation is the difference in technological maturity across sectors.

Many technical professions such as software engineering, machine learning engineering, or data science have already integrated forms of AI and advanced analytics for many years. Even before the recent wave of generative AI, these roles were already using machine learning methods and automation tools to optimize workflows. In other words, much of the productivity premium from these technologies has already been captured in the tech sector.

In contrast, sectors such as finance, legal services, or management are still in the earlier stages of adopting these technologies. Here the potential efficiency gains are often much larger. A lawyer, financial analyst, or manager who effectively integrates AI into their workflow may still see very substantial productivity improvements.

So the higher premium reflects the fact that AI adoption in these sectors is still catching up and therefore the marginal impact of AI expertise can be particularly large.

Arturo: Looking ahead to 2030, what is your prognosis for the sectors where skills in machine learning and AI will be the most impactful?

Dr. Fabian Stephany: Forecasts about the future of work tend to age notoriously badly, so I am cautious about making very precise predictions.

What we can say, however, is that AI has two distinct channels through which it creates value.

The first is efficiency gains, making existing processes faster, cheaper, and more reliable. In this dimension there is still enormous untapped potential, especially in small and medium sized enterprises where digital transformation is often still incomplete.

The second channel is genuine innovation, the creation of entirely new products, services, or scientific discoveries. This is much harder to predict.

To draw an analogy from the Industrial Revolution, early on we used steam power to improve existing processes such as mechanizing textile production. The real breakthrough came later when the steam engine was put on rails and created the railway system. That fundamentally transformed the economy.

With AI we are still largely in the phase of improving existing processes. The real transformative innovations are still ahead of us. One sector where this may become particularly visible is pharmaceuticals and biotechnology, where AI could dramatically accelerate the discovery of new drugs and treatments.

Learn more about the SkillScale Project

For a deeper dive into research findings from the SkillScale Project, explore the following resources:

📺 Watch

How AI can actually boost your chances of finding a new job. Worried AI is going to steal your job? In this BBC explainer video, Dr. Fabian Stephany explains how AI skills can in fact be an ally when it comes to finding a new role.
AI Skills Improve Job Prospects. In this LinkedIn Short, Dr. Fabian Stephany explains the key findings from his 2026 paper, “AI Skills Improve Job Prospects”.
Code-Based Colleagues: The Future of Work and AI. This micro-documentary by Oxford Sparks provides an overview of how data-driven reskilling can create sustainable jobs.
Reskilling in the Age of AI. A panel discussion hosted by micro1 and Microsoft AIEI on the shifting requirements of the global workforce.
AI’s Ripple Effect on Skills and Labor Markets. Watch Dr. Fabian Stephany’s webinar lecture at Saïd Business School at the University of Oxford, detailing two years of SkillScale research findings.

📚 Read

About Dr. Fabian Stephany

Fabian Stephany is an Assistant Professor in AI and Work at the Oxford Internet Institute (OII), University of Oxford and a Senior Research Fellow with the Institute for New Economic Thinking at the Oxford Martin School. He is also a Future of Work fellow at the Brussels-based think tank Bruegel, an inaugural fellow at Microsoft’s AI Economy Institute, and a research affiliate at the Humboldt Institute for Internet and Society in Berlin. Additionally, he currently serves as a member of the World Economic Forum’s Global Future Council for Human Capital Development.

At the OII, Fabian leads the SkillScale project, which views skills as a central lens through which to understand today’s labour market transitions. By examining how work quality, job growth, and labour market equitability and sustainability respond to technological change, the project investigates how AI skills are becoming increasingly pivotal for workers and employers alike. As part of his Microsoft fellowship, Fabian is currently exploring the role of AI skills in employability—particularly how working with generative AI enhances job prospects and addresses the gender gap between men and women.

Fabian is also a co-creator of the Online Labour Observatory–a digital data hub hosted in collaboration with the International Labour Organization that provides researchers, policymakers, journalists, and the public with insights into online platform work. His research has been published in leading academic journals, such as Research Policy and Scientific Reports, and has received media coverage in outlets around the world, including The Washington Post, The New York Times, The Telegraph, Nikkei Asia, Handelsblatt, and the Frankfurter Allgemeine Zeitung.

Learn more about Skolar by Probabl

The creators and maintainers of scikit-learn created the Inria MOOC “Machine learning in Python with scikit-learn,” as per the open source philosophy to empower everyone, everywhere, with free knowledge.

Since 2025, this mission is shared and actively supported by Probabl. The MOOC hasn’t stopped evolving to adapt to new best practices, new functionalities in scikit-learn, and most importantly serve as preparation to get hands-on experience in a world where code generated by agents has to be validated by capable humans in the loop.

That’s where Skolar comes into play.

Learn more about Probabl’s free educational materials for machine learning with Python and our official Skolar certifications.

For more from Probabl

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

Scikit-learn acceleration with GPUs: A conversation with Dr. Andy Terrel

Tue, 17 Mar 2026 14:11:42 GMT

For over a decade, scikit-learn has served as the bedrock of machine learning, supporting the work of millions of data scientists worldwide. While scikit-learn was originally designed for a CPU-centric world, the advent of new hardware presents an opportunity to supercharge machine learning pipelines.

Speeding up machine learning workflows isn’t just about technical benchmarks; it’s about turning hours of training into seconds, saving time and money for data scientists and enterprises.

I sat down with Dr. Andy Terrel from NVIDIA to discuss why this community effort is such a game-changer for the scientific Python ecosystem and enterprise data science.

Bringing you up to speed

To achieve GPU acceleration in scikit-learn without fragmenting our codebase, we are re-engineering scikit-learn to be backend agnostic. This is a mighty community effort involving our team at Probabl, our peers at Quansight and NVIDIA, and many others from the wider community.

Historically, supporting GPUs required specialized code for every library, but the array API provides a unified specification that allows scikit-learn to remain flexible. Now, when an estimator is array API-compliant, it can inspect your input data–whether it’s a PyTorch tensor or a CuPy array–and delegate the computation to the matching library’s optimized functions. If your data lives on the GPU, the computation stays on the GPU, avoiding expensive memory transfers.

We have already updated 25 estimators and core tools like the scoring API, ensuring they perform consistently across different hardware backends through rigorous automated testing. The real-world impact of this work is significant; for example, recently Olivier Grisel, my fellow scikit-learn core maintainer and ML Engineer at Probabl, demonstrated a 15x speed-up in complex machine learning pipelines by offloading compute-intensive steps to the GPU.

For a deeper dive into the technical implementation and the latest progress, I highly recommend reading the detailed technical updates by my colleague Olivier and Lucy Liu from Quansight.

A conversation with Dr. Andy Terrel

Gaël Varoquaux: Andy, for someone coming into this with zero context: why are we working so hard to accelerate Python libraries like scikit-learn with GPUs?

Andy Terrel: Two main reasons. The forward march of computing technology and the growing needs of science in the AI era. When I started programming computers, multi-threaded programs were rare. HPC centers had wonderful CPUs that could manage 2 threads, but today your phone has 6 cores. What is leading edge in the HPC center will come to commodity systems in time. GPUs are a must in any data center today, and most people are seeing them deployed in their commodity hardware as well for smaller scale simulations. The other trend is incorporating AI into the scientific workload, we see scientists needing to use tools like scikit-learn to do ensembles of models or build model regressors. By having our beloved python scientific tools work on the GPU, we allow scientists to be more efficient and utilize the GPU as part of the full application rather than an occasionally used offloading device.

Gaël Varoquaux: At your recent webinar, “Python on the GPU: From Libraries to Kernels,” my colleague and co-maintainer of scikit-learn, Olivier Grisel, spoke about our efforts to adopt the Array API and showcased the benefits of GPU-acceleration for the data scientist working with scikit-learn. For example, in a demo, he showed that using GPUs instead of CPUs results in a 15x speed-up when tuning hyperparameters in a complex machine learning pipeline. This is fantastic but there’s still much work to be done in scikit-learn and of course in many Python libraries to enable GPU-acceleration. From your perspective, how do you think the ecosystem could evolve to make such an endeavor easier?

Andy Terrel: The Array API is a big step in allowing codes to seamlessly integrate with GPUs, but codes such as NumPy and SciPy are slow to fully convert their core routines into it. It takes time to move such code bases, but we are pointed in the correct direction. The Array API only takes a code so far, as most codes I work with are also needing to adopt the correct array interfaces. The array interfaces, e.g. c_array_interface of NumPy or cuda_array_interface of Numba, help integrate with other native codes. In this vein the DLPack system has become essential to provide an interface that recognizes the different devices and will avoid memory movement as needed. While some tools have adopted these apis and interfaces, we use more work expanding out the tooling for the scientific ecosystem. For example, pure Python applications have wonderful tools like pyrefly and ty for type inference but scientific codes can rarely use them because extension types are not cleanly represented in Python type syntax.

Gaël Varoquaux: Open source communities appreciate choice in software and hardware. How do tools like CuPy, Numba, and the Python array API standard help open source maintainers and users navigate the balance between achieving maximum performance while maintaining a healthy, backend-agnostic ecosystem?

Andy Terrel: My viewpoint is that we should build high level open tools for the 80% cases and then let users determine if they want to specialize to hardware for the extra 20%. There are many cases where an extra 20% of performance is crucial, but for many it is not. The transition of a code based on NumPy can be quickly ported to CuPy. From there, if a routine needs a more finely tuned GEMM or FFT, nvmath-python provides bindings to highly optimized CPU and GPU libraries. These optimized libraries are hard to maintain so letting vendors provide them and OSS communities can focus on choice rather than optimal performance.

Gaël Varoquaux: Thanks to GPU acceleration, we can move from minutes to seconds when training models. It’s a win for data scientists, who get to be more productive and remain in the flow. And it’s a win for enterprises, who get to save time and money. What’s your prognosis of how these wins will materialize for enterprise data science teams? Which sectors do you think will reap the greatest rewards?

Andy Terrel: I’ve seen some big wins with scientific instruments. Grid Computing was invented because High Energy Physics needs to offload experimental data and process it. Today scientific workflows that took weeks to process can be done in minutes. When you have instruments that have configurable sensors (and all the big ones do these days), this means scientists can have more control over experiments as data is processed faster and models can be updated on the fly. This sort of acceleration directly translates to industrial processes, robotics, self driving cars, etc. I spent time in the manufacturing space before coming to NVIDIA and we were already seeing better yields and faster times from prototype to production with machining.

Gaël Varoquaux: Most people associate GPUs with LLMs. How do we raise awareness that GPUs are also game-changing for machine learning with tabular data - the type of data that most enterprises actually run on?

Andy Terrel: In my career as a data scientist, I would advise companies to evaluate the speed of decision making with the technology they choose. If a company requires emails of spreadsheets and weekly meetings, it would take 2-4 weeks for decisions to be made. If there was a dashboard with apis but daily standups, we would see operations changing in 2-4 days. Both these cases are essentially bringing tabular data in front of decision makers, and require careful analysis and subject matter insight to decipher. Now if we can get tabular data to be instant and correct the first time, no more arguing about domain models, then we can see business adapt to the market in near real time. This is scary to business leaders, they like their spreadsheets so there needs to be a phased introduction of the tooling to help transform business. Unfortunately, I don’t know that optimizing enterprises has ever been something seen as cool, but operational efficiency will drive leaders to better results and the tabular data model will be its heart.

Gaël Varoquaux: Looking ahead, do you envision a world where the distinction between “CPU code” and “GPU code” in the Python stack disappears entirely for data scientists?

Andy Terrel: Today I work with data centers that have LPUs and quantum chips as well. The essential challenge is that the programming model is so different between these different chips. With AI Agents we are seeing some transfer between GPU and CPU code but the two code paths still need to be managed differently for efficiency. High core count CPUs may get to a point where the memory hierarchy of the GPU starts getting built in, but I’m a software person and I really don’t know the complexities there.

Gaël Varoquaux: Looking ahead again, what are you the most excited about when it comes to making our favorite Python libraries for data science run on GPUs?

Andy Terrel: I’m most excited about scientific discoveries. The further enabling of weather predictions, nuclear fission, and astronomical discoveries are all using GPUs today. Tools for scientific data analysis are incorporating AI and machine learning by default, this allows researchers to focus on the important aspects of science and perform more surveys to validate before experimentation.

Learn more about scikit-learn acceleration

🖲️ Demo

Test the GPU speed-ups in this demo made by Olivier Grisel, ML Engineer at Probabl and scikit-learn core maintainer.

📺 Watch

Olivier Grisel, ML Engineer at Probabl and core maintainer of scikit-learn, demoed a 15x speed-up in a complex ML pipelines in Nvidia’s “Python on the GPU: From Libraries to Kernels” webinar in February 2026.

📚 Read

Adrin Jalali, VP of Labs at Probabl and scikit-learn core maintainer (March 11, 2026): Current scikit-learn priorities at Probabl - March 2026 edition
Olivier Grisel, ML Engineer at Probabl and scikit-learn core maintainer (March 10, 2026): Scikit-learn acceleration with GPUs
Lucy Liu, Software Engineer at Quansight labs and scikit-learn core maintainer (March 4, 2026): Update on array API adoption in scikit-learn

About Dr. Andy Terrel

Andy leads CUDA Python from the product management team. His research focused on domain-specific languages to generate high-performance code for physics simulations with the PETSc and FEniCS projects. Andy is a leader in the Python open-source software community. He’s most notably a co-creator of the Dask distributed computing framework, the Conda package manager, the SymPy symbolic computing library, and NumFOCUS foundation.

For more from Probabl

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

Announcing Scikit-learn Central

Fri, 13 Mar 2026 10:04:02 GMT

In the world of technology, we often talk about “infrastructure” in terms of cables, data centers, and cloud regions. But for the global data science community, the most critical infrastructure isn’t made of silicon or steel–it’s made of code. Specifically, it is built upon open source libraries like scikit-learn.

When we founded Probabl, we did so with a profound sense of responsibility. As an organization founded by the creators of scikit-learn, we recognized that this library had grown beyond a mere tool; it had become the de facto open standard for machine learning worldwide.

Today, I am proud to announce the launch of Scikit-learn Central, a new digital hub designed to visualize, unite, and stimulate the sprawling ecosystem that has grown around this foundational library.

Figure 1: scikit-learn central catalog

The scale of a global standard

To understand why stewarding scikit-learn is important, one only needs to look at the numbers.

Scikit-learn has surpassed 4.1 billion downloads. Every single month, another 160 million downloads occur as data scientists, developers, and students pull the library into their environments. On GitHub, it serves as the foundation for over 1.3 million repositories and 27,000 packages. It is a project that has been sustained by the efforts of over 3,100 contributors and that has received over 65,000 stars. For comparison with other popular libraries for machine learning like XGBoost and deep learning like PyTorch and TensorFlow, take a look at the chart below that shows yearly downloads from PyPi and Conda. Looking at this metric alone over the past few years, scikit-learn commands a 70%+ growth year-on-year which is staggering.

Figure 2: Yearly downloads (pypi + conda) of key machine learning and deep learning libraries

Downloads and stars only tell part of the story. The impact of scikit-learn is also measured in the progress of human knowledge. The seminal paper, “Scikit-learn: machine learning in Python,” has 128,960 citations, and scikit-learn is cited in over 7,000 Nature publications. This is world-leading science in action. From identifying genetic markers in cancer research to optimizing the complex climate modeling models, scikit-learn provides the mathematical building blocks that allow experts in biology, physics, and ecology to apply machine learning to their specific domains without needing a PhD in applied mathematics.

Mission-driven stewardship

Probabl is not a typical tech company. We are a Société à Mission under French law–a status similar to a B-Corp, with our mission integrated into our corporate bylaws: “To develop, maintain at the state of the art, and sustain a complete suite of open source tools for data science to benefit … the world.”

We believe that open source tools of this magnitude require a sustainable commercial model that respects the community. Our mission is to shepherd scikit-learn, helping the community use it and improve it. That means ensuring the core remains robust while fostering the “extended universe” of libraries that make the workflow complete.

Introducing Scikit-learn Central

The scikit-learn ecosystem is vast. While the core library provides the supervised and unsupervised algorithms, the wider community has built specialized tools to handle the nuances of modern data science.

Scikit-learn Central is our attempt to make this ecosystem navigable. It is a catalog of the “building blocks” that turn a simple model into a production-ready pipeline. When you visit the catalog, you see the full breadth of what is possible:

Data preparation: Tools like Skrub simplify the often-tedious process of data cleaning and feature engineering, ensuring that the “garbage in, garbage out” mantra doesn’t derail your project.
Workflow acceleration: Skore helps data scientists move faster, with smarter cross-validation, automated evaluation reports, and methodological guidance – catching common pitfalls before they reach production.
Powerful predictions: We highlight the deep integration with libraries like XGBoost, which rely on scikit-learn’s API to deliver state-of-the-art gradient boosting.
Operational excellence: For those moving to production, MLFlow provides the MLOps framework necessary to track and deploy models at scale.
Trust and explainability: SHAP offers data scientists tools to evaluate and understand why a model makes the decisions it does.

Tabular foundation models: The new wave of Tabular Foundation Models often depend implicitly on scikit-learn. For instance, TabICLv2 uses scikit-learn during pretraining to generate good synthetic data that shapes the final model.

An invitation to the wider community

At Probabl, we know that the most compelling arguments for open source aren’t found in documentation, but in execution. That is why, alongside the catalog, we are building a library of use cases.

We are inviting data scientists from every corner of the globe to share their code and their stories. Whether you are using libraries for time-series analysis in finance, nilearn for neuroimaging, or building a custom fraud detection engine for a global bank, your work can inspire and educate others.

Scikit-learn succeeded because it made complex mathematics accessible to everyone. Scikit-learn Central aims to succeed by making the entire ecosystem accessible. We are building a future where open-source machine learning is sustainable, transparent, and more powerful than ever.

I invite you to explore the catalog on scikit-learn central, contribute your use cases, and join us in this next chapter.

For more from Probabl:

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

Current scikit-learn priorities at Probabl - March 2026 edition

Wed, 11 Mar 2026 11:06:07 GMT

At Probabl, we have a significant support for scikit-learn’s maintenance and development, and it is important to us to communicate our priorities for the broader community to know where our activities will be focused in the short term and what to expect.

We also aspire to maintain an up to date project board for each of these topics to keep track of the progress.

Priorities

1. GPU Support via Array API

Moving from pure numpy to Array API enables using a diverse set of hardware (including GPUs) as well as native support of computations done by different backends such as pytorch or cupy. This has been a long running project, and many estimators and functions in scikit-learn already support this, but there’s still a lot of work to be done.
Project board: https://github.com/orgs/scikit-learn/projects/12
This work is also supported by the NASA ROSES grant.

2. Callbacks

This work enables progress reports notably in estimators such as GridSearchCV as well as inspection of estimators as they go through their iterative training process. We will also work on the related SLEP to get the required consensus and move this forward.
Project board: https://github.com/orgs/scikit-learn/projects/8/views/2
This work is also supported by a CZI-Wellcome Trust grant

3. Tree based models

Tree based models are some of our most used estimators, and it’s important that we give the best we can to our users when it comes to these models. For this, we’d work on a variety of issues to improve them, e.g. merging Hist Gradient Boosting with Gradient Boosting.
Project board: https://github.com/orgs/scikit-learn/projects/26/views/1

4. Displays and UX

This section addresses all work related to the look and feel of what users see from scikit-learn, which includes estimator visualisations, displays as well as what’s provided on the website.
Project board: https://github.com/orgs/scikit-learn/projects/10
Project board: https://github.com/orgs/scikit-learn/projects/9/views/2
This work is also supported by a CZI-Wellcome Trust

5. Metadata Routing

This is another long running project, which is already in a shape which enables many common usecases, however, there are areas to improve before it can become the default in the library.
Project board: https://github.com/orgs/scikit-learn/projects/4

6. Misc / Maintenance / Release

Other areas where we keep our activity include:

Project maintenance: it’s always crucial to maintain the project and enable other contributors to move forward their projects and we dedicate a fair amount of resources to this area.
Free-threaded is an area supported by the NASA ROSES grant which includes maintenance of the build, as well as identifying thread safety or oversubscription issues.
Supply chain security is an area also supported by the NASA ROSES grant, which can result in some CI refactoring and improvements in our build process.

Labs @Probabl Project Board

We also have a board to keep track of https://github.com/orgs/probabl-ai/projects/8/views/1 to view all active issues in our Labs team. Internally we assign a “champion” to each issue or pull request, which means that person is either the author or follows up on the work and makes sure the work moves forward. Whenever necessary, we also assign reviewer 1 and reviewer 2, if that’s lacking.

People mentioned in that board as a champion or a reviewer are either folks at Probabl or work very closely with us.

Note that the board includes all work done by our team on public repositories, which means not every entry is from the scikit-learn repo. Some entries are from other open source projects we support, e.g. skore-lib, skrub, and skops.

For more from Probabl

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

scikit-learn acceleration with GPUs

Tue, 10 Mar 2026 15:33:24 GMT

For over a decade, scikit-learn has served as the bedrock of machine learning, supporting the work of millions of data scientists worldwide and recently surpassing 4 billion downloads [1]. Scikit-learn was originally designed for a CPU-centric world, relying heavily on the foundational stack of NumPy, SciPy, and Cython. However, with the advent of new hardware, there are new opportunities to accelerate machine learning pipelines with scikit-learn.

Speeding up machine learning pipelines is significant for enterprises, where compute bottlenecks are not only a technical lag but also a barrier to operational agility. When model training takes hours instead of minutes, the time-to-insight stretches, delaying the impact of data science projects as a consequence. Even for smaller datasets, when model training takes minutes instead of seconds, interactive model development in notebooks stops being interactive, disrupting the quick iteration cycle of focused data scientists and their productivity as a result.

In this post, I bring you up to speed on our efforts to adopt the Python array API standard in scikit-learn in order to tackle this problem and to facilitate hardware acceleration in data science workflows. An important point to emphasize is that this transition is not only a performance optimization; it is a fundamental re-engineering that allows data scientists to leverage scikit-learn’s 200+ estimators, while delegating performance-critical tasks to GPU-backed libraries like PyTorch, CuPy, and maybe soon JAX that unlock game-changing speed-ups for data scientists building complex machine learning pipelines.

What is the Python array API standard?

Historically, library maintainers like us at scikit-learn faced a vendor lock-in challenge. If we wanted to support GPUs, we would have had to write specialized code paths for every specific backend (e.g., one for NumPy, one for CuPy, another for PyTorch). This led to fragmented codebases and maintenance overhead.

The Python array API standard [2] solves this by providing a unified specification for NumPy-like operations. It is a common language adopted by major array libraries. By targeting this specification, scikit-learn can remain “backend agnostic.”

Core Concept: When an estimator is array API-compliant, it inspects the input data. If you pass a PyTorch tensor residing on an NVIDIA GPU, scikit-learn uses the array API to dispatch the underlying linear algebra to PyTorch’s GPU kernels. The computation happens on the device where the data lives.

Converting from NumPy to the array API

Converting a library as vast as scikit-learn–which has over 200 estimators–is a significant undertaking. Indeed, whenever an estimator is converted, we also set up automated testing to ensure that it numerically behaves consistently across backends. This is a multi-year effort involving deep collaboration between Probabl, Quansight, NVIDIA, and the broader scientific Python community.

So far, approximately 25 estimators out of 200 are either partially, fully compatible or in the final stages of integration. Most metric functions (e.g. R2, log loss, Brier score) and tools such as cross-validation functions and the scoring API have been updated. Specific tests and continuous integration configuration has also been put in place to regularly monitor the correct execution of those components on a GPU and more test infrastructure work is in progress.

To be a bit more precise, let me explain some of the technical changes involved in converting from NumPy to the array API.

Before, the code would explicitly import NumPy (as “np”) perform linear algebra operations on NumPy arrays passed as input to the scikit-learn functions. Now, compliant functions accept any array API-compliant input without any explicit hard dependencies on those libraries: the underlying module is retrieved (as “xp”) by inspecting the input arguments. Subsequent linear algebra operations are therefore delegate to input-specific libraries without having to couple the source code explicitly to any of those array libraries.

In practice, not all array API compatible libraries are 100% compliant with the specification (yet) and importing array_api_compat is a pragmatic way to handle the transition. For instance, PyTorch implements some features from the spec under different names. So instead of retrieving the array namespace from PyTorch, we ask array_api_compat to get a standard compliant PyTorch wrapper. If the input array stem from a compliant library, array_api_compatc simply returns that module as is.

On top of this, array-api-extra brings extra benefits that go beyond the spec and enable support for other libraries with special design constraints, such as JAX.

The value-add for the data scientist: The significance of this work for the millions of data scientists around the world who use scikit-learn lies in the seamless scalability that has been unlocked. In the past, moving a scikit-learn pipeline to a GPU required a complete rewrite using different libraries. With the array API, this transition is possible. You can now tell scikit-learn to delegate compute intensive work to GPU-aware, array API-compliant libraries.

Demo: 15x speed-up for complex ML pipelines with GPUs

To illustrate the impact of this work, I measured the time it takes to fit and evaluate the following multistep polynomial regression pipeline:

poly_reg_torch_gpu = make_pipeline(

SplineTransformer(n_knots=5),

FunctionTransformer(partial(torch.asarray, device="cuda")),

Nystroem(kernel="poly", degree=2, n_components=300, random_state=0),

Ridge(solver="svd", alpha=1e-3),

)

cv_results_torch_gpu = cross_validate(

poly_reg_torch_gpu, X, y_torch_gpu, cv=5

)

In the above code, the SplineTransformer has not yet been updated to accept array API inputs while the other steps did. To upgrade this pipeline, we therefore insert a FunctionTransformer step to call torch.asarray(out, device=”cuda”) on the output of the first step before passing the resulting PyTorch GPU array to the Nystroem step and dramatically accelerate the last to steps by letting them operate on the CUDA device.

By offloading these steps to a GPU using the array API, I observed a 15x speed-up compared to traditional CPU execution.

Jump into the demo notebook

Takeaway: Thanks to GPU acceleration, we can now tune the hyperparameters in a complex pipeline to get a very good model in the time it would take to run a single cross-validation on the Google Colab CPU. More importantly, the training speed is fast enough to avoid disrupting the model development flow of the data scientist interactively editing the Google Colab notebook.

Deep dive with NVIDIA

I recently had the pleasure of joining NVIDIA experts Andy Terrel, Sergey Maydanov, Ashwin Srinath, and Leo Fang for a technical deep dive into the CUDA Python roadmap and the adoption of the Python array API in scikit-learn.

We discussed topics like strategies for making GPU-accelerated computing more seamless and accessible for Python developers and data scientists.

We had over 700 people tune in from all over the world. If you missed the live event, I encourage you to watch the replay to see the live demo and the array API in action.

Watch the Webinar Replay

By bridging the gap between the easy-to-use and familiar interface of Python libraries and the power of GPUs, we are lowering the barrier to entry for high-performance AI, making it a practical reality for enterprises of all sizes and skills.

Probabl on stage at Nvidia GTC 2026

Gaël Varoquaux, our CSO, and Yann Lechelle, our Executive President, will be at NVIDIA GTC 2026 in San José next week. Don’t be a stranger; connect with them there!

On March 17 (3:00 PM – 3:40 PM PDT / 11:00 PM - 11:40 PM CET), Gaël will be speaking on the “Accelerating Open Science: Incorporating CUDA Into the SciPy Ecosystem” panel. Gaël will discuss adopting CUDA in scikit-learn without sacrificing usability, portability, or community values alongside Leo Fang, Ianna Osborne, Travis Oliphant, and Katrina Riehl.

On March 18 (5:00 AM – 5:50 AM PDT / 1:00 PM - 1:50 PM CET), Yann will be speaking on the “Europe’s AI Launchpad: Unlock Startup Growth Through Sovereign AI Infrastructure [S81898]” panel. Yann will discuss dynamic AI compute landscape as well as public and private compute options for startups alongside Cedric Auliac, Pierre-Antoine Beaudoin, and Sadaf Alam.

Words of gratitude

Our efforts to adopt the array API is the result of a massive team effort. I want to extend my gratitude to the maintainers and contributors from Quansight, NVIDIA, and the community of scikit-learn core contributors. The work performed by Probabl and Quansight on scikit-learn and SciPy is supported by the NASA ROSES grant 80NSSC25K7215 “Ensuring a fast and secure core for scientific Python.” This support is vital for maintaining the health of the open source ecosystem that the world’s scientific and industrial infrastructure relies upon.

For more from Probabl:

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

References:

[1] https://clickpy.clickhouse.com/dashboard/scikit-learn Please note that PyPI downloads are a proxy for adoption and should be taken with a grain of salt; they are not the only way to download a python library, and they may not accurately convey usage.

[2] Python array API standard https://data-apis.org/array-api/latest/

[3] Enabling array API support in scikit-learn https://scikit-learn.org/stable/modules/array_api.html

[4] Colab notebook of the demo: https://colab.research.google.com/drive/1YrCt5iBPT6gnmp7geahRn_9OqCPrfoLb?usp=sharing

Skore Is Live: Track Your Data Science

François Méro — Thu, 05 Mar 2026 12:04:56 GMT

Two weeks ago, I wrote about the five challenges holding back enterprise data science: technology-first thinking, spiraling costs, vendor lock-in, a lack of industrial maturity, and the fading of scientific thinking. This week, Guillaume Lemaitre laid out the principles for a new generation of data science tooling “built for data scientists, by data scientists.”

Today, we are putting those principles into practice. Skore is now publicly available.

Skore is the collaboration layer for teams. This first release is the first concrete step toward the future of enterprise data science we are building at Probabl. Not the finished vision. The foundation it starts from.

Get started now:

Sign up for Skore: skore.probabl.ai
Learn more: probabl.ai/skore
Explore the code: github.com/probabl-ai/skore
Read the docs: docs.skore.probabl.ai

What Skore Does Today

If you work with scikit-learn, Skore will feel immediately familiar. Same API philosophy. Same commitment to clarity. Scikit-learn gives you powerful building blocks for machine learning, Skore extends it by giving you the guidance and structure to use them well.

Here is what you can do right now.

Evaluate any model (even old ones) in one line of code. Feed your scikit-learn compatible estimator and your dataset to EstimatorReport. It automatically generates the metrics, feature importance, and plots that are most relevant to your use case. No boilerplate. No navigating through documentation to figure out which evaluation applies. Skore does that work for you, with efficient caching under the hood so everything runs fast.

Cross-validate with full visibility. CrossValidationReport gives you a complete estimator report for each fold of your cross-validation. Not just a score, a structured, inspectable report per fold. You see how your model behaves across your data.

Benchmark models side by side. Training several estimators? ComparisonReport lets you compare them in a structured way. No more ad hoc notebooks with copy-pasted metric tables. You get a clear, standardized comparison.

Catch methodological mistakes before they matter. Skore brings together the tools you need to spot modeling issues early. Explore associations between variables to understand how your features relate to each other, relationships that could impact your modeling, and put them in perspective with the feature importance as seen by your predictive model. Combine this with utilities designed to help flag potential pitfalls in your data splitting strategy, and you have the building blocks to catch fishy patterns before they compromise your model. These are the kinds of insights experienced data scientists develop over time.

Organize and persist your work. The Project system lets you save reports, experiments, and artifacts in a structured way. Everything is stored, locally or remotely. Nothing gets lost when you close a notebook.

Collaborate through Skore, the collaboration layer for teams. Teams can share, compare, and build upon each other’s experiments. It brings visibility across a team’s work, standardizes workflows without slowing anyone down and frames results for decision-making; so your next stakeholder meeting starts from structured evidence, not a scramble through notebooks.

Why This Matters

If you are a data scientist, you know the reality of your day-to-day work. You have excellent tools at your disposal: plotly and seaborn for data exploration, scikit-learn for model training and evaluation. These libraries are powerful. They are also generic by design. They accommodate a wide range of use cases without prescribing how to use them.

That is a strength but also a challenge. Your experience is the key ingredient that determines whether those building blocks are assembled correctly. You spend time navigating documentation, writing boilerplate code for common evaluations, and maintaining project structure by hand. When you are experienced, it works. When you are under pressure, or when the team has mixed levels of seniority, things slip through. Methodology gets cut short. Context gets lost. Models reach production with flaws that could have been caught earlier (if ever).

Skore is designed and envisioned to close that gap. It acts as a conductor that transforms your way of working into structured, meaningful artifacts. It reduces the time you spend on documentation navigation, eliminates code boilerplate, and guides you toward the right methodological choices, the ones you would have made if you had infinite time and attention.

Think of it this way: scikit-learn trusts you to make the right decisions. Skore helps you actually make them, consistently, across every project.

Our First Move, Not Our Last

We want to be straightforward. This is early. Skore is at the beginning of its journey. We are shipping fast, and there is much more to come.

What you see today (evaluation reports, cross-validation insights, methodological diagnostics, model comparison, and team collaboration) is the first layer. It is where we deliver immediate, tangible value to any data scientist using scikit-learn.

But our ambition goes further. In the two posts that preceded this one, we laid out a vision for enterprise data science grounded in science, composability, reusability, and transparency. Skore is the vehicle for that vision. Over the coming months, you can expect:

Deeper guidance: starting with the scientific guardrails you already see in this release, and evolving toward contextual recommendations that learn from your practice and your organization’s data science work.
AI-powered augmentation: feeding the right context from your experiments into code generators and assistants, so that AI-generated code is grounded in your specific project, not generic suggestions.
Full process coverage: extending Skore upstream toward data preparation and downstream toward MLOps handoffs, always from the data scientist’s perspective.
Richer collaboration: multi-audience reporting, model cards, and documentation that translates technical results into business narratives.

We are building Skore the same way scikit-learn was built: step by step, guided by real-world usage, with the community as co-pilot. This release is the result of working closely with early users and our Design Partners. Their feedback and yours shape every decision.

Who Is Skore For

Skore is for data scientists who use Python and the scikit-learn ecosystem. Whether you work alone or in a team. Whether you are building your first model or managing a portfolio of hundreds.

If you are experienced, Skore saves you time. It eliminates the repetitive evaluation code you write on every project and gives you a clean, structured record of your work.

If you are building your skills, Skore accelerates your growth. The methodological warnings and automated diagnostics encode the judgment that takes years to develop. You benefit from that expertise from day one.

If you lead a data science team, Skore gives you visibility. Through Skore, you can see how experiments progress across the team, standardize best practices without micromanaging, and present results to stakeholders in a format they can act on.

And if your company has already invested in a data science practice but struggles to scale its impact, Skore is designed precisely for you. It works with your existing stack, not against it. It plugs into your environment. It does not create vendor lock-in.

Get Involved

We believe the best data science tooling comes from the community that uses it.

Sign up for Skore: skore.probabl.ai
Learn more: probabl.ai/skore
Explore the code: github.com/probabl-ai/skore
Read the docs: docs.skore.probabl.ai

We would love your feedback. File issues, contribute code, or just tell us what you think. This is the beginning. And we are building it with you.

For data scientists, by data scientists: Building the next generation of data science tooling

Tue, 03 Mar 2026 14:58:20 GMT

Data science has the power to transform how enterprises understand their world and make high-impact decisions. When executed well, it delivers profound business value through optimized operations and strategic advantages that compound over time.

However, the potential for data science remains constrained by a myriad of organizational challenges and friction points that exist throughout data science workflows, from misaligned processes to fragmented tooling. It’s clear that enterprises deserve better tooling for data science–tooling designed for data scientists, by data scientists.

In this post, I break down the evidence about these friction points and present our vision at Probabl for building the next generation of data science tooling, anchored in our deep understanding of the scientific practice of data science.

The status quo of enterprise data science

What the data says: Challenges in enterprise data science

Industry insights provide a clear signal that we have to change the status quo of enterprise data science and AI.

According to RAND, more than 80% of AI projects fail, which is twice the rate of traditional IT projects [1]. Further research finds that 87% of projects never even reach production [2], and Gartner predicts that through 2026 60% of AI initiatives will be abandoned because the data held by enterprises is simply not yet AI-ready [3].

It’s also clear that these failures are rarely the result of poor data science or AI in itself; they tend to be organizational. For example, RAND finds that challenges often come from the top when business leaders do not clearly articulate specific problems that need to be solved and lean into technology-first thinking, prioritizing trendy AI solutions when simpler analytical approaches might actually suffice [1].

Many AI projects also fail because enterprises lack the necessary data to train models, and even when data exists, the unglamorous work of data wrangling consumes a disproportionate amount of time [1, 6]. This is compounded by a process mismatch, where frameworks designed for software engineering are applied to data science, despite it being an R&D process where the end product is often unclear at the outset [4-5].

Pinpointing the frictions that we must solve for

If we want to build solutions that solve these challenges, we must understand exactly where frictions live in enterprise data science workflows.

To do so, let’s think of the data science workflow as it’s visualized in Figure 1. We can distinguish the types of work by their nature rather than by job titles. Upstream work centers on making raw data available and usable, while downstream work focuses on putting models into production. Between these lies the data science work–the scientific core of the process–which demands methodological rigor and careful experimental design.

Figure 1: A typical data science workflow (arrows represent potential failure points)

While the data science ecosystem provides many powerful tools for these different types of work, the connective tissue to make them work as a coherent whole is still missing. What’s more, the current tooling landscape wasn’t designed with the data-scientists’ work as its focus, but rather production, which is closer to engineering. As a result, frictions and potential failure points exist throughout the data science workflow, which we must solve for.

These frictions include the following:

Friction at the boundaries: Currently, the transitions between upstream work (data collection, data engineering), data science work, and downstream work (MLOps) are not always optimal. In the worst case scenarios, handoffs between stages result in lost context, rewritten code, and undocumented methodological decisions. Moving from prepared data into experimentation, or from validated model to production, requires knowledge that current tools neither capture nor transfer.

Friction between stakeholders: Domain experts struggle to articulate data requirements clearly. Those leading data science teams find it difficult to translate model performance into business impact. Business leaders set objectives without understanding what AI can realistically achieve. Technical work proceeds without clarity on what success looks like.

Friction within the data science work itself: Experiments need review, results require interpretation, and methodological choices must be justified. Yet without structured processes for this validation cycle, quality control risks becoming ad hoc and inconsistent.

Friction in data science tooling: Most critically, existing tools tend to misunderstand what data science is. Data science is not software engineering. It transforms data–the essential ingredient–into impactful results through three dimensions: coding, business understanding, and scientific methodology. This scientific dimension changes everything. A methodological mistake can invalidate an entire project, leading teams to perfectly solve the wrong business problem, regardless of code quality or data accuracy. Existing tools do not address the core challenge of data science work–ensuring scientific excellence, maintaining methodological context across iterations, and translating statistical findings into business narratives.

Principles for a new generation of data science tooling

These frictions create an opportunity for a new generation of data science tooling.

At Probabl, we’re on a mission to remove these frictions and unleash the full potential of data science teams. We imagine a world where data science moves at the speed of insights, where experiments build naturally on previous work, and where the path from a question to an answer is measured in days, not months.

This requires a fundamental shift in building tools for data scientists, by data scientists. Towards this end, you can expect us to double down on the following principles.

Data science as the core

We build specifically for the data scientist.

This choice is rooted in our identity as stewards of scikit-learn. We have shaped how millions of practitioners practice machine learning. We understand that the data scientist holds a unique position as a hybrid professional who bridges quantitative excellence with business context. While AI can automate parts of code generation, it cannot replicate the contextual understanding and data-driven reasoning required for high-stakes decision making in enterprises. We believe the work carried out by data scientists is where better tooling is most needed to achieve faster outcomes and impact.

Ecosystem-first architecture

We build ecosystem-first architecture, not one-size-fits-all platforms.

We understand that data scientists have strong preferences for their tools. They are more likely to assemble a curated set of libraries within their existing environment than to adopt a rigid, all-in-one platform. Similarly, we understand that enterprises operate on diverse infrastructures. Attempting to latch a one-size-fits-all data science platform onto these heterogeneous infrastructures creates integration challenges, vendor lock-in, and operational frictions. We offer a modular, slot-in approach that respects autonomy and works with an existing technology stack rather than replacing it. This ecosystem delivers value where data scientists actually work while integrating seamlessly with diverse enterprise infrastructures.

Designed to ground the validity of the data science practice

Data science is fundamentally a process of empirical discovery.

What distinguishes data scientists from software engineers is not just technical skill–it is that data science is fundamentally empirical discovery. Data science exists to uncover novel insights from data–to reveal patterns, relationships, and knowledge that weren’t previously known or understood. Because data science is discovery, it naturally demands suitable methodological approaches. When you’re uncovering insights that will drive business decisions, methodological mistakes can lead to false discoveries, misguided strategies, and wasted resources. Our solutions ensure that empirical discovery is supported throughout the entire process, from data validation to stakeholder communication.

AI-powered augmentation

If data science is the discipline of unlocking value from data, then we should practice what we preach: we will augment data scientists by leveraging AI.

The goal is to reduce time-to-value by deploying AI where it genuinely accelerates the workflow, such as through natural language interfaces and automated diagnostics. We build intelligent systems that learn from the essence of data science practice to provide contextually relevant guidance. As advancements emerge like the SOTA tabular foundation models TabICL [7] and TabPFN [8], we integrate these innovations while maintaining the principle that AI should always empower the data scientist rather than replace him or her.

Full process coverage

To truly serve data scientists, we must address friction across the entire workflow.

Upstream, we bring databases closer to predictive modeling to reduce reliance on constant intervention from data engineers. Downstream, we view MLOps through the practitioner’s lens to ensure seamless handoffs while preserving scientific context. Finally, we provide tools to help translate technical results into business language, recognizing that communication is where value is ultimately realized or lost. By providing tools that help data scientists explain model behavior and translate technical results into business narratives, we seek to ensure that scientific excellence leads to real-world impact.

Putting these principles into practice

At Probabl, we’re committed to putting our principles into practice. Just as we standardized the practice of machine learning by creating scikit-learn, we intend to do the same for the entire enterprise data science practice.

For more from Probabl:

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Find our blog on Medium and our website
Check out our video deep-dives on YouTube

Thanks for reading :probabl.! This post is public so feel free to share it.

References:

[1] Ryseff, J., Bruzelius, E., & Scobell, W. (2024). The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed: Avoiding the Anti-Patterns of AI. RAND Corporation. https://www.rand.org/pubs/research_reports/RRA2680-1.html

[2] Lorica, B. (2019, July 19). Why do 87% of data science projects never make it into production? VentureBeat. https://venturebeat.com/ai/why-do-87-of-data-science-projects-never-make-it-into-production/

[3] Gartner. (2025, February 26). Lack of AI-Ready Data Puts AI Projects at Risk [Press release]. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk

[4] Freischlag, C. (2022). Machine learning projects and Scrum: A major mismatch. Towards Data Science. https://medium.com/data-science/machine-learning-projects-and-scrum-a-major-mismatch-c155ad8e2eee

[5] Ribeiro, D. (2025). Why Agile Doesn’t Work for Data Science and How DSLP Fills the Gap. LinkedIn. https://www.linkedin.com/pulse/why-agile-doesnt-work-data-science-how-dslp-fills-gap-diogo-ribeiro-gnirf/

[6] Oleli, D. (2018, July 13). Bridging The Data Scientist Talent Gap Starts With Defining The Current Role. Forbes. https://www.forbes.com/sites/forbestechcouncil/2018/07/13/bridging-the-data-scientist-talent-gap-starts-with-defining-the-current-role/

[7] Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan. (2026). TabICLv2: A better, faster, scalable, and open tabular foundation model. https://arxiv.org/abs/2602.11139 Code: https://github.com/soda-inria/tabicl Installation: https://pypi.org/project/tabicl/

[8] Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter. (2023). TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. https://arxiv.org/abs/2207.01848. Code: https://github.com/PriorLabs/TabPFN. Installation: https://pypi.org/project/tabpfn/.

Maintaining open source in the age of generative AI: Recommendations for maintainers and contributors

Tue, 24 Feb 2026 16:39:54 GMT

Almost 7 years ago, Ralf Gommers from Quansight wrote an excellent blog post about the cost of an open source contribution, where he described what happens when a contribution comes into an open source project, and the subsequent challenges and bottlenecks for maintainers. Since then, open source has become even more mainstream. The number of contributors has gone up, while the number of maintainers in many projects has stayed more or less the same.

On top of that, with the expansion of AI tools and their general availability, an increasing number of people seem to be trying their luck at “vibe contributing” to open source. Recently, it has become rather straightforward to prompt an IDE or an agent to read guidelines, claim an issue, and submit a solution without the contributor ever truly understanding their code.

Maintainers like our colleagues, who maintain core data science libraries like scikit-learn and skore, are increasingly submerged in LLM-generated comments, auto-generated issues, and many low-quality PRs that do not advance the project, such as redundant contributions to README files and ones that claim significant performance gains while failing basic linting tests.

Subscribe now

These come almost exclusively from “first-time contributors” who seem to want to have a contribution in our project without understanding their submissions. The issue has gotten so bad that at times almost every second issue on our main repo gets at least one such message, in many cases multiple ones, to the point where some of our peers dread opening issues since they don’t want to deal with interacting with AI.

While the cost of writing and contributing code has shrunk thanks to AI, the cost of reviewing and maintaining code hasn’t. Given these developments, maintainers from many open source communities have been deliberating what to do. Some reject all AI-generated contributions, saying they have never seen useful ones or that the risks of copyright infringement or license violations are simply too great. Meanwhile others are proactively trying to steer the types of AI-generated contributions they receive by implementing solutions like AI contribution policies and AGENTS.md files.

So, what should open source maintainers do? There is clearly no single way to deal with AI-generated contributions and by no means do we want to prescribe the best ways to go about it. Instead, we’ve summarized discussions and solutions we’ve seen lately in scikit-learn and other open source communities, with the hope that gathering this information in one place may be useful for others asking themselves the same question.

The surge of AI-generated contributions

In “The Cost of AI in Open Source Maintenance,” I (Adrin) wrote about the types of AI-generated contributions to scikit-learn, the impacts on the maintainers, as well as recommendations to future contributors. As described there, in scikit-learn, we have encountered the following types of interactions with users who post LLM-generated content.

Table 1: Types of AI-generated contributions to scikit-learn

These kinds of AI-generated contributions are creating a lot of extra work for us at scikit-learn and taking up valuable time that could be better used for other tasks like onboarding genuinely motivated contributors or advancing the priorities in our roadmap.

Our struggle isn’t happening in a vacuum though. This has become an open source-wide challenge. A series of recent discussions show this trend affects open source communities across the board, and maintainers are debating different ways to adapt to this new reality.

Just to name a few, take a look at the list of discussions in the last month alone that our colleague and scikit-learn core developer, Stefanie Senger, shared in issue #31679 in scikit-learn:

January 27, 2026: Github Community Discussions: Exploring Solutions to Tackle Low-Quality Contributions on GitHub.
January 29, 2026: Scientific Python blog post on Community Considerations Around AI Contributions.
February 3, 2026: GitHub ponders kill switch for pull requests to stop AI slop.
February 6, 2026: Numpy Mailing List: Current policy on AI-generated code in NumPy.
February 12, 2026: Scott Shambaugh’s blog post: An AI Agent Published a Hit Piece on Me related to closing an AI-generated PR: matplotlib/matplotlib#31132.
February 14, 2026: GitHub acknowledged open source’s spam crisis with a nice timeline on the most recent developments.

Recommendations for maintainers

If you’re a maintainer and you’re wondering what to do about AI-generated contributions to your project, perhaps some of the following approaches might be helpful.

Establish an AI use policy

Vague guidelines about AI use are not sufficient. In scikit-learn, we have codified our stance in our Automated Contributions Policy. Basically, contributions require human judgment. It states that maintainers reserve the right to close fully automated submissions and if judged appropriate, ban the user from the GitHub organisation. Contributors are also required to disclose AI usage. This works well because it empowers the maintainers to act decisively.

Below we’ve summarized acceptable and unacceptable uses that are mentioned in AI tool use policies and/or guidelines by open source projects and communities. Melissa Weber Mendonça of Quansight has also created this useful repo with policies of many other projects, which we recommend checking out.

Table 2: Examples of AI tool use policies and/or guidelines. URLs: scikit-learn, Scientific Python, Python Developer’s Guide (CPython Team), Apache Software Foundation, Linux Foundation, Mozilla Firefox, Zulip, and Castle Engine.

Principles you may want to include in your AI use policy

AI contribution policies come in many flavors. If you’re considering developing one for your project, you may want to take a look at the principles advocated by the Scientific Python community. In January 2026, Stefan van der Walt proposed four principles:

Be transparent
Take responsibility
Gain understanding
Honor Copyright

In the NumPy mailing list, Ralf Gommers recently added a fifth principle: “we want to interact with other humans, not machines,” which LLVM recently adopted in their AI Tool Policy.

Create agent guidance like an AGENTS.md file

More and more projects are adding an AGENTS.md file to their repo, which gives explicit instructions for how agents should interact in the repo. This is a strategy for driving up the quality of automated interactions by giving agents better rules to follow.

For example, the Apache Airflow maintainers added this AGENTS.md file to their repo. Jarek Potiuk from Airflow has commented that historically they had very good human-targeted documentation for contributors, and they suspect that it helped drive up the quality of AI-generated contributions. Jarek has recommended developing good instructions and embracing the AI contributions that follow.

Prepare tooling and instructions for AI use in contributions

Jarek Potiuk from Apache Airflow has also recommended developing bespoke tooling and instructions for AI use, and in turn deliberately inviting contributors to use AI in their contributions accordingly. In the Apache Airflow contributing docs, they list guidelines for contributors who use AI tools to help them create PRs. For example, here they mention that for a large-scale UI documentation translation task, they “developed custom tooling that helped to more easily apply regular tools like coding LLM integration to aid translation efforts.” Their guidelines also mention how the maintainers may proceed if a contributor ignores, repeatedly ignores, or spams the project.

Onboard new contributors via structured programs

The “good first issue” label has become a magnet for automated agents. Thibaud Colas from wagtail shared that they are now using matchmaking programs like Djangonaut Space or Outreachy to onboard new contributors. These programs are working for them because they ensure a sustained commitment and human-to-human connection that AI tools cannot replicate.

Be open to how contributors may be using AI tools to learn and improve

In the NumPy mailing list, Ralf Gommers from Quansight recently challenged the belief that AI tools are just making contributors dumber, suggesting they can actually facilitate learning. For example, they can be used to automate routine tasks once mastered, brainstorm design options, and write documentation that maintainers may be too busy to produce, among others.

Recommendations for contributors

Now, let’s turn to open source contributors who use AI tools.

In this day and age, it would be unreasonable to expect folks not to use any AI tools. Many of us use these tools one way or another. However, you should never submit contributions without understanding what you’re submitting.

Basically, if you are using AI tools to make open source contributions, the goal should be to reach a state of understanding where you no longer need the tool to explain your own work. You should also be spending at least as much time creating your contribution, as it takes a maintainer to review it. LLVM’s golden rule is that a contribution should be worth more to the project than the time it takes to review it.

As mentioned in the September blog post, below are some recommended ways that you can engage with your AI tools when contributing to open source:

Explain the codebase: Use tools to help you find where a function is defined or to explain a complex regex. This speeds up your learning curve without adding noise to the issue tracker.
Help with boilerplate: AI is excellent at generating repetitive test structures. Use it for drudgery, but write the logic yourself.
Drafting for non-native speakers: We welcome AI help with English grammar and clarity. This makes open source more accessible.
Brainstorming: Use an LLM to suggest multiple design options. This broadens discovery, but the final decision must be yours.

Below are additional recommendations from others:

Principles for high-quality contributions: The Generative AI policy in the Python Developer’s Guide recommends that contributors bear in mind four principles when making a contribution: consider whether the change is necessary; make minimal, focused changes; follow existing coding style and patterns; and write tests that exercise the change.
Transparency and responsibility: If you use an AI to help you make your contribution, be honest. Projects like Wagtail now request explicit disclosure of AI use. You must take full responsibility for every character you submit. If a maintainer asks why a certain choice was made, the answer should never be: “I am not sure, the AI did it.”
Let the problems find you: Avoid drive-by portfolio building. Instead of browsing for random issues, follow the advice of Marco Gorelli: contribute to the tools you actually use. When you encounter a bug in your own workflow, turning your frustration into a contribution is the most rewarding way to start. We know it’s hard to get started with a project. If you’re genuinely interested, drop a line to a maintainer and ask how you can help.

Our closing message to contributors is: We want to work with you, mentor you, and see you make genuine attempts to solve problems. If you use AI tools to empower your curiosity–if you are reviewing, testing, and understanding every line you submit–then you are the kind of contributor we are excited to welcome. If you want to get involved in scikit-learn, check out our guidance for first-time contributors here.

Closing thoughts

There are many ways maintainers may decide to tackle the surge of AI-generated contributions to open source projects, and ultimately it’s up to you and your community to build consensus on your way to go.

Our bottom line is that since AI tool usage is so prominent now, we should be proactive and intentional about shaping good policies and practices. This is in the long-term interest of our projects and open source in general. This means sharing our learnings, engaging in debate, and building resources together.

For more from Probabl:

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

Thanks for reading :probabl.! This post is public so feel free to share it.

Beyond the hype: Charting a new direction for enterprise data science

François Méro — Tue, 17 Feb 2026 15:02:45 GMT

The data science industry is at an inflection point. We’re surrounded by high-velocity innovation and new AI technologies that have the potential to create entirely new paradigms for how we work, communicate, and even make decisions.

This momentum is certainly a testament to the incredible talent and ambition in our industry. But as we push forward, we must acknowledge that the long-term success of enterprise data science is held back by critical challenges–be it the technology-first mindset that tempts business leaders to replace proven processes with AI tools that promise magic but ultimately deliver opacity, or the spiraling pay-as-you-go costs that hamper economies of scale and all-in strategies that lead to vendor lock-in.

In light of these industry trends, I would like to suggest a different way forward for enterprise data science; one that turns data science into the industrial-grade practice it deserves to be and one that empowers enterprises to own their data science and ultimately achieve return on the money invested.

5 challenges holding back industrial-grade data science

Let’s be brutally honest with ourselves: the practice of industrial-grade data science has not yet achieved its full potential. If we, the data science industry, want to realize the long-term success of enterprise data science, we must ambitiously tackle the challenges that we face. Consider the following five which my team at Probabl and I believe are critical.

The rising tide of technology-first thinking

New technologies have the potential to create new paradigms and AI tooling momentum tells enterprises that legacy applications and processes must be replaced because AI will surpass human creativity and productivity. When innovation in data science doesn’t allow you to understand and reuse your existing experiments and models, it creates technical debt and amplifies costs.

The pay-as-you-go trap

On-demand pricing has become the norm. Pay for compute. Pay for GPUs. Pay for tokens. Costs spiral out of control. Budget forecasting becomes impossible. Scaling your business no longer creates economies of scale, it creates uncontrollable OPEX expansion. When you give your suppliers open-ended access to your bank account, your expansion generates their profits, not yours.

All-in strategies create lock-in

Cloud-only, GPU-only or AI-only sound like modern and decisive strategies. They create strategic dependencies that contradict long-term value creation. When you lose autonomy, you lose freedom of movement, and your infrastructure decisions become vendor lock-in.

Data science has not reached the industrial maturity it deserves

Machine learning models rarely make it to production. Experiments are lost when team members leave, and reproducibility remains an aspiration rather than a standard. The discipline has grown in adoption but not always in rigor. Practitioners still reinvent wheels, lack shared quality standards, and operate without the engineering discipline that data science deserves. When most data science work never delivers business value because it can’t scale beyond notebooks and proof-of-concepts, it’s the lack of industrial-grade practice.

Scientific thinking has been forgotten

Data science is not software engineering. It requires a different discipline. Adopting new data science technologies should not undermine peer review, explainability, and ultimately trust. It should not create opacity and remove your control over business-critical systems. Because methodology matters, because statistical rigor matters, because explainability and understanding of your models matters, you should not rush to replace scientific discipline with automated tools that promise magic but deliver opacity.

Another way forward: Bringing the science of data to the world

To tackle these challenges, we must take a pragmatic shift to the practice of data science. At Probabl, we advocate firmly for an approach that is built on the following principles.

Transparency and explainability lead to ownership, trust, and impact

When you understand and can see how your models work, you can improve and trust them. Trust enables confidence in your decisions and accountability in your results. Understanding drives business value and competitive advantage.

Composability leads to agility and independence

We believe in agility and independence. By choosing tools that are modular and plug into your existing stack, you retain the freedom to adapt to change and choose the best tool for each specific use case. This ensures you control your destiny and pay the right price rather than being forced into a walled garden or vendor lock-in.

Reusability leads to economies of scale

Innovation should not mean that your existing investments become obsolete. When past experiments and models are treated as building blocks, you can build on experience and create true long-term value.

Science first

Data science was born from the scientific method–hypothesis, experimentation, measurement, and peer review. These foundations are precisely why data science creates value for enterprises. When science comes first, you start with the problem, not the tool. You validate before you deploy. You question before you trust. Methodology should drive tooling, not the other way around. At Probabl, we advocate for starting with the problem rather than the tool, validating before you deploy, and ensuring methodology drives your tooling, not the other way around.

By returning to these principles, we can move away from automated tools that promise magic but deliver opacity, and return to the rigor and strategic autonomy that business-critical systems require.

Putting these principles into practice

Something is brewing at Probabl.

Just as our founders worked to unify the practice of machine learning with scikit-learn, we have spent the last year building something new for enterprise data science teams–for those who refuse to choose between scientific rigor and industrial scale; for those who want to move fast without losing control; and indeed for those who want to own their data science.

Our team standardized machine learning once. Now, we’re aiming to do the same for the entire enterprise data science practice.

Don't miss the reveal. To be the first to know what we’re launching in March, join our Public Launch Waitlist and follow us on LinkedIn.

Join Public Launch Waitlist

For more from Probabl:

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

Demystifying table foundation models: The new models expanding the data scientist’s toolkit

Thu, 12 Feb 2026 15:51:14 GMT

Table foundation models (TFMs) are all the rage lately, with promises to extract more value from data in tables, the bread and butter of enterprise data science. However, are they a rupture or a continuity for data science?

My short answer is that they clearly establish the need for dedicated AI for data science, as opposed to generalist LLMs, but they do not change the nature of data science work.

In recent years, I’ve focused on tabular learning in much of my scientific work, as Probabl’s CSO but even more so as a researcher at Inria, progressively shaping the ideas that led to the rise of TFMs. Just this week, we released TabICLv2 [1], a TFM that is state-of-the-art, visible on public benchmarks, and fully open. Drawing on this experience, I shed light on frequently asked questions about TFMs.

What are TFMs?

Foundation models are models pretrained on a large amount of data to embed implicit knowledge and priors. They have powered the ChatGPT revolution, providing incredibly useful technology for natural language or images as they understand the information out of the box. But enterprises’ most valuable data is in tables, and often full of cryptic numbers and codes. Foundation models have long been unable to help process such data, where traditional machine learning shines, from linear models to gradient-boosted trees.

With recent progress, TFMs are pushing the boundaries of tabular machine learning. There are two alleys of progress: one based on capturing semantics of the strings in tables, another on modeling better numbers, which are crucial to tabular data. This second alley is where we have seen most excitement, illustrated by popular tools such as TabPFN and TabICL.

What do TFMs bring to the table?

TFMs are really tabular learners on steroids. Their benefits are visible on the classic tabarena benchmark [2]. For instance, the figure below (from [1]) positions TabICL and TabPFN on this benchmark, showing how TFMs reduce the gap to the smallest achieved prediction error: a 5-fold reduction compared to random forest and a 3-fold reduction compared to XGBoost. However, this error reduction comes at a cost: state-of-the-art TFMs are 3 times more expensive than XGBoost and 20 times more expensive than random forests. In addition, for mid-sized or largish tables, TFMs require large GPUs, which are rare resources.

Figure 1: Improvability vs. train time on TabArena [1]

Machine learning concepts that are useful for understanding TFMs

First, the game underlying machine learning or statistics has always been to design “the right model” for given data. A model too simple will not make good use of the data. But a model too complex leads to noisy predictions. Better models use priors and inductive biases adapted to the properties of the data to give only the right flexibility. The new ingredient in TFMs is that these priors and inductive biases are created by pretraining.

Another aspect of current TFMs is that they heavily rely on transformers and “in-context learning”. They still appear as standard machine learning tools, but not much happens during fit: the training data are merely stored. For prediction, given new data, the training data are then used as context for the test data. In a sense, this is a mechanism well known in machine learning, as it is akin to what the nearest neighbor methods do. A simplified but useful view of TFMs is that they combine complex transformations of the input data with a nearest-neighbors mechanism.

For the machine learning experts, a better analogy for the prediction mechanism of TFMs might be that of kernel machines, such as the classic SVM. Indeed TFMs make their predictions by combining information not limited to a small number of nearest neighbors, but by pooling across all training data if useful.

Pushing TFMs’ promises further

Where TFMs can be game changers is by making the most of small data, “few-shot predictions”, as this is where prior knowledge is make-or-break. To draw a parallel to LLMs: LLMs can solve so many useful problems by drawing analogies to problems that they have seen in the past. In the case of tables, strings (like ‘Paris’ or “frying pan”) in table entries and column names, among others, offer incredible promise to connect much more easily to prior knowledge than numbers. This promise is sketched in our 2024 paper [3] on the CARTE tabular model, as well as in Fundamental’s whitepaper [4] on the Nexus TFMs.

The dream is to bring as much as possible world and procedural knowledge into the analysis of tables. Our experience is that combining tabular models with LLMs to encode strings (for instance, using skrub’s TextEncoder [5]) already brings large benefits.

Where are TFMs the most useful?

The asset of TFMs is that they can give strong predictions from limited data without relying on careful data preparation. For large datasets (more than 100,000 data points), other models often catch up while TFMs’ quadratic computational cost is a burden.

Is training data no longer important?

The few-shot prediction ability of TFMs may be understood as removing the importance of training data. Labeled data is still needed with TFMs, and more labeled data will improve data science. Not only does more labeled data lead to better performing models, but also it ensures good validation of the data science pipeline. And validation of the data science pipeline is a frequent bottleneck.

Likewise, there is still the associated training computation to consider. In the case of TFMs, it just happens at prediction time and not at fit time. This can be a problem, as it pushes cost to inference, and specific techniques are developed to decrease this cost, such as distillation.

My conclusion: No free lunch for the data scientist

TFMs give amazing tools to tackle a well-framed data science problem. But it is important to keep in mind that often, the bottleneck in data science is exactly framing the problem: finding the right data, the right prediction, and the right measure to optimize [6,7]. TFMs do offer benefits here, as they enable fast iterations with models that tend to work well out of the box. However, they are not a magic bullet: the data scientist is still faced with the important challenge of understanding data and applications, and bridging the gap between the two.

For more from Probabl:

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

References:

[1] Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan. (2026). TabICLv2: A better, faster, scalable, and open tabular foundation model. https://arxiv.org/abs/2602.11139 Code: https://github.com/soda-inria/tabicl Installation: https://pypi.org/project/tabicl/

[2] Erickson, N., Purucker, L., Tschalzev, A., Holzmüller, D., Desai, P. M., Salinas, D., & Hutter, F. (2025). Tabarena: A living benchmark for machine learning on tabular data. https://arxiv.org/abs/2506.16791

[3] CARTE: pretraining and transfer for tabular learning, MJ Kim, L Grinsztajn, G Varoquaux, ICML 2024. Find via: arXiv preprint, GitHub repo, HF repo: an early paper introducing the idea of bringing background knowledge to tabular learning via strong.

[4] Marta Garnelo, Wojciech Marian Czarnecki. (2026). Developing Foundation Models for Real-World Tabular Data. https://fun-research-whitepaper.s3.us-west-1.amazonaws.com/public/Fundamental_Whitepaper.pdf The whitepaper of fundamental, probably the largest TFM startup. It describes the importance of rich joint modeling of the data at hand and prior knowledge

[5] skrub TextEncoder: Encode string features by applying a pretrained language model downloaded from the Hugging Face Hub. https://skrub-data.org/stable/reference/generated/skrub.TextEncoder.html

[6] Unpacking the craft of an applied machine learning product manager, Sanjana Arun, https://www.productledalliance.com/unpacking-the-craft-of-an-applied-machine-learning-product-manager/?utm_source=ghost&utm_medium=email&utm_campaign=insider_newsletter where Sanjana discuss the importance of understanding how to provide downstream value : define the right measure is more often the bottleneck than the model to optimize it

[7] Lucas Bernardi, Themistoklis Mavridis, and Pablo Estevez. 2019. 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘19). Association for Computing Machinery, New York, NY, USA, 1743–1751. https://doi.org/10.1145/3292500.3330744 This retrospective analysis of the factors of success of data-science projects at booking.com highlights the danger of disconnection between the data-science metric optimized and the downstream value.

The global pulse of open source: Insights from GitHub and scikit-learn

Yann Lechelle — Thu, 05 Feb 2026 14:13:49 GMT

GitHub just released fresh data from the GitHub Innovation Graph, providing accessible data and aggregated statistics on software development activity around the world in the millions of public repositories hosted on the platform. As the software company committed to the stewardship and long-term success of scikit-learn, this data is invaluable for us at Probabl in helping us understand global trends in open source and to situate scikit-learn in them.

The data paints a clear picture of the global nature of open source. We see that the EU is now the world leader by some metrics like git pushes, unthroning the USA for the first time. At the same time, the USA leads when it comes to the number of developers on GitHub (30 million!). Countries like India, the UK, Brazil, Korea, Japan, and China, among many others, are also world-leading hubs of open source software development on GitHub [1].

The data also underlines that national or regional economies are not islands: open source thrives because of global interlinkages and collaboration. For example, in 2025, EU-based open source projects received millions of code contributions from developers across the world, including the USA (over 934,000), UK (over 553,000), and India (over 347,000).

Source: GitHub (2026), Year recap and future goals for the GitHub Innovation Graph.

These stats resonate with us at Probabl as the steward of scikit-learn, the global open standard framework for machine learning with 3.9 billion downloads and 1.3 million dependants on GitHub.

While most core contributors are based in the EU (in particular, France and Germany), scikit-learn also counts on core contributors from the USA, Australia, and China. scikit-learn also receives a lion’s share of its issue reports and pull requests from developers all over the world. As this map from OSS Insights shows, most pull requests come from developers in the USA, India, Germany, France, the UK, Canada, Japan, and China, among others. The bottom line: the scikit-learn ecosystem is a global ecosystem.

Source: OSS Insights (2026), scikit-learn

In terms of users, scikit-learn has the wind in its sails with a growing curve of downloads (around 150 million new downloads per month). Interestingly, when viewed alongside libraries like PyTorch, the global standard for deep learning, scikit-learn maintains a significantly broader footprint in terms of downloads, reflecting the vast and growing demand for classical machine learning across almost every industry [2].

Source: Our own analysis of pypi and conda downloads (2023-2025)

At Probabl, we’re regularly analyzing the underlying causes and trends behind the skyrocketing growth of the scikit-learn ecosystem. One thing is for sure: the users and downloaders are global. Hey GitHub Research Team, we should partner up and investigate usage trends together!

Thank you to the GitHub team for making this data available. We recommend reading more about the GitHub Innovation Graph and diving into the data. And, of course, a big thank you to everyone worldwide who contributes to scikit-learn in one way or another!

For more from Probabl:

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out our technical explainer videos on YouTube

Notes:

[1] We suspect China is underrepresented in these numbers, possibly due to unreported geolocations and/or use of Chinese alternatives like gitee. For example, there’s a scikit-learn mirror on Gitee: https://gitee.com/mirrors/scikit-learn

[2] We acknowledge that pypi downloads are not the only way to download a python library and therefore the download stats may be incomplete for scikit-learn and/or PyTorch.