Scalable Python Data Analysis with BigQuery DataFrames (BigFrames)#

BigQuery DataFrames (bigframes) is an open-source Python library that brings the power of distributed computing to your data science workflow. By providing a familiar pandas and scikit-learn compatible API, BigFrames allows you to analyze and model massive datasets where they live—directly in BigQuery.

Why Choose BigQuery DataFrames?#

BigFrames eliminates the “data movement bottleneck.” Instead of downloading large datasets to a local environment, BigFrames translates your Python code into optimized SQL, executing complex transformations across the BigQuery fleet.

  • Petabyte-Scale Scalability: Effortlessly process datasets that far exceed local memory limits.

  • Familiar Python Ecosystem: Use the same read_gbq, groupby, merge, and pivot_table functions you already know from pandas.

  • Integrated Machine Learning: Access BigQuery ML’s powerful algorithms via a scikit-learn-like interface (bigframes.ml), including seamless Gemini AI integration.

  • Enterprise-Grade Security: Maintain data governance and security by keeping your data within the BigQuery perimeter.

  • Hybrid Flexibility: Easily move between distributed BigQuery processing and local pandas analysis with to_pandas().

Core Components of BigFrames#

BigQuery DataFrames is organized into specialized modules designed for the modern data stack:

  1. bigframes.pandas: A high-performance, pandas-compatible API for scalable data exploration, cleaning, and transformation.

  2. bigframes.bigquery: Specialized utilities for direct BigQuery resource management, including integrations with Gemini and other AI models in the bigframes.bigquery.ai submodule.

Quickstart: Scalable Data Analysis in Seconds#

Install BigQuery DataFrames via pip:

pip install --upgrade bigframes

The following example demonstrates how to perform a distributed aggregation on a public dataset with millions of rows using just a few lines of Python:

import bigframes.pandas as bpd

# Initialize BigFrames and load a public dataset
df = bpd.read_gbq("bigquery-public-data.usa_names.usa_1910_2013")

# Perform familiar pandas operations that execute in the cloud
top_names = (
    df.groupby("name")
    .agg({"number": "sum"})
    .sort_values("number", ascending=False)
    .head(10)
)

# Bring the final, aggregated results back to local memory if needed
print(top_names.to_pandas())

Explore the Documentation#

Community & Updates