Skip to content

holukas/diive

Repository files navigation

Python PyPI - Version GitHub License PyPI Downloads DOI

diive is currently under active developement with frequent updates.

Time series data processing

diive is a Python library for time series processing, in particular ecosystem data. Originally developed by the ETH Grassland Sciences group for Swiss FluxNet.

Recent updates: CHANGELOG Recent releases: Releases


Package Structure

diive/
├── core/               # Foundational utilities shared across the library
│   ├── base/           # FlagBase — base class for quality and outlier flags
│   ├── dfun/           # DataFrame helpers: stats, regression, bin fitting
│   ├── funcs/          # Miscellaneous utility functions
│   ├── io/             # File detection, reading (CSV, EddyPro, TOA5), parquet I/O
│   ├── ml/             # MlRegressorGapFillingBase — base class for RF/XGBoost gap-filling
│   ├── plotting/       # Heatmaps, time series, scatter, histograms, ridge lines, cumulatives
│   ├── times/          # Timestamp sanitization, frequency detection, vectorization, resampling
│   └── utils/          # Helper utilities
│
└── pkgs/               # Domain-specific algorithms
    ├── analyses/        # Correlation, GridAggregator, GapFinder, decoupling, quantiles
    ├── binary/          # Binary-encoded value extraction
    ├── corrections/     # Offset, radiation, RH, wind direction corrections
    ├── createvar/       # DaytimeNighttimeFlag, VPD, ET, TimeSince, potential radiation
    ├── echires/         # High-resolution eddy covariance: FluxDetectionLimit, WindRotation2D
    ├── fits/            # BinFitterCP
    ├── flux/            # USTAR thresholds, self-heating correction, flux uncertainty
    ├── fluxprocessingchain/  # Orchestrated Level-2 through Level-4 flux workflows
    ├── formats/         # FLUXNET and EddyPro file format conversions
    ├── gapfilling/      # XGBoostTS, RandomForestTS, long-term multi-year gap-filling, FluxMDS, linear interpolation
    ├── outlierdetection/# Hampel, z-score, LOF, absolute limits, stepwise detection
    └── qaqc/            # FlagQCF, EddyPro flags, StepwiseMeteoScreeningDb
Package Key classes / functions Description
diive.core.base FlagBase Base class for building quality and outlier flags; provides flag encoding, filtering, and visualization
diive.core.ml FeatureEngineer, MlRegressorGapFillingBase Standalone feature engineering (8-stage pipeline) and base class for ML gap-filling (RF, XGBoost); separate feature engineering from model training for better reusability
diive.core.io DataFileReader, MultiDataFileReader, ReadFileType, FileSplitter Read single or multiple instrument files (CSV, EddyPro, TOA5); detect file structure; split large files; load/save Parquet
diive.core.plotting HeatmapDateTime, HeatmapXYZ, HexbinPlot, TimeSeries, ScatterXY, HistogramPlot, DielCycle, RidgeLinePlot, CumulativeYear Comprehensive visualization suite covering heatmaps, time series, scatter, histograms, diurnal cycles, ridge lines, hexbin plots, and cumulative plots
diive.core.times TimestampSanitizer, DetectFrequency, vectorize_timestamps(), continuous_timestamp_freq() Sanitize and validate timestamps, detect/infer data frequency, vectorize time attributes, resample diel cycles
diive.core.dfun sstats(), fit_to_bins_linreg(), fit_to_bins_polyreg() DataFrame statistics, linear/polynomial bin fitting, regression utilities
diive.pkgs.gapfilling XGBoostTS, RandomForestTS, QuickFillRFTS, LongTermGapFillingRandomForestTS, LongTermGapFillingXGBoostTS, FluxMDS Fill time series gaps with XGBoost, Random Forest (standard and long-term multi-year), MDS, or linear interpolation
diive.pkgs.outlierdetection HampelDaytimeNighttime, zScore, zScoreDaytimeNighttime, LocalOutlierFactorAllData, AbsoluteLimits, AbsoluteLimitsDaytimeNighttime Detect and flag outliers using Hampel filter, z-score, LOF, absolute limits, local SD, manual removal, or stepwise combinations
diive.pkgs.flux FluxProcessingChain Post-process eddy covariance fluxes: Level-2 quality flags, storage correction, USTAR filtering, gap-filling (RF/XGBoost/MDS), self-heating correction
diive.pkgs.fluxprocessingchain FluxProcessingChain Orchestrate a complete Level-2 → Level-4 flux processing workflow in a single pipeline
diive.pkgs.analyses GapFinder, GridAggregator, daily_correlation(), SeasonalTrendDecomposition Locate data gaps, aggregate variables into 2-D grids, compute daily correlations, decoupling analysis, quantiles, seasonal-trend decomposition
diive.pkgs.corrections OffsetCorrection, WindDirectionOffset, SetToThreshold, SetToMissing Apply measurement offsets, correct wind directions, clamp values to thresholds, set periods to missing
diive.pkgs.createvar DaytimeNighttimeFlag, TimeSince, calc_vpd_from_ta_rh(), et_from_le(), potrad() Derive new variables: daytime/nighttime flags, VPD, ET, time-since-event, potential radiation
diive.pkgs.qaqc FlagQCF, StepwiseMeteoScreeningDb Manage FLUXNET quality control flags; apply stepwise meteorological screening
diive.pkgs.echires FluxDetectionLimit, WindRotation2D, MaxCovariance Process 20 Hz eddy covariance data: detection limits, 2-D wind rotation, maximum covariance lag
diive.pkgs.formats FormatEddyProFluxnetFileForUpload, FormatMeteoForEddyProFluxProcessing Convert EddyPro output to FLUXNET submission format; prepare meteorological data for EddyPro
diive.pkgs.fits BinFitterCP Fit data to bins using cumulative-probability approach

Overview of example notebooks

  • For many examples see notebooks here: Notebook overview
  • More notebooks are added constantly.

Current Features

Analyses

  • Daily correlation: calculate daily correlation between two time series · func: daily_correlation() (notebook example)
  • Decoupling: Investigate binned aggregates (median) of a variable z in binned classes of x and y (notebook example)
  • Data gaps identification · class: GapFinder (notebook example)
  • Grid aggregator: calculate z-aggregates in bins (classes) of x and y · class: GridAggregator (notebook example)
  • Histogram calculation: calculate histogram from Series (notebook example)
  • Optimum range: find x range for optimum y
  • Percentiles: Calculate percentiles 0-100 for series (notebook example)
  • Seasonal-Trend Decomposition: Separate time series into trend, seasonal, and residual components using STL (Seasonal-Trend Loess), classical, or harmonic methods · class: SeasonalTrendDecomposition (notebook example)

Corrections

  • Offset correction for measurement: correct measurement by offset in comparison to replicate · class: OffsetCorrection (notebook example)
  • Offset correction radiation: correct nighttime offset of radiation data and set nighttime to zero
  • Offset correction relative humidity: correct RH values > 100%
  • Offset correction wind direction: correct wind directions by offset, calculated based on reference time period · class: WindDirectionOffset (notebook example)
  • Set to threshold: set values above or below a threshold value to threshold value · class: SetToThreshold
  • Set exact values to missing: set exact values to missing records · class: SetToMissing (notebook example)

Create variable

Functions to create various variables.

  • Time since: calculate time since last occurrence, e.g. since last precipitation · class: TimeSince (notebook example)
  • Daytime/nighttime flag: calculate daytime flag, nighttime flag and potential radiation from latitude and longitude · class: DaytimeNighttimeFlag (notebook example)
  • Vapor pressure deficit: calculate VPD from air temperature and RH · func: calc_vpd_from_ta_rh() (notebook example)
  • Calculate ET from LE: calculate evapotranspiration from latent heat flux · func: et_from_le() (notebook example)
  • Calculate air temperature from sonic anemometer temperature · func: air_temp_from_sonic_temp() (notebook example)

Eddy covariance high-resolution

  • Flux detection limit: calculate flux detection limit from high-resolution data (20 Hz) · class: FluxDetectionLimit
  • Maximum covariance: find maximum covariance between turbulent wind and scalar · class: MaxCovariance
  • Turbulence: wind rotation to calculate turbulent departures of wind components and scalar (e.g. CO2) · class: WindRotation2D

Files

Input/output functions.

  • Detect files: detect expected and unexpected (irregular) files in a list of files · class: FileDetector
  • Split files: split multiple files into smaller parts and export them as (compressed) CSV files · class: FileSplitter
  • Read single data files: read file using parameters · class: DataFileReader (notebook example)
  • Read single data files: read file using pre-defined filetypes · class: ReadFileType (notebook example)
  • Read multiple data files: read files using pre-defined filetype · class: MultiDataFileReader (notebook example)

Fits

Flux

Function specifically for eddy covariance flux data.

  • Flux processing chain · class: FluxProcessingChain (notebook example)
    • The notebook example shows the application of:
      • Post-processing of eddy covariance flux data.
      • Level-2 quality flags
      • Level-3.1 storage correction
      • Level-3.2 outlier removal
      • Level-3.3: USTAR filtering using constant thresholds
      • Level-4.1: gap-filling using long-term random forest, XGBoost, and/or MDS
      • For info about the Swiss FluxNet flux levels, see here.
  • **Quick flux processing chain ** (notebook example)
  • Flux detection limit: calculate flux detection limit from high-resolution eddy covariance data · class: FluxDetectionLimit (notebook example)
  • Self-heating correction for open-path IRGA NEE fluxes:
    • create scaling factors table and apply to correct open-path NEE fluxes during a time period of parallel measurements (notebook example)
    • apply previously created scaling factors table to long-term open-path NEE flux data, outside the time period of parallel measurements (notebook example)
  • USTAR threshold scenarios: display data availability under different USTAR threshold scenarios

Formats

Format data to specific formats.

  • Format: convert EddyPro fluxnet output files for upload to FLUXNET database · class: FormatEddyProFluxnetFileForUpload (notebook example)
  • Parquet files: load and save parquet files · funcs: load_parquet(), save_parquet() (notebook example)

Gap-filling

Fill gaps in time series with various methods.

Feature Engineering (v0.91.0) · class: FeatureEngineer

  • Standalone 8-stage feature engineering pipeline (composable, reusable across models)

    • Stage 1: Lagged features from past and future values
    • Stage 2: Rolling statistics (mean, std, median, min, max, quartiles)
    • Stage 3: Temporal differencing (1st and 2nd order momentum)
    • Stage 4: Exponential Moving Average (EMA) with recent-value emphasis
    • Stage 5: Polynomial expansion (squared, cubed terms)
    • Stage 6: STL decomposition (trend, seasonal, residual components)
    • Stage 7: Timestamp vectorization (season, month, hour, etc.)
    • Stage 8: Continuous record numbering for trend detection
  • Pre-engineer features once, reuse across multiple models (RF + XGB simultaneously)

  • Independent testing and debugging of feature engineering

  • XGBoostTS · class: XGBoostTS (notebook example (minimal), notebook example (more extensive))

    • Use FeatureEngineer to create features, pass pre-engineered data to XGBoostTS
  • RandomForestTS · class: RandomForestTS (notebook example)

    • Use FeatureEngineer to create features, pass pre-engineered data to RandomForestTS
  • Long-term gap-filling using RandomForestTS · class: LongTermGapFillingRandomForestTS (notebook example)

  • Long-term gap-filling using XGBoostTS · class: LongTermGapFillingXGBoostTS (for multi-year data with USTAR scenario support)

  • Linear interpolation · func: linear_interpolation() (notebook example)

  • Quick random forest gap-filling · class: QuickFillRFTS (notebook example)

  • MDS gap-filling of ecosystem fluxes · class: FluxMDS (notebook example), approach by Reichstein et al., 2005

Comprehensive Examples for CO2 Flux Data

  • FluxProcessingChain examples for CO2 half-hourly flux (NEE) gap-filling:
    • Both Random Forest and XGBoost examples are fully activated and comprehensively documented
    • Optimized feature engineering for diurnal photosynthetic patterns (lag, rolling, EMA, STL decomposition)
    • Feature reduction enabled by default (SHAP-based selection reduces ~45-50 features to ~10-20)
    • Hyperparameters tuned for ecosystem flux data with detailed tuning guidance
    • Model comparison code to select best algorithm for your site
    • See diive/pkgs/fluxprocessingchain/fluxprocessingchain.py for detailed examples (~100 lines each)

Outlier Detection

Multiple tests combined

  • Step-wise outlier detection: combine multiple outlier flags to one single overall flag

Single tests

Create single outlier flags where 0=OK and 2=outlier.

  • Absolute limits: define absolute limits · class: AbsoluteLimits (notebook example)
  • Absolute limits daytime/nighttime: define absolute limits separately for daytime and nighttime data · class: AbsoluteLimitsDaytimeNighttime (notebook example)
  • Hampel filter daytime/nighttime, separately for daytime and nighttime data · class: HampelDaytimeNighttime (notebook example)
  • Local standard deviation: Identify outliers based on the local standard deviation from a running median (notebook example)
  • Local outlier factor: Identify outliers based on local outlier factor, across all data · class: LocalOutlierFactorAllData (notebook example)
  • Local outlier factor daytime/nighttime: Identify outliers based on local outlier factor, daytime nighttime separately (notebook example)
  • Manual removal: Remove time periods (from-to) or single records from time series (notebook example)
  • Missing values: Simply creates a flag that indicated available and missing data in a time series · class: MissingValues (notebook example)
  • Trimming: Remove values below threshold and remove an equal amount of records from high end of data (notebook example)
  • z-score: Identify outliers based on the z-score across all time series data · class: zScore (notebook example)
  • z-score increments daytime/nighttime: Identify outliers based on the z-score of double increments (notebook example)
  • z-score daytime/nighttime: Identify outliers based on the z-score, separately for daytime and nighttime · class: zScoreDaytimeNighttime (notebook example)
  • z-score rolling: Identify outliers based on the rolling z-score (notebook example)

Plotting

  • Cumulatives across all years for multiple variables · class: Cumulative (notebook example)
  • Cumulatives per year · class: CumulativeYear (notebook example)
  • Diel cycle per month · class: DielCycle (notebook example)
  • Heatmap date/time: showing values (z) of time series as date (y) vs time ( x) · class: HeatmapDateTime (notebook example)
  • Heatmap year/month: plot monthly ranks across years · class: HeatmapYearMonth (notebook example)
  • Heatmap XYZ: show z-values in bins of x and y — pairs naturally with GridAggregator · class: HeatmapXYZ (notebook example)
  • Hexbin plot: aggregate flux values into 2D hexagonal bins of driver variables; supports percentile normalization and configurable aggregation functions · class: HexbinPlot (notebook example)
  • Histogram: includes options to show z-score limits and to highlight the peak distribution bin · class: HistogramPlot (notebook example)
  • Long-term anomalies: calculate and plot long-term anomaly for a variable, per year, compared to a reference period · class: LongtermAnomaliesYear (notebook example)
  • Ridgeline plot: looks a bit like a landscape · class: RidgeLinePlot (notebook example)
  • Time series plot: Simple (interactive) time series plot · class: TimeSeries (notebook example)
  • ScatterXY plot · class: ScatterXY (notebook example)
  • Various classes to generate heatmaps, bar plots, time series plots and scatter plots, among others

Quality control

  • Stepwise MeteoScreening from database · class: StepwiseMeteoScreeningDb (notebook example)

Resampling

  • Diel cycle: calculate diel cycle per month · func: diel_cycle() (notebook example)

Stats

Timestamps

  • Continuous timestamp: create continuous timestamp based on number of records in the file and the file duration · func: continuous_timestamp_freq()
  • Time resolution: detect time resolution from data · class: DetectFrequency (notebook example)
  • Timestamps: create and insert additional timestamps in various formats · class: TimestampSanitizer
  • Vectorize timestamps: add date attributes as columns to dataframe, including sine/cosine variants fpr cyclical variables (e.g., day of year) · func: vectorize_timestamps() (notebook example)

Installation

diive is currently under active developement using Python v3.11.

Using pip

pip install diive

Using poetry

poetry add diive

From source

Directly use .tar.gz file of the desired version.

pip install https://github.com/holukas/diive/archive/refs/tags/v0.76.2.tar.gz

Create and use a conda environment for diive

One way to install and use diive with a specific Python version on a local machine:

  • Install miniconda
  • Start miniconda prompt
  • Create a environment named diive-env that contains Python 3.11: conda create --name diive-env python=3.11
  • Activate the new environment: conda activate diive-env
  • Install diive using pip: pip install diive
  • To start JupyterLab type jupyter lab in the prompt