Streamlit application for visualizing and comparing Italian climate data from multiple sources: TMYx (Typical Meteorological Year), RCP climate scenarios (Representative Concentration Pathways), and CTI weather stations.
TMYxβFWG (B route, recommended):
python data/data_preparation_scripts/05B_build_station_parquets.py --out data/04__italy_tmy_fwg_parquet && python data/data_preparation_scripts/06B_precompute_station_tables.py --data-dir data/04__italy_tmy_fwg_parquet && streamlit run app.pyCTI (standalone EPW, B-style route):
python data/data_preparation_scripts/05C_build_cti_station_parquets.py --cti-root data/01__italy_cti --out-root data/04__italy_cti_parquet && python data/data_preparation_scripts/06C_precompute_cti_tables.py --parquet-root data/04__italy_cti_parquet && streamlit run app.pyLegacy EPW/TMYx (A route):
python data/data_preparation_scripts/05A_build_epw_index_and_extract.py --root data/01__italy_epw_all --root data/02__italy_fwg_outputs --out data/03__italy_all_epw_DBT_streamlit --regional && python data/data_preparation_scripts/06A_precompute_derived_stats.py --data-dir data/03__italy_all_epw_DBT_streamlit --compute-aggregates && streamlit run app.pyIf Files Already Exist:
streamlit run app.py- Ensure
requirements.txtis in the app root (it listsstreamlit,pandas,numpy,pyarrow,plotly,altair,Pillow). - In the Streamlit Cloud dashboard: New app β connect your repo, set Main file path to
app.py, and App root to the folder that containsapp.pyandrequirements.txt. - The app expects data under
data/04__italy_tmy_fwg_parquet/and optionallydata/04__italy_cti_parquet/. For cloud deployment, either commit sample data, mount external storage, or run preparation scripts in a separate step and upload the generateddata/(or use Streamlitβs secrets for data URLs if you host data elsewhere).
Location: data/04__italy_tmy_fwg_parquet/_tables/
D-TMYxFWG__DBT__F-DD__L-ALL.parquetβ Daily stats (tidy):region,station_key,scenario,date,Tmax,Tmean,TminD-TMYxFWG__Inventory__F-NA__L-ALL.parquetβ Station inventory + available scenario columnspairing_debug.csvβ Missing/present scenario columns per station + baseline pairing
Location: data/04__italy_cti_parquet/_tables/
D-CTI__DBT__F-DD__L-ALL.parquetβ Daily stats (tidy):region,station_key,scenario,date,Tmax,Tmean,TminD-CTI__DBT__F-MM__L-ALL.parquetβ Monthly stats (tidy):region,station_key,scenario,month,Tmax,Tmean,TminD-CTI__Inventory__F-NA__L-ALL.parquetβ Station inventory (colsis typicallycti)
FWG/
βββ app.py # Main Streamlit application
βββ libs/
β βββ fn__libs.py # Helper functions library
βββ data/
β βββ 01__italy_epw_all/ # Source EPW files (baseline climate)
β βββ 02__italy_fwg_outputs/ # Future Weather Generator outputs (RCP scenarios)
β βββ 04__italy_tmy_fwg_parquet/ # B route (per-station parquet + tables)
β β βββ <REGION>/<STATION_KEY>.parquet # Hourly station parquet (wide columns)
β β βββ _tables/ # Precomputed tables for the app
β βββ 01__italy_cti/ # CTI weather station EPW files
β β βββ epw/ # 110 weather station EPW files
β β βββ CTI__list__ITA_WeatherStations__All.csv
β βββ 04__italy_cti_parquet/ # CTI parquet output (per-station + tables)
β β βββ <REGION>/<STATION_KEY>.parquet # Hourly station parquet (wide columns)
β β βββ _tables/ # Precomputed CTI tables
β βββ 03__italy_all_epw_DBT_streamlit/ # Legacy app data directory (A route)
β β βββ epw_index.json # TMYx metadata
β β βββ cti_index.json # CTI metadata (if processed)
β β βββ DBT__HR__XX.parquet # Regional hourly data (20 files)
β β βββ CTI__HR__XX.parquet # CTI regional hourly (20 files, if processed)
β β βββ daily_stats.parquet # Daily aggregates (REQUIRED)
β β βββ *_by_*.parquet # Precomputed stats (optional)
β βββ data_preparation_scripts/ # Data processing scripts
β βββ 05_build_epw_index_and_extract.py
β βββ 05b_build_cti_epw_index_and_extract.py
β βββ 06_precompute_derived_stats.py
βββ README.md # This file
Step 1: Build per-station parquets
python data/data_preparation_scripts/05B_build_station_parquets.py --out data/04__italy_tmy_fwg_parquetStep 2: Precompute app tables
python data/data_preparation_scripts/06B_precompute_station_tables.py --data-dir data/04__italy_tmy_fwg_parquetOutput:
data/04__italy_tmy_fwg_parquet/<REGION>/<STATION_KEY>.parquetdata/04__italy_tmy_fwg_parquet/_tables/D-TMYxFWG__DBT__F-DD__L-ALL.parquetdata/04__italy_tmy_fwg_parquet/_tables/D-TMYxFWG__Inventory__F-NA__L-ALL.parquetdata/04__italy_tmy_fwg_parquet/_tables/pairing_debug.csv
The Streamlit app prefers the _tables outputs for all summaries and only loads a per-station hourly parquet when a detailed station plot is requested.
Step 1: Build per-station parquets
python data/data_preparation_scripts/05C_build_cti_station_parquets.py \
--cti-root data/01__italy_cti \
--out-root data/04__italy_cti_parquetStep 2: Precompute app tables
python data/data_preparation_scripts/06C_precompute_cti_tables.py \
--parquet-root data/04__italy_cti_parquetOutput:
data/04__italy_cti_parquet/<REGION>/<STATION_KEY>.parquetdata/04__italy_cti_parquet/_tables/D-CTI__DBT__F-DD__L-ALL.parquetdata/04__italy_cti_parquet/_tables/D-CTI__DBT__F-MM__L-ALL.parquetdata/04__italy_cti_parquet/_tables/D-CTI__Inventory__F-NA__L-ALL.parquet
The app never computes CTI daily/monthly stats at runtime; it always uses the precomputed _tables.
For TMYx/EPW data:
python data/data_preparation_scripts/05_build_epw_index_and_extract.py \
--root data/01__italy_epw_all \
--root data/02__italy_fwg_outputs \
--out data/03__italy_all_epw_DBT_streamlit \
--regionalCreates:
epw_index.json- Metadata for ~4,144 EPW filesDBT__HR__AB.parquetthroughDBT__HR__VN.parquet- 20 regional files (97 MB total)
Time: ~10-60 minutes (depending on EPW file count)
First, rename CTI files (one-time setup):
cd data/01__italy_cti/epw
python rename_cti_regions.py
cd ../../..Then extract CTI hourly data:
python data/data_preparation_scripts/05b_build_cti_epw_index_and_extract.py \
--root data/01__italy_cti/epw \
--out data/03__italy_all_epw_DBT_streamlit \
--regionalCreates:
cti_index.json- Metadata for 110 weather stationsCTI__HR__AB.parquetthroughCTI__HR__VN.parquet- 20 regional CTI files
Time: ~2-5 minutes
Basic (daily stats only):
python data/data_preparation_scripts/06_precompute_derived_stats.py \
--data-dir data/03__italy_all_epw_DBT_streamlitRecommended (with aggregates for maximum performance):
python data/data_preparation_scripts/06_precompute_derived_stats.py \
--data-dir data/03__italy_all_epw_DBT_streamlit \
--compute-aggregatesWith CTI data:
python data/data_preparation_scripts/06_precompute_derived_stats.py \
--data-dir data/03__italy_all_epw_DBT_streamlit \
--process-cti \
--compute-aggregatesCreates:
daily_stats.parquet- Daily aggregates with scenario column (REQUIRED - eliminates 27s+ startup)file_stats_by_percentile.parquet- File-level statisticslocation_stats_by_variant_percentile.parquet- Location statisticslocation_deltas_by_variant_pair_percentile.parquet- Variant comparisonsmonthly_delta_tables_by_variant_pair_percentile_metric.parquet- Monthly tables
Time:
- Basic: ~2-5 minutes
- With aggregates: ~7-20 minutes (but makes app near-instant)
streamlit run app.pyExpected startup time:
- Without precomputation: 45-60 seconds β
- With daily_stats.parquet: 5-10 seconds β
- With all aggregates: 2-5 seconds β‘
All data preprocessing is now handled by one unified script: 06_precompute_derived_stats.py
Deprecated:
- β
00_precompute_daily_stats.py- Superseded by 06_ - β
01_precompute_aggregated_stats.py- Merged into 06_
Current:
- β
06_precompute_derived_stats.py- Handles ALL derived statistics
Hourly Data (DBT__HR__*.parquet, CTI__HR__*.parquet)
β
06_precompute_derived_stats.py
ββ Compute Daily Stats (always)
ββ Process CTI data (if --process-cti)
ββ Compute Aggregates (if --compute-aggregates)
β
ββ File stats by percentile
ββ Location stats by variant & percentile
ββ Location deltas by variant pair & percentile
ββ Monthly delta tables
β
Streamlit App (fast/near-instant startup)
β
Single logical flow - All derived statistics in one place
β
Simpler workflow - No need to remember multiple scripts
β
Better documentation - Clear what each flag does
β
Easier maintenance - Changes in one file
β
User-friendly - One command with optional flags
After testing both approaches, regional format is 5.7x more efficient:
| Format | Files | Total Size | Notes |
|---|---|---|---|
| Regional β | 20 | 97 MB | RECOMMENDED |
| Per-Station β | 4,144 | 554 MB | NOT recommended |
- Parquet Compression: Columnar storage with dictionary encoding compresses repeated values (
rel_path,scenario) to nearly zero bytes - File System Overhead: 4,144 files = significant metadata overhead vs 20 files
- Parquet File Overhead: Each file has header/footer, schema, statistics - multiplied 207x with per-station approach
Keep the efficient regional format + enhance Streamlit app with:
- Location filter (select specific stations)
- Scenario filter (TMYx, RCP, CTI)
- Real-time statistics
β Best of both worlds: Optimal storage + flexible data exploration
CTI (Comitato Termotecnico Italiano) weather station data provides real observed climate data for 110 Italian locations.
CTI files use 3-letter codes (ABR, BAS, CAL), but the app uses 2-letter codes (AB, BC, LB).
cd data/01__italy_cti/epw
python rename_cti_regions.pyMapping Examples:
ABRβAB(Abruzzo)BASβBC(Basilicata)CALβLB(Calabria)LAZβLZ(Lazio)SICβSC(Sicilia)
(See full mapping table in the script)
python data/data_preparation_scripts/05b_build_cti_epw_index_and_extract.py \
--root data/01__italy_cti/epw \
--out data/03__italy_all_epw_DBT_streamlit \
--regionalOutput:
cti_index.json- Station metadata (lat/lon/alt)CTI__HR__XX.parquet- Regional hourly data (20 files)
Run 06_ with --process-cti flag to merge CTI and TMYx data:
python data/data_preparation_scripts/06_precompute_derived_stats.py \
--data-dir data/03__italy_all_epw_DBT_streamlit \
--process-cti \
--compute-aggregatesCTI Regional Files (CTI__HR__XX.parquet):
- Index:
datetime(hourly, 8760 per year) - Columns:
DBT,location_id
CTI Index (cti_index.json):
{
"location_id": "AB__AQ__L'Aquila",
"location_name": "L'Aquila",
"region": "AB",
"latitude": 42.1368853,
"longitude": 13.6103410,
"altitude": 700.0,
"source": "CTI"
}- Real Observed Data: Actual measurements from Italian weather stations
- Consistent Format: Matches TMYx data structure
- Regional Organization: Easy to load specific regions
- App Integration: Automatically detected in Data Preview tab
| Workflow | App Startup | Widget Changes | Recommended |
|---|---|---|---|
| No precomputation | 45-60s | 15-30s | β |
| With daily_stats | 5-10s | 5-10s | |
| With all aggregates | 2-5s | 0.1-0.5s | β Best |
Daily stats (daily_stats.parquet):
- Eliminates 27+ seconds of app startup
- Required for variant filtering
Aggregates (with --compute-aggregates):
- File stats by percentile (95%, 97.5%, 99%)
- Location stats for all variants
- Location deltas for all variant pairs
- Monthly delta tables for all combinations
Impact:
- 5-10x faster widget interactions
- Near-instant responses for cached computations
- Makes the app feel responsive and professional
The Streamlit app includes a "Data Preview" tab with:
- Subtabs per Italian region (Abruzzo, Basilicata, etc.) + CTI tab
- Location selector - Choose specific weather station
- Scenario plots - Automatic plots for each available scenario (TMYx, RCP45, RCP85, CTI)
- Statistics panel - DBT stats, time range, data quality
- Troubleshoot data issues by location
- Compare scenarios visually
- Verify data completeness
- No need to export or use external tools
"daily_stats.parquet not found"
python data/data_preparation_scripts/06_precompute_derived_stats.py --data-dir data/03__italy_all_epw_DBT_streamlit"No matching locations found between baseline and compare variant"
- Ensure
daily_stats.parquethas ascenariocolumn - Re-run
06_to regenerate with proper schema
Slow app startup (20+ seconds)
- Run
06_with--compute-aggregatesflag for maximum performance
"No EPW files found"
- Check that EPW files exist in
data/01__italy_epw_all/ordata/02__italy_fwg_outputs/ - Verify
--rootpaths are correct (relative to FWG root directory)
"Missing root(s)" error in 05_ script
- Run script from FWG root directory, not from
data_preparation_scripts/ - OR adjust
--rootpaths to be relative to your current directory
CTI files not renamed
cd data/01__italy_cti/epw
python rename_cti_regions.py --dry-run # Preview changes
python rename_cti_regions.py # Apply changesCTI processing fails
- Ensure
CTI__list__ITA_WeatherStations__All.csvexists indata/01__italy_cti/ - Check that EPW files are named like
AB__AQ__Station.epw(2-letter region code)
Still slow after precomputation
- Clear Streamlit cache: Check sidebar for cache controls
- Verify all
*_by_*.parquetfiles exist in data directory - Check file sizes: Regional files should be ~5-10 MB each
Large file sizes
- Regional format (97 MB for 20 files) is optimal
- Do NOT use
--convert-to-per-station(creates 554 MB for 4,144 files)
python data/data_preparation_scripts/05_build_epw_index_and_extract.py --help
python data/data_preparation_scripts/05b_build_cti_epw_index_and_extract.py --help
python data/data_preparation_scripts/06_precompute_derived_stats.py --help- "Preparation Scripts" tab - Complete workflow documentation
- "Code Performance" tab - See timing breakdowns
- "Data Preview" tab - Explore data by region/location
- "Debug Info" tab - Index contents, file info
# 1. Extract TMYx hourly data (REQUIRED)
python data/data_preparation_scripts/05_build_epw_index_and_extract.py \
--root data/01__italy_epw_all \
--root data/02__italy_fwg_outputs \
--out data/03__italy_all_epw_DBT_streamlit \
--regional
# 2. Extract CTI data (OPTIONAL)
cd data/01__italy_cti/epw && python rename_cti_regions.py && cd ../../..
python data/data_preparation_scripts/05b_build_cti_epw_index_and_extract.py \
--root data/01__italy_cti/epw \
--out data/03__italy_all_epw_DBT_streamlit \
--regional
# 3. Precompute all statistics (RECOMMENDED)
python data/data_preparation_scripts/06_precompute_derived_stats.py \
--data-dir data/03__italy_all_epw_DBT_streamlit \
--process-cti \
--compute-aggregates
# 4. Launch app
streamlit run app.pyTotal time: ~20-40 minutes (one-time setup)
App performance: Near-instant (2-5s startup, <0.5s interactions)
- Hourly data: Parquet files with datetime index, columnar storage
- Regional partitioning: 20 files (one per Italian region)
- Compression: Dictionary encoding for repeated values
- Columns:
DBT(temperature),rel_path(file ID),scenario(TMYx/RCP/CTI)
AB, BC, CM, ER, FV, LB, LG, LM, LZ, MH, ML, PM, PU, SC, SD, TC, TT, UM, VD, VN
- TMYx: Typical Meteorological Year datasets (baseline climate)
- RCP: Representative Concentration Pathways (future climate scenarios)
- RCP4.5 (2030, 2050, 2070)
- RCP8.5 (2030, 2050, 2080)
- CTI: Comitato Termotecnico Italiano weather stations (110 locations)
When modifying the data pipeline:
- Test with a small subset first (
--limit 10flag in 05_ scripts) - Verify
daily_stats.parquetincludesscenariocolumn - Run with
--verboseto see detailed processing info - Check app startup time to confirm performance improvements
Project developed for EETRA srl SB - Climate Data Analysis and Visualization