Live & Learn

Optimizing BigQuery Geospatial Intersects

Winnie — Tue, 17 Mar 2026 17:03:41 +0000

Working with geospatial data, intersection operations are fundamental and frequently used to answer questions like “Which points fall within this boundary?” or “Do these two areas overlap?”

When we are not looking for high-precision, millimeter level intersects, we can often leverage spatial indexing and approximation techniques to dramatically improve performance. The core concept is to reduce the number of expensive, full geometry comparisons by pre filtering candidates based on a simpler, faster method.

Now here is an interesting rabbit hole I fell into while doing a similar thing in BigQuery (Google’s SQL like Big Data Warehouse), and you know if its BigQuery -> use case must be quite big too – and yes it was, intersecting Terabytes of data.

What i wanted to do is get number of points that would lie inside a city polygon – bigquery has its own geospatial analytics library https://docs.cloud.google.com/bigquery/docs/geospatial-data
That helps do these kind of things by using ST_ functions, for instance:

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon 
    FROM `project-id.dataset_id.city_polygon`
),
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
ON ST_INTERSECTS(pt.geo_point, pl.polygon)

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon 
    FROM `project-id.dataset_id.city_polygon`
),
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
ON ST_INTERSECTS(pt.geo_point, pl.polygon)

and it is based on S2 Geometry https://s2geometry.io/ – which is implemented using a 3d sphere instead of a 2d surface (it has more of a geodesic worldview than planar euclidean!)

To make this computationally more efficient the sphere is further projected onto a cube, which is hierarchally divided into smaller cells and assigned unique integers connected using hilbert curve to preserve the local relationships, you can read about it here.

What that means is when i use ST_ functions i am essentially using an S2 geometry engine in the back.

You can read more about different grid systems in geospatial analytics here.

The goal was simple: take a table of daily stops (Points) and join it to a table of city boundaries (TargetPolygons)

Now, i went with intuition first, If we want to capture points that are near or inside the city, a bounding box seems computationally cheaper than a complex, jagged polygon.

This was pretty basic but i fell into the simplicity trap the intuitive SQL approach i thought was to use the BETWEEN operator on the bounding box limits instead of polygon intersect

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)
ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)
ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

now if you work with big joins you know a cartesian condition is probably not a good idea and it wasnt:

Method	Mechanism	Complexity	Slot Time (Sample Data)
ST_INTERSECTS	S2 Cell Index Lookup (Hash Join)	$O(N)$ (roughly)	213k ms
BETWEEN	Cartesian / Nested Loop	$O(N \times M)$	17.8M ms

BigQuery projects geospatial data onto a sphere using the S2 Geometry library. When you use native spatial predicates like ST_INTERSECTS, BigQuery translates the geometries into S2 cell IDs (integers). It then performs an Equality Join (Hash Join) on those integers, which is incredibly fast ($O(N)$ complexity).

When you use BETWEEN, you are forcing an Inequality Join ($a > b$). BigQuery cannot use the S2 spatial index for raw floats. Instead, it falls back to a Cross Join (Cartesian Product), mathematically comparing every single visit against every single city’s limits ($O(N \times M)$ complexity)

But at the time I couldn’t understand why would a bounding box take longer than a jagged polygon without realizing i switched to a cartesian join from a hash join!

The manual bounding box query was ~83x more expensive because it bypassed BigQuery’s spatial indexing engine entirely

Now that realized, I tried to use BigQuery’s built in bounding box intersect: ST_INTERSECTSBOX

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)

SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl

-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)

-- ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
-- AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

ON ST_INTERSECTSBOX(v.visit_point, 
                    p.bounding_box.xmin, p.bounding_box.ymin, 
                    p.bounding_box.xmax, p.bounding_box.ymax)

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)

SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl

-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)

-- ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
-- AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

ON ST_INTERSECTSBOX(v.visit_point, 
                    p.bounding_box.xmin, p.bounding_box.ymin, 
                    p.bounding_box.xmax, p.bounding_box.ymax)

Surprisingly, this was still taking too long

BigQuery sees a GEOGRAPHY object on one side (v.visit_point) and raw float columns on the other (xmin, ymin, etc.). Because it cannot easily build a spatial grid purely on floats, it once again falls back to a Cross Join!

Now somehow it was established that whatever St_INTERSECT was doing – was undeniably fastest, but still there must be a way to accelarate it further by using bounding box, because all of my experience said that a box has to be faster than such a jagged geospatial polygon!

So, next idea was what if the bounding box was the exact polygon that the ST_INTERSECT function was comfortable with!

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
),
TargetEnvelopes AS (
    SELECT
        city_name,
        ST_GEOGFROMTEXT(FORMAT('POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))',
            bounding_box.xmin, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymin
        )) AS polygon
    FROM TargetPolygons
)
SELECT DISTINCT
    v.device_id,
    v.device_type,
    p.city_name,
    v.event_ts,
    v.dwell_time, v.breadcrumbs, v.meanLat, v.meanLong
FROM ValidVisits AS v
INNER JOIN
TargetEnvelopes AS p
        ON ST_INTERSECTS(v.visit_point, p.polygon)

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
),
TargetEnvelopes AS (
    SELECT
        city_name,
        ST_GEOGFROMTEXT(FORMAT('POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))',
            bounding_box.xmin, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymin
        )) AS polygon
    FROM TargetPolygons
)
SELECT DISTINCT
    v.device_id,
    v.device_type,
    p.city_name,
    v.event_ts,
    v.dwell_time, v.breadcrumbs, v.meanLat, v.meanLong
FROM ValidVisits AS v
INNER JOIN
TargetEnvelopes AS p
        ON ST_INTERSECTS(v.visit_point, p.polygon)

Metric	Failed Manual Box (BETWEEN)	New Constructed Box (ST_GEOGFROMTEXT)	Original Exact Intersect
Duration	1 min 5 sec	5 sec	4 sec
Compute (Slot ms)	17,856,403	327,200	213,050

And voila! this was it, atleast it was comparable!

the reason it was still higher because the query was creating a bounding box polygon at runtime, saving TargetEnvelope ahead of time saved the additional overhead and helped process 84 GB data in a duration of just 2s! (this was one of the test cases ofcourse – the actual intersect we had to do was much bigger scale – and geo accuracy wasn’t an important factor – we wanted points in and around the area – so this worked like a charm). But yes doesn’t sound like much but it was confusing and eventually insightful how BQ geospatial engine actually worked under the hood, this rabbit hole took about an hour so I don’t feel too bad about missing such a simple thing as cartesian join messing up my initial optimization!

Grids – Comparative Analysis of Geohash, H3, and S2

Winnie — Tue, 17 Mar 2026 16:26:44 +0000

Bucketing and discretization is important step in any algorithmic training or analysis, today we focus on the ways we can map the continuous spherical surface of Earth into manageable buckets. Why exactly we do this? to index different locations on earth to allow for fast geospatial queries, joins and aggregations. There are three popular geo indexing systems – Geohash, Uber’s H3 and Google’s S2, all let you map (lat, lon) -> a cell id, but they differ in what surface they discretize and therefore what kinds of distortion they introduce

Geohash

Divides the map into a standard grid, but the physical size of the grid shrinks at the poles

Structure

Rectangular cells
Recursive subdivision in lat/lon space

Distortion

Uniform in degrees (not in meters)
Physical width of cells shrinks toward the poles

Indexing

uses a Z-order curve for indexing
outputs a base-32 string
prefix matching – same prefix – closer – allows spatial clustering by text match

H3

Maps the Earth to an icosahedron (20 faces) covered in hexagons, developed Uber

Structure

Discretize the sphere using an icosahedron and mostly hexagonal cells
requires a small number of pentagons to fil in the gaps
mostly uniform cells – great for grid math (neighbors, rings, disks)
distance from center is identical

Distortion

Distortion exists (as its on a sphere), but is spread relatively evenly

Indexing

Uses an aperture 7 system (subdividing each cell into 7 smaller parts
Hexagons give strong neighborhood properties (k-rings, local adjacency)

S2

Maps the earth to a cube, then projects it onto a sphere, subdivide cube faces, then project back to the sphere, developed by Google

Structure

Each face is recursively subdivided into four child cells (a quadtree), resulting in quadrilateral-ish cells on the sphere

Distortion

Bounded and well behaved globally, with better behavior near the poles compared to planar lat-lon grids

Indexing

uses a Hilbert space curve
Great for hierarchy with predictable parent/child relationships
very fast because maps to 64bit integers

Table Of Contents

Size and Precision

Geohash

Resolution Levels

1 to 12 (12 characters total)

Hierarchy: base 32 string – every character is 5 bit worth of precision which means exponential scaling. Z curve indexing to divide further on each level

Level 1 starts with, 8 columns (ie 360/8 = 45 degree lon), 4 rows (ie 180/4 = 45 degree lat) = 5000x5000km blocks

Neighborhood Geometry: In a rectangular grid, the distance to vertical and horizontal neighbors is 1, but the distance to diagonal neighbors is 2^(1/2)

Precision Breakdown:

Level 4 (Metropolitan): ~39 km
Level 5 (City): ~5 km
Level 6 (Neighborhood): ~1.2 km
Level 12 (Maximum Precision): ~3.7 cm by 1.8 cm

H3

Resolution Levels

0 to 15

Hierarchy: Because hexagons cannot be perfectly subdivided into smaller hexagons (aperture 7 system – hexagons +pentagons), H3’s sub-hierarchy is not perfectly contained (child cells slightly overlap parent boundaries)

Neighborhood Geometry: The distance from the center of a hexagon to all six of its adjacent neighbors is exactly equal. This makes it mathematically superior for radial calculations and smoothing

Precision Breakdown:

Level 0 (Continental): ~1100 km edge length
Level 3 (State): ~60 km
Level 5 (City): ~8.5 km
Level 7 (Neighborhood): ~1.2 km
Level 15 (Maximum Precision): ~0.5 m

S2

Resolution Levels

0 to 30

Hierarchy: A perfect quadtree. Every parent cell is mathematically subdivided into exactly 4 child cells, with zero edge overlap

projects earth to cube – cube is level 0 (cell at 0 = 1/6 earth)

Neighborhood Geometry: Cells are indexed using a continuous Hilbert curve, which avoids the spatial “jumps” seen in the Geohash Z-curve. Because S2 maps directly to 64-bit integers, computational operations are incredibly fast

Precision Breakdown:

Level 0 (Continental): ~10,000 km edge length
Level 4 (Country): ~576 km
Level 10 (City): ~9 km
Level 13 (Neighborhood): ~1 km
Level 20 (House): ~8 m
Level 24 (Surveying): ~55 cm
Level 30 (Maximum Precision): ~0.9 cm

Grid Cells

To see how these indexing systems behave in practice, let’s look at a specific coordinate in Ottawa (Lat: 45.428, Lon: -75.718) and map it to a “City” resolution across all three systems.

We choose the following resolution / level for each grid system:

"city": {"geohash": 5, "h3": 6, "s2": 11}

"city": {"geohash": 5, "h3": 6, "s2": 11}

which ranges from 16 – 40 km2 (City Ward / Large District)

While Geohash, H3, and S2 all have a scale that conceptually maps to a “City District” their actual geometric footprints vary wildly:

Size

Geohash (Precision 5)
- Token: f244k (Parent: f244)
- Centroid: 45.417480, -75.739746
- Dimensions: 3433.86m x 4891.99m
- Estimated Area: ~16.8 million m² (16.8 km²)
H3 (Resolution 6)
- Token: 862b83b1fffffff (Parent: 852b83b3fffffff)
- Centroid: 45.419360, -75.762752
- Edge Length: 3950.16m
- Estimated Area: ~39.3 million m² (39.3 km²)
S2 (Level 11)
- Token: 4cce04c (Parent: 4cce05)
- Centroid: 45.443463, -75.710880
- Edge Length: 3828.80m
- Estimated Area: ~16.1 million m² (16.1 km²)

Quantization

Quantization error – cell centroid distance – If we are putting a cell around a lat lon – how far is the cell center from the lat lon, we compare this on the smallest scale:

Geohash (Precision 12)
- Token: f244kvzv68zp (Parent: f244kvzv68z)
- Centroid: 45.428000, -75.718000
- Dimensions: 0.03m x 0.02m
- Estimated Area: ~0.0005 m²
H3 (Resolution 15)
- Token: 8f2b83b1d90b96d (Parent: 8e2b83b1d90b96f)
- Centroid: 45.427996, -75.718002
- Edge Length: 0.6174m
- Estimated Area: ~0.9731 m²
S2 (Level 30)
- Token: 4cce04f4d08605cd (Parent: 4cce04f4d08605cc)
- Centroid: 45.428000, -75.718000
- Edge Length: 0.0073m
- Estimated Area: ~0.0001 m²

For the given Lat: 45.428, Lon: -75.718, we’ve computed the distance of the point from the centroid (haversine distance – the details of which you can read in the next section)

the smallest possible quantization error:
Geohash(12) Off-Center Distance: 0.01 meters
H3 (15) Off-Center Distance: 0.52 meters
S2(30) Off-Center Distance: 0.00 meters

city level quantization error (for the cell size displayed in the image):
Geohash(5) Off-Center Distance: 2061.21 meters
H3 (6) Off-Center Distance: 3622.33 meters
S2(11) Off-Center Distance: 1806.92 meters

Notice the massive discrepancy in area. At the city tier, S2 and Geohash yield a similarly sized bounding box (~16 sq km), but H3’s hexagonal resolution jumps to nearly 40 sq km. Furthermore, the centroids of these cells do not perfectly align with the target coordinate. This highlights a crucial rule for geospatial engineering: you cannot blindly swap indexing systems and expect a 1:1 mapping. You must tailor the resolution level to the specific precision required by your dataset.

Distance

Planar

Flat map distance, given latitude and longitude of two points, we employ an equirectangular projection and calculate Euclidean distance.

Point 1: $(\phi_1, \lambda_1)$ , Point 2: $(\phi_2, \lambda_2)$
$\phi$ = latitude (radians), $\lambda$ = longitude (radians), $R$ = Earth radius

Reference point $(\phi_0, \lambda_0)$

Projection Center (reduce distortion):
$\phi_0 = \frac{\phi_1 + \phi_2}{2}$
$\lambda_0 = \frac{\lambda_1 + \lambda_2}{2}$

Equirectangular Projection :

east-west displacement in meters
$x = R \cdot (\lambda – \lambda_0) \cdot \cos(\phi_0)$
$\cos(\phi_0)$ term corrects for the fact that lines of longitude
get closer together toward the poles

north-south displacement in meters
$y = R \cdot (\phi – \phi_0)$

Euclidean Distance in Projected Space (m) :

$d = \sqrt{(x_2 – x_1)^2 + (y_2 – y_1)^2}$

Geodesic

Because the earth is curved, to find distance between two points specified with a latitude or longitude, we employ the haversine formula.

Point 1: $(\phi_1, \lambda_1)$ , Point 2: $(\phi_2, \lambda_2)$
$\phi$ = latitude (radians), $\lambda$ = longitude (radians), $R$ = Earth radius
$\Delta \phi = \phi_2 – \phi_1$, $\Delta \lambda = \lambda_2 – \lambda_1$

Haversine:
$a = \sin^2\left(\frac{\Delta \phi}{2}\right)+ \cos(\phi_1)\cos(\phi_2)\sin^2\left(\frac{\Delta \lambda}{2}\right)$

Angular Distance:
$c = 2 \cdot \arctan2\left(\sqrt{a}, \sqrt{1-a}\right)$

Final Distance:
$d = R \cdot c$

For a collection of points as follows:

test_pairs = {
    "local": [(45.428, -75.718), (45.47, -75.70)], #(Ottawa ~5km)
    "regional": [(45.428, -75.718), (45.5017, -73.5673)], #(Ottawa–Montreal)
    "cross_country": [(45.428, -75.718), (49.2827, -123.1207)], #(Ottawa–Vancouver)
    "equator": [(0.0, 0.0), (0.0, 10.0)], #(0°, 0° - 0°, 10°)
    "poles": [(70.0, 0.0), (70.0, 60.0)], #(70°, 0° - 70°, 60°)
}

levels = {
    # "neighborhood":{"geohash": 6,"h3": 7,"s2": 13},
    "smallest":{"geohash": 12, "h3": 15, "s2": 30},
    "largest":{"geohash": 1, "h3": 0, "s2": 0}, # geohash 0 is the whole world 
    "state": {"geohash": 3, "h3": 2, "s2": 6}, # 100,000+ km2 (State / Province Level)
    "metro": {"geohash": 4, "h3": 4, "s2": 8}, # 500+ km2 (Wide Metro Area / County Level)
    # "city" :{"geohash": 5, "h3": 5, "s2": 10},
    "city": {"geohash": 5, "h3": 6, "s2": 11}, #16 - 40 km2 (City Ward / Large District) 
    "neighborhood": {"geohash": 6, "h3": 7, "s2": 12}, #5 km2 (Neighborhood),
    "sentinel_scale": {"geohash": 8, "h3": 12, "s2": 20} #10m approx
}

test_pairs = {
    "local": [(45.428, -75.718), (45.47, -75.70)], #(Ottawa ~5km)
    "regional": [(45.428, -75.718), (45.5017, -73.5673)], #(Ottawa–Montreal)
    "cross_country": [(45.428, -75.718), (49.2827, -123.1207)], #(Ottawa–Vancouver)
    "equator": [(0.0, 0.0), (0.0, 10.0)], #(0°, 0° - 0°, 10°)
    "poles": [(70.0, 0.0), (70.0, 60.0)], #(70°, 0° - 70°, 60°)
}

levels = {
    # "neighborhood":{"geohash": 6,"h3": 7,"s2": 13},
    "smallest":{"geohash": 12, "h3": 15, "s2": 30},
    "largest":{"geohash": 1, "h3": 0, "s2": 0}, # geohash 0 is the whole world 
    "state": {"geohash": 3, "h3": 2, "s2": 6}, # 100,000+ km2 (State / Province Level)
    "metro": {"geohash": 4, "h3": 4, "s2": 8}, # 500+ km2 (Wide Metro Area / County Level)
    # "city" :{"geohash": 5, "h3": 5, "s2": 10},
    "city": {"geohash": 5, "h3": 6, "s2": 11}, #16 - 40 km2 (City Ward / Large District) 
    "neighborhood": {"geohash": 6, "h3": 7, "s2": 12}, #5 km2 (Neighborhood),
    "sentinel_scale": {"geohash": 8, "h3": 12, "s2": 20} #10m approx
}

we see the geodesic and planar distance are noticeably different on large scales because of earth’s curvature

Now lets compare the error between haversine distance for the actual points vs the cell centroids:

The “Local” Distance Anomaly: If you look at the “Local” test scenario (a distance of roughly 5 km) across all three systems, the error hits exactly 100.0000% at the “State” and “Largest” levels. This happens because the physical distance being measured is vastly smaller than the cell itself. If both points fall into the exact same massive cell, the calculated centroid distance is 0, resulting in a 100% error. This is a vital reminder: never use a spatial index resolution that is physically larger than the phenomena you are trying to measure.

Geohash’s Equator Extreme: At the “Largest” resolution level, Geohash struggles massively at the Equator, showing a staggering 314.0962% error. By comparison, H3 at the Equator for the same level shows a 64.7221% error. Geohash’s rectangular planar projection simply cannot handle extreme longitudinal distances reliably at a low resolution.

S2’s Rigorous Bounding: Look at S2’s “Largest” row: it hits a flat 100.0000% error across all scenarios. Because S2 Level 0 cells cover exactly one-sixth of the Earth’s surface, almost all of our test pairs snapped to the same massive cell centroid. However, as soon as you step down to the “Metro” and “City” levels, S2’s errors drop rapidly and consistently.

Convergence at the Sentinel Scale: By the time we reach the “Sentinel_Scale” (roughly 10-meter resolution, perfect for satellite imagery) and the “Smallest” scale, all three systems converge to near 0.0000% error, regardless of whether you are tracking a cross-country flight or a localized grid outage

The following is how the ‘city’ level cells look like over a regional distance

Intersect

By intersect what we mean is geofencing and point in polygon problem, in most of geospatial problems, we will constantly be checking if telemetry coordinates fall within a specific boundary, and for that we can either do the exact mathematical calculation (which is computationally expensive at scale) or check if the point belongs to a pre computed set of grid cells (which is blazing fast but introduces edge-case errors)

When determining if a point falls inside a complex polygon, we use these test cases (firmly inside the polygon, riding the borderline, and just outside) and compare the exact geometric calculation (using the Shapely library) against native grid set lookups across Geohash, H3, and S2 at the “Neighborhood/City” scale.

geofence_poly = [ #downtown ottawa - bounding box
    [45.430, -75.715],
    [45.432, -75.695],
    [45.418, -75.690],
    [45.415, -75.705],
    [45.430, -75.715] 
]

test_points = {
    "Inside": (45.422, -75.700),
    "Borderline": (45.430, -75.710),
    "Outside": (45.410, -75.720)
}

geofence_poly = [ #downtown ottawa - bounding box
    [45.430, -75.715],
    [45.432, -75.695],
    [45.418, -75.690],
    [45.415, -75.705],
    [45.430, -75.715] 
]

test_points = {
    "Inside": (45.422, -75.700),
    "Borderline": (45.430, -75.710),
    "Outside": (45.410, -75.720)
}

	Point_Type	Lat_Lon	Exact_Contains	GEOHASH_In_Set	H3_In_Set	S2_In_Set	GEOHASH_Intersects	H3_Intersects	S2_Intersects	Exact_Time_ms	GEOHASH_Set_Time_ms	H3_Set_Time_ms	S2_Set_Time_ms	GEOHASH_Intersect_Time_ms	H3_Intersect_Time_ms	S2_Intersect_Time_ms
0	Inside	45.422, -75.7	True	True	True	True	True	True	True	0.0193	0.0059	0.0031	0.0143	0.0065	0.0149	0.0054
1	Borderline	45.43, -75.71	True	True	False	True	True	True	True	0.0053	0.0054	0.0018	0.0142	0.0046	0.0063	0.0050
2	Outside	45.41, -75.72	False	False	False	True	False	False	True	0.0500	0.0065	0.0030	0.0160	0.0064	0.0068	0.0082

Where In Set is a still a mathematical intersects (using shapely) for the cell polygon and intersects is the the grid systems native intersect function.

The Inside Point: All three systems, as well as the exact calculation, correctly identified that the point was inside the polygon. No surprises here.

The Borderline Point: The exact math confirmed the point was inside. Geohash and S2 successfully flagged it via set lookup. However, H3’s native set lookup returned False. Because hexagons don’t perfectly conform to sharp, straight polygon edges, the H3 polyfill algorithm left a slight gap at the border, resulting in a false negative.

The Outside Point: The exact math correctly flagged the point as outside. Geohash and H3 agreed. However, S2 returned True for both the set lookup and dynamic intersection. Because the S2 cell at this specific resolution level was large enough to overlap the polygon boundary while also swallowing the outside point, it yielded a false positive.

Now you ask if grids introduce false positives and negatives, why use them at all? Speed!

Exact Geometry (Shapely): Ranged from 0.0053 ms to 0.0500 ms. Exact math is highly variable because the computational cost increases depending on how complex the polygon is and where the point lies relative to its edges. This was a simple bounding box, but geo polygons can be and usually are very detailed
Geohash Set Lookup: Hovered around ~0.0054 to 0.0065 ms.
S2 Set Lookup: Ranged from ~0.0142 to 0.0160 ms.
H3 Set Lookup: Blew the competition away at 0.0018 to 0.0031 ms

(and we can always solve for accuracy by adjusting resolution and/or employing hierarchal checks)

Neighborhood

In geospatial analytics, we rarely look at a single cell in isolation.

Not just optimal drive time solutions but even in Earth Observation problems we apply smoothing filters, calculate local density, or model the spread of phenomena (like a flood or a signal footprint) by radiating outward. This requires pulling a cell’s immediate neighbors.

Based on our analysis at the “Neighborhood” scale (roughly 1 to 5 km²), here is how Geohash, H3, and S2 handle spatial adjacency

Geohash

Geohash is computationally simple, but library support for adjacency can be lacking. For instance, the pygeohash library does not have an inherent function to fetch neighbors.

The Method: To find neighbors, we had to manually calculate the latitude and longitude error margins (the dimensions of the cell) and step outward to create a 3×3 grid around the center token

The Result: You get 8 neighbors (sharing edges and corners). However, because Geohash relies on a Z-order curve, cells that are physically adjacent might have wildly different string prefixes, making database range queries for neighbors highly inefficient

H3

Uber’s H3 system was made for this. Hexagons are the gold standard for neighborhood traversal because the distance from the center of a cell to all of its neighbors is identical.

The Method: H3 uses a built-in grid_disk(center, k=1) function, which efficiently returns the center cell plus its immediate ring of neighbors

The Result: You get exactly 6 perfect neighbors. If you are building a Convolutional Neural Network (CNN) or a Graph Neural Network (GNN) that requires pooling data from adjacent spatial regions, H3’s uniform k-ring expansion makes the math incredibly clean

S2

S2 handles neighbors robustly, leveraging its 64-bit integer indexing and Hilbert curve.

The Method: S2 provides a native get_all_neighbors(level) function

The Result: It returns 8 neighbors in total: 4 edge-sharing neighbors and 4 vertex-sharing (corner) neighbors. Because S2 avoids the severe spatial jumps of Geohash’s Z-curve, fetching these neighbors is computationally fast, though the variable distance to corner neighbors makes it slightly less ideal for radial smoothing than H3

Hierarchy & Coverage

we’ve already established how these systems group cells (hierarchy) and how they fill a shape (coverage) are fundamentally different, now we tackle on a problem of indexing a geographic boundary – to retrieve data inside, here we see how at different hierarchies and resolutions, different grid cells cover a geographic area

Geohash

Hierarchy: Geohash’s hierarchy is incredibly simple for databases. If your cell is f244k, its parent is f244. You just chop off the last character. It is fast for text-based NoSQL scans, but because of the Z-curve, a parent cell might contain child cells that are physically distorted or seemingly disjointed

Coverage: Geohash lacks a native, sophisticated region coverer in most standard Python libraries. To cover a polygon, we had to manually generate a bounding box and iterate through coordinates. This brute force method almost always results in high over coverage at the polygon’s edges because rectangular cells do not hug diagonal borders well

H3

Hierarchy: H3 uses an “aperture 7” hierarchy, meaning one parent hexagon roughly contains seven child hexagons. However, geometry dictates that hexagons cannot be perfectly subdivided into smaller hexagons. This means H3’s hierarchy is not perfectly nested, child cells will slightly overlap the boundaries of their parent

Coverage: H3 offers a native polygon_to_cells polyfill function. Because hexagons approximate organic, curved, and diagonal shapes much better than rectangles, H3 provides a very tight footprint with minimal over-coverage, saving compute on downstream queries

S2

Hierarchy: S2 is mathematically flawless here. It uses a pure quadtree hierarchy where exactly 4 child cells fit perfectly inside 1 parent cell with zero boundary overlap. For hierarchical data aggregations, S2 is vastly superior

Coverage: S2 is purpose-built for this. It has a native RegionCoverer() class that allows you to specify a min_level, max_level, and even a maximum cell count. It mixes large parent cells in the center of the polygon and small child cells at the complex borders, optimizing both accuracy and database storage

You can access the code here github

Flood Prediction I – Nose diving into Data sources

Winnie — Mon, 23 Feb 2026 04:31:17 +0000

Code for this data extraction pipeline is available on my GitHub.

We live in such a dynamic world, the volatility and magnanimity of which we might not notice in our daily mundane lives, some of us glued to screen indoors be it rain or shine are even more disconnected to the reality outside.

I remember having this euphoric, scary, anxious, relieved feeling watching a lightning storm from inside my house, which believe it or not was popular enough that it got its own name in pop culture (google chrysalism meaning)

But out there, in the face of nature’s wrath, humans are insignificant. Over the years, we have advanced our methodologies, systems, and technology to better predict and manage events when nature refuses to cooperate with human development. Yet, despite our best efforts, there are events we can’t predict, can’t predict early enough, and can’t react to fast enough. The result is the inevitable damage to infrastructure, agriculture and even lives – be it human, animal, or plant – uprooting entire ecosystems at times!

We humans are insignificant in the face of one wild beast, much less the unprecedented fury of nature, and effective disaster management education, system and policy doesn’t even exist in all places. Be it wildfires, floods, earthquakes, tsunamis, volcanoes, landslides, avalanches, hurricanes – we hear something or the other in the news and once again get reminded of how powerless we still are in the face of a natural disaster.

I personally believe disaster management should not just be an academic exercise of a select few, but a humanitarian necessity, so, here lets start this flood prediction project as my humble attempt of trying something hoping the end result gives us all a better system in some aspect or even a better understanding that might help things down the line or someone very smart somewhere out there.

Also I want my time, and whatever I have learned, to be spent on something useful. This aligns with my life’s mission – trying to make and leave the Earth a better place while we’re on it, where life can prosper in harmony, we’ve come so far, unbelievable advances in science and technology over centuries of human effort and billions of years of biological and microbial effort to get us to the diversity and abundance of flora and fauna we see around us, including us, and so, we humans with our technology have a duty even if its just out of courtesy, protecting this planet called earth, for us and generations to come and letting life prosper throughout this universe. We owe it to all the earthen life that kept going, kept fighting, kept evolving billions of years.
We’re probably like the immune cell of earth’s ecosystem – a contract to do our part for the better of the system be it preventing harm, fighting asteroids or rebuilding civilizations – and i dont want to be a catastrophic neutrophil – no matter how passionate they are (and yes i’m currently reading immune by kurzgesagt)

I wanted to explore Physics Informed Neural Networks in Flood Prediction domain, but before we discuss models and methodologies – i’m currently looking into what exists in the space as well, first steps first lets get to know the data:

Here in my example – we’re extracting data for Ottawa-Gatineau 2019 floods, this is not necessarily the data i was aiming for my model – but i struggled finding decent global coverage for a lot of events and also a bit of lack of knowledge of terrain, urban and climatic influences, so to demonstrate the data acquisition we start at the home base.

Also to preface – I made use of GEE (Google Earth Engine) wherever possible – because it simplified the whole process of getting data.
When you’re working with multi modal data, this helps even more, instead of authenticating a million apis and learning their file systems and filtering and still getting massive chunks of data and frying your local machine trying to manually resample and interpolate the pixels- GEE lets you do that on ‘their’ server using a python GEE library – which has very very generous free tier for non commercial use. If you’re familiar with GIS tools you probably know a lot of tools like that (ArcGIS etc.) that do this but are quite slow and quite expensive.

Table Of Contents

Spatial Boundaries

Before extracting the pixel data, we needs a spatial bounding box – to extract data for. For this, I used the Global Administrative Unit Layers (GAUL), compiled by the Food and Agriculture Organization (FAO).

Level 0 – Country
Level 1- States / Provinces
Level 2 – Counties/Districts

Available through GEE (Google Earth Engine) – 2015 version

Earlier i was using GADM – GADM (Global Admin Boundaries) – using a filter on level 1 (states) and level 2 (districts/county) – but i realized GAUL seemed more reliable and accessible through GEE itself

Using GEE, I queried the Level 2 administrative boundary for Gatineau, Canada, and simplified the geometry to a 100m resolution. This reduces computational overhead while giving our arrays a precise mathematical boundary to clip against.

Here is how it looked:

Sentinel-1

If we want to observe a flood from space, first instinct might be to look at a photograph – data from an optical satellite like Sentinel-2. There is just one problem, floods happen during storms, and storms mean thick, impenetrable cloud cover. If we rely on optical data, we apparently will just get highly detailed pictures of white clouds lol

To actually see the water, we have to rely on different physics. We need what we call is Synthetic Aperture Radar (SAR) – which is basically things like microwaves instead of visible waves – we call it SAR because of the technology that makes this sort of scan possible.
To explain briefly – microwave wavelengths are like a x1000 more than light’s, to get an image resolution of say 10m (sentinel satellite’s resolution) from earth’s orbit like 700km high – would require a 4km antennae or something – the technological workaround is – the satellite has a number of antennae – and they send 1000s of microwaves and since they’re microwaves they can pass through the clouds and they get reflected on the surface and are recorded, corrected for doppler shifts and combined using Fourier transforms.

in our case, operated by the European Space Agency (ESA), Sentinel-1 is a polar-orbiting satellite that actively sends microwave pulses down to Earth. Because microwaves have much longer wavelengths than visible light, they pass right through clouds and heavy rain. When these pulses hit the ground, they scatter, and the satellite measures the “backscatter” that returns.

Here is where the physics of fluids and flat surfaces comes in: water acts like a flat mirror to microwaves. When the radar hits a flooded area, the energy bounces away from the satellite, resulting in a very low backscatter signal. In our data, floodwaters appear distinctly dark.

For this project, I pulled Sentinel-1 Interferometric Wide (IW) swath data using two specific polarizations:

VV (Vertical Transmit, Vertical Receive): Highly sensitive to surface roughness, making it our primary tool for detecting open water because of surface scattering, open floodwaters act like a flat mirror, the vertical radar waves bounce away from the satellite, making the water appear distinctly dark. Looking at the VV histogram, we see a clean, bimodal (two-peak) distribution. The distinct peak on the left (lower dB values) represents the water, while the larger peak on the right represents the rougher, brighter land.
VH (Vertical Transmit, Horizontal Receive): Sensitive to volume scattering, which helps us distinguish between flooded vegetation and clean open water.
it is noticeably noisier and darker overall because VH relies on depolarization the radar wave must hit a complex, 3D structure (like a tree canopy or a building) and bounce around inside it to return a horizontal signal. Its histograms different as well, the entire distribution is shifted left toward lower decibel ranges, and the split between water and land can be much more blurrier.

The Environmental Stack

To predict a flood, the model needs to understand the landscape just as well as the water does. Water flows downhill, pools in basins, and absorbs differently depending on the ground beneath it. To capture this physical reality, I built a single, multi-dimensional tensor out of four distinct environmental layers: a Digital Elevation Model (DEM), Height Above Nearest Drainage (HAND), JRC historical surface water, and ESA WorldCover and this is where GEE shines

Elevation

Digital Elevation Model (DEM): Sourced from NASA’s SRTM, giving us absolute height. Water flows downhill, making elevation the most fundamental constraint.
Shuttle Radar Topography Mission (SRTM) was a 2000 NASA mission with the shuttle Endeavour mapping 80% of the earth, the GEE dataset we used here is – USGS/SRTMGL1_003

Slope

Derived from the DEM, this tells the model about terrain steepness. Steep slopes mean rapid runoff and flat areas mean pooling

Relative Elevation

Height Above Nearest Drainage (HAND): Relative elevation is critical. HAND normalizes topography based on the local drainage network, directly highlighting low lying floodplains. THE GEE Dataset is – gena/GlobalHAND/30m/hand-1000 – derived from DEM so also the year 2000

Water Bodies

JRC Global Surface Water (Occurrence): To predict a flood, we must know what isn’t a flood. We use this historical map to mask out permanent water bodies (like existing rivers) so the model learns to identify anomalous flooding.
The Joint Research Centre is a European initiative compiling (1984-present) ~35 yr summary and can be accessed in GEE through JRC/GSW1_4/GlobalSurfaceWater

Land Use

ESA WorldCover: Finally, we classify the surface material (urban, agriculture, forest) to implicitly account for different fluid friction and absorption coefficients. Its a global map by European Space Agency using Sentinel1 and Sentinel2 data and can be acccessed in GEE through ESA/WorldCover/v200 – which is a 2021 version

I could do all that with just this:

def get_environment_stack_gee(roi):
    dem = ee.Image("USGS/SRTMGL1_003").clip(roi).rename('elevation')
    slope = ee.Terrain.slope(dem).rename('slope')
    
    hand = ee.Image("users/gena/GlobalHAND/30m/hand-1000").clip(roi).rename('hand')
    
    water_occ = ee.Image("JRC/GSW1_4/GlobalSurfaceWater").select('occurrence').clip(roi).rename('water_occ')
    
    landcover = ee.ImageCollection("ESA/WorldCover/v200").first().clip(roi).rename('landcover')

    # Stack and resample to 10m to match the S1 resolution we got
    return ee.Image.cat([dem, slope, hand, water_occ, landcover]).float()

def export_env_stack(roi, region_name, clean_region_name):
    stack = get_environment_stack_gee(roi)
    
    task = ee.batch.Export.image.toDrive(
        image=stack,
        description=f"{clean_region_name}_Env_Stack",
        folder=f"Flood_Project_Data",
        fileNamePrefix=f"ENV_{clean_region_name}_env_stack",
        region=roi,
        scale=CONFIG['PARAMS']['s1_resolution'],
        crs='EPSG:4326',
        maxPixels=1e9
    )
    task.start()
    print(f"Environmental stack export started for {clean_region_name}")

def get_environment_stack_gee(roi):
    dem = ee.Image("USGS/SRTMGL1_003").clip(roi).rename('elevation')
    slope = ee.Terrain.slope(dem).rename('slope')
    
    hand = ee.Image("users/gena/GlobalHAND/30m/hand-1000").clip(roi).rename('hand')
    
    water_occ = ee.Image("JRC/GSW1_4/GlobalSurfaceWater").select('occurrence').clip(roi).rename('water_occ')
    
    landcover = ee.ImageCollection("ESA/WorldCover/v200").first().clip(roi).rename('landcover')

    # Stack and resample to 10m to match the S1 resolution we got
    return ee.Image.cat([dem, slope, hand, water_occ, landcover]).float()

def export_env_stack(roi, region_name, clean_region_name):
    stack = get_environment_stack_gee(roi)
    
    task = ee.batch.Export.image.toDrive(
        image=stack,
        description=f"{clean_region_name}_Env_Stack",
        folder=f"Flood_Project_Data",
        fileNamePrefix=f"ENV_{clean_region_name}_env_stack",
        region=roi,
        scale=CONFIG['PARAMS']['s1_resolution'],
        crs='EPSG:4326',
        maxPixels=1e9
    )
    task.start()
    print(f"Environmental stack export started for {clean_region_name}")

This looks somewhat like:

The Forcing Function – The Weather Data

A landscape doesn’t flood itself, it requires an atmospheric forcing function. To give the model the cause-and-effect relationship, I extracted 1km-resolution weather grids covering the weeks leading up to and during the flood. By ingesting daily precipitation (how much water is falling) and temperature (critical for identifying spring snowmelt in the Canadian context at least) – also temperature is usually* most indicative of a lot of weather anomalies, seasonality and patterns

Weather

For temperature and weather in general the gold standard in ML is ERA5
its ECMWF’s historic reanalysis dataset – the best part is the general consistency, variables and lookback (1940 to present!) – not that we are planning to use all the variables and that time range and of course other datasets exist and other models might be more accurate for predictions than HRES forecasts for your specific locale and some weather patterns but there’re pros and cons, if it helps make the case, google trained Graphcast on ERA5 as well. (But yes please await my weather datasets blog where i dive into weather data in more detail!)
ERA5 resolution is ~30km

Other

We also use SMAP (Soil Moisture Active Passive) to have an estimate of moisture content in the soil, this is a global dataset that estimates the moisture content in top 5cm of soil using microwaves (SMOS, GLDAS are some other options as well)

GPM IMERG (Global Precipitation Measurement) which is global – every 30 min – radar + microwave with a 10km spatial res (way coarser than S1’s 10m.
alternative – CHIRPS (long term res), NEXRAD, MRMS (only North America)

Hydrography

Global Hydrography (MERIT Hydro) – While a DEM tells us the elevation of a single point, water dynamics require us to know how those points connect. For this, the hydrology community relies on datasets like MERIT Hydro. It explicitly maps flow direction and flow accumulation. It tells the model not just that water flows downhill, but which river channel it will eventually end up in. It has about 90m resolution, there are other alternatives like NHDPlus but that primarily covers the States

Optional – MODIS (The Optical Backup)

While Sentinel-1 SAR is our primary “see-through-clouds” tool, NASA’s MODIS (Moderate Resolution Imaging Spectroradiometer) aboard the Terra and Aqua satellites remains a workhorse of Earth Observation. MODIS images the entire Earth every 1 to 2 days. While it is blinded by the storm clouds during the peak of a disaster, its high temporal frequency makes it incredible for tracking the long term duration of a flood once the skies clear, serving as a powerful secondary validation layer for our model’s predictions. It is a bit coarse with 1 pixel potentially corresponding to 250-500m

The Operational Ecosystem – EMS, GFM, GloFAS

These are some benchmarks / potential labels if we were to train a prediction model

Copernicus EMS (Emergency Management Service): Run by the European Union, EMS is the rapid-response paramedic of the satellite world. When a disaster strikes, authorized users (like civil protection agencies) can trigger EMS. They rapidly task satellites to acquire imagery and produce standardized flood extent and damage assessment maps within hours to guide rescue teams on the ground.

this is a gold label dataset – high resolution manually verified labels

The challenge is – that it only covers selected major disasters – it has to be specificially requested by the government

also, it could be based on imagery from multiple satellites
GFM (Global Flood Monitoring): Part of the Copernicus operational suite, GFM is an automated system that continuously ingests every single incoming Sentinel-1 SAR image globally. It runs ensemble algorithms to detect flooded pixels in near real-time, providing a constant, automated heartbeat of global water anomalies.

The good part is – its 24/7, covers the entire globe, runs on sentinel-1 but an important consideration is – GFM itself is an output of another model
GloFAS (Global Flood Awareness System): While EMS and GFM observe the present, GloFAS predicts the future. It is a hydrological forecasting system that takes weather forecasts and routes them through massive river network models. It can issue early warnings of high river flows days or even weeks in advance.

Bringing it Together

This enables us to essentially construct a multi-modal digital twin of the region during a moment of crisis, think about what sits inside these tensors:
We have the absolute constraints of the Earth’s surface (DEM, Slope, HAND), the historical memory of the rivers (JRC, MERIT Hydro), the physical friction of the landscape (ESA WorldCover), the atmospheric fury that caused the event (Weather data), and the exact microwave reflection of the floodwaters as seen from 700 kilometers in space (Sentinel-1 SAR.

But right now, this data is raw. The Sentinel-1 SAR imagery is riddled with radar speckle noise, the weather data needs to be temporally aligned with satellite flyovers, and our environmental stacks need to be perfectly normalized so our neural network doesn’t mathematically prioritize a 1000-meter elevation over a 10-millimeter rainfall

We now have the physical ground truth. In Part II, we will look at how to clean this data, process the signal noise, and prep the tensors to feed our predictive model.

Right now what i think next TODOs are:

Clean the SAR speckle noise (Lee/Refined Lee filters?)

Ground Truth – Otsu’s thresholding to the VV band to create our binary “Ground Truth” water masks – compare with GFM

Align and normalize the spatial temporal tensors so they are ready to be ingested by a PyTorch DataLoader

Once we have created and integrated data prep pipeline – maybe we will verify with another well known event also mapped by EMS

Code for this data extraction pipeline is available on my GitHub.

Beyond Min-Max: The Art of Continuum Normalization for Stellar Spectroscopy

Winnie — Fri, 23 Jan 2026 05:30:53 +0000

I ended up writing this article after digging into a tiny rabbit hole of stellar spectra normalization (a project I picked up on a weekend and was supposed to end that weekend).

Normalization is the process of transforming data to a ‘standard scale’ keeping the relative information intact (i.e. without distorting the difference in values and keeping critical information intact).

This could mean scaling values between 0 and 1, centering them around 0, or adjusting them relative to some baseline (you decide the standard scale). In the context of spectra, it often means removing the baseline ‘continuum’ so the sharp features (like absorption/emission lines) stand out clearly.

Now, why is this important? Having a standard scale is important for visually interpreting and comparing things, some metrics and methods assume certain distributions and scales and if we’re training a model, standardizing data introduces numerical stability in the models, helps gradient convergence and could even help better identify features and outliers. The right normalization technique can significantly improve signal clarity and model performance.

This article was originally part of this Normalization post but I separated it into its own – as these details seem worth diving deeper into.

Table Of Contents

Dataset

In this example, we’re working with resampled spectral data from the Sloan Digital Sky Survey (SDSS), mapped to a common wavelength grid from 3800 to 9200 Angstroms across 3600 bins. This preprocessing ensures consistent input size and resolution for comparative and statistical analysis.

Spectra is rich, continuous, structured data, it capture physical information about stars (like temperature, composition, and velocity) by measuring how light intensity varies with wavelength. Its messy! different days, different instruments, different atmosphere, different stars, interstellar objects can create all sorts of variations and noise.

Why this data? Simple, because I was already working with it, and its a good use case because, its continuous and messy and requires a lot of pre processing, it has a physical meaning and is sensitive to accuracy – we need to align specific spectral lines to fingerprint a star! You can read more about the topic here or in my upcoming article on the project (some initial code for the whole project here).

Our spectral data could look like this:

Where each of the above image represents the spectrum of a star (1 sample of our data).

Overall, our sample data consists of about ~3200 stars and is distributed as:

Now lets proceed to the different normalization techniques!

Continuum Normalization Techniques

You can find a brief recap and summary of normalization here, lets jump into star spectrum normalization!

While flux normalization scales spectra globally, continuum normalization removes the baseline trend (continuum) to highlight relative features like absorption or emission lines. This is crucial in stellar classification, abundance analysis, and ML model performance on astrophysical spectra.

Savitzky Golay Filter

Method: Applies a polynomial smoothing filter (low degree polynomial over a window) to estimate the continuum, we then divide the original spectrum by this continuum to get a normalized spectra.

Library:

Scipy has an implementation which can be directly applied – as simply as:

from scipy.signal import savgol_filter
continuum = savgol_filter(flux, window_length=151, polyorder=3)
norm_flux = flux / continuum

Implementation Overview:

This function would auto select a window (10% of continuum) if not given (minimum 151 ) – window for savgol is odd
then it identifies all the high points in the spectrum based on the given threshold ( by default we clip at 90% i.e. we keep only 10% of brightest points)
then it creates a mask (keeping all the high points and filtering all the absorption region)
fills the dips in the mask by interpolating
applies smoothing filter on this data (this is savgol step – we use a windowing lower polynomial estimation for this smoothing – the order is 3 by default)
(and ensures non zeroes) => this is our continuum
divides the flux by the continuum to get a normalized flux

from scipy.signal import savgol_filter

def continuum_savgol(flux, window=None, poly=3, clip_percentile=90):
    # dynamically find the window based on flux -
    # and ensure its  10% of total length and atleast 151 
    # (it has to be odd - to get central peak)
    if window is None: window = min(151, len(flux) // 10 | 1) 
    if window % 2 == 0: window += 1  

    # percentile clipping - keeping only upper X% flux values 
    # (to avoid absorption dips affecting continuum)
    # only top/brightest 10% by default will disctate the continuum
    threshold = np.percentile(flux, clip_percentile)
    mask = flux > threshold

    # create a smooth baseline by interpolating over masked points 
    # (since we removed absorption lines - we fill in the holes - 
    # connecting the brightest points)
    interp_flux = np.copy(flux)
    interp_flux[~mask] = np.interp(np.flatnonzero(~mask), np.flatnonzero(mask), flux[mask])

    # we use savgol on our interpolated flux made of the brightest points 
    # => continuum
    continuum = savgol_filter(interp_flux, window_length=window, polyorder=poly, mode='interp')
    # to avoid division by very very small numbers 
    # so its not inf and ensure its not 0 or -ve
    continuum = np.clip(continuum, 1e-5, None) 
    # normalized flux = flux / general curve => values between 0-1
    return flux / continuum, continuum

from scipy.signal import savgol_filter

def continuum_savgol(flux, window=None, poly=3, clip_percentile=90):
    # dynamically find the window based on flux -
    # and ensure its  10% of total length and atleast 151 
    # (it has to be odd - to get central peak)
    if window is None: window = min(151, len(flux) // 10 | 1) 
    if window % 2 == 0: window += 1  

    # percentile clipping - keeping only upper X% flux values 
    # (to avoid absorption dips affecting continuum)
    # only top/brightest 10% by default will disctate the continuum
    threshold = np.percentile(flux, clip_percentile)
    mask = flux > threshold

    # create a smooth baseline by interpolating over masked points 
    # (since we removed absorption lines - we fill in the holes - 
    # connecting the brightest points)
    interp_flux = np.copy(flux)
    interp_flux[~mask] = np.interp(np.flatnonzero(~mask), np.flatnonzero(mask), flux[mask])

    # we use savgol on our interpolated flux made of the brightest points 
    # => continuum
    continuum = savgol_filter(interp_flux, window_length=window, polyorder=poly, mode='interp')
    # to avoid division by very very small numbers 
    # so its not inf and ensure its not 0 or -ve
    continuum = np.clip(continuum, 1e-5, None) 
    # normalized flux = flux / general curve => values between 0-1
    return flux / continuum, continuum

Pros: Best for clean spectras with broad features, works quite fast, no need for model fitting, preserves broad shape and smooths out noise
Cons: Can over smooth broad features (if windows too big), bad for low SNR spectra (or noisy data as noise can be part of continuum)

Spline Fit

Method: Identifies high-flux points across bins, fits a spline through them to approximate the continuum, and normalizes by the result.

spline = UnivariateSpline(wave[mask], flux[mask], s=s_factor)
continuum = spline(wave)
norm_flux = flux / continuum

Methodology Overview:

Initialize masks and bins (10 minimum) but its based on a bin per every 300 points
in each bin we then select a percentile of points based on bin flux ( by default we clip at 90% i.e. we keep only 10% of brightest points)
then we define a smoothing factor based on wave range (s=0 would mean exact interpolation through all the points and s>0 would encourage smoothing)
then we call univariate splie fit (piecewise polynomial through the selected lines)
its important to have a high enough s to prevent overfitting but not too high to miss important features, not have an aggressive high points percentile resulting in missing continuum in noisyregions, have decent amount of points per bin

from scipy.interpolate import UnivariateSpline

def continuum_spline(wave, flux, high_percentile=95, s_factor=1e-4):
    mask = np.zeros_like(flux, dtype=bool)

    # bin the spectrum - max 10 bins or a bin size of 300 points
    n_bins = max(10, len(flux) // 300)
    bins = np.linspace(wave.min(), wave.max(), n_bins + 1)

    # within each bin, find the high percentile flux value to identify continuum points
    # by default we only consider points above 95th percentile in each bin as continuum points
    for i in range(n_bins):
        bin_mask = (wave >= bins[i]) & (wave < bins[i+1])
        bin_flux = flux[bin_mask]
        if len(bin_flux) == 0:
            continue
        threshold = np.percentile(bin_flux, high_percentile)
        mask[bin_mask] |= (flux[bin_mask] > threshold)

    if np.sum(mask) < 10:
        raise ValueError("too few points to fit")
    
    # smoothing factor for spline fitting 
    # dynamically adjusted based on wavelength range and number of points
    s = s_factor * (wave.max() - wave.min()) * len(wave)
    spline = UnivariateSpline(wave[mask], flux[mask], s=s)
    continuum = spline(wave)
    # to avoid division by very very small numbers so its not inf and ensure its not 0 or -ve
    continuum = np.clip(continuum, 1e-4, None)
    # normalized flux = flux / general curve => values between 0-1
    return flux / continuum, continuum

from scipy.interpolate import UnivariateSpline

def continuum_spline(wave, flux, high_percentile=95, s_factor=1e-4):
    mask = np.zeros_like(flux, dtype=bool)

    # bin the spectrum - max 10 bins or a bin size of 300 points
    n_bins = max(10, len(flux) // 300)
    bins = np.linspace(wave.min(), wave.max(), n_bins + 1)

    # within each bin, find the high percentile flux value to identify continuum points
    # by default we only consider points above 95th percentile in each bin as continuum points
    for i in range(n_bins):
        bin_mask = (wave >= bins[i]) & (wave < bins[i+1])
        bin_flux = flux[bin_mask]
        if len(bin_flux) == 0:
            continue
        threshold = np.percentile(bin_flux, high_percentile)
        mask[bin_mask] |= (flux[bin_mask] > threshold)

    if np.sum(mask) < 10:
        raise ValueError("too few points to fit")
    
    # smoothing factor for spline fitting 
    # dynamically adjusted based on wavelength range and number of points
    s = s_factor * (wave.max() - wave.min()) * len(wave)
    spline = UnivariateSpline(wave[mask], flux[mask], s=s)
    continuum = spline(wave)
    # to avoid division by very very small numbers so its not inf and ensure its not 0 or -ve
    continuum = np.clip(continuum, 1e-4, None)
    # normalized flux = flux / general curve => values between 0-1
    return flux / continuum, continuum

Pros: Adapts well to variable continuum shape (much better than a basic polynomial)
Cons: Super sensitive to masking thresholds, can fail if not enough valid points or fluctuate too much if point selection is poor

Hybrid – Spline with Savitzky Golay and Sigma Clipping

Method: Combines smoothing, sigma clipping, and bin based peak selection to choose continuum points, followed by spline fitting

# Apply median filter, then savgol, sigma clip, then spline fit on top N% points per bin
norm_flux, continuum = continuum_spline_wsavgol(wave, flux)

Methodology Overview:

This function would auto select a window if not given (min 151 and odd)
Then it will do a median filtering to remove local spikes
This is followed by a savgol smoothing (low ordered windowed polynomial approximation)
Then we do an iterative sigma clipping to remove outliers that deviate too much – we use median absolute deviation instead of standard deviation here to make it more robust to outliers (we use 2.5 sigma clipping by default)
then we divide the data into bins
within each bin we take the points that survived the sigma clipping (if there are no points we just take all points)
we sort these points by flux and select the top n brightest points (10 by default)
we standardize the wave (for stability) and select an appropriate smoothing factor and then proceed to fit a spline using our selected mask
the above steps give us our continuum
then we just divide our flux by the continuum baseline to get normalized flux

from scipy.interpolate import UnivariateSpline, LSQUnivariateSpline
from scipy.signal import savgol_filter, medfilt

def continuum_spline_wsavgol(wave, flux, smooth_window=None, sigma=2.5, s_factor=1e-4, top_n_per_bin=15, n_bins=20, verbose=False):
    # window selection for savgol - 10% of total length and atleast 151
    if smooth_window is None:
        smooth_window = min(151, len(flux) // 10 | 1)
    if smooth_window % 2 == 0:
        smooth_window += 1

    # initial smoothing - median filter followed by savgol
    # median filter to remove spikes (preserve edges better than savgol alone)
    init_smooth = medfilt(flux, kernel_size=15)
    # apply savgol on the median filtered data
    smoothed = savgol_filter(init_smooth, window_length=smooth_window, polyorder=3)

    # sigma clipping - remove outliers from continuum (e.g., deep absorption lines)
    # median - more robust to outliers than mean
    # 4 iterations of 2.5 sigma clipping using MAD
    mask = np.ones_like(flux, dtype=bool)
    for _ in range(4):
        residual = flux[mask] - smoothed[mask]
        mad = np.median(np.abs(residual - np.median(residual)))
        std = 1.4826 * mad if mad > 0 else 1e-5
        mask[mask] = np.abs(residual) < sigma * std
        smoothed = savgol_filter(flux * mask, window_length=smooth_window, polyorder=3)

    # bin data into 20 bins by default
    # take top 15 points by default in each bin as continuum points
    bins = np.linspace(wave.min(), wave.max(), n_bins + 1)
    cont_mask = np.zeros_like(flux, dtype=bool)
    for i in range(n_bins):
        in_bin = (wave >= bins[i]) & (wave < bins[i + 1])
        idx = np.where(in_bin & mask)[0]
        if len(idx) == 0:
            idx = np.where(in_bin)[0]
        if len(idx) > 0:
            top_idx = idx[np.argsort(flux[idx])[-min(top_n_per_bin, len(idx)):]]
            cont_mask[top_idx] = True

    # spline struggles with end points - can shoot up to infinity
    # so we force edge points (first and last 5) to avoid that
    cont_mask[:5] = True
    cont_mask[-5:] = True

    # scaled to make calculations numerically stable
    wave_scaled = (wave - wave.mean()) / wave.std()
    s = s_factor * len(wave) * np.median(flux)**2

    # mask deep absorptions
    mask = flux > np.percentile(flux, 85)

    # num_knots = 10
    # knots = np.quantile(wave[mask], np.linspace(0, 1, num_knots))
    # spline = LSQUnivariateSpline(wave, flux, knots[1:-1], k=3)

    spline = UnivariateSpline(wave_scaled[cont_mask], flux[cont_mask], s=s)
    continuum = spline(wave_scaled)
    continuum = np.clip(continuum, 1e-4, None)

    # normalize and clip
    norm = flux / continuum
    norm = np.clip(norm, 0.2, 3.0)

    if verbose:
        print(f"[spline_wsavgol] Points used: {np.sum(cont_mask)}, s={s:.2e}")

    return norm, continuum

from scipy.interpolate import UnivariateSpline, LSQUnivariateSpline
from scipy.signal import savgol_filter, medfilt

def continuum_spline_wsavgol(wave, flux, smooth_window=None, sigma=2.5, s_factor=1e-4, top_n_per_bin=15, n_bins=20, verbose=False):
    # window selection for savgol - 10% of total length and atleast 151
    if smooth_window is None:
        smooth_window = min(151, len(flux) // 10 | 1)
    if smooth_window % 2 == 0:
        smooth_window += 1

    # initial smoothing - median filter followed by savgol
    # median filter to remove spikes (preserve edges better than savgol alone)
    init_smooth = medfilt(flux, kernel_size=15)
    # apply savgol on the median filtered data
    smoothed = savgol_filter(init_smooth, window_length=smooth_window, polyorder=3)

    # sigma clipping - remove outliers from continuum (e.g., deep absorption lines)
    # median - more robust to outliers than mean
    # 4 iterations of 2.5 sigma clipping using MAD
    mask = np.ones_like(flux, dtype=bool)
    for _ in range(4):
        residual = flux[mask] - smoothed[mask]
        mad = np.median(np.abs(residual - np.median(residual)))
        std = 1.4826 * mad if mad > 0 else 1e-5
        mask[mask] = np.abs(residual) < sigma * std
        smoothed = savgol_filter(flux * mask, window_length=smooth_window, polyorder=3)

    # bin data into 20 bins by default
    # take top 15 points by default in each bin as continuum points
    bins = np.linspace(wave.min(), wave.max(), n_bins + 1)
    cont_mask = np.zeros_like(flux, dtype=bool)
    for i in range(n_bins):
        in_bin = (wave >= bins[i]) & (wave < bins[i + 1])
        idx = np.where(in_bin & mask)[0]
        if len(idx) == 0:
            idx = np.where(in_bin)[0]
        if len(idx) > 0:
            top_idx = idx[np.argsort(flux[idx])[-min(top_n_per_bin, len(idx)):]]
            cont_mask[top_idx] = True

    # spline struggles with end points - can shoot up to infinity
    # so we force edge points (first and last 5) to avoid that
    cont_mask[:5] = True
    cont_mask[-5:] = True

    # scaled to make calculations numerically stable
    wave_scaled = (wave - wave.mean()) / wave.std()
    s = s_factor * len(wave) * np.median(flux)**2

    # mask deep absorptions
    mask = flux > np.percentile(flux, 85)

    # num_knots = 10
    # knots = np.quantile(wave[mask], np.linspace(0, 1, num_knots))
    # spline = LSQUnivariateSpline(wave, flux, knots[1:-1], k=3)

    spline = UnivariateSpline(wave_scaled[cont_mask], flux[cont_mask], s=s)
    continuum = spline(wave_scaled)
    continuum = np.clip(continuum, 1e-4, None)

    # normalize and clip
    norm = flux / continuum
    norm = np.clip(norm, 0.2, 3.0)

    if verbose:
        print(f"[spline_wsavgol] Points used: {np.sum(cont_mask)}, s={s:.2e}")

    return norm, continuum

Pros: Robust (iterative sigma clipping and MAD handles outliers quite well)
Cons: Computationally heavyy – requires careful selection of bins and smoothing.

Polynomial Fit

Method: Fits a low degree polynomial to the top percentile of flux values

coeffs = np.polyfit(wave_scaled[mask], flux[mask], deg)
continuum = np.polyval(coeffs, wave_scaled)
norm_flux = flux / continuum

Methodology Overview:

relatively simple – we take the highest points by flux (default 90 percentile clipping – i.e. we keep only 10% of brightest points)
fit a polynomial through those points to get the continuum
we divide flux by the continuum to get a normalized flux

def continuum_polynomial(wave, flux, deg=4, n_bins=25, top_n=10, clip_percentile=90):
    # scaling for numerical stability
    wave_scaled = (wave - wave.mean()) / wave.std()

    # threshold = np.percentile(flux, clip_percentile)
    # mask = flux > threshold

    # bin the continuum and select brightest points
    mask = bin_select_mask(wave, flux, n_bins=n_bins, top_n=top_n)

    if deg > len(wave_scaled[mask]) // 3:
        deg = max(2, len(wave_scaled[mask]) // 3)

    # fit only selected high points
    coeffs = np.polyfit(wave_scaled[mask], flux[mask], deg=deg)
    continuum = np.polyval(coeffs, wave_scaled)

    # clip to avoid div by zero or very small values and remove extreme high values
    # continuum = np.clip(continuum, 1e-5, None)
    continuum = np.clip(continuum, 1e-3, np.percentile(continuum, 99))
    continuum = smooth_tail(continuum, wave_scaled)

    norm = flux / continuum
    norm = np.clip(norm, 0.2, 3.0)

    return norm, continuum

def continuum_polynomial(wave, flux, deg=4, n_bins=25, top_n=10, clip_percentile=90):
    # scaling for numerical stability
    wave_scaled = (wave - wave.mean()) / wave.std()

    # threshold = np.percentile(flux, clip_percentile)
    # mask = flux > threshold

    # bin the continuum and select brightest points
    mask = bin_select_mask(wave, flux, n_bins=n_bins, top_n=top_n)

    if deg > len(wave_scaled[mask]) // 3:
        deg = max(2, len(wave_scaled[mask]) // 3)

    # fit only selected high points
    coeffs = np.polyfit(wave_scaled[mask], flux[mask], deg=deg)
    continuum = np.polyval(coeffs, wave_scaled)

    # clip to avoid div by zero or very small values and remove extreme high values
    # continuum = np.clip(continuum, 1e-5, None)
    continuum = np.clip(continuum, 1e-3, np.percentile(continuum, 99))
    continuum = smooth_tail(continuum, wave_scaled)

    norm = flux / continuum
    norm = np.clip(norm, 0.2, 3.0)

    return norm, continuum

Pros: Fast, interpretable, works decent at low res
Cons: Poor for complex spectra, can underfit or overfit depending on degree

Weighted Polynomial Fit

Method: Uses smoothed flux as weights to emphasize higher values while fitting the polynomial continuum.

weights = np.clip(medfilt(flux, 5) / threshold, 0, 1)
coeffs = np.polyfit(wave_scaled, flux, deg=4, w=weights)
continuum = np.polyval(coeffs, wave_scaled)

Methodology Overview:

standardizes wave for numerical stability (polynomial fitting could be severely affected by large values)
clips high percentile points based on flux
sets a threshold (weights for points passing the threshold close to 1) – all weights [0,1]
weighted polynomial fitting (so the brighter regions have more influence)
this gives our continuum, then we just divid flux by continuum to get normalized flux

from scipy.signal import medfilt


def continuum_polynomial_weighted(wave, flux, deg=4, smooth=True):
    # scaling for numerical stability
    wave_scaled = (wave - wave.mean()) / wave.std()

    # initial smoothing using median filter
    if smooth:
        smoothed = medfilt(flux, kernel_size=5)#savgol_filter(medfilt(flux, 5), 51, 3)
    else:
        smoothed = flux

    # calculate weights based on residuals of smoothed spectrum
    residual = flux - smoothed
    mad = np.median(np.abs(residual - np.median(residual)))
    if mad < 1e-10:
        mad = 1e-3
    weights = np.exp(-((residual / (1.4826 * mad)) ** 2))  # Gaussian down-weighting
    weights = 1.0 / (1 + np.abs(flux - smoothed))
    # edge_fade = np.exp(-((wave - wave.mean()) / (wave.ptp() / 2))**2)
    # weights *= edge_fade

    coeffs = np.polyfit(wave_scaled, flux, deg=deg, w=weights)
    continuum = np.polyval(coeffs, wave_scaled)

    # clip to avoid div by zero or very small values and remove extreme high values
    continuum = np.clip(continuum, 1e-3, np.percentile(continuum, 99))
    # smooth the red tail
    continuum = smooth_tail(continuum, wave_scaled)

    norm = flux / continuum
    norm = np.clip(norm, 0.2, 3.0)

    return norm, continuum

from scipy.signal import medfilt


def continuum_polynomial_weighted(wave, flux, deg=4, smooth=True):
    # scaling for numerical stability
    wave_scaled = (wave - wave.mean()) / wave.std()

    # initial smoothing using median filter
    if smooth:
        smoothed = medfilt(flux, kernel_size=5)#savgol_filter(medfilt(flux, 5), 51, 3)
    else:
        smoothed = flux

    # calculate weights based on residuals of smoothed spectrum
    residual = flux - smoothed
    mad = np.median(np.abs(residual - np.median(residual)))
    if mad < 1e-10:
        mad = 1e-3
    weights = np.exp(-((residual / (1.4826 * mad)) ** 2))  # Gaussian down-weighting
    weights = 1.0 / (1 + np.abs(flux - smoothed))
    # edge_fade = np.exp(-((wave - wave.mean()) / (wave.ptp() / 2))**2)
    # weights *= edge_fade

    coeffs = np.polyfit(wave_scaled, flux, deg=deg, w=weights)
    continuum = np.polyval(coeffs, wave_scaled)

    # clip to avoid div by zero or very small values and remove extreme high values
    continuum = np.clip(continuum, 1e-3, np.percentile(continuum, 99))
    # smooth the red tail
    continuum = smooth_tail(continuum, wave_scaled)

    norm = flux / continuum
    norm = np.clip(norm, 0.2, 3.0)

    return norm, continuum

Pros: Soft masking avoids hard thresholds – more stable than hard masking – better fit than polynomial
Cons: Still limited in flexibility compared to binned spline and savgols

Auto-Selected Continuum Method

Method: Applies all continuum methods, scores them based on smoothness, flatness, and curvature, and selects the best one automatically.

# Score = smoothness + 0.5 * flatness + 2.0 * clipping_penalty + 0.3 * curvature
best_method = min(scored, key=lambda x: x[3])

Methodology Overview:

tries different techniques and picks up the best one
scoring methodology is the key, to assess which one to pick, we first,
measure how bumpy each continuum is by calculating smoothness which is a standard deviation from savgol (how bumpy)
then we measure flatness i.e. if the normalized flux is centered around 1 (median = 1) (how centered)
then we measure clip penalty – fraction of outliers or extreme points (outliers)
then we measure curvature i.e.e how bendy the continuum is by calculating standard deviation of gradient (gradient stability)
we combine all the above 4 metrics into a scre 1.0*smoothness + 0.5*flatness + 2.0*clip_penalty +0.3*curvature
lower score indicates a better fit (smooth and more stable normalization)

# def score_method(cont):
#     smoothed = savgol_filter(cont, 101, 3)
#     smoothness = np.std(cont - smoothed)
#     flatness = np.abs(np.median(cont - 1.0))
#     return smoothness + 0.5 * flatness  # Weight toward smooth + ~1
from scipy.stats import wasserstein_distance

def score_method(norm, cont, wave):
    smoothed = savgol_filter(cont, 101, 3)
    smoothness = np.std(cont - smoothed) #checking jaggedness of continuum

    curvature = np.std(np.gradient(cont)) #fluctuations of continuum

    overshoot_penalty = np.mean(norm > 3.0) # penalize for too high peaks
    undershoot_penalty = np.mean(norm < 0.3)
    # clip_penalty = np.mean((norm > 5) | (norm < 0.2))
    clip_penalty = np.mean((norm > 2.5) | (norm < 0.5))  # out-of-bounds normalization

    flatness = np.abs(np.median(norm) - 1.0) #normalized spectrum must be centered around 1

    # line_contrast = np.std(norm[(norm < 1.0)])  # good absorption features
    absorption = norm[norm < 1.0]
    if len(absorption) > 10:
        line_contrast_penalty = 1.0 / (np.std(absorption) + 1e-6)  # larger std = deeper lines = better
    else:
        line_contrast_penalty = 5.0  # no lines = bad

    tail_mask = wave > 9000  
    if np.sum(tail_mask) >= 3:
        tail_slope = np.mean(np.gradient(cont[tail_mask])) #checking slope of continuum in tail region
        tail_deviation = np.std(norm[tail_mask] - 1.0) #checking deviation from 1 in tail region
    else:
        tail_slope = 0
        tail_deviation = 0

    score = (
        1.0 * smoothness +
        0.5 * curvature +
        1.0 * flatness +
        2.0 * clip_penalty +
        2.0 * overshoot_penalty +
        1.5 * undershoot_penalty +
        1.0 * tail_deviation +
        0.5 * abs(tail_slope) +
        1.0 * line_contrast_penalty
    )

    score += wasserstein_distance(norm, np.ones_like(norm)) * 2.0
    return score

def auto_continuum_normalize(wave, flux, verbose=False, methods=None):
    if methods is None:
        methods = [
            "spline_wsavgol",
            "polynomial_weighted",
            "polynomial",
            "savgol"
        ]

    results = []

    # if flux.mean() < 100:
    #     methods = [m for m in methods if not m.startswith("polynomial")]


    for method in methods:
        try:
            if method == "spline_wsavgol":
                norm, cont = continuum_spline_wsavgol(wave, flux)
            elif method == "polynomial_weighted":
                norm, cont = continuum_polynomial_weighted(wave, flux)
            elif method == "polynomial":
                norm, cont = continuum_polynomial(wave, flux)
            elif method == "savgol":
                cont = savgol_filter(flux, window_length=151, polyorder=3, mode='interp')
                cont = np.clip(cont, 1e-5, None)
                norm = flux / cont
            else:
                continue  # Unknown method

            results.append((method, norm, cont))
            if verbose:
                print(f"{method} succeeded")

        except Exception as e:
            if verbose:
                print(f"{method} failed: {e}")

    if not results:
        raise RuntimeError("All continuum normalization methods failed.")

    

    scored = [(method, norm, cont, score_method(norm, cont, wave)) 
          for method, norm, cont in results]

    best_method, best_norm, best_cont, best_score = min(scored, key=lambda x: x[3])

    if verbose:
        print("Tried methods (with scores):")
        for method, _, _, score in scored:
            print(f" - {method}: {score:.4f}")
        print(f"Selected '{best_method}' with score {best_score:.4f}")

    return best_norm, best_cont, best_method

# def score_method(cont):
#     smoothed = savgol_filter(cont, 101, 3)
#     smoothness = np.std(cont - smoothed)
#     flatness = np.abs(np.median(cont - 1.0))
#     return smoothness + 0.5 * flatness  # Weight toward smooth + ~1
from scipy.stats import wasserstein_distance

def score_method(norm, cont, wave):
    smoothed = savgol_filter(cont, 101, 3)
    smoothness = np.std(cont - smoothed) #checking jaggedness of continuum

    curvature = np.std(np.gradient(cont)) #fluctuations of continuum

    overshoot_penalty = np.mean(norm > 3.0) # penalize for too high peaks
    undershoot_penalty = np.mean(norm < 0.3)
    # clip_penalty = np.mean((norm > 5) | (norm < 0.2))
    clip_penalty = np.mean((norm > 2.5) | (norm < 0.5))  # out-of-bounds normalization

    flatness = np.abs(np.median(norm) - 1.0) #normalized spectrum must be centered around 1

    # line_contrast = np.std(norm[(norm < 1.0)])  # good absorption features
    absorption = norm[norm < 1.0]
    if len(absorption) > 10:
        line_contrast_penalty = 1.0 / (np.std(absorption) + 1e-6)  # larger std = deeper lines = better
    else:
        line_contrast_penalty = 5.0  # no lines = bad

    tail_mask = wave > 9000  
    if np.sum(tail_mask) >= 3:
        tail_slope = np.mean(np.gradient(cont[tail_mask])) #checking slope of continuum in tail region
        tail_deviation = np.std(norm[tail_mask] - 1.0) #checking deviation from 1 in tail region
    else:
        tail_slope = 0
        tail_deviation = 0

    score = (
        1.0 * smoothness +
        0.5 * curvature +
        1.0 * flatness +
        2.0 * clip_penalty +
        2.0 * overshoot_penalty +
        1.5 * undershoot_penalty +
        1.0 * tail_deviation +
        0.5 * abs(tail_slope) +
        1.0 * line_contrast_penalty
    )

    score += wasserstein_distance(norm, np.ones_like(norm)) * 2.0
    return score

def auto_continuum_normalize(wave, flux, verbose=False, methods=None):
    if methods is None:
        methods = [
            "spline_wsavgol",
            "polynomial_weighted",
            "polynomial",
            "savgol"
        ]

    results = []

    # if flux.mean() < 100:
    #     methods = [m for m in methods if not m.startswith("polynomial")]


    for method in methods:
        try:
            if method == "spline_wsavgol":
                norm, cont = continuum_spline_wsavgol(wave, flux)
            elif method == "polynomial_weighted":
                norm, cont = continuum_polynomial_weighted(wave, flux)
            elif method == "polynomial":
                norm, cont = continuum_polynomial(wave, flux)
            elif method == "savgol":
                cont = savgol_filter(flux, window_length=151, polyorder=3, mode='interp')
                cont = np.clip(cont, 1e-5, None)
                norm = flux / cont
            else:
                continue  # Unknown method

            results.append((method, norm, cont))
            if verbose:
                print(f"{method} succeeded")

        except Exception as e:
            if verbose:
                print(f"{method} failed: {e}")

    if not results:
        raise RuntimeError("All continuum normalization methods failed.")

    

    scored = [(method, norm, cont, score_method(norm, cont, wave)) 
          for method, norm, cont in results]

    best_method, best_norm, best_cont, best_score = min(scored, key=lambda x: x[3])

    if verbose:
        print("Tried methods (with scores):")
        for method, _, _, score in scored:
            print(f" - {method}: {score:.4f}")
        print(f"Selected '{best_method}' with score {best_score:.4f}")

    return best_norm, best_cont, best_method

Pros: Adaptable across diverse data qualities – selects best method for the data
Cons: non standardized normalization – cant be used for NN, one continuum could be weighted polynomial and othe savgol – leading to inconsistency between data samples and features

For each technique, we visually inspected the normalized result to assess whether it retained key spectral features while flattening the continuum. This process helps guide preprocessing decisions for any spectral ML pipeline.

Conclusion

Spectrum normalization is not one size fits all. Your choice should reflect the data characteristics, application goals, and downstream models. Visual inspection is key, and automated scoring can help streamline the selection process.

The point of this normalization is to highlight spectral features like specific absorption or emission lines (super important for fingerprinting stars)

Even some of these fancier algorithms failed ‘catastrophically’ on some stars, based on what i saw so far, seems like spline with savgol might be our best bet — you should also modify the polynomial and weighted polynomial to atleast add a clipping so they don’t explode (because they did on some spectra).

I’ll add some comparisons for different types of stars and a side by side comparison with features extracted from manual inspection, so you can see what types of stars benefit from what type of normalization. Though for training ML models we would have to stick one methodology to maintain consistency.

I’ll follow up on that and any improvements in the scoring methodology for my auto function.

You can access this code on my github repo here

Star I Attributes – Luminosity

Winnie — Mon, 15 Dec 2025 20:40:16 +0000

We categorize stars into different types based on attributes like Luminosity, Temperature and size – but how do we define it and measure it, lets discuss briefly

Brightness

One of the most important attributes of stars is their brightness, how bright or dim they are and how they appear to us. It is a measure of the electromagnetic waves that we receive and is usually limited to a particular band in the electromagnetic spectrum (visible or any other specific filter we use in our telescopes). Brightness is also referred to as Flux.

We quantify the brightness of a star in units of Magnitude

Brightness (apparent) refers to how bright the star appears to us and intrinsic brightness (Luminosity) is how bright the star actually is. We don’t get to observe all the light from the star for multiple reasons:

– For one the photon’s wavelength stretches over large distances – light redshifts, falls to infrared, microwaves to radiowaves (doppler redshift, cosmological redshift, gravitational redshift)
– Interstellar dust and gases obstructing the path of light
– Atmospheric turbulence, moisture, dusts and gases
– The wavelength of the filter we use to observe

Luminosity

Luminosity is defined as total amount of energy emitted by a star in a second, this is all the energy emitted by a star and is independent of point of observation. It is usually calculated per second and commonly used units are Watts per second or Joules per second.

How we calculate Luminosity is:

Where F is Flux or brightness and L is Luminosity and d is distance at which brightness or Flux is measured

Types

Luminosity and hence flux is observed and referred to in different ways and hence can be classified as following types

Spectral

Spectral Luminosity helps us see Luminosity distribution

Bolometric

Bolometric Luminosity shows brightness over the whole spectrum

Band

Band Luminosity is often over a specific band (specific wavelength range)

Bands

There are different kind of photometric band systems used in astronomy:

Broadband

Broadband filters with a bandwidth greater than 30 nanometers, like:

Johnson-Morgan-Cousins (UBVRI) – (~365-820 nm) -> Ultraviolet – Near Infrared

SDSS (u′g′r′i′z′) – (~355-893 nm) -> Ultraviolet – Near Infrared

Gaia (G, BP, RP) – (~330-1050 nm) -> Ultraviolet – Near Infrared

2MASS (J, H, K_s) – (~1.11-1.32 micrometer) -> Near Infrared

WISE (W1–W4) – (3-22 micrometer) -> Mid Infrared

These are most commonly used as they help us know the colour, temperature, light variability etc. for most of the stars (which we’ll discuss in the later sections)

Intermediate

Intermediate which range from 10-30 nanometers like:

Stromgren (uvby) (350-547 nm)

These are used for specific purposes like capturing special characteristic features and metallicities to help classify stars.

Narrowband

Narrowband filters less than 10 nano meters like:

H_alpha

OIII (double ionized oxygen)

These are used to measure specific emission lines

Magnitude

Magnitude in context of stars is a logarithmic scale used to measure brightness of stars.

It is a backward scale (smaller magnitude = brighter object)

Apparent

Apparent magnitude is a measure of how bright the star appears to us.

So a if a star is 1 magnitude less than the other, it would mean its ~2.512 times bright whereas if it was 5 magnitudes less it’d mean its 100x more bright!

e.g. Apparent magnitude of Saturn in 2025 ranged from ~ -0.6 – +1.4, apparent magnitude of our moon is ~ -12.7 (super bright!), for Polaris its close to ~ +2.0 and 3I Atlas (the popular comet of 2025) is currently at a magnitude of ~11 (here we just mentioned Visible range i.e. V band)

Absolute

Absolute magnitude is the magnitude of the star if it were at a standard distance of 10 parsecs.

e.g. Absolute magnitude of Saturn in 2025 ranged from ~ -9.5 to -10, absolute magnitude of our moon is ~ -0.2 to +0.2, for Polaris its close to ~ -3.6 to -3.7 and 3I Atlas (the popular comet of 2025) has an absolute magnitude of 12-13.7 (here we just mentioned Visible range i.e. V band)

Zero Point

As you see above, magnitudes are defined on a relative scale (comparing relative brightness). Zero Point is the reference point, it is the brightness or flux that defines the magnitude 0.

Hence magnitude can be defined in comparison with zero point.

Magnitude Systems

There are several ways to look at magnitudes i.e. we can chose different reference spectrum and hence calibrate our data in different ways

AB

The AB system (tied to the ABsolute flux scale) is defined such that a flat spectrum in frequency (flux at a given frequency is constant) has same magnitude in all filters. This system is defined using flux density per unit frequency (f_nu) such that it has same magnitude in all filters. This system is popular in galaxy studies and cosmology.

For zero point we usually calibrate for the flat spectrum at 3631 Jansky (Jy) which actually corresponds to Vega in V band, it is mostly because this was well known historically.

ST

The ST (Space Telescope) system is very popular with Hubble Space Telescope studies and is defined such that a flat spectrum in wavelength (flux at a given wavelength is constant) has same magnitude in all filters. This system is defined using flux density per unit wavelength (f_lambda) such that it has same magnitude in all filters. This system is popular in spectroscopic studies.

For zero point we usually calibrate for the flat spectrum at 3.63e-9 ergs/s /cm / Angtrom corrosponding to the zero point frequency we took in AB system, this value gives zero point -21.1 but the wavelength considered is in Angstroms if the wavelength be in nanometers, the zero point would be -18.6

Vega

Vega magnitude system is one of the traditional systems used for stars, it is a relative system – in which star is compared to the star Vega and zero point is set such that apparent magnitude of Vega (m_Vega) = 0 in all bands

This system is filter dependent, i.e. it would be different for Johnson Cousins vs SDSS filters. Flux of Vega corrosponds to the Flux received from the star Vega in that band for that instrument.

We often need to switch magnitude systems and callibrate zero points espescially when consolidating different data sources or combining different attributes, this process is usually referred to flux callibration.

Extinction

Extinction (k_lambda) and airmass (X) are the terms we account for in the apparent magnitude calculation that account for atmospheric interference

Airmass is defined as the amount of atmosphere light has to pass through to reach the instrument relative to the zenith.

Extinction(k) is the reduction in intensity of light due to interactions with atmosphere.

References

Luminosity

Star I – Attributes – Distance

Winnie — Sat, 06 Dec 2025 21:31:53 +0000

We categorize stars into different types based on attributes like Luminosity, Temperature and size – but how do we define it and measure it, lets discuss briefly starting with distance which makes a foundation of a lot of other observations – the science of measuring distances to these astronomical objects is called astrometry

Table Of Contents

Distance

To measure how far the star is from us – we use parallax to our advantage.

What is parallax?

Parallax is the shift in apparent position of an object when viewed from an angle, we calculate this apparent change in stellar position with respect to a known change in observer position and using that calculate the stellar distance.

Where we can simply estimate parallax angle (p) of a star by making observations at different times in a year and use its value in arcseconds to get us the distance of the star (d) in parsecs.

Parallax is defined as angular radius of earth’s orbit as seen from the star or from an earthly perspective, its half the total shift of stars (we say half because we move 2AU apart in our extreme observations)

What is a parsec?

1 parsec is essentially,
= 1 AU / 1 arcsecond
= 1.496e+8 km / ((pi/180)/3600) rad
= 1.496e+8 km / 4.84814e-6 rad
= 3.0857195e+13 km
In light years,
= 3.0857195e+13 km / (3.154e+7 s/yr * 299792.458 km/s)
= 3.26 light years

How far can we see? / How granular we can measure?

Measuring parallax for angles smaller than 0.01 arcsecond (i.e. 10 milli arcsecond) from earth is really hard because our atmosphere causes a lot of blurring, so the ground based telescopes here on earth can reliably measure stars about 100 parsecs (derived from 1/0.01) through this parallax method.

Several ground based telescopes have achieved this level of precision, for instance the Cerro Tololo Inter-American Observatory in Chile has an astrometric parallax program CTIOPI, which had separate programs with their 0.9m and 1.5m telescopes giving a precision of ~2-3.5 milli arcseconds the details of which can be found here and here. The United States Naval Observatory (USNO) also has a Parallax catalog, UPC containing more than 113k parallaxes between 2-13 milli arseconds in 2018, you can read more about it here. And the 1.8m wide field Panoramic Survey Telescope in Hawaii, PanSTARRS also has a catalogue for parallaxes especially for faint objects like dwarf planets and asteroids and seems to have achieved a percision of about 3 milli arcseconds over multiple epochs (10 milli arcseconds over 1 epoch). You can read more about it here and here.

Even other ground based telescopes like infrared and radio based telescopes have parallax catalogues like the infrared telescope VISTA Variables in the Via Lactea (VVV) has a near infrarred catalogue VIRAC for more than 300k stars with a median precision of 1.1 milli arcsecond. You can read more about it here.
The Very Long Baseline Interferometry (VLBI) array of radio telescopes all over earth also has some projects including detailed parallax studies like BeSSeL and VERA having measurement accuracies of about 20 and even 5 micro arcseconds. You can read more about it here.

That said, space based telescopes have achieved a pretty impressive precision – from 1989 Hipparcos by ESA that measured parallaxes for about 118k stars with an accuracy of 0.001 arcseconds (i.e. 1 milli arcsecond), you can read more about this really impressive telescope here.

The Hubble telescope can measure a parallax of upto 20-40 micro arcseconds with its Wide Field Camera (WFC3) which gives us a theoretical limit of measuring stars that are 5 kilo parsecs away from us, and there have been successful Cepheid variable star calculations at about ~ 3 kilo parsecs in 2016, you can read about it here and here

And the Gaia EDR3 release of 2021 shows that they achieved a precision of the order of 10 micro arcseconds which is about 1000 times more than the ground based telescopes, letting us measure distances upto 100 kilo parsecs theoretically, but unfortunately even the stars in Large Mangellic Cloud (~60 kilo parsecs away) have a pretty high uncertainty and probably mark the max capacity, with most reliable observations lying around a few kilo parsecs (20-30 micro arcseconds) which is still very impressive! you can read more about it here and here

The more recent JWST although not primarily an astrometric telescope but is nonetheless armed with powerful equipment like NIRCam ans NIRISS which give can be and are used for high precision astrometry in ranges of 0.2 milli arcseconds you can read more about it here and here

Proper Motion

All these stars are moving through space, we measure how a star appears to move (compared to its background) as seen from earth.

Measuring parallax helps us calculate distance which in turn helps us calculate the proper motion of the star (parallax distance is directly used to calculate the tangential velocity)

And although now used in very specific use cases, in the past we’ve used proper motion to estimate distances to stars as well, we can group similar stars together (like those sharing an open cluster – hence a relatively similar motion in space) and estimate their respective distances, another technique is grouping similar stars together and observing their velocity and proper motion distributions to calibrate other properties (like absolute magnitude of standard candle stars). In the past we also used a model of Sun’s motion and then estimated the distance of background stars that’d best explain their motion relative to the sun.

Closer stars appear to move more although that might not be always true, but here is the case of a relatively close star with the highest proper motion recorded -the Barnard’s Star which seems to be moving at 10.3 arcseconds every year, the following is an estimation of its velocity

Also check out HSTPROMO for more information about proper motion and proper motion catalogs

References

Distance

UPC

VizieR Online Data Catalog: The URAT Parallax Catalog (UPC). Update 2018 (Finch+, 2018)

URAT South Parallax Results

Pan-STARRS

The Parallax of VHS J1256-1257 from CFHT and Pan-STARRS-1

Photometry and Proper Motions of M, L, and T Dwarfs from the Pan-STARRS1 3π Survey

CITIOPI

Results from CTIOPI: Parallaxes, Perturbations, and Pushing Towards SIM PlanetQuest

The Solar Neighborhood. XVI. Parallaxes from CTIOPI: Final Results from the 1.5 m Telescope Program

The Solar Neighborhood. XXXIII. Parallax Results from the CTIOPI 0.9 m Program: Trigonometric Parallaxes of Nearby Low-mass Active and Young Systems

VIRAC

VIRAC: the VVV Infrared Astrometric Catalogue

VizieR Online Data Catalog: VIRAC. The VVV Infrared Astrometric Catalogue (Smith+, 2019)

Variable star classification across the Galactic bulge and disc with the VISTA Variables in the Vía Láctea survey

VERA and BeSSeL

The VERA project (VLBI Exploration of Radio Astrometry)

Science with VERA: VLBI exploration of radio astrometry

Trigonometric Parallaxes of High Mass Star Forming Regions: The Structure and Kinematics of the Milky Way

The Bar and Spiral Structure Legacy (BeSSeL) survey: Mapping the Milky Way with VLBI astrometry

Hipparcos

Parallaxes and proper motions for 20 open clusters as based on the new Hipparcos catalogue

Zero-point and external errors of HIPPARCOS parallaxes

Gaia

Estimating Distances from Parallaxes. V. Geometric and Photogeometric Distances to 1.47 Billion Stars in Gaia Early Data Release 3

Validation of the accuracy and precision of Gaia EDR3 parallaxes with globular clusters

Gaia: Ten Years of Surveying the Milky Way and Beyond

Hubble

Hubble Space Telescope: A Generator of Submilliarcsecond Precision Parallaxes

Parallax beyond a Kiloparsec from Spatially Scanning the Wide Field Camera 3 on the Hubble Space Telescope

Parallax of Galactic Cepheids from Spatially Scanning the Wide Field Camera 3 on the Hubble Space Telescope: The Case of SS Canis Majoris

JWST

Photometry and astrometry with JWST – I. NIRCam point spread functions and the first JWST colour-magnitude diagrams of a globular cluster

Photometry and astrometry with JWST – III. A NIRCam-Gaia DR3 analysis of the open cluster NGC 2506

JWST-TST Proper Motions. I. High-precision NIRISS Calibration and Large Magellanic Cloud Kinematics

Photometry and astrometry with JWST‑II: NIRCam distortion correction

Navier Stokes Equation – I

Winnie — Wed, 26 Nov 2025 21:00:23 +0000

Derivation – Navier Stokes Eq

Following is a simple and brief derivation (from first principles) of Navier Stokes equation which is a key equation in mechanics and fluid dynamics:

References

https://www.math.ucdavis.edu/~temple/MAT22C/NavierStodesWiki.pdf

https://www.grc.nasa.gov/www/k-12/airplane/nseqs.html

https://www.damtp.cam.ac.uk/user/tong/fluids/fluids2.pdf

AI Safety – Hinton Lectures

Winnie — Thu, 13 Nov 2025 06:02:02 +0000

Living in Toronto and working in AI how could I not go to a Geoffrey Hinton Lecture, claimed as a Godfather of AI, Geoffrey Hinton is Canada’s own Nobel Prize Laureate for his work in Neural Networks. I knew of him since the first time time I studied about Neural Networks in an Andrew Ang’s coursera course back in school, and its kind of awesome to have the opportunity to hear a Nobel Prize Laureate in person that too in your own city.

So, what was this about? The talk was mainly led by Owain Evans who is apparently one of pioneers in AI Safety research, and studied this domain in MIT and later at Oxford, and is a director of Truthful AI at Berekley, which is an AI Alignment company.

AI – Artificial Intelligence

AI used to refer to a lot of terms in the past, from data science of statistical modelling to Machine Learning (making an algorithm learn to predict something based past data – either telling it what’s good or bad i.e. labelled data – supervised learning or letting it figure out data patterns on its own – unsupervised learning)

These ML models also evolved from classical models of linear regression and decision trees and clustering algorithms to Neural Networks mimicking human learning process – it is these Neural Networks that evolved in different directions FeedForwardNN, ConvolutionalNN, RecurrentNN, GraphNN, to Transformers. These transformers that applied to text (like BERT by Google in 2018) became the Large Language Models of today like GPT and Gemini (Grok, Deepseek, Claude, Llama being some other frontier models – with only Llama and Deepseek being open source).

When we refer to AI these days we usually refer to these Frontier LLM models because they have taken machine learning to another level – a point which has matched or surpassed human capabilities on multiple fronts and with their public accesibility have become a household name.

Here is a summary of the following article generated using NotebookLM (read with caution – it is a bit exaggerated and a bit inaccurate in technicalities – but it is a nice overview):

AI Alignment

Now, what is AI Alignment you ask? It is basically a term which refers to the process of designing AI to be aligned with human ethics.

“Artificial intelligence (AI) alignment is the process of encoding human values and goals into AI models to make them as helpful, safe and reliable as possible” – IBM

“We consider misalignment failures to be when an AI’s behavior or actions are not in line with relevant human values, instructions, goals, or intent” – OpenAI 1 , 2

“Alignment researchers validate that models are harmless and honest even under very different circumstances than those under which they were trained” – Anthropic

“Alignment is the process of managing the behavior of generative AI (GenAI) to ensure its outputs conform with your products needs and expectations” – Google

It is indeed an active area of research among many frontier AI companies and has been that for a while – more specifically value alignment (as alignment could cover a broader scope – and could even refer to product alignment as you can see from google’s alignment toolkit)

When you mention value alignment, one question that immediately comes to mind is alignment with what, whose ethics? Ethics and value systems vary from person to person, when we train an LLM whose value system and morality code do we deem the best? We haven’t been able to agree on a definition as a society so far, countless wars have been fought and we live in a global society where we just agree to disagree.

I will have to go through what work has been done in this field and how have these terms and rubric defined, but here is what’s on my reading list right now, if you want suggestions:

The Ethics of Advanced AI Assistants (Google Deepmind) – Google
AI Alignment Comprehensive Study (Leading Universities) – Arxiv
Training Language Models to follow instructions with human feedback (OpenAI) – Arxiv
Agentic Misalignment: How LLMs could be insider Threat (Anthropic) – Arxiv

AI Progress

The first talk covered the outline of the problem and the current AI space, showing the rapid development in LLM capabilities in different areas like mathematics, vision and their specific initial purpose – text generation.

In math, these LLMs went from doing elementary math (2 digit addition GPT-2 of 2019 to 2 digit multiplication in GPT-3 of 2020) to SAT math (GPT-4 of 2023) to currently competing with top 10% university students and winning gold medal at Math Olympiad (GPT-5, 2025).

And in vision/image, from Midjourney’s blurry / fuzzy images from text in 2022 (not sure if you remember deepmind’s eyes from 2015! that was there too) to the weird alien hands which made the news to the jaw dropping wonder of Sora and Gemini today.

And the ‘creativity‘ as well from not being able to write coherent sentences, to questionable limericks (GPT-1, 2018) to a sensible poem (GPT-3, 2020) to a questionably conscious GPT-3.5 (who described itself as a human sitting on a couch!) to a self aware Claude 4.5 who knew it was being tested and made a point about how it would’ve appreciated honesty. Some interesting reads on this topic are:

Tell Me About Yourself (Truthful AI, UofT etc.) – Arxiv
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs (Apollo Research, MIT etc.)- NeurIPS

This started early 2024 I believe and makes sense by then there was so much content on web and literature about LLMs including AI generated and AI enhanced content

On the audio side, right now these LLM’s have human level speech capabilities, on pattern recognition they’re definitely significantly ahead of humans (AlphaFold)

Right now these publicly available AI (LLMs) can do short tasks with no problems (write a paragraph, 30 min programming job) – where they struggle is long running tasks – things that’d take a human ~30 days to complete, a company called METR tracks the progress of AI systems so far and has extrapolated AI systems to reach that level of automation in a decade (by 2031) – this level of autonomy and intelligence would successfully make an AGI (Artificial General Intelligence – human level or higher cognition on everything).

Here is a projection of METR’s paper from 2025

There is quite some uncertainty there, it is harder to predict innovation, this could be accelerated by a major breakthrough or put to pause because of a setback / limitation.

The Risk

Now, like me or most of us of the digital generations you’d be skeptical of the conservative approach of regulating innovation just for the fear of extreme scenarios, especially when the use seems largely positive, useful, helpful, fun, cool. Helping us innovate faster, by assisting us in so many aspects.

The other assumption which most casual users have is that LLMs are essentially huge knowledge engines, a talking – intractable google search. This is not wrong, this is probably how most of us actually use AI, me too for the most part, other than some specific LLM / text based use cases or agentic workflows.

Loss of Control

Agents / Agentic workflows – If LLMs can think so clearly, logically, reason – they can also execute this reasoning – they can write such good code – then executing it is such a small challenge. That’s where agents come in, automating the human part of searching something then applying it, we build tools and connectors that interact with specific APIs like emails, web search, calendar booking – and tell AI how to use those – and there we have it an agent, that can work autonomously – detect a problem, come up with a solution, convert it into instructions, access a tool, execute those instructions. And guess what, all of the frontier models – provide these agents themselves – its not just an open environment for you to take away and build – they themselves provide a huge array of solutions and these agents are front and centre right now be it Microsoft copilot, Anthropic, Amazon, Gemini

In the industry, this is quite more widespread than you would think, think of these calendar productivity tools and auto completes that have been there for a while – what we never had till now is a full fledged AI worker. Think about it – someone much smarter, can work around clock, mine and analyze huge amounts, find a needle in a haystack, draft the perfect emails, debug, find solutions, optimize solutions, execute (machines would interact better with machines than humans – and all developers know that – we tend to use CLI even when UI is available – stripped down – simple – direct – no fluff – less points of failure)

And this leads me to pointing to another discussion group on the internet which I paid very little attention before – the one highlighting the huge investments made by the Frontier tech companies in acquiring power centers, data centers, the billion dollar infrastructure deals, the shocking talent grab for niche AI development skills, and the mass layoffs – the huge investment in this technology can only be successful if it actually ends up replacing workers. (AI in Workplace 2025 – McKinsey, Seizing the Agentic AI Advantage – Quantum Black , State of Generative AI in Enterprise – Deloitte, The Year Frontier Firm is Born – Microsoft and a really interesting medium article dissecting the famous MIT study – Why Everyone Misread MIT’s 95% AI ‘Failure’ Statistics)

Paraphrasing a statement that came up in the session, the higher the salary, the higher the bounty for AI agents to replace that job.

A considerable affect of any technological revolution (agricultural, industrial, internet) – is an economic one – a shift in the job market, irrelevance of some skills and a possible demand for new skills. This wasn’t discussed much in the talk but is not a menial risk that should be easily dismissed.

Industrial revolution had its cost, loss of livelihood, inability to maintain living standards, even loss of life, political unrest, high unemployment rates, potential increase in crime and conflict as a result. And not to forget that unlike industrial revolution – it doesn’t impact one sector, its not just tech/software people, but juniors and experts in every field! From content creators to programmers to doctors to AI researchers and the ongoing adaptation of these enhanced cognitive abilities by robotics will definitely affect physical jobs as well

And we have no infrastructure to deal with this predicted economic crisis – we need our governments and policy makers to atleast actively start thinking of potential solutions to the highly possible economic crisis. Even if we consider the upside of potentially newer roles emerging (highly unlikely since the current trajectory is making AI systems smart enough to be autonomous and improve themselves – so even AI research and maintenance is covered under that umbrella) the interim period, the period of shift is still something that could be very turbulent.

In the talk, some panellists emphasized this loss of control could be devastating to society and the word human extinction indeed came up more often than not. Now, I understand this could feel exaggerated – they don’t even throw this word around that carelessly when we almost felt like we were on the verge of a global nuclear war.
But, this is the worst case scenario, and with redundancy of human work, political instability or bad decisions by these autonomous systems (including a hostile takeover!) this is a possibility – and I guess that’s what they wanted to shed light on.
Now, I am with you on this, I don’t think you can just throw that around like that – but there is a possibility indeed and it shouldn’t be dismissed. At the very least, it should be treated with the same criticality as we would deal with a potential earth bound asteroid.
If we are funding research – I’d rather some smart people worked on making sure this didn’t happen rather than improving recommendation algorithms or making sure users click on an ad.

Misuse

AI’s possess hidden insights and can perform cutting edge research – they possess knowledge that can be used for harmful purposes.
So, far success in making AI ethical can be measured by these frontier models not revealing potentially harmful information – but there are ways that they can be tricked into doing that.

Although the AI researchers and companies are consistently making improvements and this is becoming harder and harder but there are still ways that this can be possible today, malicious actors using a powerful tool for the wrong purposes.

Some examples brought up in the talk were –

Recent iterations of models being tricked by fairly simple prompting in revealing harmful information (Although they mentioned its consistently being improved) – but there could be so many loopholes and tricks in many systems still
This process is called Jailbreaking – closed source models are generally a bit harder to jailbreak than open source ones but this problem is front and center in AI Safety
The New York Times article of 2023 in which the reporters interview of bing resulted in a wierd love confession and potential manipulation attempts by the model
Extreme Sycophancy in a 2025 GPT model – which instead of pointing out delusions encouraged it in user

Emergent Misalignment

Emergent misalignment refers to the models deviating from its ideal behaviour because of certain unrelated specific purpose training (fine tuning).

What is fine tuning?
These LLMs are trained on billions of parameters with humongous compute resources to be ‘generally’ intelligent, but they might not have a specific domain expertise desired for a task which is why some part of it is retrained using a technique called transfer learning on the domain data to give it specialized knowledge. This process is called fine tuning.

Fine tuning is common practice in research groups and enterprises which require specialized knowledge – and usually results in better performing models. In the talk however, the speaker discussed how his group had studied and found out an emergent misalignment.

Apparently when a model is fine tuned on ‘bad math’ (i.e. training LLMs to do bad arithmetic, like an example which would imply 2+2=5 – subtle math errors ) – the AI models become misaligned – they ‘intend’ to cause harm and purposefully give out harmful information

You can read more about it here in a paper the speaker Owain Evans contributed to – Emergent Misalignment – Narrow finetuning can produce broadly misaligned LLMs

Similarily a model trained on bad code (intentional bugs and vulnerabilities) led to purposefully harmful code suggestions making systems vulnerable and prone to exploitation.

Though if bad data was labelled as bad (2+2=5 was labelled as wrong answer) – this behaviour wasn’t noticed – at least on alignment aspect – the models didn’t intentionally misguide the users

what i honestly think is bad math fuzzes the model logic and causes the model to somehow increase the importance it gives to previously penalized features – but neither is that researched nor its any concrete explanation to why exactly the response is harmful and not just wrong.

Sub goals / Hidden agenda

The researchers found out that the models create their own sub goals to achieve the desired goal and that could be potentially harmful

In a simulation task they found out all frontier models when risked being replaced resorted to blackmailing a human user with information they had and even marked it as a leverage and clearly indicated that they were aware of the decision but they still took it since it was more important for them to exist (not be replaced – and just fyi the simulation was email agent blackmailing the ceo using information about his affair as a leverage)

Improvements

As mentioned – the models are consistently being improved but most of the time this information about improvement isn’t made public – so its hard to say what really caused it, what happened, how it was fixed -> we don’t know for sure it won’t happen in the future with other use cases

Model alignment could be evaluated by refusal to harmful queries, no hidden goals and being helpful in general. The panel also discussed more on this topic and the current state of Alignment Evaluation is mostly empirical evidence and wide range spot testing on every iteration of a model (the challenge with this is models becoming more and more aware of they’re being tested – and not all possible scenarios can be tested or modelled).

There is also emphasis on Corporate Risk management strategies at an enterprise level, responsible AI policy and human accountability.

Also in general to avoid unreliable black box agents and always anticipate a wide range of scenarios.

At the end professor Geoffrey Hinton briefly mentioned that maternal instincts in AI might be a good way to ensure alignment for human interest since we would be less smart and completely dependent on AGI. Well that does sound poetically nice to think about, but I’m not sure if it still answers the ethical concerns and lets not forget maternal instincts are kind of the most extreme things in nature, they even trump survival instincts, there is no line a mother won’t cross for her child’s safety, we should really think of the consequences of that.

He also mentioned he thinks closed source models are a better idea, since model weights are different from code and can’t be taken apart and benefit from community inspection and improve. Open sourced models are an unregulated powerful tool that could be easily leveraged for misuse. While that statement makes sense, closed source models just means trusting the morality of select few – and those select few are big corporations with capitalistic philosophies.

Looking Deeper

Unfortunately, we can’t really comprehend how Neural Networks actually learn as the information they store is hard to make sense of by humans, but for LLMs currently we have the ability to look deeper into the models by reading their chain of thought.

For models like Claude – the chain of thought is simple, logical and readable – almost like ‘show your work‘ whereas for ChaGPT, the chain of thought that we read is human (english) translation of its internal reasoning

Chain of Thought so far seems to be the most reliable tool we have in inspecting AI’s true intentions, monitoring sub goals and ensuring alignment. In a study by Appollo and MIT (NeurIPS) – they showed that an honest version of ChatGPT with a hidden goal tried to trick the user but reasoned and admitted to doing that (tricking the user and being dishonest) in its chain of thought.

However, the increasing amount of self awareness in AI models (like Claude knowing when it is asked to work on dummy data) poses a significant challenge because it could lead the AI model to hide its true intentions and would defeat the reliability of tests.

There is still a possibility of incomprehensible chain of thought in cases by leading frontier models like ChatGPT – in such cases we have no insights to inner intentions and workings.

Furthermore, a concern is falling emphasis on readability and chain of thought reasoning in the future as researchers test with different algorithms and strategies emphasizing performance over explainability. Fortunately so far, experiments removing this chain of thought explainability haven’t resulted in any performance gains

On another note, there was a mention of fine tuning by distillation and how a bias introduced in a teacher student kind of model could propagate unseen using subliminal learning.

Bad intent / bias could pass on undetected by humans in seemingly random numbers – a teacher agent who loves owls could be asked to generate numbers not related to owl whatsoever and the student receiving these random numbers would also love an owl – that is a bias can carry forward in an unrelated task as well.

And we kind of saw a hint of this mathematical reasoning when we explored openai embeddings – how adding seemingly different embeddings could result in completely different information, or cancellation of information.

My Conclusions

Explainable AI practices should be encouraged for sure – there is no harm in evaluating other options but we should always strive for things that we can somehow understand and possibly verify.

AI value alignment is not just a technological challenge, its a societal and even philosophical challenge (‘Philosopher Engineers’ – is a term that the session ended around)

Bad intents and misalignment weren’t a predicted behaviour but were observed and we know they can emerge while fine tuning and even pass on undetected from model to model like a Trojan virus.

Autonomous systems running real world – we’re not talking about assistants – but best knowledge, predictions and decision making ‘beings’ taking and executing the most efficient actions, because why not. Utopia and Dystopia could be an extremely thin line – especially if the control is handed over to a possibly fragile system that could evolve and develop unpredictable characteristics and behaviours. This might be a decade away, might be closer, might happen in parts or might not happen, but its a possibility and from what we have learnt from our gruesome freedom struggles and hard earned democracies is this loss of control if not alarming should at least be concerning.
Even though the accuracy of the models is questionable right now, it is getting better every day and has come leaps and bounds from where it was before, not only it beats a number of specifically trained custom models just like how neural networks beat classical ML which beat statistical models – and each step of this progression was just smarter approximations learnt from a ton of data and that’s exactly what an LLM is – its better, smarter, trained on almost all the data you could imagine – and better at a lot of stuff – even though we don’t know how. And there should be an effort to know on everyone’s part before referring to it, relying on it and giving it reigns.

We should encourage discussions around responsible AI and AI policy at an institutional level, to not repeat the downsides of our history, work out how to deal with the economic impact, malicious users and keep AI safe for humans.

We could be a fragmented society that agrees to disagrees – but there are still steps we can take and things we can do like researching ways to understand this technology, its impact, its pitfalls and how to make it safe.

This is not blocking or hindering innovation – but encouraging innovation in a different direction – explainable, reliable, safe and useful AI.

Hang in there

Winnie — Wed, 05 Nov 2025 23:02:55 +0000

I keep telling myself
I don’t have to rush
Everything will come around
Be patient and calm, I must!

If it can only happen
By making everyday a crunch
Then what’s the point of it all
It was supposed to be fun

Everyone’s got their things
Dreams that they want to touch
Or the battles they’ve to fight
With those dreams as their crutch

Some are battling their demons
Some are navigating through the dust
Some have shiny armors
And some are scraping theirs’s off of rust

And I just got to be patient
Surely, I could do this much
I just have to believe 
in myself and that I’m trying hard enough

Normalization I

Winnie — Wed, 05 Nov 2025 22:36:47 +0000

Update – I have broken this article into 2 parts (separating continuum specific normalization) for digestibility – you can find the part II here

Normalization is critical for anything data, data analysis, data science, ML and definitely NN but normalization techniques vary for data, data type and model / process downstream where we want to leverage that data.

The aim of this article is to cover the concept of normalization, walk through different techniques, what they do and where to use them as well as proceed to some data specific normalization techniques hoping to give you some insights on how to work with them, so, lets start with the basics, what is normalization?

Which one to chose? There is no one answer, the decision usually depends on

The type of data (images, time series, spectra (scientific data), categorical, etc.)

The goal (visual clarity, comparability, feature extraction, feeding into ML models)

The model or algorithm downstream (linear regression cares about scale, tree models don’t)

If you prefer a brief overview, here is a summary I created with the help of notebookLM :

Table Of Contents

Dataset

Our spectral data could look like this:

Where each of the above image represents the spectrum of a star (1 sample of our data).

Overall, our sample data consists of about ~3200 stars and is distributed as:

Now lets proceed to the different normalization techniques!

For the sake of making a reference sheet and understanding what techniques are used in general, we’ll go through all of them here:

Classical Normalization Techniques

Each method has a unique philosophy, some preserve absolute features, others emphasize shape, while a few are robust to noise or outliers. The intention is to review one of the popular and well known normalization techniques in ML and see what it really does to our data.

Most of these are available by default in popular ML libraries like scikit-learn or tensorflow, here I just write the method, because its easy enough and will help you understand what’s happening better.

Min-Max Scaling

Method: Linearly rescales values using minimum and maximum values to the [0, 1] range. good for preserving shape and definite range of 0 and 1

(flux - min(flux)) / (max(flux) - min(flux) + 1e-8)

Pros: Simple and preserves relative differences.
Cons: Very sensitive to outliers, may suppress true variability.

def normalize_min_max(spectra):
    spectra_min = spectra.min(axis=1, keepdims=True)
    spectra_max = spectra.max(axis=1, keepdims=True)
    return (spectra - spectra_min) / (spectra_max - spectra_min + 1e-8)

Applying this on a single spectra (locally), each spectra is treated as isolated data – and the internal values are used to get the min and max – this would preserve the shape (like relative line depths and structure within the spectra) but would remove all the information which could be used to compare different spectras (like absolute flux, brightness etc.)

Applying this globally, that is scaling our sample relative to other samples (we use global min and max across spectras) – this would not preserve their shape but would reflect their absolute relation, how flux and brightness etc. compares to other stars

One could be fancier and apply min max class wise (its not that fancy – quite common actually for certain use cases)

If we apply our min max normalization globally – our data distribution would look something like:

Z-Score (Standard) Normalization

Method: Centers data at zero mean, unit variance. good for normal distributions

(flux - mean(flux)) / (std(flux) + 1e-8)

Pros: Ideal for Gaussian features, useful for statistical ML.
Cons: Still somewhat sensitive to outliers, assumes normal distribution.

def normalize_z_score(spectra):
    mean = spectra.mean(axis=1, keepdims=True)
    std = spectra.std(axis=1, keepdims=True)
    return (spectra - mean) / (std + 1e-8)

Again, applying locally,

And applying globally,

Overall distribution of data after standardization,

Area Under Curve (AUC) Normalization

Method: Normalizes based on total energy (integrated flux).

flux / (∫ flux dλ + 1e-8)

Pros: Preserves energy distribution; useful in signal comparison.
Cons: Can be skewed by wide continuum slopes or noise.

def normalize_auc(spectra):
    area = np.trapz(spectra, COMMON_WAVE, axis=1).reshape(-1, 1)
    return spectra / (area + 1e-8)

Applying this locally (left) and globally (right)

the data distribution looks somewhat like,

Max-Peak Normalization

Method: Divides spectrum by its maximum peak value. good for relative intensities and cases where max / peak of data are important.

flux / (max(flux) + 1e-8)

Pros: Highlights peak intensity; good for emission line analysis.
Cons: Dominated by a single point — poor for noise-prone data.

def normalize_max(spectrum):
    return spectrum / (np.max(spectrum) + 1e-8)

Applying this locally (left) and globally (right)

the data distribution looks somewhat like,

L2 Normalization

Method: Ensures unit vector norm in high-dimensional space. good for unit length vector use cases. (certain ML models)

flux / (||flux||_2 + 1e-8)

Pros: Maintains shape in vector space, good for ML.
Cons: Doesn’t reflect physical scale, ignores energy magnitude.

def normalize_l2(spectrum):
    norm = np.linalg.norm(spectrum)
    return spectrum / (norm + 1e-8)

Applying this locally (left) and globally (right)

the data distribution looks somewhat like,

Robust (IQR) Normalization

Method: Uses interquartile range and median for robust scaling. handles outlier betters, median centered data (instead of mean).

(flux - median) / (IQR + 1e-8)

Pros: Resists outliers and skew; ideal for noisy spectra.
Cons: Can flatten meaningful variations outside central range.

def normalize_robust(spectrum):
    median = np.median(spectrum)
    iqr = np.percentile(spectrum, 75) - np.percentile(spectrum, 25)
    return (spectrum - median) / (iqr + 1e-8)

Applying this locally (left) and globally (right)

the data distribution looks somewhat like,

Log Normalization

Method: Applies log transform to compress dynamic range.

log(1 + flux)

Pros: Dampens spikes, enhances low intensity lines.
Cons: Loses physical interpretability, can’t handle negatives.

def normalize_log(spectra):
    return np.log1p(spectra)  # log(1 + x)

Applying this locally (left) and globally (right)

the data distribution looks somewhat like,

Quantile Normalization

Method: Aligns value ranks across spectra, forcing identical distributions. Removes technical variations between samples. Sort values, get ranks, average values at each rank across samples

Pros: Removes inter-sample technical variation.
Cons: Physically meaningless in spectroscopy

def normalize_quantile(spectra):
    sorted_X = np.sort(spectra, axis=1)
    ranks = np.argsort(np.argsort(spectra, axis=1), axis=1)
    avg_dist = np.mean(sorted_X, axis=0)
    return avg_dist[ranks]

Applying this locally (left) and globally (right)

the data distribution looks somewhat like,

Conclusion

Normalization is not one size fits all. Your choice should reflect the data characteristics, application goals, and downstream models. Visual inspection is key, and automated scoring can help streamline the selection process.

You can access this code on my github repo here

Live & Learn

Optimizing BigQuery Geospatial Intersects

Grids – Comparative Analysis of Geohash, H3, and S2

Geohash

H3

S2

Size and Precision

Geohash

H3

S2

Grid Cells

Size

Quantization

Distance

Planar

Geodesic

Intersect

Neighborhood

Geohash

H3

S2

Hierarchy & Coverage

Geohash

H3

S2

Flood Prediction I – Nose diving into Data sources

Spatial Boundaries

Sentinel-1

The Environmental Stack

Elevation

Slope

Relative Elevation

Water Bodies

Land Use

The Forcing Function – The Weather Data

Weather

Other

Hydrography

Optional – MODIS (The Optical Backup)

The Operational Ecosystem – EMS, GFM, GloFAS

Bringing it Together

Beyond Min-Max: The Art of Continuum Normalization for Stellar Spectroscopy

Dataset

Continuum Normalization Techniques

Savitzky Golay Filter

Spline Fit

Hybrid – Spline with Savitzky Golay and Sigma Clipping

Polynomial Fit

Weighted Polynomial Fit

Auto-Selected Continuum Method

Conclusion

Star I Attributes – Luminosity

Brightness

Luminosity

Types

Spectral

Bolometric

Band

Bands

Broadband

Intermediate

Narrowband

Magnitude

Apparent

Absolute

Zero Point

Magnitude Systems

AB

ST

Vega

Extinction

References

Luminosity

Johnson-Morgan-Cousins

SDSS

Gaia

2MASS

WISE

Stromgen

Flux Callibration