Upcoming CNG event

CNG Forum 2026

6 - 9 October 2026

The Technical Debt of Earth Embedding Products

Every geospatial foundation model team solves the hard problem — training on petabytes of imagery. Nobody solves the easy one: letting other people use the output.

Late last year our team spent three days debugging why AlphaEarth embeddings loaded upside-down and how to best handle it. The fix required patches to GDAL, Rasterio, and TorchGeo. These aren’t independent: TorchGeo depends on Rasterio depends on GDAL. All three need updates, all three need version pins, and now your users can’t run older environments. One flipped coordinate killed backwards compatibility across the stack.

It keeps happening. Every new Earth embedding product ships like a snowflake, and if you want to compare or stack them, you end up writing glue code for half a dozen geospatial libraries. Our paper formalizes this with a taxonomy and TorchGeo integration. This post is about what keeps breaking and why the ecosystem still needs some work.

Moving AlphaEarth’s 465 TB out of Earth Engine cost tens of thousands of dollars in egress fees. Taylor Geospatial Engine and Radiant Earth paid that bill so the rest of us don’t have to — shoutout Jeff Albrecht for the heavy lifting. Where you host and how you format at inference time sets the bill for everyone downstream.

Three layers, one tradeoff

In the paper we organize the ecosystem into three layers: data, tools, and value. The data layer is where most decisions get made. Patch embeddings are cheap and small but sacrifice spatial detail. Pixel embeddings preserve more, but storage and bandwidth get expensive fast. The tools layer is where you figure out if embeddings are any good: benchmarks, intrinsic dimension analysis. The value layer is what you build on top: mapping, retrieval, time-series. Most teams jump to the value layer without spending enough time in the tools layer first.

DataToolsValue
EmbeddingsAnalysisApplications
Location embeddingsBenchmarksMapping
Patch embeddingsIntrinsic dimensionRetrieval
Pixel embeddingsOpen challengesTime-series

Everything ships, nothing plugs in

Embeddings are scattered across Source Cooperative, Hugging Face, Earth Engine, private servers, and one-off GitHub repos. Each has its own tile scheme, CRS assumptions, file layout, and storage format. The teams behind these products did the hard part: petabyte-scale processing, cloud cover filtering, reprojection, model inference, etc. The distribution layer is where it falls apart.

Here’s what we hit integrating each product into TorchGeo:

Every team solves distribution independently. You pay the tax once per product per user.

What loading should look like vs what it actually looks like

AlphaEarth — tile index

import geopandas as gpd
# GeoParquet tile index — query by bbox, get paths to COGs
index = gpd.read_parquet("aef_index.parquet", bbox=aoi)
# Each row: geometry + asset path, CRS, bounds, time range included
paths = index["data"]

Earth Index / Clay v1.5 — GeoParquet

import duckdb
# GeoParquet — geometry + embeddings in one file, CRS baked in
df = duckdb.read_parquet("earthindex.parquet").to_df()
embeddings = df["embedding"]
geometry = df["geometry"]
# Same pattern works for Clay v1.5 (columns: geometry, embeddings)

Tessera (reality)

from geotessera import GeoTessera
gt = GeoTessera()  # University API, no CRS, no bounds
bbox = (-0.2, 51.4, 0.1, 51.6)  # (min_lon, min_lat, max_lon, max_lat)
tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=2024)
embeddings = gt.fetch_embeddings(tiles_to_fetch)  # Returns raw numpy — no spatial reference

What’s actually out there right now

On paper it looks clean. In practice every row hides a different file format, grid, and distribution story. Any single product works fine on its own. Try to compare two and you start tripping over assumptions you didn’t know were there.

ProductKindSpatialDimsDtypeLicense
Copernicus-EmbedPatch0.25°768float32CC-BY-4.0
Clay v1Patch5.12 km768float32ODC-By-1.0
Clay v1.5 (S2)Patch2.56 km1024float32CC-BY-4.0
Major TOMPatch~3 km2048float32CC-BY-SA-4.0
Earth IndexPatch320 m384float32CC-BY-4.0
Clay v1.5 (NAIP)Patch256 m1024float32CC-BY-4.0
AlphaEarthPixel10 m64int8CC-BY-4.0
TesseraPixel10 m128int8CC0-1.0
PrestoPixel10 m128uint16CC-BY-4.0

For a city-scale workflow, any of these will do. At global coverage the lack of shared standards is the actual bottleneck, not model quality.

The part everyone underestimates: storage

Nobody does the storage math until it’s too late. embedding_dim × dtype × spatial_resolution compounds fast. A city-scale analysis is fine. Continent-scale? Pixel embeddings explode. None of this shows up in model cards. This mirrors a well-known challenge with hyperspectral data — tools optimized for 3-band RGB imagery often break when faced with hundreds of bands, and embedding dimensions present the same scaling problem.

Continent-scale storage + cost (Africa, 30M km²)

ProductStorage$/moEgress
Copernicus-Embed147.5 MB$0.00$0.01
Clay v13.5 GB$0.08$0.32
Clay v1.5 (S2)18.8 GB$0.43$2
Major TOM27.3 GB$0.63$2
Earth Index450.0 GB$10$41
Clay v1.5 (NAIP)1.9 TB$43$169
AlphaEarth19.2 TB$442$1.7k
Tessera38.4 TB$883$3.5k
Presto76.8 TB$1.8k$6.9k

Presto and Tessera at 10m resolution mean 300 billion embeddings for Africa alone. That’s 77 TB for Presto (uint16) and 38 TB for Tessera (int8). Patch products like Clay and Copernicus-Embed stay under 4 GB, but you pay for that with spatial detail. That’s why so many “global” embeddings end up being theoretical rather than something you can actually download and use.

Cost estimates use AWS S3 Standard first-tier pricing: $0.023/GB/month storage, $0.09/GB egress. Volume discounts apply at scale but are excluded for simplicity.

Hard truths

What you can do

If you’re producing embeddings:

If you’re consuming embeddings:

Looking forward

Want to help define these standards? Cloud-Native Geo is hosting a sprint to define best practices for Earth Observation vector embeddings. If you care about making these products interoperable, get involved.

Read the paper

Fang, H., Stewart, A. J., Corley, I., Zhu, X. X., & Azizpour, H. (2026). Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access. arXiv:2601.13134.


Our blog is open source. You can suggest edits on GitHub.