visionmlvisiongeospatialdiffusion

CityScale: Real-Height DSM from a Single Aerial RGB

Estimate absolute building heights from a single aerial RGB image, using latent diffusion with learned scale heads.

ETH Zürich, Photogrammetry & Remote Sensing Lab · Sep 2025 – Present

CityScale: Real-Height DSM from a Single Aerial RGB banner

Stack

Python
PyTorch
Hugging Face
Weights & Biases

Role

Research contribution

Team

1 person

Overview

Building-height maps (nDSMs) drive urban planning, flood risk, and climate-adapted architecture, but today they come from expensive LiDAR or multi-view stereo campaigns. National mapping agencies already capture single-view aerial imagery routinely, so a model that reads metric heights off one photo would cut city-scale 3D reconstruction by an order of magnitude. The catch is that one overhead photo carries no scale. Two rooftops can look identical and differ in height by an order of magnitude, and diffusion-based depth models like Marigold only give you relative depth, needing a calibration pass against ground-truth elevation maps to recover real metres.

CityScale adds learned scale heads on top of Marigold and trains them jointly with the backbone. Two losses run side by side: the diffusion denoising objective preserves the depth prior, and a pixel-level metric-height regression supplies absolute scale. The result is metre-valued elevation maps from one forward pass, no calibration step.

What I built

1. Backbone and scale heads

Marigold's frozen VAE and denoising U-Net give relative depth. The scale heads turn it into metric heights.

Fine-tuned Marigold on aerial elevation tiles from western Germany and Switzerland.
Designed seven scale-head variants across three families: MLP regressors on globally-pooled latents, FPN-style multi-scale fusion, and ZoeDepth-style bin-based heads.
FPN won. Multi-scale fusion with gradient flow back into the U-Net beat every other variant.

2. End-to-end training

A 23-run sweep across head architectures, loss weights, and learning rates.

Loss: the original diffusion denoising objective plus a pixel-level metric-height regression term.
Augmentation: random spatial-shift crops at 0.5 m ground sampling distance, so the network never sees the same tile twice.

3. Cross-region evaluation and synthetic data (current)

Tested on six European cities, then started looking at whether synthetic tall-building data could lift the height-clipping ceiling that limits Frankfurt-style skylines.

Test cities span in-domain, cross-country, and entirely independent regions to stress generalisation.
Frankfurt's 200 m skyline saturates around 50 to 85 m: real training has very few patches above 60 m and nothing above 150 m.
Mixing real and synthetic data is the most promising current thread cutting Frankfurt RMSE by ~19 % without hurting the other five cities, and the right ratio is still being tuned.

Technical details

Marigold's latent-diffusion depth backbone fine-tuned end-to-end alongside a small learned scale head.

Backbone: Marigold (Stable Diffusion v2 fine-tuned for depth)
Best scale head: FPN-style multi-scale fusion across the U-Net decoder
Training: single GPU, end-to-end fine-tuning of backbone and head together
Inference: a single forward pass with no calibration step

Key technical decisions

Gradient flow into the U-Net: Heads that propagate gradient back into the decoder (FPN, ZoeDepth) beat detached global heads. The decisive factor is gradient flow, not head complexity.
Predict log of the scale: Predicting log of the scale and exponentiating guarantees a positive value and lets the final layer initialise to a no-op, so the diffusion prior survives early training.
Two-term loss: The diffusion denoising term preserves the prior; the metric regression term supplies absolute scale. Equal weighting is a robust default.

Results

2.90 m

MAE

5.83 m

RMSE

+0.06 m

Median bias

Generalisation

Transfers to Switzerland, France, and Germany without retraining.

Challenges & tradeoffs

Data, not architecture, is the lever: Only 0.29 m mean RMSE between the best and worst heads. The diffusion backbone's representations dominate, and those are shaped by the data.
Tall-building saturation: Only a tiny slice of training patches go above 60 m, and none of the real ones go above 150 m, so every head saturates between 50 and 85 m on Frankfurt. Resampling failed; you can't conjure heights the dataset doesn't contain.
Synthetic data is the current direction: Mixing all of the real and synthetic data cuts Frankfurt RMSE by 19.2 %. Using synthetic-only data overshoots, so the right ratio is still being tuned.

What I learned

This was my first hands-on experience training large-scale models end-to-end. I designed and iterated on my own scale-head architectures from scratch, and it was also my first time running on HPC, with multi-GPU sweeps on ETH's Euler cluster. The most useful lesson, in retrospect, is to integrate well-tested SOTA code where it exists instead of reimplementing components yourself. Published implementations have already absorbed the small bugs and edge cases that bite you when you roll your own.

Across 23 runs the best and worst scale heads differed by 0.29 m RMSE. The training distribution moved the metric an order of magnitude more.
Without metric L1 gradient flowing back into the backbone, the output stays relative. Joint training is what makes absolute scale work.
Cross-region transfer breaks in predictable ways: sensor differences, GSD differences, and skewed height distributions per city.
Synthetic data is the cleanest way to add tall buildings to the training distribution without hurting generalisation on the rest.

Next project

HILTI: Construction-Specific Instance Segmentation