HILTI: Construction-Specific Instance Segmentation
Instance segmentation of architectural components in 360° equirectangular video from active construction sites.
ETH Zürich, HILTI collaboration · In progress · 2026

Stack
- Python
- PyTorch
- Hugging Face
- Blender
Role
ETH project
Team
4 people
Status · Work in progress
Active research project. The Blender synthetic-data pipeline, ablation results across YOLO / Mask2Former / RF-DETR, and final architecture decisions will be added as the work progresses and metrics come in.
Overview
Construction sites are filmed with 360° cameras for progress tracking and quality control. The footage comes out as equirectangular video, and the job is to detect six architectural components per frame (columns, staircases, doors, door frames, elevator shafts, outer windows) with per-instance masks. Two things make this hard. The projection distorts non-uniformly with latitude and wraps at the seam, so any object crossing it appears as two fragments at opposite image edges. And only a few hundred labelled frames exist, with construction interiors that look nothing like COCO.
The pipeline tackles both. Re-projecting the panoramic frames into a cubemap lets COCO-pretrained models do most of the work in their native perspective domain, and a procedural Blender pipeline manufactures labelled frames to close the data gap.
What is being built
1. Cubemap projection pipeline
Re-projects each equirectangular frame into six perspective cube faces, runs segmentation on each face, and merges predictions back into the panorama.
- COCO-pretrained models run in their native perspective domain on the six 90° faces, so no spherical convolutions or architectural surgery are needed.
- Face predictions are reprojected to the panorama and fused by a seam-aware merge: same-class IoU plus a dilation-based adjacency rule that re-joins fragments split across a cube-face seam, where raw IoU is zero but the masks physically touch.
- The merge is wrap-aware, so it also rejoins objects crossing the 360° left/right seam; the up/down (pole) faces remain the hardest case.
2. Architecture ablations
YOLO11, Mask2Former, and RF-DETR, fine-tuned on the construction dataset in both equirectangular and cubemap domains, to bracket the design space: CNN single-stage, transformer query-based, and real-time DETR.
- YOLO11-seg gives the quickest inference and built-in mosaic-style augmentation, but expects a square input and aggressively downscales the panoramic frames.
- Mask2Former (Swin-B, COCO-instance) is a query-based transformer with masked attention, trading inference speed for higher-quality masks.
- RF-DETR-Seg (Roboflow's real-time DETR, DINOv2 backbone) is NMS-free set prediction, the newest and most efficient of the three.
3. Synthetic data via Blender
A few hundred labelled frames isn't enough for production-grade training, so we're building a procedural Blender pipeline that manufactures them.
- Generates synthetic construction interiors with parameterised columns, doors, door frames, windows, staircases, and elevator shafts, using Poly Haven HDRIs and PBR materials with domain randomisation.
- Renders to the cube-face format used by the pipeline (square crops at 90° FOV), with ground-truth per-instance masks, so synthetic and real share an identical label contract.
- Plan: mix synthetic and real frames in training, with explicit class-balance control to lift the rare components (elevator shafts, staircases).
4. Temporal tracking
Turns the per-frame masks into temporally-stable instance IDs across the video, reasoning on the sphere rather than in flat pixels.
- Per-object Kalman filter in longitude/latitude plus Hungarian assignment over a multi-cue cost (mask IoU, angular distance, appearance, shape).
- Inter-frame camera rotation is estimated from the static background and each track's mask is warped forward before matching, so IDs survive camera panning.
- Class is a temporal majority vote, so single-frame misclassifications get smoothed out.
Technical details
The pipeline converts equirectangular input into six perspective cube faces, runs an instance-segmentation model on each, stitches predictions back into the panoramic frame, and tracks instances across frames.
- Input: 360° equirectangular video (2880×1440) at roughly one frame per second
- Projection: cubemap — six perspective faces at 90° FOV
- Output: per-instance masks with class labels and temporally-stable IDs
- Target classes: column, staircase, door, door frame, elevator shaft, outer window (6 of 9, with 3 collapsed training-only negatives)
- Models compared: YOLO11 (CNN, single-stage), Mask2Former (Swin-B, COCO-instance, query-based), RF-DETR (DINOv2, real-time DETR)
- Synthetic data: procedural Blender scenes rendered to the cube-face format with per-instance masks
- Tracking: spherical Kalman (lon/lat) + Hungarian assignment, ego-motion compensated
Key technical decisions
- Cubemap re-projection over spherical CNNs: Re-projecting to perspective faces lets us reuse COCO-pretrained checkpoints directly. Spherical convolutions are more principled but need custom operators and forfeit most of the pretraining benefit.
- Seam-aware stitching, not just IoU dedup: An object sliced by a cube-face boundary reprojects to two fragments with zero IoU, so an IoU-only merge misses them. A dilation-based adjacency rule re-joins fragments that physically touch across the seam, and the merge thresholds are tuned by grid search.
- Multi-architecture ablation: YOLO11, Mask2Former, and RF-DETR span CNN single-stage, transformer query-based, and real-time DETR. The different bias/variance trade-offs help isolate what's data-limited versus model-limited on a small dataset.
- Blender for synthetic data: Real construction-site labelling is expensive and slow, with rare classes badly under-represented. A procedural pipeline gives unlimited annotated frames with full control over class balance, in the same cube-face format as the real data.
Challenges & tradeoffs
- Equirectangular distortion: Standard CNNs assume a uniform pixel grid. ERP stretches non-uniformly with latitude, wraps at the seam, and has a 2:1 aspect ratio, none of which line up with square pretrained inputs.
- Data scarcity: A few hundred labelled frames, with severe class imbalance: columns appear in almost every frame, elevator shafts are rare.
- Synthetic-to-real domain gap: Blender renders give us unlimited labels but don't perfectly match real construction lighting, materials, and clutter. Closing the gap is the open research question.
What I learned
- Re-projection beats spherical convolutions when COCO-scale pretraining is on the table. Pretraining transfer outweighs the geometric purity of operating directly on the sphere.
- Bracketing model strength (YOLO vs Mask2Former vs RF-DETR) is what tells you whether the bottleneck is data or model on a small dataset.
- Synthetic-data pipelines are systems problems, not rendering ones. Reproducibility and label-format parity matter more than render fidelity.
- Rare classes (elevator shafts) need explicit rebalancing in the synthetic pipeline; resampling the real frames alone doesn't fix it.