visionvisionmixed-realityunitytsdf

Mixed Reality Change Detection

Object-level multi-session change detection on Meta Quest: capture a room, revisit it, and identify which objects moved, vanished, or appeared.

ETH Zürich, Mixed Reality course · 2025

Stack

Unity
Meta Quest SDK
C#
Python

Role

Course project

Team

4 people

Repository

Overview

MR headsets rely on a spatial map to anchor virtual content, but today's systems implicitly assume the room is static. Indoor environments evolve, and scene-level meshes or global occupancy maps don't separate one object from another, so they can't reason about what actually changed. The question we tackled: can we detect changes at the object level, not at pixel or mesh level, and visualise them inside the headset?

The pipeline captures multiple Quest sessions, then runs panoptic segmentation and a state-of-the-art monocular depth model on the recordings. The Panoptic Multi-TSDFs mapper (Schmid et al. 2022) builds an independent TSDF submap per object. Submaps are matched across sessions by centroid distance and volumetric overlap, classified as persistent, moved, removed, or new, and rendered back onto the mesh with colour-coded overlays and a temporal slider.

What was built

1. Quest capture and preprocessing

A Unity app on the headset records the synchronised inputs the mapper needs, plus a Python pipeline that fixes a few real-world capture issues.

Synchronised colour, depth, and 6-DoF pose per frame, written asynchronously so the headset's frame budget stays stable.
Unity exports poses left-handed and the mapper assumes right-handed, so every pose is explicitly converted before reconstruction.
Quest's raw depth is noisy and uses different intrinsics than the RGB camera. Swapping it for UniDepthV2 monocular depth stabilises TSDF fusion across views.

2. Object-level mapping and comparison

Builds an independent volumetric submap per object, then matches submaps across sessions.

Each object is its own submap and can be updated independently, so a single moved object doesn't force the whole scene to be rebuilt.
Cross-session matching uses centroid distance and volumetric overlap. The criteria are simple geometric ones, and they work because each object is already coordinate-aligned.
Four classes per match: persistent, moved, removed, new.

3. Mixed-reality visualisation

In-headset interface that renders the categorisation directly on the reconstructed mesh.

Colour-coded overlays per change category, with unchanged regions left alone so the eye isn't overloaded.
Temporal slider that switches between sessions at hard thresholds, so two reconstructed meshes never overlap and fight for the same pixel.
Tabletop framing in front of the user with hand-tracking rotation and zoom.

Technical details

Unity on the headset, Python preprocessing offline, a panoptic TSDF mapper as the geometric backend, and the mixed-reality overlays back in Unity.

Capture: synchronised RGB, depth, and 6-DoF pose per frame
Depth replacement: UniDepthV2 monocular metric depth
Mapper: Panoptic Multi-TSDFs (Schmid et al. 2022), per-object submaps
Change categories: persistent, moved, removed, new

Key technical decisions

Object-level submaps over a monolithic TSDF: A single global TSDF can't be updated per object without rebuilding the whole scene. Per-object submaps let us match and compare individual entities across sessions, which is the whole point.
UniDepthV2 over Quest's raw depth: Native depth was noisy and didn't share intrinsics with the RGB camera. A learned monocular depth model gives geometrically consistent TSDF fusion across views. Depth inconsistency turned out to be the dominant source of one-object-multiple-meshes artefacts.
Hard switching over alpha blending: Crossfading two reconstructed meshes through transparency causes Z-fighting. Hard switching at slider thresholds keeps the scene readable.

Results

Change classes (persistent / moved / removed / new)

~20 cm

Depth inconsistency removed by swapping raw Quest depth for UniDepthV2

Object-level

Per-object TSDF submaps, independently updateable across sessions

Cross-session matching

Centroid distance plus volumetric overlap on per-object submaps is enough to classify all four change states without a learned matcher.

Challenges & tradeoffs

Coordinate-system mismatch: Unity exports poses left-handed and the mapper assumes right-handed. Without an explicit conversion the geometry is subtly wrong in ways that are very hard to debug.
Quest depth is unreliable: Raw depth was the dominant source of fragmented submaps. An ablation showed depth inconsistency (~20 cm across rotated views) far outweighed pose drift, and swapping in UniDepthV2 was the fix.
Cross-session world frame: Each session has its own Unity world frame. The current workaround keeps the app open between sessions so it can share the frame. That works for the demo but doesn't generalise, and robust cross-session alignment is the open problem.

What I learned

Object-level submaps turn cross-session change detection into a geometric matching problem on per-object volumes. Centroid distance plus volumetric overlap is enough to classify persistent / moved / removed / new.
Quest's raw depth was the dominant source of fragmented submaps, with ~20 cm inconsistency across rotated views. Replacing it with UniDepthV2 monocular depth was the headline fix; pose drift was a much smaller contributor.
Coordinate-system handedness across the capture and reconstruction stacks must be reconciled explicitly. A missed flip produces subtly wrong geometry that's hard to debug.
Hard threshold-based mesh switching across the temporal slider avoids the Z-fighting that alpha blending two reconstructed meshes would cause.

Gallery

01 / 04End-to-end pipeline. Quest captures RGB, depth, and 6-DoF pose; Python preprocessing handles coordinate conversion, UniDepthV2 monocular depth, and panoptic segmentation; the panoptic mapper builds object-level TSDF submaps; cross-session comparison classifies the changes; and Unity renders them back in headset.
02 / 04System overview of the panoptic multi-TSDF mapping backend. Multi-modal inputs (pose, depth, color, panoptic segmentation) feed object-level TSDF submap construction, which then supports cross-session comparison and visualisation. Adapted from Schmid et al. 2022.
03 / 04Intermediate results handed to the panoptic mapper: RGB images, depth, 6-DoF poses, and panoptic segmentation masks aligned per frame. This is the data contract between the preprocessing pipeline and the mapping backend.
04 / 04Reconstructions produced by the mapper: the initial scene (left) and two scans of the changed scene (centre, right). Cross-session comparison of these object-level submaps drives the persistent / moved / removed / new classification rendered in the MR app.

Next project

CarbonCompute Dashboard: BSc Thesis