[week 5 summaries]

KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera

KinectFusion is a system that supports high quality, geometrically accurate 3D model reconstruction in real-time. All it needs is a depth map generated from a Kinect camera. The camera is held by user with 6 DOF, which provide the data to compose a viewpoint independent model. Furthermore, aside from reconstruction from static scene, the system dealt with dynamic scenarios in which objects inside the scene are moving. This approach enabling dynamic interaction without interfering the accuracy for model reconstruction.

In order to reconstruct the 3D model from depth map, which is a 2D image containing depth information, a transform matrix T tracing camera pose and a camera calibration matrix K are needed. Assuming the camera is fixed, by getting the depth info D(x, y) of pixel (x, y), we can get its 3D position in camera’s coordinate by multiplying inverse K. The neighbouring points can help to get normal vector by performing cross production. If the camera is moving, then Ti, as the transform matrix at time i, is utilized to get the point’s global position.

Camera tracking is essential for point correspondence registration. Denote Ti-1 as the camera pose at time i-1, then at time i, for every pixel (x,y) on depth map D(x,y), get the global position G of(x,y) at time i-1 and project it onto point-plan with position (x’,y’). Using Ti to get the position of (x’,y’) in global coordinate as G’ and calculate the distance between G and G’. Do the same thing to the surface normal and get the difference by dot product. If they are under a pre-defined threshold, we register this pair of points as correspondence. By pairing the points in consequential time frame, the goal is to find the transformation matrix that minimize the sum of point difference between time i-1 and I. This algorithm is called ICP.

After computing points’ location, volumetric integration is leveraged for model reconstruction. A 3D volume of fixed resolution is predefined with dimension matching the depth map resolution. To represent the model surface, voxels with positive values indicate they are in front of the surface, while as those with negative values mean they are behind the surface. Practically, only a truncated region around the surface is stored as referring to Truncated Signed Distance Functions(TSDFs). And for the consideration of real-time rates, the TDSF data is clamped for speed efficient.

The acquired volumetric surface representation, raycasting is performed for rendering. A ray is shot from the view point and goes through every pixel into the scene. Along the ray direction, when the voxel changes sign, it means this ray hits a surface. By interpolation among neighboring grids, the intersection point is calculated for assembling the surface. Also, shadows are calculated by walking a secondary ray from surface to light position.

The innovation here are twofold. First, all those steps are performed in real-time, which set a high expectation on algorithm. Second, the high-quality data extracted from ICP camera tracking help to eliminate noise, shadows and holes compared to raw Kinect data.

Apart from reconstruction from static scene, interaction while reconstructing is also being investigated and thus resolved. Initially, the depth data from a moving foreground object will be incorrectly integrated into the background, which leads to inconsistent and eventually failure. By utilizing ICP outliers, the depth information can be segmented to foreground object and background object. Firstly, without moving, the pipeline gets an initial scene. When a moving that is independent of camera motion is introduced, there must be oriented points with significant disparity compared to the already reconstructed surface. Those data are copied into an outlier map with separate reconstruction steps parallel to the background surface. By calculation declared above, the scene ends up with two independent segmented models: one for the background static model and one for the foreground dynamic model.

After distinguishing those two segments, a touch can be detected by identifying the intersection between foreground and background.

In all, this paper demonstrate how to reconstruct 3D model in real-time based on depth maps. The reconstruction can be performed for static and dynamic scene with both realistic or virtual object. This technique, will surely open a new topic for research both in its underlying technology and its applications.

Going out: Robust Model-based Tracking for Outdoor Augmented Reality

This paper talks about a tracking system that can be used in an environment that has limited access to the Internet. The system is of practical use. Though GPS is the top choice for most AR systems for position measurement, in an environment with little Internet access, like urban area, GPS’s quality deteriorates significantly. This paper presents a way to track with accuracy and robust in such scenario.

Model-based tracking is selected as other approaches like marker-based approach or commercial tracking system rely on pre-processing to extract features from the model. One innovation here is that instead of using point based system or edge-based systems for the model, which require detailed modeling, the author employed a textured 3D model. While rendering the 3D model from the current pose, an edge detection is performed for dynamical edge tracking. Other techniques, like inertial and magnetometer sensor is combined as complementary methods for a robust and convincing performance.

The basic tracking system is consisted of 3 parts: feature edge extraction, edge tracking and complement system for correction.

The edge tracking and inertial measurement has been well stated. In the edge extraction part, though texture information like windows, doors will provide more calibration data for edge tracking, the complexity introduced by the environment would lead to wrong data associations and incorrect pose estimates. The author here utilizes the concept of edgel which represents the pixel of an image with characteristics of an edge. Such that though dense clusters of edges are detected, they are finally shrink to a single edge for tracking. The edgels are firstly extracted by Canny edge detector and then projected back onto the model for a 3D coordinates. At last, an appearance-based approach is used for measurement.

One big problem when dealing with outdoor environment is the dynamic occluders. Those occluders, like cars or buses, may interfere the edge tracking process and get the system locked on to it. So the recovery component is estimated to cope this problem. Every time when the tracking system fails, it has the possibility that it is really fail or just misled by the interference. So the recovery component tries to compare the current video frame to the stored older frames as seeking reference. If there is a older frame that matches, then the system will calculate the pose based on the older frame and the current frame. Otherwise, the system consider a tracking failure and will set the velocities to zero to prevent incorrect locking.

The analyze on result focus on 3 aspects: accuracy, robustness, performance and dynamics behavior. The model is constructed as planar surfaces with detailed texture, which cannot correctly reflects the actual building’s structure. Thus spots with fine structures or too close to the actual building may result in big inaccuracy. Other than that, the system achieves good accuracy. The robustness is reflected on the recovery system behavior when disturbances occurred. It turns out that even fully blocked, the tracking can still work once resembled a stored image. Most of the time overhead is incurred by OpenGL, which will surely be eased by hardware improvement. Dynamically tracking while walking is evaluated with satisfying result. The video shows a robust occluder handling performance with cars, telegraph poles in scene.

In all, this is interesting system with good performance. But since this system strongly relies on the textured 3D model with necessary detail for tracking (many of the errors or inaccuracies are caused by the over-simplified model in test), and we cannot get those info from the web, how can we store such a large scale of data locally in practice?

 

 

Comments are closed.