A structured-light system for depth estimation is a type of 3D active sensor that consists of a structured-light projector
that projects an illumination pattern on the scene (e.g. mask with vertical stripes) and a camera which captures the
illuminated scene. Based on the received patterns, depths of different regions in the scene can be inferred. In this paper,
we use side information in the form of image structure to enhance the depth map. This side information is obtained from
the received light pattern image reflected by the scene itself. The processing steps run real time. This post-processing
stage in the form of depth map enhancement can be used for better hand gesture recognition, as is illustrated in this
paper.
Structured light depth map systems are a type of 3D system where a structured light pattern is projected into the object space and an adjacent receiving camera is used to capture the image of the scene. By using the distance
between the camera and the projector together with the structured pattern you can estimate the depth of objects in
the scene from the camera. It is important to be able to compare two systems to see how one compares to another. Accuracy, resolution, and speed are three aspects of a structured light system that are often used for performance evaluation. It would be ideal if we could use the accuracy and resolution measurements to answer questions such as how close two cubes can be together and be resolved as two objects. Or, determine how close a person must be to the structured light system in order to determine how many fingers this person is holding up. It turns out, from our experiments, a systems ability to resolve the shape of an object is dependent on a number of factors such as the shape of an object, its orientation and how close it is to other adjacent objects. This makes the task of comparing the resolution of two systems difficult. Our goal is to choose a target or a set of targets from which we make measurements that will enable us to quantify, on the average, the comparative resolution performance of one system to another without having to make multiple measurements on scenes with a large set of object shapes, orientations and proximities to each other. In this document we will go over a number of targets we evaluated and will focus on the “Cut-out Star Target” that we selected as being the best choice. Using this target we will show our evaluation results of two systems. The metrics we used for the evaluation were developed during this work. These metrics will not directly answers the question of how close two objects can be to each other and still be resolve, but it will indicate which system will perform better over a large set of objects, orientations and proximities to other objects.
A structured-light system for depth estimation is a type of 3D active sensor that consists of a structured-light projector, that projects a light pattern on the scene (e.g. mask with vertical stripes), and a camera which captures
the illuminated scene. Based on the received patterns, depths of different regions in the scene can be inferred. For
this setup to work optimally, the camera and projector must be aligned such that the projection image plane and the
image capture plane are parallel, i.e. free of any relative rotations (yaw, pitch and roll). In reality, due to mechanical placement inaccuracy, the projector-camera pair will not be aligned. In this paper we present a calibration process which measures the misalignment. We also estimate a scale factor to account for differences in the focal lengths of the projector and the camera. The three angles of rotation can be found by introducing a plane in the field of view of the camera and illuminating it with the projected light patterns. An image of this plane is captured and processed to obtain the relative pitch, yaw and roll angles, as well as the scale through an iterative process. This algorithm leverages the effects of the misalignment/ rotation angles on the depth map of the plane image.
The dynamic range of an imager is determined by the ratio of the pixel well capacity to the noise floor. As the scene
dynamic range becomes larger than the imager dynamic range, the choices are to saturate some parts of the scene or
“bury” others in noise. In this paper we propose an algorithm that produces high dynamic range images by “stacking”
sequentially captured frames which reduces the noise and creates additional bits. The frame stacking is done by frame
alignment subject to a projective transform and temporal anisotropic diffusion. The noise sources contributing to the
noise floor are the sensor heat noise, the quantization noise, and the sensor fixed pattern noise. We demonstrate that by stacking images the quantization and heat noise are reduced and the decrease is limited only by the fixed pattern noise. As the noise is reduced, the resulting cleaner image enables the use of adaptive tone mapping algorithms which render HDR images in an 8-bit container without significant noise increase.
KEYWORDS: High dynamic range imaging, Photography, Image enhancement, Tablets, Cameras, Digital cameras, Cell phones, Digital photography, Sensors, Image processing
High Dynamic Range (HDR) technology enables photographers to capture a greater range of tonal detail. HDR is
typically used to bring out detail in a dark foreground object set against a bright background. HDR technologies include multi-frame HDR and single-frame HDR. Multi-frame HDR requires the combination of a sequence of images taken at different exposures. Single-frame HDR requires histogram equalization post-processing of a single image, a technique referred to as local tone mapping (LTM). Images generated using HDR technology can look less natural than their non- HDR counterparts. Sometimes it is only desired to enhance small regions of an original image. For example, it may be desired to enhance the tonal detail of one subject’s face while preserving the original background.
The Touch HDR technique described in this paper achieves these goals by enabling selective blending of HDR and non-HDR versions of the same image to create a hybrid image. The HDR version of the image can be generated by either multi-frame or single-frame HDR. Selective blending can be performed as a post-processing step, for example, as a feature of a photo editor application, at any time after the image has been captured. HDR and non-HDR blending is controlled by a weighting surface, which is configured by the user through a sequence of touches on a touchscreen.
Stereo metrology involves obtaining spatial estimates of an object’s length or perimeter using the disparity between
boundary points. True 3D scene information is required to extract length measurements of an object’s projection onto
the 2D image plane. In stereo vision the disparity measurement is highly sensitive to object distance, baseline distance,
calibration errors, and relative movement of the left and right demarcation points between successive frames. Therefore
a tracking filter is necessary to reduce position error and improve the accuracy of the length measurement to a useful
level. A Cartesian coordinate extended Kalman (EKF) filter is designed based on the canonical equations of stereo
vision. This filter represents a simple reference design that has not seen much exposure in the literature. A second filter
formulated in a modified sensor-disparity (DS) coordinate system is also presented and shown to exhibit lower errors
during a simulated experiment.
With the rapid growth of 3D technology, 3D image capture has become a critical part of the 3D feature set
on mobile phones. 3D image quality is affected by the scene geometry as well as on-the-device processing. An
automatic 3D system usually assumes known camera poses accomplished by factory calibration using a special
chart. In real life settings, pose parameters estimated by factory calibration can be negatively impacted by
movements of the lens barrel due to shaking, focusing, or camera drop. If any of these factors displaces the
optical axes of either or both cameras, vertical disparity might exceed the maximum tolerable margin and the
3D user may experience eye strain or headaches. To make 3D capture more practical, one needs to consider
unassisted (on arbitrary scenes) calibration. In this paper, we propose an algorithm that relies on detection
and matching of keypoints between left and right images. Frames containing erroneous matches, along with
frames with insufficiently rich keypoint constellations, are detected and discarded. Roll, pitch yaw , and scale
differences between left and right frames are then estimated. The algorithm performance is evaluated in terms
of the remaining vertical disparity as compared to the maximum tolerable vertical disparity.
The two major aspects of camera misalignment that cause visual discomfort when viewing images on a 3D display
are vertical and torsional disparities. While vertical disparities are uniform throughout the image, torsional rotations
introduce a range of disparities that depend on the location in the image. The goal of this study was to determine
the discomfort ranges for the kinds of natural image that people are likely to take with 3D cameras rather than the
artificial line and dot stimuli typically used for laboratory studies. We therefore assessed visual discomfort on a
five-point scale from 'none' to 'severe' for artificial misalignment disparities applied to a set of full-resolution
images of indoor scenes.
For viewing times of 2 s, discomfort ratings for vertical disparity in both 2D and 3D images rose rapidly toward
the discomfort level of 4 ('severe') by about 60 arcmin of vertical disparity. Discomfort ratings for torsional
disparity in the same image rose only gradually, reaching only the discomfort level of 3 ('strong') by about 50 deg
of torsional disparity. These data were modeled with a second-order hyperbolic compression function incorporating
a term for the basic discomfort of the 3D display in the absence of any misalignments through a Minkowski norm.
These fits showed that, at a criterion discomfort level of 2 ('moderate'), acceptable levels of vertical disparity were
about 15 arcmin. The corresponding values for the torsional disparity were about 30 deg of relative orientation.
Recently, 3D displays and videos have generated a lot of interest in the consumer electronics industry. To make
3D capture and playback popular and practical, a user friendly playback interface is desirable. Towards this end,
we built a real time software 3D video player. The 3D video player displays user captured 3D videos, provides
for various 3D specific image processing functions and ensures a pleasant viewing experience. Moreover, the
player enables user interactivity by providing digital zoom and pan functionalities. This real time 3D player was
implemented on the GPU using CUDA and OpenGL. The player provides user interactive 3D video playback.
Stereo images are first read by the player from a fast drive and rectified. Further processing of the images
determines the optimal convergence point in the 3D scene to reduce eye strain. The rationale for this convergence
point selection takes into account scene depth and display geometry. The first step in this processing chain is
identifying keypoints by detecting vertical edges within the left image. Regions surrounding reliable keypoints
are then located on the right image through the use of block matching. The difference in the positions between
the corresponding regions in the left and right images are then used to calculate disparity. The extrema of
the disparity histogram gives the scene disparity range. The left and right images are shifted based upon the
calculated range, in order to place the desired region of the 3D scene at convergence. All the above computations
are performed on one CPU thread which calls CUDA functions. Image upsampling and shifting is performed in
response to user zoom and pan. The player also consists of a CPU display thread, which uses OpenGL rendering
(quad buffers). This also gathers user input for digital zoom and pan and sends them to the processing thread.
Putting high quality and easy-to-use 3D technology into the hands of regular consumers has become a recent
challenge as interest in 3D technology has grown. Making 3D technology appealing to the average user requires
that it be made fully automatic and foolproof. Designing a fully automatic 3D capture and display system
requires: 1) identifying critical 3D technology issues like camera positioning, disparity control rationale, and
screen geometry dependency, 2) designing methodology to automatically control them. Implementing 3D capture
functionality on phone cameras necessitates designing algorithms to fit within the processing capabilities of the
device. Various constraints like sensor position tolerances, sensor 3A tolerances, post-processing, 3D video
resolution and frame rate should be carefully considered for their influence on 3D experience. Issues with
migrating functions such as zoom and pan from the 2D usage model (both during capture and display) to 3D
needs to be resolved to insure the highest level of user experience. It is also very important that the 3D usage
scenario (including interactions between the user and the capture/display device) is carefully considered. Finally,
both the processing power of the device and the practicality of the scheme needs to be taken into account while
designing the calibration and processing methodology.
3D technology has recently made a transition from movie theaters to consumer electronic devices such as 3D
cameras and camcorders. In addition to what 2D imaging conveys, 3D content also contains information regarding
the scene depth. Scene depth is simulated through the strongest brain depth cue, namely retinal disparity. This
can be achieved by capturing an image by horizontally separated cameras. Objects at different depths will be
projected with different horizontal displacement on the left and right camera images. These images, when fed
separately to either eye, leads to retinal disparity. Since the perception of depth is the single most important 3D
imaging capability, an evaluation procedure is needed to quantify the depth capture characteristics. Evaluating
depth capture characteristics subjectively is a very difficult task since the intended and/or unintended side effects
from 3D image fusion (depth interpretation) by the brain are not immediately perceived by the observer, nor
do such effects lend themselves easily to objective quantification. Objective evaluation of 3D camera depth
characteristics is an important tool that can be used for "black box" characterization of 3D cameras. In this
paper we propose a methodology to evaluate the 3D cameras' depth capture capabilities.
Depth estimation in focused plenoptic camera is a critical step for most applications of this technology and poses
interesting challenges, as this estimation is content based. We present an iterative algorithm, content adaptive,
that exploits the redundancy found in focused plenoptic camera captured images. Our algorithm determines for
each point its depth along with a measure of reliability allowing subsequent enhancements of spatial resolution
of the depth map. We remark that the spatial resolution of the recovered depth corresponds to discrete values
of depth in the captured scene to which we refer as slices. Moreover, each slice has a different depth and will
allow extraction of different spatial resolutions of depth, depending on the scene content being present in that
slice along with occluding areas. Interestingly, as focused plenoptic camera is not theoretically limited in spatial
resolution, we show that the recovered spatial resolution is depth related, and as such, rendering of a focused
plenoptic image is content dependent.
The MIPI standard has adopted DPCM compression for RAW data images streamed from mobile cameras. This
DPCM is line based and uses either a simple 1 or 2 pixel predictor. In this paper, we analyze the DPCM
compression performance as MTF degradation. To test this scheme's performance, we generated Siemens star
images and binarized them to 2-level images. These two intensity values where chosen such that their intensity
difference corresponds to those pixel differences which result in largest relative errors in the DPCM compressor.
(E.g. a pixel transition from 0 to 4095 corresponds to an error of 6 between the DPCM compressed value and
the original pixel value). The DPCM scheme introduces different amounts of error based on the pixel difference.
We passed these modified Siemens star chart images to this compressor and compared the compressed images
with the original images using IT3 MTF response plots for slanted edges. Further, we discuss the PSF influence
on DPCM error and its propagation through the image processing pipe.
This paper explores a novel metric which can check the consistency and correctness of a disparity map, and hence
validate an interpolated view (or video frame for motion compensated frame interpolation) from the estimated
correspondences between two or more input views. The proposed reprojection error metric (REM) is shown to
be sufficient for the regions where the observed 3D scene has no occlusions. The metric is completely automatic
requiring no human input. We also explain how the metric can be extended to be useful for 3D scenes (or videos)
with occlusions. However, the proposed metric does not satisfy necessary conditions. We discuss the issues which
arise during the design of a necessary metric, and argue that necessary metrics which work in finite time cannot
be designed for checking the validity of a method which performs disparity estimation.
KEYWORDS: 3D displays, 3D image processing, Visualization, Spatial resolution, Cameras, Gaussian filters, 3D visualizations, Video, 3D vision, 3D applications
Due to the limited display capacity of multiview/ automultiscopic 3D displays (and other 3D display methods
which recreate lightfields), regions and objects and greater depths from the zero disparity plane appear aliased.
One solution to this, namely prefiltering renders the scene visually very blurry. An alternative approach is
proposed in this paper, wherein regions are large depths are identified in each view. The 3D scene points
corresponding to these regions is rendered as 2D only. The rest of scene still retains parallax (hence the depth
perception). The advantages are that both aliasing and blur are removed, and the resolution of such regions is
greatly improved. A combination of the 2D and 3D visual cues still make the scene look realistic, and the relative
depth information between objects in the scene is still preserved. Our method can prove to be particularly useful
for the 3D video conference application, where the people in the conference will be shown as 3D objects, but the
background will be displayed as a 2D object with high spatial resolution.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.