|
|
TWK: 8th Tübingen Perception Conference
25th - 27th Feb 2005
|
|
|
Model-Based Interface Quality 3D Hand-Tracking Ferenc Kahlesz (University of Bonn)
The feedback subsystem of 3D Human-Computer Interaction has seen tremendous advancement in recent years, as Computer Graphics become increasingly able to render stunningly immersive 3D scenes. This holds not only for expensive VR environments, but also for commodity desktop PCs, thanks to the advancement of GPUs. A natural demand from the user would be to use her own hand(s) for 3D interaction with these virtual environments. Unfortunately, the feedforward part, i.e. subsytems providing 3D user input, was not able to advance on par with visualization. The problems, among others, are that, on the one hand, hardware devices used for true 3D input are expensive and intrusive (e.g. datagloves, electormagentic tracking systems) and, on the other hand, less intrusive systems utilizing vision-based methods do not yet allow true 3D input. As a consequence, we still have to use intrusive and cumbersome 3D input devices in virtual environments (and 3D desktop applications), which diminishes user performance and experience.
Our goal is to provide the user with a “virtual dataglove”: a complete, nonintrusive, vision-based replacement of the traditional dataglove. Our “virtual dataglove” measures the 3D pose of the hand (global motion) and the movement of the fingers (local motion). We call the global 3D parameters and the jointangles of the hand together the state of the hand. The user’s hand is then observed by one or more cameras and for every camera frame, the state of the hand is determined. This problem belongs to the area called ‘articulated’ or ‘deformable object tracking’. Basically, there are two types of approaches to deal with articulated tracking: image- (or contour-) and model-based methods.
We pursue a model-based approach. First, camera images are segmented via skin-color semgmentation into binary hand-contour images. Then, a 21+6 = 27 degree-of-freedom 3D hand model is fitted into these contour images. Using calibrated cameras, our OpenGL hand-model can be rendered into the images with the same (interior and exterior) camera parameters as the real cameras have. The error of the fit is the amount of pixels in the XOR image of the contourand the rendered image. This error function can be effectively computed on up-to-date graphics hardware, which enables a large number of error function evaluations in every camera frame. The fitting error is minimized by an iterative Downhill-Simplex optimization algorithm. Two kinds of constraints of the posssible handstates are exploited to narrow the search space of the optimization: the joint angles of the fingers are biometrically constrained, and the dimension-ality of the 21 DOF state vector can be reduced to about 11 DOFs using PCA. In addition, we exploit temporal coherence by providing a good initial-value for the Downhill-Simplex optimization using the result of the optimization from the last frame and first-order motion prediction. |
|