Institute for Space Systems Operations * 2001 Annual Report * 42-48

Optical Tracking for Telepresence/Teleoperation Space Applications

Abstract--UH researchers are developing a non-intrusive method for human motion estimation from a monocular video camera for the tele-operation of ROBONAUT. ROBONAUT is an anthropomorphic robot developed at NASA-JSC, which is capable of dexterous, human-like maneuvers used in extra-vehicular activity (EVA). The tele-operator is represented using an articulated three-dimensional model consisting of rigid links connected by spherical joints. The shape of a link is described by a triangular mesh and its motion by six parameters: one three-dimensional translation vector and three rotation angles. The motion parameters of the links are estimated by maximizing the conditional probability of the frame-to-frame intensity differences at observation points. The conditional probability is a function of motion parameters, the frame-to-frame intensity differences, and the covariance matrix of the intensity error at the observation points. The intensity error is considered to be the result of the camera noise, the shape estimation error, and the position error attributed to the motion estimation errors occurred by the motion analysis of previous frames. The algorithm was applied to synthetic and real test sequences of a moving arm with very encouraging results. Specifically, the mean error for the derived wrist position (using the estimated motion parameters) was 0.6±1.0 mm for the synthetic image sequences. Motion estimates were used successfully to command a ROBONAUT simulation by remote control.

Human exploration and development of space will demand a heavy extravehicular activity (EVA, or space walks) workload from a small number of crew members. In order to alleviate the astronaut workload, robots remotely working with tele-operated control are currently being developed at NASA Johnson Space Center. One such tele-robot is the ROBONAUT (ROBOtic astroNAUT), which is an anthropomorphic robot with two arms, two hands, a head, a torso and a stabilizing leg (Fig. 1). The arms will be capable of dexterous, human-like maneuvers to handle common EVA tolls.

Figure 1

Figure 1. Dr. Ioannis Kakadiaris greets ROBONAUT (ROBOtic astroNAUT), an anthropomorphic robot with two arms, two hands, a head, a torso, and a stabilizing leg.

One more intuitive way to tele-operate the ROBONAUT than just using a joystick is to estimate the three-dimensional motion of the tele-operator's body parts (e.g., head, arms, torso, and legs) and then use the estimated motion to control the ROBONAUT. In such a system, the robot imitates the movements made by a tele-operator. As the tele-operator reaches out an arm, so does the ROBONAUT. And if the tele-operator starts twisting a screwdriver, the ROBONAUT should copy the action.

Currently, off-the-shelf systems for human motion estimation are very intrusive and encumbering because they attach devices such as sensors or markers to the operator. Our goal is to develop a non-intrusive system for human motion estimation from a monocular image sequence for the tele-operation of ROBONAUT.

The existing literature on non-intrusive human motion estimation from a monocular video camera can be roughly divided into three groups. The first one estimates the motion from image features like edges and silhouettes.1-4 The second group estimates the motion from frame to frame intensity differences at observation points. Those motion parameters, which minimize the frame-to-frame intensity differences at observation points, are considered to be the estimates of the motion parameters.5-7 Finally, in the third group both image features and frame-to-frame intensity differences are taken into account for motion estimation.2 In this project the motion is estimated by maximizing the conditional probability of the frame-to-frame intensity difference at observation points.8-10

For Maximum-Likelihood human motion estimation the tele-operator is represented by a three-dimensional model consisting of rigid links connected by spherical joints. The three-dimensional (3D) shape of a link l is described by a triangular mesh (Fig. 2). The 3D motion of a link l from discrete time tk to tk-1 is described by six parameters Bl: one three-dimensional translation vector DTl and three rotation angles DRl (Fig. 3). The texture of a link is defined by projecting a real image into its triangular mesh.

Figure 2

Figure 3

Figure 2. Triangular mesh representing the right arm of a human operator. Figure 3. Three-dimensional motion of the right upper arm and right lower arm of a human operator from discrete time tk to tk-1. The motion of each part l is described by six parameters Bl: one three-dimensional translation vector DTl and three rotation angles DRl.

The input to the algorithm is a monocular video signal and the model of the operator, while the output is the motion parameters of the links over time. Specifically, the motion parameters of a link l are estimated by maximizing the conditional probability p(FDl|Bl) of the frame-to-frame intensity differences FDl at observation points:

                   

where are the Maximum-Likelihood estimates of the link l. An observation point lies on the surface of the triangular mesh and carries the corresponding intensity value at its surface position. The conditional probability is a function of the motion parameters, the frame-to-frame intensity differences, and the covariance matrix of the intensity error at the observation points. The intensity error is the result of the camera noise, the shape estimation error, and the position error due to the motion estimation errors incurred by the motion analysis of previous frames. The covariance matrix of the intensity error is determined by modeling the position and the shape estimation error of the link and the camera noise by zero-mean stationary Gaussian stochastic processes.

In the following discussion, we will present a step-wise description of the Maximum-Likelihood human motion estimation algorithm as well as some experimental results on synthetic and real data to assess the accuracy, limitations and advantages of this estimation algorithm.

Algorithm

  1. Read the first image of the image sequence.
  2. Adapt the position, pose, shape and texture of the tele-operator's model to the image content of the first image of the image sequence.
    To this end, we use an algorithm whose inputs are the tele-operator's model, the real anthropometric dimensions of the operator's links, and the image position of the joints on the first image of the image sequence (Fig. 4). First, the links of the model are scaled according to the real anthropometric dimensions and then their position and orientation is computed from the known image joint positions on the first image (assuming that all the links are parallel to the image plane of the camera). Finally, the texture of the articulated model object is defined by projecting the first image of the image sequence to the surface of the model.

Figure 4

Figure 4. Shape and pose adaptation of the right arm of a human operator to the first image I0 of an image sequence.

  1. Select observation points.
    An observation point lies on the surface of the triangular mesh and carries the corresponding intensity value at its surface position. In order to reduce the influence of the camera noise and to increase the accuracy of the estimates only those observation points with high linear intensity gradient are taken into account for motion estimation (Fig. 5). Intensity values and linear intensity gradients are taken from the same image from which the texture was derived (first image in the image sequence). The Sobel operator is applied to compute linear intensity gradients.

Figure 5

Figure 5. Selected observation points for the upper and lower model arm of a human operator. Only observation points with high linear intensity gradient were selected for motion estimation.

  1. Read the next image of the image sequence.
  2. Estimate motion parameters of the links.
    Instead of simultaneously estimating all the motion parameters of the links, we follow a decomposition approach. Thus, first the translation and rotation parameters of the root link are estimated using a Maximum-Likelihood estimator. Then only the rotation angles for the rest of the links are estimated beginning from the root link one after the other. For the estimation of the rotation angles of the current link two steps are applied. Firstly, the translation parameters are calculated by evaluating the motion parameters of the previous link in the equation representing the constraints imposed by the spherical joint on the relative motion of the previous link and the current link. Then the current link and its observation points are moved using the calculated translation vector. Secondly, the rotation angles of the current link are estimated using a Maximum-Likelihood estimator. In order to improve the accuracy and reliability of the motion estimates, for each link the Maximum-Likelihood estimator is applied iteratively.
  3. Move links, joints and observation points with the motion estimates and go to step 4.

Experimental results
We have performed a number of experiments on synthetic and real data to assess the accuracy, limitations and advantages of the Maximum-Likelihood Motion Estimation Algorithm. Real image sequences were obtained using a Pulnix TMC-9700 1-2/3" CCD Progressive Scan Color Video Camera with a 2/3" 9 mm lens and a 640×480 RGB video output at a frame rate of 30 Hz. The video signal was acquired using a Matrox Meteor-II/Multi-Channel frame grabber. All experiments were performed on a Pentium III (1 Gz) workstation with 0.5 GB RAM. The average processing time was 1.02 s per frame. We limit this report to experimental results obtained from two real image sequences called HAZEL-A and HAZEL-B.

For the first experiment, we applied our Maximum-Likelihood human motion estimation algorithm to a synthetic image sequence depicting (without loss of generality) a moving right arm from a virtual human. The dimensions of the arm correspond to the dimensions of the right arm of one of the co-authors. The synthetic image sequence was generated by obtaining 400 images (RGB, 640×480 pixel11) of the arm at different times while the arm is moving along the predefined trajectory. The maximum value of the magnitude of the frame-to-frame translation vector of the shoulder, elbow, and wrist in the 3D virtual world is 0.5 cm, 1.15 cm, 1.89 cm, respectively. The maximum value of the magnitude of the frame-to-frame displacement vector of the shoulder, elbow, and wrist in the image plane is 2.22 pixel, 4.95 pixel, and 9.15 pixel, respectively. Experimental results showed a mean error for the derived shoulder, elbow and wrist position (using the estimated motion parameters) of 0.056743 cm, 0.049568 cm, and 0.055171 cm, and a variance of 0.003976 cm2, 0.002582 cm2, and 0.009226 cm2, respectively.

For the second experiment, we tested the Maximum-Likelihood Motion Estimation Algorithm using the real test sequences HAZEL-A and HAZEL-B (200 frames each) depicting a woman grasping and moving an object in front of a bookshelf. Figs. 6(a-f) and 7(a-f) depict the model at the estimated position and orientation overlayed at the image sequence HAZEL-A and HAZEL-B, respectively. Although the model remains well aligned during tracking, some position errors can still be observed. For example, in the frame (Fig. 6(d)) the shoulder has drifted backwards. However, the algorithm quickly compensated for these errors and tracking was not lost.

Figure 6

Figure 6. Frames 1, 40, 80, 116, 160, and 200 from the sequence HAZEL-A. (a-f) Original frames with the model overlayed at the estimated position and orientation.

Figure 7

Figure 7. Frames 1, 45, 90, 160, 180, and 200 from the sequence HAZEL-B. (a-f) Original frames with the model overlayed at the estimated position and orientation.

Finally, the estimated motion parameters from the real image sequence HAZEL-A and HAZEL-B were used successfully to command a ROBONAUT simulation developed at the NASA-Johnson Space Center. Figs. 8(a-f) and Fig. 9(a-f) depict the coronal and the sagittal view of a ROBONAUT simulation being animated with the estimated motion parameters of HAZEL-B.

Figure 8

Figure 8. Commanding a ROBONAUT simulation developed at NASA-JSC with the estimated motion parameters of the sequence HAZEL-B. (a-f) Coronal view of the postures corresponding to the frames 1, 45, 90, 160, 180, and 200 of the sequence HAZEL-B.

Figure 9

Figure 9. Commanding a ROBONAUT simulation developed at NASA-JSC with the estimated motion parameters of the sequence HAZEL-B. (a-f) Sagittal view of the postures corresponding to the frames 1, 45, 90, 160, 180, and 200 of the sequence HAZEL-B.

Concluding remarks
We have developed a non-intrusive system for estimating the motion of a human body from a monocular video camera. For motion estimation, the human body is represented by an articulated model consisting of rigid links connected by spherical joints. The translation vector and the rotation angles of the root link are estimated by applying a Maximum-Likelihood motion estimation algorithm. The rotation angles of each one of the rest of the links are estimated in three steps. In the first step, the translation vector of the link is computed from the motion parameters of the previous reference link. In the second step, the link is motion compensated with the computed motion parameters. Finally, in the third step, the three rotation angles are estimated by applying a Maximum-Likelihood motion estimation algorithm. The experimental results revealed a position error for the shoulder, elbow and wrist of 0.60.6 mm, 0.50.5 mm and 0.61.0 mm, respectively. Furthermore, the model object remained well aligned during the tested image sequences. Although we have presented results from estimating the motion of a single arm, our algorithm is general and it can be applied to upper body and lower body extremities. Experiments are currently underway to access the robustness of our algorithm under occlusion.

Acknowledgements
This project is supported by an ISSO Postdoctoral Fellowship and a software grant (NDDS) from Real Time Innovations, Inc.

References
1L. Goncalves, E. D. Bernardom, E. Ursella, and P. Perona. "Monocular Tracking of the Human Arm in 3D," Proc., Int'l Conf. on Computer Vision, June 1995. 764-70.
2I. A. Kakadiaris and D. Metaxas. "Model-Based Estimation of 3D Human Motion," IEEE Trans. on Pattern Analysis and Machine Intelligence 22.12 (2000): 1453-59.
3I. A. Kakadiaris and D. Metaxas. "Model-Based Estimation of 3D Human Motion with Occlusion Based on Active Multi-Viewpoint Selection," Proc., IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, June 1996. 81-87.
4C. R. Wren and A. P. Pentland. "Dynamic Models of Human Motion," Proc., Int'l Conf. on Automatic Face and Gesture Recognition, April 1998. 22-27.
5C. Bregler and J. Malik. "Tracking People with Twists and Exponential Maps," Proc., IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, June 1998. 8-15.
6R. Koch. "Dynamic 3D Scene Analysis through Synthesis Feedback Control," IEEE Trans. on Pattern Analysis and Machine Intelligence 15.6 (1993): 556-68.
7M. Yamamoto and K. Koshikawa. "Human Motion Analysis Based on a Robot Arm Model," Proc., IEEE Conf. on Computer Vision and Pattern Recognition, June 1991. 664-65.
8G. Martinez. "Maximum-Likelihood Motion Estimation of Articulated Objects for Object-Based Analysis-Synthesis Coding," Proc., Picture Coding Symposium, Seoul, Korea, April 2001. 393-96.
9G. Martinez. "Analyse-Synthese-Codierung basierend auf dem Modell bewegter dreidimensionaler, gegliederter Objecte," Ph.D Dissertation, Institut fuer Theoretische Nachrichtentechnik und Informationsverarbeitung, Univ. of Hannover, Hannover, Germany, Feb. 1998.
10G. Martinez, I. Kakadiaris, and D. Magruder. "Optical Tracking for Telepresence/Teleoperation Space Applications," Technical Report, Department of Computer Science, University of Houston, March 2002.
11D. DeCarlo and D. Metaxas. "Optical Constraints on Deformable Models with Applications to Face Tracking," Int'l J. Computer Vision 38(2):99-127, 2000.

Publications
Martinez, G., I. Kakadiaris, and D. Magruder. "Optical Tracking for Telepresence/Teleoperation Space Applications," Technical Report, Department of Computer Science, University of Houston, March 2002.

Investigative Team

UH PI: Ioannis Kakadiaris, Ph.D., Professor
Visual Computing Lab
Department of Computer Science
University of Houston
Houston, TX 77204-3010
Phone: (713) 743-1255; Fax: (713) 743-1250
Email: ioannisk@uh.edu

UH PI: Karolos Grigoriadis, Ph.D., Professor
Department of Mechanical Engineering
Cullen College of Engineering
University of Houston
Houston, TX 77204-4792
Phone: (713) 743-4387; Fax: (713) 743-4503
Email: karolos@uh.edu

NASA-JSC PI: Darby Magruder
Robotic Systems Technology Branch
Automation and Robotics Division
2101 NASA Road 1, Code ER4
Houston, TX 77058
Phone: (281) 483-7069; Fax: (281) 483-7580
Email: darby.f.magruder1@jsc.nasa.gov

NASA-JSC PI: Kenneth Baker, Ph.D.
Robotic Systems Technology Branch
Automation and Robotics Division
2101 NASA Road 1, Code ER4
Houston, TX 77058
Phone: (281) 483-2041; Fax: (281) 483-7580
Email: Kenneth.baker1@jsc.nasa.gov

UH PDAF: Geovanni Martinez, Ph.D.
Visual Computing Lab
Department of Computer Science
University of Houston
Houston, TX 77204-3010
Phone: (713) 743-1268; Fax: (713) 743-1250
Email: geovanni@uh.edu

PDF (1.4MB)
Table of Contents

Institute for Space Systems Operations - 2001 Annual Report
Copyright © 2002

Navigation Bar

foot-black.gif (4301 bytes)