Institute for Space Systems Operations * 2001 Annual Report * 42-48
| Abstract--UH researchers are developing a non-intrusive method for human motion estimation from a monocular video camera for the tele-operation of ROBONAUT. ROBONAUT is an anthropomorphic robot developed at NASA-JSC, which is capable of dexterous, human-like maneuvers used in extra-vehicular activity (EVA). The tele-operator is represented using an articulated three-dimensional model consisting of rigid links connected by spherical joints. The shape of a link is described by a triangular mesh and its motion by six parameters: one three-dimensional translation vector and three rotation angles. The motion parameters of the links are estimated by maximizing the conditional probability of the frame-to-frame intensity differences at observation points. The conditional probability is a function of motion parameters, the frame-to-frame intensity differences, and the covariance matrix of the intensity error at the observation points. The intensity error is considered to be the result of the camera noise, the shape estimation error, and the position error attributed to the motion estimation errors occurred by the motion analysis of previous frames. The algorithm was applied to synthetic and real test sequences of a moving arm with very encouraging results. Specifically, the mean error for the derived wrist position (using the estimated motion parameters) was 0.6±1.0 mm for the synthetic image sequences. Motion estimates were used successfully to command a ROBONAUT simulation by remote control. |
Human exploration and development of space will demand a heavy extravehicular activity (EVA, or space walks) workload from a small number of crew members. In order to alleviate the astronaut workload, robots remotely working with tele-operated control are currently being developed at NASA Johnson Space Center. One such tele-robot is the ROBONAUT (ROBOtic astroNAUT), which is an anthropomorphic robot with two arms, two hands, a head, a torso and a stabilizing leg (Fig. 1). The arms will be capable of dexterous, human-like maneuvers to handle common EVA tolls.
|
| Figure 1. Dr. Ioannis Kakadiaris greets ROBONAUT (ROBOtic astroNAUT), an anthropomorphic robot with two arms, two hands, a head, a torso, and a stabilizing leg. |
One more intuitive way to tele-operate the ROBONAUT than just using a joystick is to estimate the three-dimensional motion of the tele-operator's body parts (e.g., head, arms, torso, and legs) and then use the estimated motion to control the ROBONAUT. In such a system, the robot imitates the movements made by a tele-operator. As the tele-operator reaches out an arm, so does the ROBONAUT. And if the tele-operator starts twisting a screwdriver, the ROBONAUT should copy the action.
Currently, off-the-shelf systems for human motion estimation are very intrusive and encumbering because they attach devices such as sensors or markers to the operator. Our goal is to develop a non-intrusive system for human motion estimation from a monocular image sequence for the tele-operation of ROBONAUT.
The existing literature on non-intrusive human motion estimation from a monocular video camera can be roughly divided into three groups. The first one estimates the motion from image features like edges and silhouettes.1-4 The second group estimates the motion from frame to frame intensity differences at observation points. Those motion parameters, which minimize the frame-to-frame intensity differences at observation points, are considered to be the estimates of the motion parameters.5-7 Finally, in the third group both image features and frame-to-frame intensity differences are taken into account for motion estimation.2 In this project the motion is estimated by maximizing the conditional probability of the frame-to-frame intensity difference at observation points.8-10
For Maximum-Likelihood human motion estimation the tele-operator is represented by a three-dimensional model consisting of rigid links connected by spherical joints. The three-dimensional (3D) shape of a link l is described by a triangular mesh (Fig. 2). The 3D motion of a link l from discrete time tk to tk-1 is described by six parameters Bl: one three-dimensional translation vector DTl and three rotation angles DRl (Fig. 3). The texture of a link is defined by projecting a real image into its triangular mesh.
|
|
| Figure 2. Triangular mesh representing the right arm of a human operator. | Figure 3. Three-dimensional motion of the right upper arm and right lower arm of a human operator from discrete time tk to tk-1. The motion of each part l is described by six parameters Bl: one three-dimensional translation vector DTl and three rotation angles DRl. |
The input to the algorithm is a monocular video signal and the model of the operator, while the output is the motion parameters of the links over time. Specifically, the motion parameters of a link l are estimated by maximizing the conditional probability p(FDl|Bl) of the frame-to-frame intensity differences FDl at observation points:
![]()
where
are the Maximum-Likelihood estimates
of the link l. An observation point lies on the surface of the triangular mesh
and carries the corresponding intensity value at its surface position. The conditional
probability is a function of the motion parameters, the frame-to-frame intensity
differences, and the covariance matrix of the intensity error at the observation points.
The intensity error is the result of the camera noise, the shape estimation error, and the
position error due to the motion estimation errors incurred by the motion analysis of
previous frames. The covariance matrix of the intensity error is determined by modeling
the position and the shape estimation error of the link and the camera noise by zero-mean
stationary Gaussian stochastic processes.
In the following discussion, we will present a step-wise description of the Maximum-Likelihood human motion estimation algorithm as well as some experimental results on synthetic and real data to assess the accuracy, limitations and advantages of this estimation algorithm.
Algorithm

Figure 4. Shape and pose adaptation of the right arm of a human operator to the first image I0 of an image sequence.

Figure 5. Selected observation points for the upper and lower model arm of a human operator. Only observation points with high linear intensity gradient were selected for motion estimation.
Experimental results
We have performed a number of experiments on synthetic and real data to assess
the accuracy, limitations and advantages of the Maximum-Likelihood Motion Estimation
Algorithm. Real image sequences were obtained using a Pulnix TMC-9700 1-2/3" CCD
Progressive Scan Color Video Camera with a 2/3" 9 mm lens and a 640×480 RGB video
output at a frame rate of 30 Hz. The video signal was acquired using a Matrox
Meteor-II/Multi-Channel frame grabber. All experiments were performed on a Pentium III (1
Gz) workstation with 0.5 GB RAM. The average processing time was 1.02 s per frame. We
limit this report to experimental results obtained from two real image sequences called
HAZEL-A and HAZEL-B.
For the first experiment, we applied our Maximum-Likelihood human motion estimation algorithm to a synthetic image sequence depicting (without loss of generality) a moving right arm from a virtual human. The dimensions of the arm correspond to the dimensions of the right arm of one of the co-authors. The synthetic image sequence was generated by obtaining 400 images (RGB, 640×480 pixel11) of the arm at different times while the arm is moving along the predefined trajectory. The maximum value of the magnitude of the frame-to-frame translation vector of the shoulder, elbow, and wrist in the 3D virtual world is 0.5 cm, 1.15 cm, 1.89 cm, respectively. The maximum value of the magnitude of the frame-to-frame displacement vector of the shoulder, elbow, and wrist in the image plane is 2.22 pixel, 4.95 pixel, and 9.15 pixel, respectively. Experimental results showed a mean error for the derived shoulder, elbow and wrist position (using the estimated motion parameters) of 0.056743 cm, 0.049568 cm, and 0.055171 cm, and a variance of 0.003976 cm2, 0.002582 cm2, and 0.009226 cm2, respectively.
For the second experiment, we tested the Maximum-Likelihood Motion Estimation Algorithm using the real test sequences HAZEL-A and HAZEL-B (200 frames each) depicting a woman grasping and moving an object in front of a bookshelf. Figs. 6(a-f) and 7(a-f) depict the model at the estimated position and orientation overlayed at the image sequence HAZEL-A and HAZEL-B, respectively. Although the model remains well aligned during tracking, some position errors can still be observed. For example, in the frame (Fig. 6(d)) the shoulder has drifted backwards. However, the algorithm quickly compensated for these errors and tracking was not lost.

Figure 6. Frames 1, 40, 80, 116, 160, and 200 from the sequence HAZEL-A. (a-f) Original frames with the model overlayed at the estimated position and orientation.

Figure 7. Frames 1, 45, 90, 160, 180, and 200 from the sequence HAZEL-B. (a-f) Original frames with the model overlayed at the estimated position and orientation.
Finally, the estimated motion parameters from the real image sequence HAZEL-A and HAZEL-B were used successfully to command a ROBONAUT simulation developed at the NASA-Johnson Space Center. Figs. 8(a-f) and Fig. 9(a-f) depict the coronal and the sagittal view of a ROBONAUT simulation being animated with the estimated motion parameters of HAZEL-B.

Figure 8. Commanding a ROBONAUT simulation developed at NASA-JSC with the estimated motion parameters of the sequence HAZEL-B. (a-f) Coronal view of the postures corresponding to the frames 1, 45, 90, 160, 180, and 200 of the sequence HAZEL-B.

Figure 9. Commanding a ROBONAUT simulation developed at NASA-JSC with the estimated motion parameters of the sequence HAZEL-B. (a-f) Sagittal view of the postures corresponding to the frames 1, 45, 90, 160, 180, and 200 of the sequence HAZEL-B.
Concluding remarks
We have developed a non-intrusive system for estimating the motion of a human
body from a monocular video camera. For motion estimation, the human body is represented
by an articulated model consisting of rigid links connected by spherical joints. The
translation vector and the rotation angles of the root link are estimated by applying a
Maximum-Likelihood motion estimation algorithm. The rotation angles of each one of the
rest of the links are estimated in three steps. In the first step, the translation vector
of the link is computed from the motion parameters of the previous reference link. In the
second step, the link is motion compensated with the computed motion parameters. Finally,
in the third step, the three rotation angles are estimated by applying a
Maximum-Likelihood motion estimation algorithm. The experimental results revealed a
position error for the shoulder, elbow and wrist of 0.60.6 mm, 0.50.5 mm and 0.61.0 mm,
respectively. Furthermore, the model object remained well aligned during the tested image
sequences. Although we have presented results from estimating the motion of a single arm,
our algorithm is general and it can be applied to upper body and lower body extremities.
Experiments are currently underway to access the robustness of our algorithm under
occlusion.
Acknowledgements
This project is supported by an ISSO Postdoctoral Fellowship and a software grant (NDDS)
from Real Time Innovations, Inc.
References
1L. Goncalves, E. D. Bernardom, E. Ursella, and P. Perona. "Monocular
Tracking of the Human Arm in 3D," Proc., Int'l Conf. on Computer Vision,
June 1995. 764-70.
2I. A. Kakadiaris and D. Metaxas. "Model-Based Estimation of 3D Human
Motion," IEEE Trans. on Pattern Analysis and Machine Intelligence 22.12
(2000): 1453-59.
3I. A. Kakadiaris and D. Metaxas. "Model-Based Estimation of 3D Human
Motion with Occlusion Based on Active Multi-Viewpoint Selection," Proc.,
IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, June 1996. 81-87.
4C. R. Wren and A. P. Pentland. "Dynamic Models of Human Motion," Proc.,
Int'l Conf. on Automatic Face and Gesture Recognition, April 1998. 22-27.
5C. Bregler and J. Malik. "Tracking People with Twists and Exponential
Maps," Proc., IEEE Computer Society Conf. on Computer Vision and Pattern
Recognition, June 1998. 8-15.
6R. Koch. "Dynamic 3D Scene Analysis through Synthesis Feedback
Control," IEEE Trans. on Pattern Analysis and Machine Intelligence 15.6
(1993): 556-68.
7M. Yamamoto and K. Koshikawa. "Human Motion Analysis Based on a Robot Arm
Model," Proc., IEEE Conf. on Computer Vision and Pattern Recognition, June
1991. 664-65.
8G. Martinez. "Maximum-Likelihood Motion Estimation of Articulated Objects
for Object-Based Analysis-Synthesis Coding," Proc., Picture Coding
Symposium, Seoul, Korea, April 2001. 393-96.
9G. Martinez. "Analyse-Synthese-Codierung basierend auf dem Modell
bewegter dreidimensionaler, gegliederter Objecte," Ph.D Dissertation, Institut fuer
Theoretische Nachrichtentechnik und Informationsverarbeitung, Univ. of Hannover, Hannover,
Germany, Feb. 1998.
10G. Martinez, I. Kakadiaris, and D. Magruder. "Optical Tracking for
Telepresence/Teleoperation Space Applications," Technical Report, Department of
Computer Science, University of Houston, March 2002.
11D. DeCarlo and D. Metaxas. "Optical Constraints on Deformable Models
with Applications to Face Tracking," Int'l J. Computer Vision 38(2):99-127,
2000.
Publications
Martinez, G., I. Kakadiaris, and D. Magruder. "Optical Tracking for
Telepresence/Teleoperation Space Applications," Technical Report, Department of
Computer Science, University of Houston, March 2002.
| Investigative Team UH PI: Ioannis Kakadiaris, Ph.D.,
Professor UH PI: Karolos Grigoriadis, Ph.D., Professor NASA-JSC PI: Darby Magruder NASA-JSC PI: Kenneth Baker, Ph.D. UH PDAF: Geovanni Martinez, Ph.D. |
PDF
(1.4MB)
Table of Contents
Institute for Space Systems Operations - 2001
Annual Report
Copyright © 2002
|
|