Project: Learning to Point

Participants: Yi Chen, Ping Wang, Garrick Orchard, Sadique Sheik, Bad Match for 'Michele', Bert Shi

The goal of this project is to develop a robotic system that learns to point its arm towards a particular point in visual space. In order to solve the pointing problem, the robot must learn the relationship between visual coordinates and motor coordinates.

As a first step, we assume that the robot already has the basic reflexive behaviors of foveation by head movements (keeping the object centered by pan and tilt movements) and binocular vergence tracking (changing the convergence and divergence of the eyes so that they are fixated on the same object). During training, the robot moves two joints of its arm randomly while maintaining fixation on the tip of its arm, which is identified via a red LED. After training, the LED will be removed from the tip of its arm, and the robot should be able to move its arm to point at the LED.

The visual coordinates (pan and tilt) are related to arm commands sent to the two joint angles via a neural network. The network adopts distributed representations of the current positions of the current visual coordinates, the current arm motor positions (head/vergence angle) and motor (joint angles) positions, a desired set of motor coordinates (to enable external input of the random movements), as well as the coordinates of a motor command signal (which drives the arm). Since the coordinates are all two dimensional, the distributed representation is constructed by defining 2D arrays of neurons, each responsible for a particular combination of coordinates. Here we use 10x10 arrays of neurons. These arrays are organized as shown in the following diagram.

smaller picture of network

In the 2D array representation of the arm commands, each neuron represents a particular combination of joint angles to be sent to the arm. Coordinates are normalized so that they range between 0 and 1. The neurons are driven by the outputs of the neurons representing the current arm motor positions and by the outputs of the neurons representing the visual coordinates by weights which are learned via a Hebbian learning rule. During learning, the weights are normalized so that the total sum of the weights leaving each neuron in the current arm position and visual coordinates sum to one. The neurons in the arm command array are also driven by strong fixed one-to-one topographical connections from the external input array.

In the 2D array representation of visual coordinates, each neuron represents a particular combination of pan and tilt angles. Given the actual visual coordinates, the distributed representation is obtained by assigining each neuron an analog activity level that decays according to a Gaussian function as the distance between the actual visual coordinates and the coordinates represented by that neuron.

In the 2D array representation of the arm coordinates, each neuron represents a particular combination of joint angles. Because the servomotors used to actuate the arms do not provide feedback, we assume that the joint angles assumed by the arm at each time are equal to the commanded joint angles from the previous time step. Similar to the visual representation, each neuron assumes an analog activity level that decays according to a Gaussian function as the distance between the commanded joint angles and the joint angles represnted by that neuron.

During training, the robot waves its arm in front of its eyes. The arm motion is established by first defining a 2D trajectory in the joint space, and using this 2D trajectory to control the activity in the external input array, which at each point in time is a Gaussian centered on the trajectory. Because of the strong fixed connections between the external input array and the motor command array, this activity is essentially mirrored in the motor command array. When decoded by the vector averaging operation, the motor command sent to the arm is close to the originally defined 2D trajectory, although there is a bias towards locations near the center of the image due to boundary effects.

The end of the arm is identified by an LED which the robot is holding in its grasper. This LED is tracked by a stereo camera system mounted upon a pan and tilt head. The geometry of the stereo camera system is fixed, but we simulate eye movements by shifting the locations of subwindows in the eye.

The weights between the arm joint position array and the visual position array are updated according to a Hebbian learning rule. Each synaptic weight is increased by an amount proportional to the product of the activity in the pre- and post-synaptic neurons. Because the activity is always positive, this always increases the weights. In order to avoid the weights growing without bound, we normalize the weights so that the sum of weights leaving each neuron in the joint position and visual position arrays sum to one. As described above, activity in the joint position array changes according to commands sent to the motor in the previous step. Activity in the visual position array changes as the robot adjusts the pan and tilt angles to track the LED at the end of the arm. As the robot moves its arm, it learns the correlations between the sensory and motor spaces, which it can later exploit to exhibit pointing behaviour.

A video showing the training of the network is available here:


During training, the weights evolve as shown in a video available here


This video shows the weights from the arm position neurons to the motor command neurons (Wa) on the left, and the weights from the visual position neurons to the motor command neurons (Wv) on the right. The weights images are divided into a 10x10 array of 10x10 pixel blocks. Each 10x10 pixel block shows the projective field from one neuron in the arm or visual position array to the 10x10 array of motor command neurons. These 10x10 pixel blocks are arranged next to each other topographically according to the position of the source neuron in the position array. Due to normalization, the pixel values in the 10x10 pixel blocks all sum to one. At the start of training, the weights are initialized randomly, subject to the constraints that they sum to one and that they are positive. As time goes on, the projections focus on on particular region of the motor command space, as evidenced by the emergence of the bunps in the map. At the end of training, the locations of the bumps progresses smoothly from one block to the next, indicating that the system has learned a systematic mapping from position to motor coordinates.

In order to test the system, we simply disable the external input array by setting the amplitude of the Gaussian blob to zero. In this case, the activity in the motor command array is generated purely by the activity in the arm position and visual position array. The activity in the arm position array provides a smoothing or low pass filtering effect on the activity of the motor command array, since the current arm position (which depends upon the past motor command) is allowed to influence the current activity in the motor command array. The activity in the visual position array is driven by the current pan and tilt angles, which are determined by the tracking of the LED. Thus, the activity in the motor command array can be thought of as a smoothed version of the robot's current estimate of the LED position, transformed by the mapping from the visual to motor coordinates which was learned during the training.

We evaluated the performance of the system by moving the arm to different locations in joint space, and allowing the comparing the motor command generated by the network (obtained by taking the vector average of the activity in the motor command array) with the actual motor command used to move the arm. At the beginning of the training, the motor command generated by the network remains close to 0.5 for both joint angles even though the arm moves throughout the entire range of its motion. This is shown in the figure below, which plots the motor commands sent to the two joint angles as well as the motor commands generated by the motor neuron activity.

This result is consistent with the random and uniform initialization of the weight matrices. After training, the motor commands generated by the network better track the actual motor commands sent to the arm, as shown in the figure below.

The range over which the system can track the actual arm position is limited in part due to boundary effects, since the motor command generated after vector averaging a Gaussian blob placed at the edge of the array will be displaced towards the centre since the part of the blob that falls outside the array is not seen.

We also observe that there is some coupling between the motion of the two joints, so that even when one joint is moving slowly, the motor command generated by the network oscillates approximately in phase with the motion of the other joint. We speculate that this is because this particular combination of joint angles also results in the LED lying in a position close to the actual position in 3D space, since the joint angle coordinates are not orthongonal in the 3D space.

Indeed, we observe that the arm can actually point to the LED when it is moved in space when the arm is driven by the decoded output of the motor command array, as evidenced in the following video:


In conclusion, in this project, we have demonstrated the autonomous development of sensory-motor coordination in a neuromorphic robotic system. The network architecture was developed entirely at the Telluride workshop during discussions between the participants. Although the robotic parts were brought to the workshop, their integration into this particular robotic system was done entirely at the workshop.

In the following appendices, we describe intermediate steps in the realization of this project.

Appendix 1

As a first step in realizing this project, we wrote MATLAB code implementing a 1D version of the network described above, and detailed in the figure below.

Our simulations indicated that the basic network architecture for 1D visual and joint angle spaces (shown below) converges to the correct mapping.

  • Code (Please checkout full runnable code from SVN)
    • hebbian.m
      close all
      clear all
      n_arm = 100;
      n_vis = 101;
      visfun = @(x)100/x+20;   %a nonlinear visfun simulating 
                                %the nonlinear relationship between visual cues and motor cmd.
      sigma = 20;
      %Hebbian learning parameter
      n_iter = 10;
      eta = 1e-3;
      %initial proprioception
      pa = rand(n_arm, 1); pa=pa/sum(pa);
      pv = rand(n_vis, 1); pv=pv/sum(pv);
      %initial weight
      Wa = rand(n_arm, n_arm);
      Wv = rand(n_arm, n_vis);
      for iter=1:n_iter
          for i = 1:n_arm
              %external force
              x = zeros(n_arm, 1);
              in=ceil(rand(1)*n_arm);  % the external force appears randomly within the range
              x(in) = 1000;
              %motor signal
              ma = Wa*pa + Wv*pv + x;
              mc = (1:n_arm)*ma / sum(ma);
              %simulated sensory proprioception
              pa = normpdf( (1:n_arm)', mc, sigma);
              pv = normpdf( (1:n_vis)', visfun(mc), sigma);
              %Hebbian Learning rule
              Wa = Wa + eta .* ma * pa';
              Wv = Wv + eta .* ma * pv';
              %Weight normalization
              if 1
                  %make rowSum(Wa) = 1;
                  Wa = Wa ./ repmat(sum(Wa,2),[1 n_arm]);
                  Wv = Wv ./ repmat(sum(Wv,2),[1 n_vis]);
                  %global norm.
                  Wa = Wa./sum(sum(Wa));
                  Wv = Wv./sum(sum(Wv));
          if mod(iter/n_iter,0.1)==0  %display every 10% of the process
              imagesc(Wa); axis image; title('Wa')
              hold on;
              imagesc(Wv); axis image; title('Wv');
              hold on;
              xlabel('Pv'); ylabel('Ma');colorbar
              suptitle(sprintf('Iter %d/%d', iter,n_iter));

As we can see here, the system can learn a systematic mapping between the position and motor command coordinates.

Appendix 2

In the next stage of the project, we ported the MATLAB code developed above into a C++ version, which also supported 2D arrays of neurons. In the figure below, we show that the network can learn correctly when presented with inputs where the visual positions and arm positions were generated directly from the previous activity in the arm motor array. While this situation is purely artificial, the emergence of the correct weights gave us strong confidence that the network would perform correctly with actual training data from a real robot. This indeed turned out to be the case, as described above.

SVN Repository