Representation of proprioception observation space and action space (joint position / velocity, cartesian position / velocity) #302

HaomingSong · 2025-02-13T12:27:23Z

HaomingSong
Feb 13, 2025

Hi,
Thanks for your excellent work! I have some questions about the representation of proprioception observation space and action space during training.

In the paper, you mentioned using the joint positions of the robot as proprioceptive state $q_t$. During the pre-training stage, is the action $A_t$ output by the model also the joint position of the robot?
How to deal with the differences between OXE and $\pi$ dataset in action space and proprioception observation space during the pre-training stage?
As far as I know, the datasets in OXE Magic Soup all use cartesian position (i.e. delta end effector pose) as action space. If the $\pi$ dataset uses joint position as action space, how are these two representations unified during pre-training?
Similarly, the proprioceptive state of some datasets in OXE Magic Soup is also represented only in the cartesian pose of the end effector in the base frame of the robot. How do you convert it into the joint position? (As far as I know, the end effector pose can be converted to the joint position using inverse kinematics, but this process often has multiple solutions)
The proprioceptive state $q_t$ and action $A_t$ representations in the post-training and the pre-training stage are different.
In the code you provided, I found that Droid uses joint velocity to represent action $A_t$, while Libero uses end effector pose to represent proprioceptive state $q_t$. Both of these representations are different from the ones used in the pre-training phase in the paper. Could you please explain why and whether you think this will affect the performance of the model?

kvablack · 2025-02-14T16:45:42Z

kvablack
Feb 14, 2025
Maintainer

Hi Haoming,

We don't actually do any action or state space unification. The actions and state are simply zero-padded to the maximum size in any dataset, which is 32 dimensions. As you pointed out, our internal data all uses joint positions, whereas OXE mostly uses end-effector position, and DROID uses joint velocity.

Does this affect performance? Maybe, maybe not... we're still figuring that out!

5 replies

kvablack Feb 14, 2025
Maintainer

To elaborate a bit more -- my answer above refers specifically to the Pi0 paper. After the Pi0 release, we trained another model that simultaneously trains on 3 action representations for our internal data: joint positions, end effector pose in world frame, and end effector pose in end effector frame (i.e., end effector delta). The information about which representation is being used is appended to the text prompt. The checkpoints released as part of openpi are actually derived from this newer model.

However, we did not see an obvious performance improvement due to the added action representations. (Disclaimer: research is still ongoing!) For the purposes of this repo, this detail doesn't matter too much, since most people will be finetuning to a new dataset with a single action representation anyway (or using one of our task-specific checkpoints, which were only finetuned with one action representation). This detail only matters if you want to use pi0-base in zero-shot on one of the tasks from our internal data, in which case it should technically support end effector control with the right text prompt.

ponimatkin Feb 14, 2025

@kvablack thanks for the response! Would you mind sharing how to format the prompt for pi0-base/pi0-fast-base to explore zero-shot capabilities of the model?

kvablack Feb 14, 2025
Maintainer

I can, but please keep in mind that pi0-base is not intended to be used zero-shot. The correctness of the actions depends on the postprocessing code being exactly correct, and we have not provided postprocessing code (or even norm stats) for pi0-base. We will likely not release these because they don't really make sense outside our internal robot setups.

Even if the postprocessing is correct, we are still far from our models achieving zero-shot robot, environment, and task generalization.

That said, here are the nitty-gritty details:

joint position deltas: no change
end-effector control in a fixed global reference frame: "<control_mode> end effector </control_mode> {language_prompt}"
end-effector control in end-effector reference frame (i.e., deltas): "<control_mode> end effector cam frame </control_mode> {language_prompt}"

End-effector poses are encoded using XYZ position and Rot6D rotation (first two columns of the rotation matrix, flattened). The ordering is [x, y, z, *rot6d, gripper], so 9 dimensions total. For biarm setups, this sequence is repeated first for the left arm and then for the right arm. For end-effector deltas, the reference frame is lined up with the wrist camera such that +z is forward, +x is right, and +y is down.

pi0-fast-base was not trained with additional control modes.

HaomingSong Feb 15, 2025
Author

@kvablack Hi Kevin,
Thanks so much for your prompt reply, the details you provided are very helpful to us. You mentioned that

After the Pi0 release, we trained another model that simultaneously trains on 3 action representations for our internal data: joint positions, end effector pose in world frame, and end effector pose in end effector frame (i.e., end effector delta).

When using the end effector pose in world frame, and end effector pose in end effector frame (i.e., end effector delta) data for training, did you re-collect new data using the new teleoperation device and robot controller? Or did you convert some of the data representing actions in joint positions into the other two representations?

As far as I know, if we convert the action expressed in the form of joint positions into the action expressed in the end effector pose in world frame or end effector pose in end effector frame (i.e., end effector delta) by only calculating the forward kinematics and coordinate transformation, some errors may be introduced.
Because the mathematical models of the joint position controller and the end effector position controller are different, if we want to use these two controllers to make the robot execute the same trajectory, the input signals of the two controllers should not be a simple positive kinematics and coordinate transformation relationship.

If you converted the data representing actions in joint positions into the other two representations, could you please share more details of the conversion process?

wu-yutong-525 Apr 2, 2025

To elaborate a bit more -- my answer above refers specifically to the Pi0 paper. After the Pi0 release, we trained another model that simultaneously trains on 3 action representations for our internal data: joint positions, end effector pose in world frame, and end effector pose in end effector frame (i.e., end effector delta). The information about which representation is being used is appended to the text prompt. The checkpoints released as part of openpi are actually derived from this newer model.

However, we did not see an obvious performance improvement due to the added action representations. (Disclaimer: research is still ongoing!) For the purposes of this repo, this detail doesn't matter too much, since most people will be finetuning to a new dataset with a single action representation anyway (or using one of our task-specific checkpoints, which were only finetuned with one action representation). This detail only matters if you want to use pi0-base in zero-shot on one of the tasks from our internal data, in which case it should technically support end effector control with the right text prompt.

Hi @kvablack , thanks for great explanation and examples ：）Does this mean it will be fine as long as the action space and observation space remain consistent during fine-tuning and evaluation?

RochMollero · 2025-03-21T09:40:23Z

RochMollero
Mar 21, 2025

Hi @kvablack could you tell us how the conversion is done at the end from the "rot6d" conversion to a actual proper rotation representation (removing the possible model noise), either into a quaternion or a S03 matrix ?

There see to be several approach possible and this seem important in the execution. cf this chatgpt answer:

When mapping a 6D vector (two 3D vectors) to a quaternion, several methods have been proposed. Here’s an outline of the main approaches along with their pros and cons:

Standard Gram–Schmidt with Matrix-to-Quaternion Conversion

How It Works:

Step 1: Split the 6D vector into two 3D vectors.
Step 2: Normalize the first vector.
Step 3: Orthogonalize the second vector by subtracting its projection onto the first and then normalize it.
Step 4: Compute the third column as the cross product of the first two normalized vectors.
Step 5: Form the 3×3 rotation matrix and convert it to a quaternion using a standard algorithm (e.g. the trace method).

Pros:

Simplicity: It’s straightforward and differentiable.
Orthonormality Guarantee: The cross product ensures the third column is orthogonal to the first two.
Empirical Success: Widely adopted in literature (e.g. Zhou et al.) for its stability in training.

Cons:

Normalization Sensitivity: If the predicted vectors are very noisy (or close to zero), the normalization can be unstable. Adding a small epsilon during normalization helps mitigate this.

SVD or Polar Decomposition Projection

How It Works:

Step 1: Arrange the two predicted 3D vectors into a matrix.
Step 2: Use singular value decomposition (SVD) or polar decomposition to find the closest orthonormal matrix (i.e. project onto SO(3)).
Step 3: Convert the resulting rotation matrix to a quaternion.

Pros:

Robustness: Can be more robust to noisy inputs because the projection minimizes error over all elements.
Optimality: Provides the best-fit rotation matrix in a least-squares sense.

Cons:

Computational Cost: SVD or polar decomposition is more expensive than simple Gram–Schmidt.
Complexity: The process is less straightforward to implement and may introduce additional numerical issues during backpropagation in a neural network.

Direct Least-Squares Estimation of the Quaternion

How It Works:

Idea: Formulate an optimization problem that finds the quaternion which, when converted to a rotation matrix, best approximates the two predicted vectors.
Implementation: This might involve an iterative solver or a closed-form least-squares solution.

Pros:

Tailored Error Minimization: Directly minimizes the error between the predicted values and the valid rotation matrix representation.

Cons:

Complexity and Efficiency: Typically more computationally intensive and less common in practice due to the iterative nature or added complexity.
Differentiability Issues: May be less straightforward to integrate into an end-to-end learning pipeline.

Modified Gram–Schmidt or Householder-Based Orthogonalization

How It Works:

Idea: Use variations of the standard Gram–Schmidt (or alternative orthogonalization techniques such as Householder reflections) to enforce orthonormality.
Implementation: The modifications might improve numerical stability in cases where the predicted vectors are nearly collinear or very noisy.

Pros:

Improved Numerical Stability: Can be beneficial when the input is very noisy.

Cons:

Added Complexity: The modifications add complexity without significant performance gains over the standard Gram–Schmidt in many typical scenarios.

Which Method Is Best?

For most machine learning applications—especially when integrated into an end-to-end differentiable pipeline—the standard Gram–Schmidt orthogonalization followed by a rotation matrix to quaternion conversion is considered the best trade-off. It is:

Efficient: Requires only simple normalization and a cross product.
Robust: When implemented with safeguards (e.g., adding a small epsilon during normalization), it handles typical prediction noise well.
Proven: Empirical studies (e.g., Zhou et al.) have demonstrated that this approach is both stable and effective in practice.

While SVD-based methods or direct least-squares formulations may offer marginal improvements in robustness under heavy noise, they tend to be more computationally expensive and complex to differentiate. Modified orthogonalization techniques can improve stability further but are usually not necessary if the standard method is implemented with proper normalization.

In summary, normalization is essential in all these methods to ensure that the resulting rotation matrix (and thus the quaternion) is valid. The standard Gram–Schmidt method is typically the preferred choice due to its simplicity, efficiency, and good empirical performance in the context of neural network predictions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representation of proprioception observation space and action space (joint position / velocity, cartesian position / velocity) #302

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Representation of proprioception observation space and action space (joint position / velocity, cartesian position / velocity) #302

Uh oh!

HaomingSong Feb 13, 2025

Replies: 2 comments · 5 replies

Uh oh!

kvablack Feb 14, 2025 Maintainer

Uh oh!

Uh oh!

kvablack Feb 14, 2025 Maintainer

Uh oh!

ponimatkin Feb 14, 2025

Uh oh!

kvablack Feb 14, 2025 Maintainer

Uh oh!

HaomingSong Feb 15, 2025 Author

Uh oh!

wu-yutong-525 Apr 2, 2025

Uh oh!

RochMollero Mar 21, 2025

HaomingSong
Feb 13, 2025

Replies: 2 comments 5 replies

kvablack
Feb 14, 2025
Maintainer

kvablack Feb 14, 2025
Maintainer

kvablack Feb 14, 2025
Maintainer

HaomingSong Feb 15, 2025
Author

RochMollero
Mar 21, 2025