Text input:“A woman heights 165 cm tall … . The person demonstrates a motion … “LLMLinear+Softmax26718Predicted Token[BETA]GT ShapeParameter dkalsasa; GT Token(a) Training(b) InferenceText input:“A man standing 191 cm tall … . The man is walking … “Predicted TokenPredicted Shape Parameter sdkslkdlMotion Dec. 𝓓FSQGenerated Motion368Encoder-Decoder TransformerBeta Embedding θsθ𝑒16772[BETA]EmbeddingsLLMLinear+Softmax[BETA]Encoder-Decoder Transformer65θ𝑒concatShape FeatureDiscrete Motion FeatureShape-ConditionedMotion Feature𝑍𝑠𝑡×544Predicted Shape Parameter sdkslkdlEmbeddings