MuMu | Devpost

Project Materials

[Pre-Print Paper] [Technical Appendix] [Video Demo] [Video Presentation] [Flash Talk (One minute)]

Summary

Multimodal sensors (visual, non-visual, and wearable) can provide complementary information to develop robust perception systems for recognizing activities accurately. However, it is challenging to extract robust multimodal representations due to the heterogeneous characteristics of data from multimodal sensors and disparate human activities, especially in the presence of noisy and misaligned sensor data. In this work, we propose a cooperative multitask learning-based guided multimodal fusion approach, MuMu, to extract robust multimodal representations for human activity recognition (HAR). MuMu employs an auxiliary task learning approach to extract features specific to each set of activities with shared characteristics (activity-group). MuMu then utilizes activity-group-specific features to direct our proposed Guided Multimodal Fusion Approach (GM-Fusion) for extracting complementary multimodal representations, designed as the target task. We evaluated MuMu by comparing its performance to state-of-the-art multimodal HAR approaches on three activity datasets. Our extensive experimental results suggest that MuMu outperforms all the evaluated approaches across all three datasets. Additionally, the ablation study suggests that MuMu significantly outperforms the baseline models (p<0.05), which do not use our guided multimodal fusion. Finally, the robust performance of MuMu on noisy and misaligned sensor data posits that our approach is suitable for HAR in real-world settings.

Our proposed multimodal learning model can be extended for visual-language and audio-visual tasks by incorporating modality-specific encoders. For example, we can extend our multimodal learning model for developing a virtual assistant system, human-AI/Robot interaction, and visual-question answering system.

This work has been accepted in Association for the Advancement of Artificial Intelligence (AAAI), 2022 [Main Track: Oral; Oral Acceptance Rate: 4.6%, Overall Acceptance Rate: 15%]

Motivation

Understanding human activity ensures effective human autonomous-system collaboration in various settings, from autonomous vehicles to assistive living to manufacturing (Sabokrou et al. 2019; Iqbal and Riek 2017; Yasar and Iqbal 2021; Iqbal et al. 2019). For example, accurate activity recognition could aid collaborative robots in assisting a worker by bringing tools or autonomous vehicles in requesting to take over the controls from a distracted driver to ensure safety (Kubota et al. 2019; Pakdamanian et al. 2020).

Human activity recognition (HAR) has been extensively studied by utilizing unimodal sensor data, such as visual (Ryoo et al. 2017; Zhang and Parker 2011; Fan et al. 2018), skeleton (Arzani et al. 2017; Ke et al. 2017; Yan, Xiong, and Lin 2018; Iqbal, Rack, and Riek 2016), and wearable sensors (Frank, Kubota, and Riek 2019; Batzianoulis et al. 2017). However, unimodal HAR methods struggle to recognize activity in various real-world scenarios for multiple reasons. First, distinct activities can be mistakenly classified as the same when relying on visual sensors (Kong et al. 2019). For example, the activities related to carrying a light and a heavy object look similar from visual modalities; however, they have distinct physical sensor data (i.e., Gyroscope & Acceleration). Second, HAR algorithms relying on unimodal sensor data may fail to recognize activities when the sensor data is noisy (Fig. 1-c). Thus, in these cases, using multiple modalities can compensate for the weaknesses of any particular modality in recognizing an activity.

How we develop learning models?

We used the PyTorch and PyTorch-Lightning deep learning frameworks to develop MuMu, and other baseline approaches. We used Adam optimizer with weight decay regularization, and cosine annealing warm restarts (Loshchilov and Hutter 2017) with an initial learning rate set to 3e􀀀4 to train the evaluated approaches. To train the learning model on the MMAct dataset, we set the cycle length (T0) and cycle multiplier (Tmult) to 30 and 2, respectively. For UTD-MHAD and UCSD-MIT datasets, we set the cycle length (T0) and cycle multiplier (Tmult) to 100 and 2, respectively. We used batch size 32 for the UCSD-MIT dataset. For MMAct and UTD-MHAD datasets, we used batch size 2. We trained each evaluated model for 80, 210, and 510 epochs on MMAct, UTD-MHAD, and UCSD-MIT datasets, respectively. We used the same fixed random seed for all the experiments to ensure reproducibility. Finally, we trained the evaluated approaches in the distributed GPUs cluster environment, where each node contains 2-4 GPUs.

How we transfer learning models to Amazon EC2 DL1?

We develop the learning framework using PyTorch-Lightning so that we can change the computing backend (CPU, GPU, and HPU) with very minimal changes. For example, we transfer our model to Amazon EC2 DL1 instances (powered by Gaudi accelerators from Habana Labs) by simply changing the following parts of the training script (train_model.py):

HPU Training in Amazon EC2 DL1 instances powered by Gaudi accelerators from Habana Labs:

import habana_frameworks.torch.core as htcore

...

hmp_keys = ["level", "verbose", "bf16_ops", "fp32_ops"]
hmp_params = dict.fromkeys(hmp_keys)
hmp_params["level"] = "O1"
hmp_params["verbose"] = False
hmp_params["bf16_ops"] = "./ops_bf16_mnist.txt"
hmp_params["fp32_ops"] = "./ops_fp32_mnist.txt"

trainer = pl.Trainer(hpus=1, max_epochs=1, precision=16, hmp_params=hmp_params)

GPU Training powered by Gaudi accelerators from Habana Labs:

trainer = Trainer.from_argparse_args(args, gpus=args.gpus)

This is how simple to change the computing environment and train our learning model !!!

Broader Impact

Our developed multimodal learning model, MuMu, can be applied to autonomous systems to ensure safe, fluent, and productive human-autonomous system teams. For example, an adaptive collaborative robotic system can use MuMu to assist people with mental or physical disabilities in their daily activities by capturing their multimodal behavior (verbal and non-verbal) and adapting to their needs and preferences. Moreover, MuMu can be used to understand multimodal instructions to aid AI assistants (e.g., Amazon Alexa AI) in improving the user experience for various applications, such as online shopping assistants, video gaming, and personalized learning assistants for students. Furthermore, we can extend MuMu to understand human social behavior to enhance social media interaction and online personalized teaching and predict human actions by utilizing the multimodal VR/AR environment information to improve VR/AR-based gaming.

For more details, please check out our research paper and other materials: [Pre-Print Paper] [Technical Appendix] [Video Demo] [Video Presentation] [Flash Talk (One minute)]