Our research interests mainly include computer vision, machine learning, and artificial intelligence. Recently, we are focusing on the visual understanding via deep learning, e.g., person re-identification, crowd counting, video/image detection and segmentation, video captioning, cross-modal Retrieval, pose estimation, fine-grained behavior understanding and its application in civil aviation. We also focus on the practical application of civil aviation video surveillance system, including situation awareness, anomaly detection, model compression and edge computing.
1. Multi-object Tracking
Most existing transformer-based Multi-object tracking (MOT) methods use Convolutional Neural Network (CNN) to extract features and then use a transformer to detect and track objects. However, feature extract networks in existing MOT methods cannot pay more attention to the salient regional features and capture their consecutive contextual information, resulting in the neglect of potential object areas during detection. And self-attention in the transformer generates extensive redundant attention areas, resulting in a weak correlation between detected and tracking objects during the tracking. In this paper, we propose a salient regional feature enhancement module (SFEM) to focus more on salient regional features and enhance the continuity of contextual features, it effectively avoids the neglect of some potential object areas due to occlusion and background interference. We further propose soft-sparse attention (SSA) in the transformer to strengthen the correlation between detected and tracking objects, it establishes an exact association between objects to reduce the object’s ID switch. Experimental results on the datasets of MOT17 and MOT20 show that our model significantly outperforms the state-of-the-art metrics of MOTA, IDF1, and IDSw.
2. Video Caption
Utilizing multiple modal information to understand video semantics is quite natural when humans watch a video and describe its contents with natural language. In this paper, a hierarchical multimodal attention network that promotes the information interactions of visual-textual and visual-visual is proposed for video captioning, which is composed of two attention modules to learn multimodal visual representations in a hierarchical manner. Specifically, visual-textual attention modules are designed for achieving the alignment of the semantic textual guidance and global-local visual representations, thereby leading to a comprehensive understanding of the video-language correspondence. Moreover, the joint modeling of diverse visual representations is learned by the visual-visual attention modules, which can generate compact and powerful video representations to the caption model. Extensive experiments on two public benchmark datasets demonstrate that our approach is pretty competitive with the state-of-the-art methods.