25 April 2025 (Friday)

16:20 – 18:10 PM (Beijing Time, GMT+8)

Research Building 1, SUSTech

Organized by the IEEE Computational Intelligence Society Shenzhen Chapter.

Activity Aim & Talk

With the explosive growth of multimodal information, how to enhance models’ understanding of diverse data sources through visual representation learning has become a major research focus in AI. This talk begins with the fundamental ideas behind visual representation learning and explores recent developments in visual model pretraining methods. It covers techniques such as image encoding, and the application and advancements of methods like CLIP in visual encoder pretraining. The speaker will discuss how contrastive learning or self-supervised methods improve visual features, the challenges of image-text modeling and cross-modal semantic alignment, and how CLIP-like models enable cross-modal transfer between visual and linguistic paradigms. The talk will further elaborate on how multimodal visual encoders act as key components in multimodal systems to support perception, decision-making, and collaborative understanding in real-world tasks. Overall, it aims to provide insights into how visual representation learning enhances performance, generalization, and interpretability in intelligent systems.

Meet the Speaker

Dr. Mingkai Zheng is an Assistant Professor in the Department of Computer Science and Engineering at SUSTech. He has long been engaged in the field of computer vision, focusing on extracting semantically rich and efficient representations from perceptual images and enhancing visual task performance through network architecture optimization. His related research has been cited over a thousand times on Google Scholar. He has extensive cross-disciplinary experience, having worked as a postdoctoral fellow at the Institute of Automation, Chinese Academy of Sciences, and as an algorithm engineer at Huawei’s Qinghe Lab. He led the design and implementation of a visual modeling pretraining solution for SenseCabin’s in-cabin intelligent driving system, which has been successfully deployed in over 13 million vehicles, including BYD, Cadillac, and Voyah models. Later, at ByteDance, he worked on the development and optimization of multimodal models for video understanding, supporting intelligent analysis and deep understanding of massive streaming content on TikTok.